How Explain Data Works
Use Explain Data as an incremental, jumping-off point for further exploration of your data. The possible explanations that it generates help you to see the different values that make up or relate to an analyzed mark in a view. It can tell you about the characteristics of the data points in the data source, and how the data might be related (correlations) using statistical modeling. These explanations give you another tool for inspecting your data and finding interesting clues about what to explore next.
Note: Explain Data is a tool that uncovers and describes relationships in your data. It can't tell you what is causing the relationships or how to interpret the data. You are the expert on your data. Your domain knowledge and intuition is key in helping you decide what characteristics might be interesting to explore further using different views.
For related information on how Explain Data works, and how to use Explain Data to augment your analysis, see these Tableau Conference presentations:
What Explain Data is (and isn’t)
Explain Data is:
- A tool and a workflow that leverages your domain expertise.
- A tool that surfaces relationships in your data and recommends where to look next.
- A tool and a workflow that helps expedite data analysis and make data analysis more accessible to a broader range of users.
Explain Data is not:
- A statistical testing tool.
- A tool to prove or disprove hypotheses.
- A tool that is giving you an answer or telling you anything about causality in your data.
When running Explain Data on marks, keep the following points in mind:
-
Use granular data that can be aggregated. This feature is designed explicitly for the analysis of aggregated data. This means that your data must be granular, but the marks that you select for Explain Data must be aggregated or summarized at a higher level of detail. Explain Data can't be run on disaggregated marks (row-level data) at the most granular level of detail.
-
Consider the shape, size, and cardinality of your data. While Explain Data can be used with smaller data sets, it requires data that is sufficiently wide and contains enough marks (granularity) to be able to create a model.
-
Don't assume causality. Correlation is not causation. Explanations are based on models of the data, but are not causal explanations.
A correlation means that a relationship exists between some data variables, say A and B. You can't tell just from seeing that relationship in the data that A is causing B, or B is causing A, or if something more complicated is actually going on. The data patterns are exactly the same in each of those cases and an algorithm can't tell the difference between each case. Just because two variables seem to change together doesn't necessarily mean that one causes the other to change. A third factor could be causing them both to change, or it may be a coincidence and there might not be any causal relationship at all.
However, you might have outside knowledge that is not in the data that helps you to identify what's going on. A common type of outside knowledge would be a situation where the data was gathered in an experiment. If you know that B was chosen by flipping a coin, any consistent pattern of difference in A (that isn't just random noise) must be caused by B. For a longer, more in-depth description of these concepts, see the article Causal inference in economics and marketing(Link opens in a new window) by Hal Varian.
How explanations are analyzed and evaluated
Explain Data runs a statistical analysis on a dashboard or sheet to find marks that are outliers, or specifically on a mark you select. The analysis also considers possibly related data points from the data source that aren't represented in the current view.
Explain Data first predicts the value of a mark using only the data that is present in the visualization. Next, data that is in the data source (but not in the current view) is considered and added to the model. The model determines the range of predicted mark values, which is within one standard deviation of the predicted value.
What is an expected range?
The expected value for a mark is the median value in the expected range of values in the underlying data in your viz. The expected range is the range of values between the 15th and 85th percentile that the statistical model predicts for the analyzed mark. Tableau determines the expected range each time it runs a statistical analysis on a selected mark.
Possible explanations are evaluated on their explanatory power using statistical modeling. For each explanation, Tableau compares the expected value with the actual value.
value | Description |
---|---|
Higher than expected / Lower than expected | If an expected value summary says the mark is lower than expected or higher than expected, it means the aggregated mark value is outside the range of values that a statistical model is predicting for the mark. If an expected value summary says the mark is slightly lower or slightly higher than expected, or within the range of natural variation, it means the aggregated mark value is within the range of predicted mark values, but is lower or higher than the median. |
Expected Value | If a mark has an expected value, it means its value falls within the expected range of values that a statistical model is predicting for the mark. |
Random Variation | When the analyzed mark has a low number of records, there may not be enough data available for Explain Data to form a statistically significant explanation. If the mark’s value is outside the expected range, Explain Data can’t determine whether this unexpected value is being caused by random variation or by a meaningful difference in the underlying records. |
No Explanation | When the analyzed mark value is outside of the expected range and it does not fit a statistical model used for Explain Data, no explanations are generated. |
Models used for analysis
Explain Data builds models of the data in a view to predict the value of a mark and then determines whether a mark is higher or lower than expected given the model. Next, it considers additional information, like adding additional columns from the data source to the view, or flagging record-level outliers, as potential explanations. For each potential explanation, Explain Data fits a new model, and evaluates how unexpected the mark is given the new information. Explanations are scored by trading off complexity (how much information is added from the data source) against the amount of variability that needs to be explained. Better explanations are simpler than the variation they explain.
Explanation type | Evaluation |
---|---|
Extreme values |
Extreme values are aggregated marks that are outliers, based on a model of the visualized marks. The selected mark is considered to contain an extreme value if a record value is in the tails of the distribution of the expected values for the data. An extreme value is determined by comparing the aggregate mark with and without the extreme value. If the mark becomes less surprising by removing a value, then it receives a higher score. When a mark has extreme values, it doesn't automatically mean it has outliers, or that you should exclude those records from the view. That choice is up to you depending on your analysis. The explanation is simply pointing out an interesting extreme value in the mark. For example, it could reveal a mistyped value in a record where a banana cost 10 dollars instead of 10 cents. Or, it could reveal that a particular sales person had a great quarter. |
Number of records |
The number of records explanation models the aggregate sum in terms of the aggregate count; average value of records models it in terms of the aggregate average. The better the model explains the sum, the higher the score. This explanation describes whether the sum is interesting because the count is high or low, or because the average is high or low. |
Average value of the mark |
This type of explanation is used for aggregate marks that are sums. It explains whether the mark is consistent with the other marks because in terms of its aggregate count or average, noting the relation SUM(X) = COUNT(X) * AVG(X). This explanation describes whether the sum is interesting because the count is high or low, or because the average is high or low. |
Contributing Dimensions |
This explanation models the target measure of the analyzed mark in terms of the breakdown among categories of the unvisualized dimension. The analysis balances the complexity of the model with how well the mark is explained. An unvisualized dimension is a dimension that exists in the data source, but isn't currently being used in the view. This type of explanation is used for sums, counts and averages. The model for unvisualized dimensions is created by splitting out marks according to the categorical values of the explaining column, and then building a model with the value that includes all of the data points in the source visualization. For each row, the model attempts to recover each of the individual components that made each mark. The analysis indicates whether the model predicts the mark better when components corresponding to the unvisualized dimension are modeled and then added up, versus using a model where the values of the unvisualized dimension are not known. Aggregate dimension explanations explore how well mark values can be explained without any conditioning. Then, the model conditions on values for each column that is a potential explanation. Conditioning on the distribution of an explanatory column should result in a better prediction. |
Contributing Measures |
This explanation models the mark in terms of this unvisualized measure, aggregated to its mean across the visualized dimensions. An unvisualized measure is a measure that exists in the data source, but isn't currently being used in the view. A Contributing Measures explanation can reveal a linear or quadratic relationship between the unvisualized measure and the target measure. |