How Explain Data Works
Use Explain Data as an incremental, jumping-off point for further exploration of your data. The possible explanations that it generates help you to see the different values that make up or relate to a selected mark in a view. It can tell you about the characteristics of the data points in the data source, and how the data might be related (correlations) using statistical modeling. These explanations give you another tool for inspecting your data and finding interesting clues about what to explore next.
Note: Explain Data is a tool that uncovers and describes relationships in your data. It can't tell you what is causing the relationships or how to interpret the data. You are the expert on your data. Your domain knowledge and intuition is key in helping you decide what characteristics might be interesting to explore further using different views. For more information, see How Explain Data Works(Link opens in a new window) and Requirements and Considerations for Using Explain Data(Link opens in a new window).
For related information on how Explain Data works, and how to use Explain Data to augment your analysis, see these Tableau Conference presentations:
- From Analyst to Statistician: Explain Data in Practice (1 hour)(Link opens in a new window)
- Leveraging Explain Data (45 minutes)(Link opens in a new window)
- Explain Data Internals: Automated Bayesian Modeling (35 minutes)(Link opens in a new window)
- Machine Learning, Explainable AI, and Tableau (45 minutes)(Link opens in a new window), Session Materials
- Making Business More Bayesian (45 minutes)(Link opens in a new window)
Explain Data is:
- A tool and a workflow that leverages your domain expertise.
- A tool that recommends where to look next, and that surfaces relationships in your data.
- A tool and a workflow that helps expedite data analysis, and make data analysis more accessible to a broader range of users.
Explain Data is not:
- A statistical testing tool.
- A tool to prove or disprove hypotheses.
- A tool that is giving you an answer or telling you anything about causality in your data.
When running and reading the explanations created by Explain Data, keep the following points in mind:
Use granular data that can be aggregated. This feature is designed explicitly for the analysis of aggregated data. This means that your data must be granular, but the marks that you select for Explain Data must be aggregated or summarized at a higher level of detail. Explain Data can't be run on disaggregated marks (row-level data) at the most granular level of detail.
For more information about aggregation, see Data Aggregation in Tableau.
Consider the shape, size, and cardinality of your data. While Explain Data can be used with smaller data sets, it requires data that is sufficiently wide and contains enough marks (granularity) to be able to create a model.
Don't assume causality. Correlation is not causation. Explanations are based on models of the data, but are not causal explanations.
A correlation means that a relationship exists between some data variables, say A and B. You can't tell just from seeing that relationship in the data that A is causing B, or B is causing A, or if something more complicated is actually going on. The data patterns are exactly the same in each of those cases and an algorithm can't tell the difference between each case. Just because two variables seem to change together doesn't necessarily mean that one causes the other to change. A third factor could be causing them both to change, or it may be a coincidence and there might not be any causal relationship at all.
However, sometimes you have outside knowledge that is not in the data that helps you to identify what's going on. A common type of outside knowledge would be a situation where the data was gathered in an experiment. If you know that B was chosen by flipping a coin, any consistent pattern of difference in A (that isn't just random noise) must be caused by B. For a longer, more in-depth description of these concepts, see the article Causal inference in economics and marketing(Link opens in a new window) by Hal Varian.
When you run Explain Data on a mark, a statistical analysis is run on the aggregated mark, and then on possibly related data points from the data source that aren't represented in the current view.
Explain Data first predicts the value of a mark using only the data that is present in the visualization. Next, data that is in the data source (but not in the current view) is considered and added to the model. The model determines the range of predicted mark values, which is within one standard deviation of the predicted value.
If an expected value summary says the mark is lower than expected or higher than expected, it means the aggregated mark value is outside the range of values that a statistical model is predicting for the mark. If an expected value summary says the mark is lower or higher than expected, but within the natural range of variation, it means the aggregated mark value is within the range of predicted mark values, but is lower or higher in that range of values.
Possible explanations are evaluated on their explanatory power using statistical modeling. Explanations are listed based on how informative they are; explanations that are more simple with less variability are favored. For each explanation, Tableau compares the expected value with the actual value. Explanations that don’t meet the defined threshold are not listed.
Models used for analysis
Explain Data builds Bayesian models of the data in a view to predict the value of a mark, and then determines whether a mark is higher or lower than expected given the model. Next, it considers additional information, like adding additional columns from the data source to the view, or flagging record-level outliers, as potential explanations. For each potential explanation, Explain Data fits a new model, and evaluates how unexpected the mark is given the new information. Explanations are scored by trading off complexity (how much information is added from the data source) against the amount of variability that needs to be explained. Better explanations are simpler than the variation they explain.
How scoring works
When you run Explain Data for a mark, the explanations window is displayed. If multiple explanations are available, they are displayed in descending order based on a numerical score given to each potential explanation. Only explanations with the highest scores are displayed. Scoring works differently for different explanation types.
Extreme values are aggregated marks that are outliers, based on a model of the visualized marks. The selected mark is considered to contain an extreme value if a record value is in the tails of the distribution of the expected values for the data.
The score of an extreme value is determined by looking at the minimum and maximum values that make up the aggregate mark. If the mark becomes less surprising by removing a value, then it receives a higher score.
When a mark has an extreme value, it doesn't automatically mean it's an outlier, or that you should exclude it from the view. That choice is up to you depending on your analysis. The explanation is simply pointing out an interesting extreme value in the mark. For example, it could reveal a mistyped value in a record where a banana cost 10 dollars instead of 10 cents. Or, it could reveal that a particular sales person had a great quarter.
|Number of records / Average value of records||
This type of explanation is used for aggregate marks that are sums. It explains whether the mark differs from the distribution overall because:
Because SUM marks are by definition equal to COUNT(X) * AVG(X), the mark can be broken down into a count of values and multiplied by the average value for the mark. This yields two new distributions: a distribution of counts and a distribution of averages. If the selected mark is an outlier, it will either have a count that is an outlier in the count distribution, an average value that is an outlier in the distribution of averages, or both.
This explanation describes whether the sum is interesting because the count is high or low, or because the average is high or low.
|Unvisualized and aggregated dimensions||
An unvisualized dimension is a dimension that exists in the data source, but isn't currently being used in the view. This type of explanation is used for sums and averages. Aggregated explanations also work on counts.
The model for unvisualized dimensions is created by splitting out marks according to the categorical values of the explaining column, and then building a model with the value that includes all of the data points in the source visualization. For each row, the model attempts to recover each of the individual components that made each mark. The score indicates whether the model predicts the mark better when components corresponding to the unvisualized dimension are modeled and then added up, versus using a model where the values of the unvisualized dimension are not known.
Aggregate dimension explanations explore how well mark values can be explained without any conditioning. Then, the model conditions on values of each column that is a potential explanation. Conditioning on the distribution of an explanatory column should result in a better prediction. The score is basically how much better the prediction becomes.