Inspect a View using Explain Data
Explain Data gives you a new window into your data. Use it to inspect, uncover, and dig deeper into the marks in a viz as you build, explore, and analyze your data. When you select a mark while editing a view and run Explain Data, Tableau builds statistical models and proposes possible explanations for the selected mark, including potentially related data from the data source that isn't used in the current view.
As you build different views, use Explain Data as a jumping-off point to help you explore your data more deeply and ask better questions.
Note: Explain Data is a tool that uncovers and describes relationships in your data. It can't tell you what is causing the relationships or how to interpret the data. You are the expert on your data. Your domain knowledge and intuition is key in helping you decide what characteristics might be interesting to explore further using different views. For more information, see What is Explain Data? and How to use Explain Data in the flow of analysis.
Creators and Explorers with editing permissions can use Explain Data when editing a view in Desktop, or editing a view on the web in Tableau Online or Tableau Server.
Steps to use Explain Data
To use Explain Data in a view, you must be able to edit a view in Tableau Desktop, Tableau Online, or Tableau Server.
Build a visualization. Make sure it uses a measure that is aggregated with SUM, AVG, COUNT, COUNTD, or AGG.
In Tableau Online or Tableau Server, you will need to open a view for editing (click Edit in the toolbar).
Select a mark of interest, and then click the Explain Data icon in the tooltip for the mark. In Tableau Desktop, you can optionally right-click the mark and select Explain Data in the context menu.
Note: You must select a single mark. Multiple mark selections are not supported. If Explain Data cannot analyze the type of mark selected, the Explain Data icon and context menu command will not be available. For more information, see When Explain Data is not available.
Read the explanations. Explanations are generated for each measure in the current view that can be analyzed.
If multiple explanations are available, click each explanation tab to see the related details.
If multiple measures are available, click each measure tab for more explanations.
Click the Open icon in the top right corner of an explanation viz to open the visualization as a new worksheet and explore the data further.
Parts of the explanations window
This image is an example of the explanations window for Explain Data, with multiple explanations available.
A - Selected Mark. Displays the dimension values of the selected mark to indicate what mark is being described and analyzed.
B - Measure in Use. Click to select the measure in use for explanations. Explanations are given for one measure at a time. If multiple measures are available, they are displayed as separate tabs here.
C - Expected Value Summary. Describes whether or not the value is unexpected given the other marks in the visualization. Hover over the text in this statement to see details about the expected value range.
If an expected value summary says the mark is lower than expected or higher than expected, it means the aggregated mark value is outside the range of values that a statistical model is predicting for the mark. If an expected value summary says the mark is lower or higher than expected, but within the natural range of variation, it means the aggregated mark value is within the range of predicted mark values, but is lower or higher in that range of values. For related information, see How explanations are analyzed, evaluated, and scored.
D - Explanations List. Displays a list of the possible explanations for the value in the selected mark that Tableau was able to identify. Click an explanation in the list to see a description in the explanation pane on the right.
E - Explanation Description. Displays the selected explanation with a combination of text and vizzes. Click the icon in the top right corner of the viz thumbnail image to add open it as a new sheet in the workbook.
Note: Sometimes a mark can be analyzed with no resulting explanations. This is indicated by No Explanation Found in Data. For information on data characteristics that work well with Explain Data, see Requirements and considerations for Explain Data and How to use Explain Data in the flow of analysis.
Explain Data is a tool that uncovers and describes relationships in your data. it can't tell you what is causing the relationships or how to interpret the data, but it can expose interesting correlations that you might want to explore further. Your domain knowledge and intuition is key in helping you decide what characteristics might be interesting to explore further using different views. You are the expert on your data. Only you can determine the "why" in your data.
What Explain Data is (and isn’t)
Explain Data is not:
- A statistical testing tool
- A tool to prove or disprove hypotheses
- A tool that is giving you an answer or telling you anything about causality in your data
Explain Data is:
- A tool and a workflow that leverages your domain expertise
- A tool that recommends where to look next, and that surfaces relationships in your data
- A tool and a workflow that helps expedite data analysis, and make data analysis more accessible to a broader range of users
Note: For related information on how Explain Data works, and how to use Explain Data to augment your analysis, see these Tableau Conference presentations:
Use Explain Data as an incremental, jumping-off point for further exploration of your data. The possible explanations that it generates help you to see the different values that make up or relate to a selected mark in a view. It can tell you about the characteristics of the data points in the data source, and how the data might be related (correlations) using statistical modeling. These explanations give you another tool for inspecting your data and finding interesting clues about what to explore next.
When running and reading the explanations created by Explain Data, keep the following points in mind:
Use granular data that can be aggregated. This feature is designed explicitly for the analysis of aggregated data. This means that your data must be granular, but the marks that you select for Explain Data must be aggregated or summarized at a higher level of detail. Explain Data can't be run on disaggregated marks (row-level data) at the most granular level of detail.
For more information about aggregation, see Data Aggregation in Tableau.
Consider the shape, size, and cardinality of your data. While Explain Data can be used with smaller data sets, it requires data that is sufficiently wide and contains enough marks (granularity) to be able to create a model.
Don't assume causality. Correlation is not causation. Explanations are based on models of the data, but are not causal explanations.
A correlation means that a relationship exists between some data variables, say A and B. You can't tell just from seeing that relationship in the data that A is causing B, or B is causing A, or if something more complicated is actually going on. The data patterns are exactly the same in each of those cases and an algorithm can't tell the difference between each case. Just because two variables seem to change together doesn't necessarily mean that one causes the other to change. A third factor could be causing them both to change, or it may be a coincidence and there might not be any causal relationship at all.
However, sometimes you have outside knowledge that is not in the data that helps you to identify what's going on. A common type of outside knowledge would be a situation where the data was gathered in an experiment. If you know that B was chosen by flipping a coin, any consistent pattern of difference in A (that isn't just random noise) must be caused by B. For a longer, more in-depth description of these concepts, see the article Causal inference in economics and marketing by Hal Varian.
When you are using Explain Data in a worksheet, remember that Explain Data works with:
Single marks only—Explain Data must be run on a single mark. Multiple mark analysis is not supported.
Aggregated data—The view must contain one or more measures that are aggregated using SUM, AVG, COUNT, COUNTD, or AGG (calculated field). At least one dimension must also be present in the view.
Single data sources only—The data must be drawn from a single, primary data source. Explain Data does not work with blended data sources.
When preparing a data source for a workbook, keep the following considerations in mind if you plan to use Explain Data during analysis.
- The underlying data must be sufficiently wide. An ideal data set has at least 10-20 columns in addition to one (or more) aggregated measures to be explained.
- Give columns easy-to-understand names.
- Eliminate redundant columns and data prep artifacts.
- Don't discard unvisualized columns.
- Low cardinality dimensions work better. The explanation of a categorical dimension is easier to interpret if its cardinality is not too high (< 20 categories).
- Don't pre-aggregate data.
- Do pre-aggregate data to an appropriate LOD if data is massive.
- Extracts run faster than live data sources. With live data sources, the process of creating explanations can create many queries (roughly one query per each candidate explanation), which can result in explanations taking longer to be generated.
Sometimes Explain Data will not be available for a selected mark, depending on the characteristics of the data source or the view. If Explain Data cannot analyze the selected mark, the Explain Data icon and context menu command will not be available.
|Explain Data can't be run in views that use:||
Explain Data can't be run if you select:
|Explain Data can't be run when the measure to be used for an explanation:||
Explain Data can't offer explanations for a dimension when it is:
Note: The Show Explanation Diagnostics setting (in Settings and Performance menu) is not intended to be used for viewing explanations in Explain Data. This option collects internal diagnostics about explanations for use by customer support.
When you run Explain Data on a mark, a statistical analysis is run on the aggregated mark, and then on possibly related data points from the data source that aren't represented in the current view.
Explain Data first predicts the value of a mark using only the data that is present in the visualization. Next, data that is in the data source (but not in the current view) is considered and added to the model. The model determines the range of predicted mark values, which is within one standard deviation of the predicted value.
If an expected value summary says the mark is lower than expected or higher than expected, it means the aggregated mark value is outside the range of values that a statistical model is predicting for the mark. If an expected value summary says the mark is lower or higher than expected, but within the natural range of variation, it means the aggregated mark value is within the range of predicted mark values, but is lower or higher in that range of values.
Possible explanations are evaluated on their explanatory power using statistical modeling. Explanations are listed based on how informative they are; explanations that are more simple with less variability are favored. For each explanation, Tableau compares the expected value with the actual value. Explanations that don’t meet the defined threshold are not listed.
Note: For related information on how Explain Data works, and how to use Explain Data to augment your analysis, see these Tableau Conference presentations:
Models used for analysis
Explain Data builds Bayesian models of the data in a view to predict the value of a mark, and then determines whether a mark is higher or lower than expected given the model. Next, it considers additional information, like adding additional columns from the data source to the view, or flagging record-level outliers, as potential explanations. For each potential explanation, Explain Data fits a new model, and evaluates how unexpected the mark is given the new information. Explanations are scored by trading off complexity (how much information is added from the data source) against the amount of variability that needs to be explained. Better explanations are simpler than the variation they explain.
How scoring works
When you run Explain Data for a mark, the explanations window is displayed. If multiple explanations are available, they are displayed in descending order based on a numerical score given to each potential explanation. Only explanations with the highest scores are displayed. Scoring works differently for different explanation types.
Extreme values are aggregated marks that are outliers, based on a model of the visualized marks. The selected mark is considered to contain an extreme value if a record value is in the tails of the distribution of the expected values for the data.
The score of an extreme value is determined by looking at the minimum and maximum values that make up the aggregate mark. If the mark becomes less surprising by removing a value, then it receives a higher score.
When a mark has an extreme value, it doesn't automatically mean it's an outlier, or that you should exclude it from the view. That choice is up to you depending on your analysis. The explanation is simply pointing out an interesting extreme value in the mark. For example, it could reveal a mistyped value in a record where a banana cost 10 dollars instead of 10 cents. Or, it could reveal that a particular sales person had a great quarter.
|Number of records / Average value of records||
This type of explanation is used for aggregate marks that are sums. It explains whether the mark differs from the distribution overall because:
Because SUM marks are by definition equal to COUNT(X) * AVG(X), the mark can be broken down into a count of values and multiplied by the average value for the mark. This yields two new distributions: a distribution of counts and a distribution of averages. If the selected mark is an outlier, it will either have a count that is an outlier in the count distribution, an average value that is an outlier in the distribution of averages, or both.
This explanation describes whether the sum is interesting because the count is high or low, or because the average is high or low.
|Unvisualized and aggregated dimensions||
An unvisualized dimension is a dimension that exists in the data source, but isn't currently being used in the view. This type of explanation is used for sums and averages. Aggregated explanations also work on counts.
The model for unvisualized dimensions is created by splitting out marks according to the categorical values of the explaining column, and then building a model with the value that includes all of the data points in the source visualization. For each row, the model attempts to recover each of the individual components that made each mark. The score indicates whether the model predicts the mark better when components corresponding to the unvisualized dimension are modeled and then added up, versus using a model where the values of the unvisualized dimension are not known.
Aggregate dimension explanations explore how well mark values can be explained without any conditioning. Then, the model conditions on values of each column that is a potential explanation. Conditioning on the distribution of an explanatory column should result in a better prediction. The score is basically how much better the prediction becomes.