Find Clusters in Data
Cluster analysis partitions marks in the view into clusters, where the marks within each cluster are more similar to one another than they are to marks in other clusters.
For an example that demonstrates the process of creating clusters with sample data, see Example: Create clusters using World Economic Indicators data.
To find clusters in a view in Tableau, follow these steps.
Create a view.
Drag Cluster from the Analytics pane into the view, and drop it on in the target area in the view:
You can also double-click Cluster to find clusters in the view.
When you drop or double-click Cluster:
Tableau creates a Clusters group on Color, and colors the marks in your view by cluster. If there is already a field on Color, Tableau moves that field to Detail and replaces it on Color with the clustering results.
Tableau assigns each mark in the view to one of the clusters. In some cases, marks that do not fit well into a cluster are assigned to a "Not Clustered" cluster.
Tableau displays the Clusters dialog box, where you can customize the cluster.
Customize the cluster results by doing either of the following in the Clusters dialog box.
Drag new fields from the Data pane into the Variables area of the Clusters dialog box. You can also drag fields out of the Variables area to remove them.
When you add variables, measures are aggregated using the default aggregation for the field; dimensions are aggregated using ATTR, which is the standard way that Tableau aggregates dimensions.
To change the aggregation for a variable, right-click it.
Specify the number of clusters (between 2 and 50). If you do not specify a value, Tableau will automatically create up to 25 clusters.
- When you finish customizing the cluster results, click the X in the upper-right corner of the Clusters dialog box to close it:
Note: You can move the cluster field from Color to another shelf in the view. However, you cannot move the cluster field from the Filters shelf to the Data pane.
Clustering is available in Tableau Desktop, but is not available for authoring on the web (Tableau Server, Tableau Online). Clustering is also not available when any of the following conditions apply:
When you are using a cube (multidimensional) data source.
When there is a blended dimension in the view.
When there are no fields that can be used as variables (inputs) for clustering in the view.
When there are no dimensions present in an aggregated view.
When any of those conditions apply, you will not be able to drag Clusters from the Analytics pane to the view.
In addition, the following field types cannot be used as variables (inputs) for clustering:
Generated latitude/longitude values
Measure Names/Measure Values
To edit an existing cluster, right-click (Control-click on a Mac) a Clusters field on Color and select Edit clusters.
To change the names used for each cluster, you will first need to drag the Clusters field to the Data pane and save it as a group. For details, see Create a group from cluster results.
Right-click the cluster group and select Edit Group to make changes to each cluster.
Select a cluster group in the list of Groups and click Rename to change the name.
If you drag a cluster to the Data pane, it becomes a group dimension in which the individual members (Cluster 1, Cluster 2, etc.) contain the marks that the cluster algorithm has determined are more similar to each other than they are to other marks.
After you drag a cluster group to the Data pane, you can use it in other worksheets.
Drag Clusters from the Marks card to the Data pane to create a Tableau group:
After you create a group from clusters, the group and the original clusters are separate and distinct. Editing the clusters does not affect the group, and editing the group does not affect the cluster results. The group has the same characteristics as any other Tableau group. It is part of the data source. Unlike the original clusters, you can use the group in other worksheets in the workbook. So if you rename the saved cluster group, that renaming is not applied to the original clustering in the view. See Correct Data Errors or Combine Dimension Members by Grouping Your Data.
Constraints on saving clusters as groups
You will not be able to save Clusters to the Data pane under any of the following circumstances:
When the measures in the view are disaggregated and the measures you are using as clustering variables are not the same as the measures in the view. For detials, see How to Disaggregate Data.
When the Clusters you want to save are on the Filters shelf.
When Measure Names or Measure Values is in the view.
When there is a blended dimension in the view.
When you save a Clusters field as a group, it is saved with its analytic model. You can use your cluster groups in other worksheets and workbooks, however, they don't automatically refresh.
In this example, a saved cluster group and its analytic model has been applied to a different worksheet. As a result, some of the marks are not included in the clustering yet (indicated by gray marks).
If the underlying data changes, you can use the Refit option to refresh and recompute the data for a saved clusters group.
To refit a saved cluster
Right-click a clusters group in the Data pane, and then click Refit.
Here's an example of updated clustering after refitting the saved cluster:
When you refit saved clusters, new clusters will be created and existing aliases for each cluster group category will be replaced with new, generic cluster aliases. Be aware that refitting saved clusters may alter your visualizations that use existing clusters and aliases.
Cluster analysis partitions the marks in the view into clusters, where the marks within each cluster are more similar to one another than they are to marks in other clusters. Tableau distinguishes clusters using color.
Note: For additional insight into how clustering works in Tableau, see the blog post Understanding Clustering in Tableau 10.
The clustering algorithm
Tableau uses the k-means algorithm for clustering. For a given number of clusters k, the algorithm partitions the data into k clusters. Each cluster has a center (centroid) that is the mean value of all the points in that cluster. K-means locates centers through an iterative procedure that minimizes distances between individual points in a cluster and the cluster center. In Tableau, you can specify a desired number of clusters, or have Tableau test different values of k and suggest an optimal number of clusters (see Criteria used to determine the optimal number of clusters).
K-means requires an initial specification of cluster centers. Starting with one cluster, the method chooses a variable whose mean is used as a threshold for splitting the data in two. The centroids of these two parts are then used to initialize k-means to optimize the membership of the two clusters. Next, one of the two clusters is chosen for splitting and a variable within that cluster is chosen whose mean is used as a threshold for splitting that cluster in two. K-means is then used to partition the data into three clusters, initialized with the centroids of the two parts of the split cluster and the centroid of the remaining cluster. This process is repeated until a set number of clusters is reached.
Tableau uses Lloyd’s algorithm with squared Euclidean distances to compute the k-means clustering for each k. Combined with the splitting procedure to determine the initial centers for each k > 1, the resulting clustering is deterministic, with the result dependent only on the number of clusters.
The algorithm starts by picking initial cluster centers:
It then partitions the marks by assigning each to its nearest center:
Then it refines the results by computing new centers for each partition by averaging all the points assigned to the same cluster:
It then reviews the assignment of marks to clusters and reassigns any marks that are now closer to a different center than before.
The clusters are redefined and marks are reassigned iteratively until no more changes are occurring.
Tableau uses the Calinski-Harabasz criterion to assess cluster quality. The Calinski-Harabasz criterion is defined as
where SSB is the overall between-cluster variance, SSW the overall within-cluster variance, k the number of clusters, and N the number of observations.
The greater the value of this ratio, the more cohesive the clusters (low within-cluster variance) and the more distinct/separate the individual clusters (high between-cluster variance).
Since the Calinski-Harabasz index is not defined for k=1, it cannot be used to detect one-cluster cases.
If a user does not specify the number of clusters, Tableau picks the number of clusters corresponding to the first local maximum of the Calinski-Harabasz index. By default, k-means will be run for up to 25 clusters if the first local maximum of the index is not reached for a smaller value of k. You can set a maximum value of 50 clusters.
Note: If a categorical variable (that is, a dimension) has more than 25 unique values, then Tableau will disregard that variable when computing clusters.
What values get assigned to the "Not Clustered" category?
When there are null values for a measure, Tableau assigns values for rows with null to a Not Clustered category. Categorical variables (that is, dimensions) that return * for ATTR (meaning that all values are not identical) are also not clustered.
Tableau scales values automatically so that columns having a larger range of magnitudes don’t dominate the results. For example, an analyst could be using inflation and GDP as input variables for clustering, but because GDP values are in trillions of dollars, this could cause the inflation values to be almost completely disregarded in the computation. Tableau uses a scaling method called min-max normalization, in which the values of each variable is mapped to a value between 0 and 1 by subtracting its minimum and dividing by its range.
The Describe Clusters dialog box provides information about the models that Tableau computed for clustering. You can use these statistics to assess the quality of the clustering.
When the view includes clustering, you can open the Describe Clusters dialog box by right-clicking Clusters on the Marks card (Control-clicking on a Mac) and choosing Describe Clusters. The information in the Describe Clusters dialog box is read-only, though you can click Copy to Clipboard and then paste the screen contents into a writeable document.
Describe Clusters – Summary Tab
The Summary tab identifies the inputs that were used to generate the clusters and provides some statistics that characterize the clusters.
Inputs for Clustering
Identifies the fields Tableau uses to compute clusters. These are the fields listed in the Variables box in the Clusters dialog box.
Level of Detail
Identifies the fields that are contributing to the view’s level of detail—that is, the fields that determine the level of aggregation. For details, see How dimensions affect the level of detail in the view.
Identifies the scaling method used for pre-processing. Normalized is currently the only scaling method Tableau uses. The formula for this method, also known as min-max normalization, is
(x – min(x))/(max(x) - min(x)).
Number of Clusters
The number of individual clusters in the clustering.
Number of Points
The number of marks in the view.
Between-group sum of squares
A metric quantifying the separation between clusters as a sum of squared distances between each cluster’s center (average value), weighted by the number of data points assigned to the cluster, and the center of the data set. The larger the value, the better the separation between clusters.
Within-group sum of squares
A metric quantifying the cohesion of clusters as a sum of squared distances between the center of each cluster and the individual marks in the cluster. The smaller the value, the more cohesive the clusters.
Total sum of squares
Totals the between-group sum of squares and the within-group sum of squares. The ratio (between-group sum of squares)/(total sum of squares) gives the proportion of variance explained by the model. Values are between 0 and 1; larger values typically indicate a better model. However, you can increase this ratio just by increasing the number of clusters, so it could be misleading if you compare a five-cluster model with a three-cluster model using just this value.
For each cluster in the clustering, the following information is provided.
The number of marks within the cluster.
The average value within each cluster (shown for numeric items).
The most common value within each cluster (only shown for categorical items).
Describe Clusters – Models Tab
Analysis of variance (ANOVA) is a collection of statistical models and associated procedures useful for analyzing variation within and between observations that have been partitioned into groups or clusters. In this case, analysis of variance is computed per variable, and the resulting analysis of variance table can be used to determine which variables are most effective for distinguishing clusters.
Relevant analysis of variance statistics for clustering include:
The F-statistic for one-way, or single-factor, ANOVA is the fraction of variance explained by a variable. It is the ratio of the between-group variance to the total variance.
The larger the F-statistic, the better the corresponding variable is distinguishing between clusters.
The p-value is the probability that the F-distribution of all possible values of the F-statistic takes on a value greater than the actual F-statistic for a variable. If the p-value falls below a specified significance level, then the null hypothesis (that the individual elements of the variable are random samples from a single population) can be rejected. The degrees of freedom for this F- distribution are (k - 1, N - k), where k is the number of clusters and N is the number of items (rows) clustered.
The lower the p-value, the more the expected values of the elements of the corresponding variable differ among clusters.
Model Sum of Squares and Degrees of Freedom
The Model Sum of Squares is the ratio of the between-group sum of squares to the model degrees of freedom. The between group sum of squares is a measure of the variation between cluster means. If the cluster means are close to each other (and therefore close to the overall mean), this value will be small. The model has k-1 degrees of freedom, where k is the number of clusters.
Error Sum of Squares and Degrees of Freedom
The Error Sum of Squares is the ratio of within-group sum of squares to the error degrees of freedom. The within-group sum-of-squares measures the variation between observations within each cluster. The error has N-k degrees of freedom, where N is the total number of observations (rows) clustered and k is the number of clusters.
The Error Sum of Squares can be thought of as the overall Mean Square Error, assuming that each cluster center represents the "truth" for each cluster.
The Tableau clustering feature partitions marks in the view into clusters, where the marks within each cluster are more similar to one another than they are to marks in other clusters. This example shows how a researcher might use clustering to find an optimal set of marks (in this case, countries) in a data source.
As life expectancy increases around the world, and as older people remain more active, senior tourism can be a lucrative market for companies that know how to find and appeal to potential customers. The World Indicators sample data set that comes with Tableau contains the kind of data that might help companies identify the countries where there are enough of the right kind of customers.
Finding the right countries
Here is an example of how Tableau clustering could help such a company identify the countries where a senior tourism business could succeed. Imagine you are the analyst. Here is how you might proceed.
Open the World Indicators sample data source in Tableau Desktop.
Double-click Country in the Data pane.
Tableau automatically creates a map view, with a mark in each country.
On the Marks card, change the mark type to Map:
You should now see a map projection where all countries are filled with a solid color:
The next step is to identify the fields that you will use as variables for clustering. Here are the fields you choose:
Field Reason for inclusion Life Expectancy Female and Life Expectancy Male Where people are living longer, there are more likely to be people who are interested in traveling later in life. Population Urban It is easier to market services in areas with greater population density. Population 65+ The target population is older residents with the time and funds to travel. TourismPerCapita
This is a measure that you must create as a named calculated field. The formula is:
SUM([Tourism Outbound])/SUM([Population Total])
Tourism Outbound aggregates the money (in US dollars) that residents of a country spend annually on international travel. But this total must be divided by the population of each country to determine the average amount each resident spends on international travel.
There is no guarantee that these are the ideal fields to choose, or that these fields will produce cluster results that are clear and unambiguous. Clustering is an iterative process—experimentation leads to discovery which leads, in turn, to more experimentation.
Drag these five fields from the Data pane to Detail on the Marks card.
Click to open the Analytics pane:
Drag Cluster from the Analytics pane and drop it in the view:
Tableau displays the Clusters dialog box and adds the measures in the view to the list of variables:
It also updates the view by adding clusters to Color. In this case, Tableau finds two distinct clusters, and is unable to assign certain countries (colored reddish-pink) to either cluster:
Note: See How clustering works for details on data that Tableau assigns to "Not clustered."
You decide that two clusters isn't enough—you don't have the resources to set up shop in half the countries in the world. So you type
4in the Number of Clusters field in the Clusters dialog box.
The map becomes more interesting:
But how do these clusters relate to the variables you have chosen? Which one correlates best with the factors that support senior tourism? It's time to look at the statistics behind the clusters.
Close the Clusters dialog box by clicking the X in its upper-right corner:
Click the Clusters field on the Marks card and choose Describe Clusters.
The table at the bottom of the Models tab in the Describe Clusters dialog box shows the average value for each variable in each cluster:
Cluster 4 has the highest life expectancy (both male and female), the highest concentration of urban population, and the highest expenditure for international tourism: $1360.40 per capita. The only variable for which Cluster 4 does not have the highest value is Population 65+, where Cluster 3 has the advantage: 0.15493 (just under 16%) to 0.11606 (just over 11%) in Cluster 4.
The clustering algorithm does not know whether you are looking for the maximum value for these variables, the minimum value, or something in the middle—it just looks for correlation. But you know that higher values for these variables is the signal you're looking for, and Cluster 4 is the best choice.
You could attempt to pick out the Cluster 4 countries from the map, but there is an easier way. Close the Describe Clusters dialog box and then click Cluster 4 on the Color legend and choose Keep Only.
Choose Text Table from ShowMe.
You now see a list of the countries in Cluster 4:
This list is not the end of the process. You might try clustering again with a somewhat different set of variables and maybe a different number of clusters, or you might add some countries to the list and remove others, based on other factors. For example, if your tours are mostly to tropical locales, you might remove countries like Curacao and the Bahamas from the list, because tropical tours might not appeal to residents of those countries.
Another option is to filter your data before you re-cluster, to only show countries with populations above a certain threshold, or to target countries in a particular geographical area.