Find Good Data Sets
A good way to learn how to use Tableau Desktop (or build sample or proof-of-concept content) is to find a data set you find interesting. When you have real questions you want to answer with data, the steps of the analysis become easier and more meaningful.
The reality of data sets
There are two unavoidable facts about trying to find a data set that's not official, business-sanctioned data.
You won’t find what you're looking for.
- Try to avoid strict expectations of what you need.
- Stay flexible and open minded about what you can use for a given project.
- Sometimes the data you want is behind a paywall—decide if it's worth it or not.
You’ll have to clean up the data.
- Be prepared for basic cleaning and shaping(Link opens in a new window) to make sure the data is well structured for analysis.
- You may need to bring in other data sets(Link opens in a new window).
- Having a data dictionary or metadata can be vital.
- Calculations may be necessary.
What makes a good data set
A good data set is one that suits your purpose. As long as that need is met, it's a good data set. However, there are some considerations that can help you weed out data sets that are unlikely to suit your purpose. On the whole, look for data sets that meet the following conditions:
- Contain the elements you need
- Are disaggregated data
- Have at least a couple dimensions and a couple measures
- Have good metadata or a data dictionary
- Are useable (not in a proprietary format, too messy, or too cumbersome)
Superstore is one of the sample data sources that come with Tableau Desktop. Why is it such a good data set?
- Necessary elements: Superstore has dates, geographic data, fields with a hierarchy relationship (Category, Sub-Category, Product), measures that are positive and negative (Profit), etc. There are very few chart types you can't make with Superstore alone, and few features it can't be used to demonstrate.
- Disaggregated: The row-level data is each item in a transaction. Those items can be rolled up to the order level (via Order ID) or by any of the dimensions (such as date, customer, region, etc.)
- Dimensions and measures: Superstore has several dimensions allowing us to "slice and dice" by things like category or city. There are also multiple measures and dates, which open the possibilities for chart types and calculations.
- Metadata: Superstore has well-named fields and values. You don’t need to look up what any values mean.
- Small and clean: Superstore is only a few megabytes so it takes up little room in the Tableau installer. It's also clean data, with only the correct values in each field and a good data structure.
1. A good data set has the elements you need for your purposes
If you're looking for a data set to build a specific visualization or to showcase specific functionalities, make sure the data set has the types of fields you need. For example, maps are a great visual but require geographic data. Basic demos often involve drilling down into dates, so the data would need at least one date field (and it would need to be more granular than just year to show drill down). Not all data sets need all these elements—know what you need for your purpose and don't waste time with data sets that are missing key elements.
Common elements for analysis:
- Dates
- Geographic data
- Hierarchical data
- "Interesting" measures—either substantial variation in magnitude or positive and negative values
Some features or viz types may require specific characteristics of the data such as:
- Clusters
- Forecasting
- Trend lines
- User filters
- Spatial calculations
- Certain calculations
- Bullet charts
- Control charts
2. A good data set is disaggregated (raw) data
If the data is too aggregated, there isn't much you can do for analysis. For example, if you want to look at trends in people Googling "Pumpkin Spice" but have yearly data, you can only look at a very high level overview. Ideally you want to get your hands on daily data, so you could see the huge spike when Starbucks starts offering #PSL.
What counts as disaggregated can vary by analysis. Note that due to privacy or practicality, some data sets will never be fully granular. For example, you would be unlikely to find a data set with case-by-case reporting of malaria by address, so monthly totals by region might be granular enough.
Understanding aggregation and granularity is a critical concept for many reasons; it impacts things like finding useful data sets, building the visualization you want, combining data correctly, and using LOD expressions. Aggregation and granularity are opposite ends of a spectrum.
Aggregation refers to how the data is combined together, such as summing all the searches for Pumpkin Spice or taking the average of all the temperature readings around Seattle on a given day.
- By default, measures in Tableau are aggregated. The default aggregation is SUM. You can change the aggregation to things like average, median, count distinct, minimum, etc.
Granularity refers to how detailed the data is. What does a row (aka record) in the data set represent? A person with malaria? A provinces' total cases of malaria for the month? That's the granularity. Knowing the granularity of the data is crucial.
For more information, see Data Aggregation in Tableau.
3. A good data set has dimensions and measures
Many visualization types require dimensions and measures
- If you only have dimensions, you're mostly limited to counting, calculating percentages, or using the Count of Table field.
- If you only have measures, you can't break out the values by anything. You can disaggregate the data entirely or work with the overall SUM or AVG, etc.
Which isn't to say a data set with only dimensions can't be useful. Demographic data is an example of dimension-heavy data, and much analysis around demographics is counting or percentage-based. But for a more analytically rich data set, you want at least a few dimensions and measures.
Numeric dimension Continuous measure Discrete measure
Dimensions and Measures
Fields are broken out into dimensions and measures with a horizontal line in the data pane. In Tableau, dimensions come out into the view as themselves, whereas measures are automatically aggregated; the default aggregation for a measure is SUM.
- Dimensions are qualitative, meaning they are described, not measured.
- Dimensions are often things like city or country, eye color, category, team name, etc.
- Dimensions are usually discrete.
- Measures are quantitative, meaning they can be measured and recorded (numeric).
- Measures are often things like sales, height, number of clicks, etc.
- Measures are usually continuous.
If you could do math with a field, it should be a measure. If you're ever not sure if a field should be a measure or a dimension, think about if you can do meaningful math with the values. Is there any meaning to AVG(RowID), the sum of two Social Security numbers, or dividing a postal code by 10? No. Those are dimensions that happen to be written as numbers. Think about how many countries have alphanumeric postal codes–they're just labels, even though in the US they're only numeric. Tableau can recognize many field names that indicate a numeric field is actually an ID or a postal code and tries to make those dimensions, but it's not perfect. Use the "could I do math with this?" test to decide if a numeric field should be a measure or dimension and rearrange the data pane as necessary.
Note: Although you can do math with dates (such as the DATEDIFF calculation), the standard convention is to categorize dates as dimensions.
Discrete and Continuous
Discrete or Continuous fields are somewhat aligned with the concepts of dimension and measure, but they’re not identical.
- Discrete fields contain distinct values. They make headers or labels in the view and the pills are blue
- Continuous fields "form an unbroken whole". They make an axis in the view and the pills are green
A good way of understanding discrete and continuous is to look at a date field. Dates can be either discrete OR continuous.
- Looking at average temperatures in August across a decade or century means "August" is being used as a discrete, qualitative date part.
- Looking at the overall trend in reported malaria cases since 1960 would take a single, unbroken axis, meaning the date is being used as a continuous, quantitative value.
For more information, see Dimensions and Measures, Blue and Green.
Tableau creates at least three fields, no matter what the data set is:
- Measure Names (a dimension)
- Measure Values (a measure)
- TableName(Count) (a measure)
And if there are geographic fields in the data set, Tableau also creates Latitude (generated) and Longitude (generated) fields.
Measure Names and Measure Values are two useful fields. For more information, see Measure Values and Measure Names.
Count of Table provides the number of records for the table by counting the rows. This enables you to have at least one measure in your data set and can help with some analysis. You must understand the granularity of your data (what a row represents) to be able to define what the number of rows means.
Here, each row is a day, so the Count of Table would be the number of days:
Here, each row is a month, so the Count of Table would be the number of months:
4. A good data set has metadata or a data dictionary
A data set can only be useful if you know what the data is. There are few things more frustrating in the hunt for good data than opening a file that looks like this:
A good data set is one that has either well-labeled fields and members or a data dictionary so you can relabel the data yourself. Think of Superstore—it's immediately obvious what the fields and their values are, such as Category and its members Technology, Furniture, and Office Supplies. Or, for the microbiome data set in the image above, there is a data dictionary(Link opens in a new window) which explains each Source (4 is feces and 12 is stomach) and the taxonomy of each OTU (OTU3 is a bacterium of the genus Parabacteroides).
Data dictionaries can also be called metadata, indicators, variable definitions, glossaries, or any number of other things. At the end of the day, a data dictionary provides information about column names and members in a column. That information can be brought into the data source or viz in several ways, including:
- Rename the columns so they are easier to understand (this can be done in the data set itself or in Tableau).
- Re-alias the members of the field (this can be done in the data set itself or in Tableau).
- Create calculations to add the data dictionary information.
- Comment on the field in Tableau (comments do not appear on published vizzes, only in the authoring environment).
- Use the data dictionary as another data source and combine the two data sources.
Losing a data dictionary can render a data set useless. If you're bookmarking a data set, bookmark the data dictionary, too. If you're downloading, download both and keep them in the same place.
5. A good data set is one you can use
As long as you can understand the data set and it has the information you need, even a small data set can pack a punch for analysis. Smaller data sets are also easy to store, share, and publish, and are likely to perform well.
Similarly, even if you find the "perfect" data set for your needs, if it requires an unrealistic amount of effort to clean up, it’s not perfect after all. Knowing when to walk away from a data set that is too messy is important.
For example, this data set is from a Wikipedia article on relative letter frequencies. It started as 84 rows and 16 columns (pivoted to be 1,245 rows and 3 columns). The Excel file is 16KB. But with some groups, sets, calculations, and other manipulations, it enables robust analysis and interesting visuals.
Relabel your data
Once you find a good data set, you’ll often need to relabel it. Relabeling data can be useful to either create fake data for samples or proof-of-concepts, or to make the data more readable.
Renaming a field changes how that field appears in Tableau, such as renaming "Sales" to "Pipeline Sales" or "State" to "Province".
Re-aliasing changes how the members of a field are displayed, such as re-aliasing values in a Country field so that CHN becomes China and RUS becomes Russia.
- The values in a discrete dimension field are called members. Only members can be re-aliased. Consider a measure field for temperature. A value of 54°F can't be changed without changing data itself. But re-aliasing the member "CHN" as "China" in a Country field is the same information, just labeled in another way.
Renaming and re-aliasing mean almost the same thing. It’s the convention in Tableau that fields are named and members are aliased. For more information, see Organize and Customize Fields in the Data Pane and Create Aliases to Rename Members in the View.
Note: Renaming or re-aliasing only changes the appearance in Tableau Desktop; no changes are written back to the underlying data.
Relabel to make fake data
Relabeling existing data sets is a great way to make samples or proof-of-concept content more compelling.
- Use an easy data set (like Superstore) to build what you want (a specific chart type, showing off certain functionality, etc.)
- Rename the relevant fields, change tooltips, and otherwise change the textual aspects to mask what the data actually represents.
Important: Only do this when it's clear the information is fake. Be careful that people don't think it's real data and try to use it for analysis. For example, use silly names or meaningless field names like colors or animals.
Re-alias to make the data easier to use
It's more efficient to store the data as numeric values rather than string values, though numeric encoding can make the data harder to understand. For small data sets it probably won't make a performance impact, so prioritize being able to understand the data easily.
A downside to re-aliasing is that you no longer have access to those numeric values (making it harder to do things like sort, assign color gradients, etc.). Consider duplicating the field and re-aliasing the copy. Alternatively, a calculation in Tableau can be a good way to preserve the original information while also making it more easily understandable.
Re-alias with the CASE function
Calculations can be very powerful for re-aliasing. For example, CASE functions allow you to say, essentially, "when this field has a value of A, give me X. When the value is B, give me Y".
Here, the CASE function looks at the F-scale in a tornado data set and provides the written description associated with each numeric value:
CASE [F-scale]
WHEN "0" THEN "Some damage to chimneys; branches broken off trees; shallow-rooted trees pushed over; sign boards damaged."
WHEN "1" THEN "The lower limit is the beginning of hurricane wind speed; peels surface off roofs; mobile homes pushed off foundations or overturned; moving autos pushed off the roads..."
WHEN "2" THEN "Roofs torn off frame houses; mobile homes demolished; boxcars overturned; large trees snapped or uprooted; highrise windows broken and blown in; light-object missiles generated."
WHEN "3" THEN "Roofs and some walls torn off well-constructed houses; trains overturned; most trees in forest uprooted; heavy cars lifted off the ground and thrown."
WHEN "4" THEN "Well-constructed houses leveled; structures with weak foundations blown away some distance; cars thrown and large missiles generated."
WHEN "5" THEN "Strong frame houses lifted off foundations and carried considerable distances to disintegrate; ... trees debarked; steel reinforced concrete structures badly damaged."
END
Now we can chose to use either the original "F-scale" field (0-5) or the "F-scale damage description" field in the viz.
Tips when looking for data sets
Note: Try to make sure you can answer the question "What does a row (aka record) in the data set represent?" If you can't articulate that, you might not understand the data well enough to be able to use it or it might be structured poorly for analysis.
- Keep track of where the data came from.
- Keep the data dictionary information with the data itself.
- Avoid stale data if you need the content to stay evergreen. Look for:
- updatable data (stocks, weather, regularly published reports, etc.)
- timeless data (the average mass of various animals isn’t going to change from year to year)
- data you can future-proof by artificially changing to historical or future dates
- Try simply Googling what you’re looking for, you might be surprised.
- Don’t be afraid to give up on a data set if it’s too much work to prep.
Places to look for data
Where can you look for data? There are a potentially overwhelming number of places to find data sets. Here are some options to get you started. Note that the reality of data sets does apply to these sites—you probably won't find what you're thinking of right now, and you will most likely need to do some cleaning to get the data ready for analysis.
Disclaimer: Although we make every effort to ensure these links to external websites are accurate, up to date, and relevant, Tableau cannot take responsibility for the accuracy or freshness of pages maintained by external providers. Listing a site here is not an endorsement of any content or organization. Contact the external site for answers to questions regarding its content.
Tableau Public(Link opens in a new window): Tableau Public is an amazing resource for Tableau-friendly data sets. Search for workbooks that are on a topic you're interested in, browse for inspiration, then download the workbook to access the data. Or check out the curated Sample Data(Link opens in a new window).
Wikipedia tables(Link opens in a new window): Get data out of Wikipedia tables by: copying and pasting into a spreadsheet, copying and pasting directly into Tableau, or using Google sheets and the IMPORTHTML function(Link opens in a new window) to create a Google spreadsheet of the data.
Google Dataset Search(Link opens in a new window): "A search engine to unite the fragmented world of online datasets."
Data is Plural(Link opens in a new window) : Subscribe for a weekly newsletter with data sets, or browse the archive(Link opens in a new window).
Makeover Monday(Link opens in a new window): "Join us every Monday to work with a given data set and create better, more effective visualizations and help us make information more accessible." You can see what other people have done with the same data set, kickstarting your analysis or giving inspiration. Use #makeovermonday(Link opens in a new window) on Twitter to participate.
Other sites
- Tableau Web Data Connectors(Link opens in a new window)
- Data.world(Link opens in a new window) and its WDC for Tableau(Link opens in a new window)
- Github Open Data(Link opens in a new window)
- Kaggle(Link opens in a new window)
- datahub.io(Link opens in a new window)
- r/datasets(Link opens in a new window)
- WHO(Link opens in a new window)
- Data.UN.org(Link opens in a new window)
- WorldBank(Link opens in a new window)
- data.gov(Link opens in a new window), data.gov.au(Link opens in a new window), data.gov.uk(Link opens in a new window), etc.
- Airbnb(Link opens in a new window)
- Yelp(Link opens in a new window)
- Zillow(Link opens in a new window)