This article describes how to connect Tableau to .pdf file data and set up the data source.
Note: Tableau doesn’t support right-to-left (RTL) languages. If your PDF includes RTL text, characters might display in reverse order in Tableau.
After you open Tableau, under Connect, click PDF File.
Select the file you want to connect to, and then click Open.
In the Scan PDF File dialog box, specify the pages in the file that you want Tableau to scan for tables. You can choose to scan for tables in all pages, just a single page, or a range of pages.
Note: The scan counts the first page of the file as page 1, similar to most PDF readers. When you scan for tables, specify the page number that the PDF reader displays and not the page number that might be used in the document itself, which may or may not start from page 1.
For example, suppose you want to use "Table 1" from the image below. The PDF reader displays a number, and the .pdf file displays a different number. To correctly scan for this table, specify the page number that the PDF reader displays. In this example, you specify page 15.
On the data source page, do the following:
(Optional) Select the default data source name at the top of the page, and then enter a unique data source name for use in Tableau. For example, use a data source naming convention that helps other users of the data source figure out which data source to connect to. The default name is automatically generated based on the file name.
If your file contains one table, click the sheet tab to start your analysis. Otherwise, from the left pane drag a table onto the canvas and then click the sheet tab to start your analysis.
About the tables in the left pane
Tables that are identified in the .pdf file are given unique names and are displayed in the left pane after a scan. For example, you might see a table name like "Page 1, Table 1." The first part of the table name indicates the page in the .pdf file the table came from. The second part of the table name indicates the order the table was identified. If Tableau has identified more than one table on a page, the second part of the table name can indicate one of two things:
- Tableau has identified another unique table or sub-table on the page.
- Tableau has interpreted the table on the page in another way. Tableau might provide multiple interpretations of a table depending on how the table is presented in your .pdf file.
PDF file data source example
Here is an example of a PDF file data source:
Get more data into your data source by adding more tables or connecting to data in a different database.
Add more data from the current file:
If the pages that were scanned in step 3 of the procedure listed above do not produce the tables that you need in the left pane, click the drop-down arrow next to the PDF File connection, and click Rescan PDF file. This option allows you to create a new scan so that you can specify different pages in the .pdf file to scan for tables.
Add more data from a different database: In the left pane, click Add next to Connections. For more information, see Join Your Data.
If a connector you want is not listed in the left pane, select Data > New Data Source to add a new data source. For more information, see Blend Your Data.
You can set table options. On the canvas, click the table drop-down arrow and then specify whether the data includes field names in the first row. If so, these names will become the field names in Tableau. If field names are not included, Tableau generates them automatically. You can rename the fields later.
If Tableau detects that it can help optimize your data source for analysis, it prompts you to use Data Interpreter. Data Interpreter can detect sub-tables that you can use and remove unique formatting that might cause problems later on in your analysis. For more information, see Clean Data from Excel, CSV, PDF, and Google Sheets with Data Interpreter.
You can union tables in your file. For more information about union, see Union Your Data.
When you use wildcard search to union tables, the result is scoped to the pages that were scanned in the initial file you connected to. For example, suppose you have three files: A.pdf, B.pdf, and C.pdf. The first file you connect to is A and you limit the scan for tables to page 1. When you use wildcard search to union tables from files B and C, the additional tables included in the union can only come from page 1 of B and page 1 of C.
The following tips can help you work with your .pdf files in Tableau.
Use PDF File connector to identify just the tables in your .pdf file.
The primary goal of the PDF File connector is to find and identify tables in your .pdf file. Therefore, it ignores any other information in the file that does not appear to be part of a table, including titles, captions, and footnotes. If related data is stored in one of these areas, such as in the table title, you can use Tableau to first export the .pdf file data into a .csv file, manually add the data that was stored in the table title, and then connect to the .csv file instead. For more information, see Export your data to .csv file .
Use standard tables.
In general, Tableau works best with standard tables that use a tabular format.
Ideally, the tables in your .pdf file have column headers on a single line and have rows values on a single line as demonstrated in the example below.
Colors and shading used in or around the tables can affect how the tables are identified.
Tables that have unique formatting might require some cleanup or manual editing outside of Tableau. Unique formatting can include hierarchical headers, header names that span multiple lines, row values that span multiple lines, angle headers, and stacked tables as demonstrated in the examples shown below.
Note: Tableau does not support connections to .pdf files generated by scanning (optical character recognition) software.
Validate the data.
Make sure that you validate the data in the tables that Tableau identifies in your .pdf file. You can validate the data by using either the data grid or if you used the Data Interpreter, the results workbook.
Avoid tables that span across pages.
If your .pdf file contains a table that spans across pages, Tableau interprets that table as multiple tables. To resolve this issue, use a union to combine the tables. For more information, see Union Your Data.
Rename .pdf files whose file names contain unicode characters.
After connecting to a .pdf file that contains unicode characters in its file name, you might see the following error.
To resolve this issue, rename the file using non-unicode characters, and connect to your .pdf file again.
Do not use password protected .pdf files.
After connecting to and scanning a .pdf file for tables, you might see the following error.
Tableau shows this error when your .pdf file is password protected and unable to access its contents. Tableau is unable to support connections to password protected .pdf files.
Alias values that are interpreted differently or incorrectly.
In the data grid you might notice that some values are interpreted differently from the .pdf file. You can correct this interpretation by using aliases to rename specific values within a field.
For example, suppose you see the following table after connecting to your .pdf file. Some state abbreviations are being interpreted in lowercase form, which are highlighted in blue.
You can resolve this issue by using aliases to change the lowercase abbreviations to uppercase abbreviations. To do this, click the drop-down arrow next to the column name and select Aliases.
Resolve column headers that are interpreted as table values.
In the data grid you might also notice that some column headers in your .pdf file are interpreted as table values instead. This can occur if your .pdf file contains tables with unique formatting or hierarchical headers. In this scenario, try the Data Interpreter first. If Data Interpreter doesn't resolve the issue, consider manually renaming the columns to their appropriate names and filtering header names that are being treated as values by using data source filters.
For example, suppose you see the following table after connecting to your .pdf file. The table headers from the .pdf file are being interpreted as table values, which are highlighted in blue.
One way you can resolve a header issue like this is to follow steps similar to the following:
Double-click the column name, and then rename F1 to Year. Repeat this step for F2 through F4 for Coal, Gas, and Oil.
Click the data type icon for the Year column and change it to a number data type. This causes the non-numerical values in this column to convert to null values.
In the upper-right corner of the data source page, click Add, click the Add button, and then select the Year field.
In the Filter dialog box, select both the Null and Exclude check boxes.
The rows in the Year column that contain null values are removed from the data grid, which affect the rows from the other columns in the table.
You might notice .ttde or .hhyper files when navigating your computer's directory. When you create a Tableau data source that connects to your data, Tableau creates a .ttde or .hhyper file. This file, also known as a shadow extract, is used to help improve the speed your data source loads in Tableau Desktop. Although a shadow extract contains underlying data and other information similar to the standard Tableau extract, a shadow extract is saved in a different format and can't be used to recover your data.
In certain situations, you might need to delete a shadow extract from your computer. For more information, see Low Disk Space because of shadow extract in the Tableau Knowledge Base.