Configure your Data Set
Note: Starting in version 2020.4.1, you can now create and edit flows in Tableau Server and Tableau Cloud. The content in this topic applies to all platforms, unless specifically noted. For more information about authoring flows on the web, see Tableau Prep on the Web in the Tableau Server(Link opens in a new window) and Tableau Cloud(Link opens in a new window) help.
To determine how much of your data set to work with in the flow, you can configure your data set. When you connect to your data or drag tables into the Flow pane, an Input step is automatically added to the flow.
The Input step is where you can decide what and how much data to include in your flow. This is always the first step in the flow.
If you're connected to an Excel or text file, you can also refresh the data from the Input step. For more information, see Add More Data in the Input Step(Link opens in a new window).
In the Input step, you can:
- Right-click or Cmd-click (MacOS) on the Input step in the flow pane to rename or remove it.
- Union multiple files in the same parent or child directory. For more information, see Union files and database tables in the Input step.
- (version 2023.1 and later) Include automatically generated row numbers based on the original sort order of your data set. See Include row numbers from your data set.
- Search for fields.
- See examples of field values.
-
Configure the field properties by changing the field name or configure the text settings for text files.
Note: Field values that include square brackets are automatically converted to parentheses.
- Perform actions to change the data that you work with in your flow. See Set your data sample size.
- Configure the data sample ingested into your flow.
- Remove fields you don't need. You can always go back to the input step and include them later.
- Hide fields that you don't need to clean, but still want to include in your flow output. You can unhide them at any time if you need them.
- Apply filters to selected fields.
-
Change the field data type for data connections that support it.
These include Microsoft Excel, text and PDF files, data from Box, Dropbox, Google Drive, and OneDrive. For other data sources you can change the data type in a Clean step.
For more information, see Review the data types assigned to your data
Include row numbers from your data set
Supported in Tableau Prep Builder version 2023.1 and later and on the web for Microsoft Excel and text (.csv) files.
Note: This option is not currently supported for files included in an input union.
Starting in version 2023.1, Tableau Prep automatically generates row numbers based on the original sort order of your data that you can include as a new field in your flow. This is available for Microsoft Excel or Text (.csv) file types only.
In previous releases, if you wanted to include these row numbers, you had to manually add them to the source before adding the data set to your flow.
This field is generated in the Input step when you connect to your data. By default, it is excluded from the flow, but you can include it in one click. If you choose to include it, it behaves like any other field and can be used in your flow operations and calculated fields.
Tableau Prep also supports the ROW_NUMBER function for calculated fields. This function is useful when there are fields in your data set that can define the sort, such as Row ID or Timestamp. For more information about using this function, see Create level of detail, rank, and tile calculations .
Add the Source Row Number field to your flow
-
Right-click or Cmd-click (MacOS) on the field, or click the More options
menu and select Include Field.
-
The change list is cleared, the field is now part of the flow data, and you can see the generated row numbers in subsequent flow steps.
Source Row Number details
When you include the Source Row Number in your data set, the following options and considerations apply.
- The data source row numbers are applied before any data sampling or filters.
- This creates a new field called Source Row Number that persists throughout the flow. This field name isn't localized, but can be renamed at any time.
- If a field with this name already exists, the new field name is incremented by 1. For example Source Row Number-1, Source Row Number-2, and so on.
- You can change the field's data type in subsequent steps.
- You can use this field in flow operations and calculations.
- This value is regenerated for the whole data set each time the input data is refreshed or the flow is run.
- This field is not available for input unions.
Connect to a custom SQL query
If your database supports using custom SQL, you will see Custom SQL displayed near the bottom of the Connections pane. Double-click Custom SQL to open the Custom SQL tab where you can enter queries to preselect data and use source-specific operations. After the query retrieves the data set, you can select the fields to include, apply filters, or change the data type before adding the data to your flow.
For more information about using custom SQL, see Use Custom SQL to connect to data.
Apply cleaning operations in an input step
Only some cleaning operations are available in an Input step. You can make any of the following changes in the Input field list. Your changes are tracked in the Changes pane and annotations are added to the left of the Input step in the Flow pane and in the Input field list.
- Hide Field: Hide fields instead of removing them to reduce clutter in your flow. You can always unhide them if you need them. Hidden fields will still be included in your output when you run your flow.
- Filter: Use the calculation editor to filter values or starting in version 2023.1, you can also use the Relative Date Filter dialog to quickly specify date ranges for any date or date & time fields.
-
Rename Field: In the Field Name field, double-click or Ctrl-click (MacOS) on the field name and enter a new field name.
-
Change Data Type: Click on the data type for the field and select a new data type from the menu. This option is currently supported for Microsoft Excel, text and PDF files, Box, Dropbox, Google Drive, and OneDrive data sources. All other data sources can be changed in a clean step.
Select fields to include in the flow
Note: Starting in version 2023.1 you can select multiple fields to hide, unhide, remove, or include them. In previous releases, you can work with one field at a time and select or clear the check boxes to include or remove fields.
The Input pane shows you a list of fields in your data set. By default all fields are included except the auto-generated field, Source Row Number. Use the following options to manage your fields.
- Search: Find fields in the list.
- Hide: Click the eye
icon or select Hide Fields from the More options
menu to hide fields that you want to include in your flow output, but don't need to clean. Fields are processed by the flow during run time. You can also Unhide fields any time if you need them. For more information, see Hide fields.(Link opens in a new window)
- Include Fields: Select one or more rows and right-click, Cmd-click (MacOS), or click the More options
menu and select Include Fields to add back fields that are marked as removed.
- Remove Fields: Select one or more rows and right-click, Cmd-click (MacOS), click the "X", or click the More options
menu and select Remove Fields to remove fields that you don't want to include in the flow.
Apply filters to fields in the Input step
Apply filters in the input step to reduce the amount of data that you ingest from your data sources. You can gain interactive performance efficiency and a more useful data sample by eliminating the data you don't want to process when you run the flow.
In the input step you can apply filters using the Calculation Editor. Starting in version 2023.1, you can also use the Relative Date Filter dialog to specify an exact date range of values to include for date and date & time field types. For more information, see "Relative Date filter" in Filter Your Data(Link opens in a new window).
You can use other filter options in the Clean step or other step types. For more information, see Filter Your Data(Link opens in a new window)
Apply a calculation filter
-
In the toolbar click Filter Values, or in the field grid, click the More options
menu and select Filter > Calculation ....
-
Enter your filter criteria in the calculation editor.
Apply a relative date filter
-
In the Input grid, select a field with a data type of Date or Date & Time. Then right-click, Cmd-click (MacOS), or click the More options
menu and select Filter > Relative Dates.
-
In the Relative Date Filter dialog, specify the exact range of years, quarters, months, weeks, or days that you want to include in your flow. You can also configure an anchor relative to a specific date, and include null values.
Note: By default, the filter operates relative to the date that the flow is run or previewed within the authoring experience.
Change field names
To change the name of a field, in the Field Name column, select the name, and then type the new name in the field. An annotation is added in the field grid and in the flow pane to the left of the Input step. Your changes are also tracked in the Changes pane.
Change data types
Currently supported for Microsoft Excel, text and PDF files, Box, Dropbox, Google Drive, and OneDrive data sources. All other data sources can be changed in a clean step.
Note: The data type for Source Row Number (version 2023.1 and later) can only be changed in a Clean step or other step type.
To change the data type of a field, do the following:
-
Click the data type for the field.
-
Select the new data type from the menu.
You can also change the data type for fields in other step types in the flow or assign data roles to help validate your field values. For more information about changing your data type or using data roles, see Review the data types assigned to your data(Link opens in a new window) and Use Data Roles to Validate your Data(Link opens in a new window).
Configure field properties
When you work with text files, you see a Settings tab where you can edit your connection and configure text properties, such as the field separator for text files. You can also edit the file connection in the Connections pane or configure incremental refresh settings. For more information about setting up incremental refresh for your flow, see Refresh Flow Data Using Incremental Refresh .
When you work with text or Excel files, you can correct data types that have been inferred incorrectly before you even start your flow. Data types can always be changed in subsequent steps in the Profile pane after you start your flow.
Configure text settings in text files
To change the settings used to parse text files, select from the following options:
-
First line contains header (default): Select this option to use the first row as the field labels.
-
Generate field names automatically: Select this option if you want Tableau Prep Builder to auto-generate the field headers. The field naming convention follows the same model as Tableau Desktop. For example F1, F2, and so on.
-
Field Separator: Select a character from the list to use to separate the columns. Select Other to enter a custom character.
-
Text Qualifier: Select the character that encloses the values in the file.
-
Character Set: Select the character set that describes the text file encoding.
-
Locale: Select the locale to use to parse the file. This setting indicates which decimal and thousand separator to use.
Set your data sample size
To maintain peak performance, by default, Tableau Prep limits the data included in the flow to a representative sample of your data set. The data sample is determined by calculating the optimal number of rows based on the total number of fields in the data set and the data types for those fields. Tableau Prep then retrieves the top number of rows for the calculated amount as quickly as possible.
The resulting data sample may include all the rows you need, or it may not, depending on how the sample was calculated and returned. If you don't see the data that you expect, you can change the data sample settings to run the query again.
When creating or editing flows on the web, limits are applied to the amount of data you can include in a flow and the options available to change your data sample are slightly different than when working in Tableau Prep Builder. For more information, see Sample data and processing limits in the Tableau Server(Link opens in a new window) or Tableau Cloud(Link opens in a new window) help.
Note: If your data is sampled, a Sampled badge shows in the Profile pane and persists for every step you add. Any changes you make apply to the sample you are working with in the flow. All changes apply to your entire data set when you run the flow.
To change your data sample settings, select an Input step, then on the Data Sample tab select from the following options:
-
(2023.1—Automatic) (2022.4 and earlier—Default sample amount): Tableau Prep calculates the total number of rows to return. This is the default.
-
(2023.1—Maximum) (2022.4 and earlier—Use all data): (Tableau Prep Builder only) Retrieve all rows in your data set regardless of size. This can impact performance or cause Tableau Prep Builder to time out.
Note: To maintain performance, even if you select this setting, a data sample limit of 1 million rows is applied to Aggregate and Union step types and a data sample limit of 3 million rows is applied to Join and Pivot step types.
-
(2023.1—Specify) (2022.4 and earlier—Fixed number of rows): Select the number of rows to return from the data set. The recommended number of rows is 1 million or less. Setting the number of rows to more than 1 million can impact performance.
- In Web authoring: The maximum number of rows that a user can select when using large data sets is configured by the administrator. As a user, you can select the number of rows up to that limit.
-
Quick select (default): The database returns the number of rows requested as quickly as possible. This might be the first N number of rows or the rows that the database had cached in memory from a previous query.
-
Random sample: The database returns the number of rows requested but looks at every row in the data set and returns a representative sample from all of the rows. This option may impact performance when the data is first retrieved.