Remove Duplicate Rows

Supported in Tableau Prep Builder version 2024.1 and later and on the web in Tableau Cloud.

Duplicate data can have an impact on your data quality, bias prediction results, and waste storage space. Removing duplicates ensures that each value is unique, providing an accurate representation of your data. The first step to removing duplicates is to identify the duplicate rows in your data. After you've identified duplicates, you can then remove them from the work flow.

Identify and remove duplicates

Tableau Prep identifies a duplicate as any row that has at least one other row with the same values.

To identify and remove duplicate rows.

  1. (Optional) For Excel and text files, you can click on the Input step and add a Source Row Number field to sort rows by the original data source. For more information, see Add the Source Row Number field to your flow and Sort order for the calculated field.
  2. Click on any step other than an Input or Output step.
  3.  

  4. Identify duplicate rows.
    • To identify duplicate rows across all fields, from the toolbar, click Identify Duplicate Rows.
    • To identify duplicate rows across specific fields, select one or more fields, then click Identify Duplicate Rows. Optionally, in the profile pane, you can click the More options menu from the selected field and select Identify Duplicate Rows.

    A calculated field is created and listed in the Changes pane. The calculated field provides a description and the names of the fields that the duplicate rows span. The Data Grid shows which rows are unique and which rows are duplicates.

    The following image shows the results of selecting the two fields Region and Item Type to identify duplicate rows across only those selected fields.

    If a Source Row Number is available, it’s used to sort the order of the rows. Otherwise, rows are sorted based on the first field in the table schema. To change the values in the calculated field, you can directly edit the calculation or write your own calculation. See Sort order for the calculated field.

  5. To evaluate duplicate rows, click on Duplicate or Unique in the Is Duplicate Row? field or in the Data Grid.
  6. The following image shows duplicate rows across all fields.

    The following image shows duplicate rows across only the fields Region and Item Type.

  7. To filter and keep only the unique rows, click Unique in the Data grid or in the Is Duplicate Row? field, and select Keep Only from the toolbar. Optionally, you can click Unique in the Data grid or in the Is Duplicate Row? field, then right-click to select Keep Only.

    You can also select Duplicate and then select Exclude to filter data for duplicates. This produces the same results as selecting Unique and Keep Only and doesn’t affect which rows are excluded or kept.

    A filter is created in the changes pane showing that only unique rows are kept and duplicate rows are excluded.

  8. From the Is Duplicate Row? calculated field, click the More options menu, and select Remove.

    The duplicate rows are removed when you run the flow and create an output.

    Note: Don’t remove the Is Duplicate Row? calculated field if you’ll be adjusting for sampled data.

Duplicates in sample data

Duplicate rows may not appear in sampled data and could introduce a bias in the model.

To address duplicate rows sampled data:

  1. Don’t remove the Is Duplicate Row? calculated field if you’ll be adjusting for sampled data. This gives you the option to change the sample size or adjust the sample data in the input step.
  2. Adjust the sample size to load as much data as possible for row selection. See Set your data sample size.
  3. Duplicates may not appear in sampled data depending on the size of the sample. Using the Keep Only option for unique rows allows you to remove duplicate rows, even if they aren’t visible in the sampled data.

Sort order for the calculated field

By default, if multiple rows contain the same value, they’re sorted by the first field, or if available, the Source Row Number. The fields are used to identify the first row as Unique and the remaining as Duplicate. Changing the sort order of fields allows you to specify which duplicate row you want to identify as Unique. You can make changes to the order by editing the calculation and changing the ORDERBY field.

For example:

IF ({PARTITION [Field1], [Field2], [Field3]: { ORDERBY [FieldName] ASC: ROW_NUMBER() } } = 1) THEN 'Unique' ELSE 'Duplicate' END

For Excel and text files, you can sort by the original data source by adding a Source Row Number field.

For example:

IF ({PARTITION [Source Row Number], [Field1], [Field2], [Field3]: { ORDERBY [Source Row Number] DESC: ROW_NUMBER() } } = 1) THEN 'Unique' ELSE 'Duplicate' END

For more information, see Add the Source Row Number field to your flow.

Write your own filter calculation to find and remove duplicates

You can write your own filter calculation across multiple fields to find and remove duplicates.

The following filter calculation will only return results that return True. The calculation returns either True or False based on the fields used to PARTITION

{PARTITION [Field1], [Field2], [Field3]: { ORDERBY [Field1] DESC: ROW_NUMBER() } } = 1

Use CASE or IF for identify duplicate calculations

You can use the IF or CASE functions in the calculation editor. For example:

CASE {PARTITION [Field1], [Field2], [Field3] : { ORDERBY [[Field3]]: ROW_NUMBER() } } = 1 WHEN TRUE THEN 'UNIQUE' ELSE 'DUPLICATE' END

IF ({PARTITION [[Field1]], [[Field2]], [[Field3]]: { ORDERBY [[Field3]] DESC: ROW_NUMBER() } } = 1) THEN 'Unique' ELSE 'Duplicate' END

Thanks for your feedback!Your feedback has been successfully submitted. Thank you!