The Bookshop data set

The Bookshop data set was created for Tableau Desktop 2020.2 to showcase relationships—the new data modeling capabilities for combining data.

Download the files

You can choose to download the raw data and start from scratch building your own data model, or download one of the pre-built data sources to jump right into analysis with relationships.

  • Raw data (xlsx)
    • Bookshop.xlsx - the raw data as an .xlsx file
    • BookshopLibraries.xlsx - additional library tables that introduce many-to-many relationships (contains only the additional library tables)
  • Packaged data source (tdsx) files
    • Bookshop.tdsx - a packaged .tdsx file with the related data source already built and some metadata customizations added
    • MinimalBookshop.tdsx - the same tables as Bookshop.tdsx but without any metadata or cleanup
    • Bookshop_libraries.tdsx - a packaged .tdsx file that adds the library tables to Bookshop.tdsx (includes all tables).

To connect to a downloaded .tdsx

  1. Open Tableau Desktop.
  2. In the Connect pane on the left, choose the More... option under the To a File heading.
  3. Navigate to where you downloaded the .tdsx and double click it (or select it then click Open).

About the data set

Bookshop consists of 13 tables, combined in the following manner:

Note: A central concept for this data set is the idea of a book versus an edition. A book is a concept with attributes such as author, title, and genre. An edition is a physical version of the book, with attributes such as format (hardcover, paperback), publication date, and page count.

Data Dictionary

Some fields may need a little explanation.

  • The Rating field in the Ratings table is on a scale of 1-5 with 5 being high.
  • The Format field is a detailed breakdown of the format. Everything other than “Hardcover” could be considered “Paperback”.
  • ISBN stands for International Standard Book Number and is a unique 13 digit identifier given to each edition of a book. The ISBN is represented in the barcode and is tied to the price.
  • The ItemID and OrderID fields are hierarchical. An order may contain multiple items.
  • The Staff Comment field contains summaries and reviews for some of the books.

If you build your own data source

  1. The Publisher and Sales tables must be related to the Edition table.
  2.  The Book and Info tables can be related or joined, but it must be on the calculation BookID = [BookID1]+[BookID2].
    • An inner join is suggested.
    • The Series table can only be used once the Info table is part of the data source.
  • optional:
    1. We recommend renaming the logical table of Book joined with Info “Book” or something similar.
    2. The four Sales tables can be analyzed independently, but if they are unioned, we recommend renaming the union “Sales” or something similar.
    3. Most of the fields used to form relationships aren’t needed for analysis and can be hidden once the data model has been built
      • Any field ending with ID can be hidden (the only “ID” fields that do are ReviewID and ReviewerID from Ratings and ItemID and OrderID from Sales).
      • ISBN should be kept, ideally from the Edition table, as the identifier of each edition. The ISBN field in the unioned Sales table can be hidden.
      • The Title field in the Award table can be hidden.
      • The sheet and table fields from the Union also don’t contain unique information and can be hidden.
    4. To simplify the data model, the core tables are Book, Author, and Edition. The easiest tables to leave out with minimal impact would be Checkoutsand Ratings, followed by Award, Publisher, Sales, or Info and Series.

    Bookshop Libraries

    The BookshopLibraries.xlsx file introduces new tables around libraries for the purpose of supporting many-to-many relationships. The Catalog table relates to the Editions table on ISBN. The Library Profiles table relates to the Catalog table as a many-to-many (n:n) relationship on LibraryID.

    Data Dictionary

    Library Profile

    • The Library field is the name of the library.
    • The Library ID field an ID also used in the Catalog table.
    • The Consortium Member field is a yes/no field that indicates if the library is part of the larger network of libraries that offer inter-library loans and share other services
    • The Private field is a yes/no field that indicates if this is a members-only library or a public institution
    • Staff Type and Number of Staff together describe how many Librarians, Library Assistants, and Library Technicians work for each library.

    Catalog

    • LoanID is a unique identifier for the combination of ISBN and library that is used to track the Number of Copies a library has. For example, if the Idle Hour Library has two paperback copies and one hardcover copy of a title, there will be two Loan IDs.
    • The Library ID field is an ID also used in the Library Profile table.
    Structure of the Library Profile table

    Explore

    This data set, while fictional, supports various analytical scenarios and plenty of exploration. Some suggestions include:

    • What books are the most popular? The least popular? Is that based on sales, reviews, checkouts, or another metric?
    • Who was the youngest debut author? Who was the oldest?
    • Do some publishing houses seem to specialize in any way?
    • What was the longest time between editions of the same book?
    • Are there any seasonal trends for sales? What about checkouts? Do any titles or genres have seasonal fluctuations?
    • Are there any correlations between checkouts, print run size, book review ratings, and sales volume?
    • Do the authors who spend the most time writing have the most successful books? Do they have the highest page count? 
    • When are most books published? Are there any anomalies? 
    • Are there any trends for genre, format, and price?
    • What sort of distributions do the ratings have? Do those distributions vary by book? By genre? Do they seem to align with awards? 
    • How would you calculate the sales price, given that there is sometimes—but not always—a discount given at the time of sale?
    • Do sales approximate the Pareto principle? 
    • Are there any patterns in the discounts? 
    • Do any tables in particular appear to have dirty data?
    Bookshop Data Set Easter Eggs