The Bookshop data set
The Bookshop data set was created for Tableau Desktop 2020.2 to showcase relationships—the new data modeling capabilities for combining data.
Download the files
You can choose to download the raw data and start from scratch building your own data model, or download one of the pre-built data sources to jump right into analysis with relationships.
- Raw data (xlsx)
- Bookshop.xlsx - the raw data as an .xlsx file
- BookshopLibraries.xlsx - additional library tables that introduce many-to-many relationships (contains only the additional library tables)
- Packaged data source (tdsx) files
- Bookshop.tdsx - a packaged .tdsx file with the related data source already built and some metadata customizations added
- MinimalBookshop.tdsx - the same tables as Bookshop.tdsx but without any metadata or cleanup
- Bookshop_libraries.tdsx - a packaged .tdsx file that adds the library tables to Bookshop.tdsx (includes all tables).
To connect to a downloaded .tdsx
- Open Tableau Desktop.
- In the Connect pane on the left, choose the More... option under the To a File heading.
- Navigate to where you downloaded the .tdsx and double click it (or select it then click Open).
About the data set
Bookshop consists of 13 tables, combined in the following manner:
Note: A central concept for this data set is the idea of a book versus an edition. A book is a concept with attributes such as author, title, and genre. An edition is a physical version of the book, with attributes such as format (hardcover, paperback), publication date, and page count.
Data Dictionary
Some fields may need a little explanation.
- The Rating field in the Ratings table is on a scale of 1-5 with 5 being high.
- The Format field is a detailed breakdown of the format. Everything other than “Hardcover” could be considered “Paperback”.
- ISBN stands for International Standard Book Number and is a unique 13 digit identifier given to each edition of a book. The ISBN is represented in the barcode and is tied to the price.
- The ItemID and OrderID fields are hierarchical. An order may contain multiple items.
- The Staff Comment field contains summaries and reviews for some of the books.
If you build your own data source
- The Publisher and Sales tables must be related to the Edition table.
- The Book and Info tables can be related or joined, but it must be on the calculation
BookID = [BookID1]+[BookID2]
.- An inner join is suggested.
- The Series table can only be used once the Info table is part of the data source.
- We recommend renaming the logical table of Book joined with Info “Book” or something similar.
- The four Sales tables can be analyzed independently, but if they are unioned, we recommend renaming the union “Sales” or something similar.
- Most of the fields used to form relationships aren’t needed for analysis and can be hidden once the data model has been built
- Any field ending with ID can be hidden (the only “ID” fields that do are ReviewID and ReviewerID from Ratings and ItemID and OrderID from Sales).
- ISBN should be kept, ideally from the Edition table, as the identifier of each edition. The ISBN field in the unioned Sales table can be hidden.
- The Title field in the Award table can be hidden.
- The sheet and table fields from the Union also don’t contain unique information and can be hidden.
- To simplify the data model, the core tables are Book, Author, and Edition. The easiest tables to leave out with minimal impact would be Checkoutsand Ratings, followed by Award, Publisher, Sales, or Info and Series.
Bookshop Libraries
The BookshopLibraries.xlsx file introduces new tables around libraries for the purpose of supporting many-to-many relationships. The Catalog table relates to the Editions table on ISBN. The Library Profiles table relates to the Catalog table as a many-to-many (n:n) relationship on LibraryID.
Data Dictionary
Library Profile
- The Library field is the name of the library.
- The Library ID field an ID also used in the Catalog table.
- The Consortium Member field is a yes/no field that indicates if the library is part of the larger network of libraries that offer inter-library loans and share other services
- The Private field is a yes/no field that indicates if this is a members-only library or a public institution
- Staff Type and Number of Staff together describe how many Librarians, Library Assistants, and Library Technicians work for each library.
Catalog
- LoanID is a unique identifier for the combination of ISBN and library that is used to track the Number of Copies a library has. For example, if the Idle Hour Library has two paperback copies and one hardcover copy of a title, there will be two Loan IDs.
- The Library ID field is an ID also used in the Library Profile table.
The Library Profile table was originally formatted as an unpivoted table with a column for each staff type, below.
Library | LibraryID | Consortium Member | Private | Librarians | Library Technicians | Library Assistants |
Idle Hour Library |
L-IHL | Yes | No | 53 | 61 | 16 |
The Bibliophile’s Shelves |
L-BS | Yes | Yes | 4 | 3 | 0 |
Armchair Athanaeum |
L-AA | No | Yes | 6 | 0 | 0 |
Old Friend Library | L-OFL | Yes | No | 3 | 5 | 17 |
Bide Awhile | L-BA | Yes | No | 9 | 20 | 6 |
IndieUnBound | L-IUB | Yes | No | 7 | 2 | 47 |
Page Station Book Exchange |
L-PS | No | Yes | 3 | 1 | 4 |
The table was pivoted across the Librarians, Library Technicians, and Library Assistants columns. The final format has a column for Staff Type and a column for Number of Staff. However, this means that there is more than one row for each library. The Library ID values are therefore not unique, and the pivoted version of the table must be related many-to-many to the Catalog table.
While this has the benefit of allowing a many-to-one relationship between Catalog and Profile, it isn’t an ideal structure for data analysis(Link opens in a new window).
Explore
This data set, while fictional, supports various analytical scenarios and plenty of exploration. Some suggestions include:
- What books are the most popular? The least popular? Is that based on sales, reviews, checkouts, or another metric?
- Who was the youngest debut author? Who was the oldest?
- Do some publishing houses seem to specialize in any way?
- What was the longest time between editions of the same book?
- Are there any seasonal trends for sales? What about checkouts? Do any titles or genres have seasonal fluctuations?
- Are there any correlations between checkouts, print run size, book review ratings, and sales volume?
- Do the authors who spend the most time writing have the most successful books? Do they have the highest page count?
- When are most books published? Are there any anomalies?
- Are there any trends for genre, format, and price?
- What sort of distributions do the ratings have? Do those distributions vary by book? By genre? Do they seem to align with awards?
- How would you calculate the sales price, given that there is sometimes—but not always—a discount given at the time of sale?
- Do sales approximate the Pareto principle?
- Are there any patterns in the discounts?
- Do any tables in particular appear to have dirty data?
- Etaoin Shrdlu is a reference to linotype hot type machines. Palimpsest is a manuscript-making reference.
- Author names were derived from a list of famous American authors with their first and last names shuffled.
- Publication dates today are usually on Tuesdays. This was reflected in the data, which assumes this industry standard holds into the year 2178.
- One book has no reviews, checkouts, or sales.
- The Checkouts data is based actual library data, with BookIDs mapped to titles, so the checkout trends are real-world.
- The Ratings data is based on actual book rating data, with BookIDs mapped to titles, so the distribution of ratings for a given book is real.
- The proportion of sales that are for a single book versus multiple books is based on real-world data from an independent bookstore.
- The Sales data is completely generated, but is based on actual seasonal and weekday trends for a resort town with a tourism-based economy (proportionally by month and day of the week, and for spring break and the winter holidays).
- Sales spike over long weekends and spring break.
- Sales are much higher on weekends and at their lowest on Tuesdays and Thursdays.
- Peak season is the summer between the 4th of July and Labor Day.
- Sales spike immediately after Thanksgiving and in the weeks leading up to Christmas.
- The ISBNs in the data set are fictitious but constructed following ISBN-13 principles:
- EAN—989 (not actually in use)
- Group number—28
- Publisher registrant element—varying length (2-4 digits)
- Title element, varying length (3-5 digits, depending on the registrant element length)
- Check digit, created with the ISBN-13 check digit modular arithmetic calculation
- One of the developers working on this feature suggested a book with no editions, and the title The Deep Grey, so its BookID ends with the last three digits of his work phone number.
- There is an (intentional) error in the Edition table where two editions of BookID PA169 were labeled as Palimpsest Printing, but the ISBN indicates they were actually printed by Etaoin Shrdlu Press.
The person who created this data set had way too much fun making it, and clearly has a deep obsession with books. She hopes you enjoy the data set and it helps you appreciate the power, elegance, and beauty of relationships in Tableau.