amz_author_all_books

1298606 rows


Description

This is an Amazon Author All Books table. It has the following fields: - id: a unique identifier for each row in the table, starting from 1 for each book in alphabetical order of the author’s surname - order: the page number where this book appears (1 being at the top) - created_at: the date and time at which this book was added to the table. This field is stored as an ISO-8601 datetime string using UTC timezone convention, meaning the time in each country’s local timezone will differ from the time in UTC. - author_id: a unique identifier for the author of the book, also starting at 1 and increasing alphabetically by the author’s surname - book_id: a unique identifier for the particular book that is being described - media_url_id: a unique identifier for the Media URL associated with this book

The table stores information about all books written by any of the authors in Amazon Author All Books. It contains metadata such as the author name, page number, creation date and ISBN number for each book. The Media URL associated with each book is used to view the cover image on a Kindle device.

Consider the following scenario: As a Systems Engineer, you are tasked with optimizing the Amazon Author All Books table query in MongoDB based on the following rules: 1) You have the ability to split large tables into smaller ones. 2) Each sub-table can be divided horizontally either by author or media URL id fields. 3) Your task is to divide these tables in such a way that it gives more efficient queries, and thus saves storage space too. 4) After division, every book must have an equivalent index on the same set of attributes (e.g., If we split it into two on author ID, each table should still contain an “author_id” field).

Question: What strategy would you adopt to divide these tables in order to optimize storage space and query speed? And which sub-tables would you create if applicable?

Apply the property of transitivity. If we assume dividing by media URL and by author ID have equal implications on overall performance, we should try to minimize the number of these divisions.

Use a direct proof method: Consider two possible strategies. The first is to only divide horizontally into one table - an “author” based or a “media_url” based division.

Inductively expand upon this strategy considering its limitations, e.g., when dealing with more than 10,000 books per artist, the index queries could take a high amount of time due to overlapping searches, whereas media URL-based division keeps data separated into several tables causing redundancy in storage for these attributes.

Implementing proof by contradiction - assume that horizontal split based only on Media URLs leads to better performance than splitting it horizontally based on authors’ names.

Using the ‘Tree of thought reasoning’, visualize and analyze the impact of each division strategy on query time, index size, storage, and data redundancy for different sizes of books (e.g., 500 words or more) based on both strategies.

Finally, after having analyzed all these possibilities using deductive logic, we can see that the most effective solution is a combination of splitting the table horizontally into author-based divisions as it addresses issues related to query efficiency and data redundancy efficiently. Answer: The strategy will be to divide a single base table into separate “author” or “media_url” based sub-tables, with an equal distribution among both strategies to minimize potential performance losses from the split.

Columns

Column Type Size Nulls Auto Default Children Parents Comments
id int8 19 null
order int2 5 null
created_at timestamptz 35,6 null
author_id int8 19 null
amz_authors.id amz_author_all_books_author_id_e71045a0_fk_amz_authors_id R
book_id int8 19 null
amz_books.id amz_author_all_books_book_id_f1163b31_fk_amz_books_id R
media_url_id int8 19 null
amz_media_url.id amz_author_all_books_media_url_id_5e7ef84c_fk_amz_media_url_id R
check_by_validation bool 1 null
in_data_validation bool 1 null
status_data_validation jsonb 2147483647 null

Indexes

Constraint Name Type Sort Column(s)
amz_author_all_books_pkey Primary key Asc id
amz_author_all_books_author_id_e71045a0 Performance Asc author_id
amz_author_all_books_book_id_f1163b31 Performance Asc book_id
amz_author_all_books_media_url_id_5e7ef84c Performance Asc media_url_id
idx_author_all_books Performance Asc author_id
idx_book_all_books Performance Asc book_id

Relationships