amz_author_all_books

1298606 rows

Description

This is an Amazon Author All Books table. It has the following fields: - id: a unique identifier for each row in the table, starting from 1 for each book in alphabetical order of the author’s surname - order: the page number where this book appears (1 being at the top) - created_at: the date and time at which this book was added to the table. This field is stored as an ISO-8601 datetime string using UTC timezone convention, meaning the time in each country’s local timezone will differ from the time in UTC. - author_id: a unique identifier for the author of the book, also starting at 1 and increasing alphabetically by the author’s surname - book_id: a unique identifier for the particular book that is being described - media_url_id: a unique identifier for the Media URL associated with this book

The table stores information about all books written by any of the authors in Amazon Author All Books. It contains metadata such as the author name, page number, creation date and ISBN number for each book. The Media URL associated with each book is used to view the cover image on a Kindle device.

Consider the following scenario: As a Systems Engineer, you are tasked with optimizing the Amazon Author All Books table query in MongoDB based on the following rules: 1) You have the ability to split large tables into smaller ones. 2) Each sub-table can be divided horizontally either by author or media URL id fields. 3) Your task is to divide these tables in such a way that it gives more efficient queries, and thus saves storage space too. 4) After division, every book must have an equivalent index on the same set of attributes (e.g., If we split it into two on author ID, each table should still contain an “author_id” field).

Question: What strategy would you adopt to divide these tables in order to optimize storage space and query speed? And which sub-tables would you create if applicable?

Apply the property of transitivity. If we assume dividing by media URL and by author ID have equal implications on overall performance, we should try to minimize the number of these divisions.

Use a direct proof method: Consider two possible strategies. The first is to only divide horizontally into one table - an “author” based or a “media_url” based division.

Inductively expand upon this strategy considering its limitations, e.g., when dealing with more than 10,000 books per artist, the index queries could take a high amount of time due to overlapping searches, whereas media URL-based division keeps data separated into several tables causing redundancy in storage for these attributes.

Implementing proof by contradiction - assume that horizontal split based only on Media URLs leads to better performance than splitting it horizontally based on authors’ names.

Using the ‘Tree of thought reasoning’, visualize and analyze the impact of each division strategy on query time, index size, storage, and data redundancy for different sizes of books (e.g., 500 words or more) based on both strategies.

Finally, after having analyzed all these possibilities using deductive logic, we can see that the most effective solution is a combination of splitting the table horizontally into author-based divisions as it addresses issues related to query efficiency and data redundancy efficiently. Answer: The strategy will be to divide a single base table into separate “author” or “media_url” based sub-tables, with an equal distribution among both strategies to minimize potential performance losses from the split.

Columns

Column

Type

Size

Nulls

Auto

Default

Children

Parents

Comments

int8

√

null

order

int2

√

null

created_at

timestamptz

35,6

null

author_id

int8

null

amz_authors.id

amz_author_all_books_author_id_e71045a0_fk_amz_authors_id

book_id

int8

√

null

amz_books.id

amz_author_all_books_book_id_f1163b31_fk_amz_books_id

media_url_id

int8

√

null

amz_media_url.id

amz_author_all_books_media_url_id_5e7ef84c_fk_amz_media_url_id

check_by_validation

bool

null

in_data_validation

bool

null

status_data_validation

jsonb

2147483647

√

null

Indexes

Constraint Name	Type	Sort	Column(s)
amz_author_all_books_pkey	Primary key	Asc	id
amz_author_all_books_author_id_e71045a0	Performance	Asc	author_id
amz_author_all_books_book_id_f1163b31	Performance	Asc	book_id
amz_author_all_books_media_url_id_5e7ef84c	Performance	Asc	media_url_id
idx_author_all_books	Performance	Asc	author_id
idx_book_all_books	Performance	Asc	book_id

Relationships

Close relationships within degrees of separation

One
Two degrees