今際の国の呵呵君: Big Table

A Bigtable is a sparse, distributed, persistent multi-dimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.

Schema:

Rows: Arbitrary string, every read or write under a single row key is atomic. Support single row transaction
Column Families: Column keys are grouped into sets called column families, which form the basic unit of access control. All data stored in a column family is usually of the same type
Timestamps: Each cell in Big table can contain multiple versions of the same data; these versions are indexed by timestamp. Different version of cells are stored in decreasing order so that the most recent one can be read first. Bigtable can garbage collect it automatically

Tablet: Bigable maintains data in lexicographic order by row key. The row range for a table is dynamically partitioned. Each row range is called a tablet, which is the unit of distribution and load balancing.

SSTable:

Sorted String Table is used internally to stored Bigtable data. An SSTable provide a persistent, ordered immutable map from key to values, where both keys and values are arbitrary byte strings
Each SSTable contains a sequence of blocks(typically 64KB in size but this is configurable)
A block index(B+ tree, store a the end of SSTable) is used to load blocks; the index is loaded into memory when the SSTable is opened
Use bloom filter to optimize reading operations, so we can decide if given key might be in the SSTable quickly. Since it is also small, in can be also loaded into memory for each SSTable
If a SSTable is too big, it will be splitted, background process will also be run to merge SSTables to clean old data and reduce space costs.

Memtable: A sorted buffer in memory to keep track of recent changes.

Compactions:

Minor compaction: When the memtable size reaches threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to a SSTable and written into GFS.
Major compaction: A merging operation, which reads the contents of a few SSTables and the memtable, and writes out a new SSTable.

WAL(Write Ahead Log):

Updates are commit to WAL that stores redo records, in case we lose memtable and we can still restore from it
Like SSTable, the log file is also store in GFS, and we only store a single commit log files for all tablet servers
Don't keep separate log for each tablet server simce a very large of files would be written concurrently in GFS and also reduce the effectiveness of the group commit optimizations
But how to identify log for each table server? Sorting the commit log in order of the keys <table, row name, log sequence number>, so we can do binary search and read sequentially in the output. We partition the log file and sort each segment in parallel on different tablet server, the process is coordinate by master.

Scaling(Master-Slave): Sharding based on row key

Master:

Knows which server to go for each row key
Assigning tablets to tablet servers

Ask tablets servers what tables are stored in the server
Scan metadata table to find tables are not assigned(may result from split or merge)
Reassign them to tablet servers

Detecting addition and expiration of table servers
Load Balancing, sharding
Garbage collection

Slaves: Tablet servers

Handle read and write requests
Call GFS to store SSTables to avoid implementation details for replicas, file restore, etc
Local disk can cache the result from GFS
Stores index and bloom filer for each SSTable(it is client side of GFS in this case)
Memtable, write WAL to GFS

Chubby(Distributed lock service):

To ensure there is only one master
To discover tablet servers and finalize table servers death

Tablet server acquire lock in chubby when start and master monitor the server directory to discover active servers
Master detect status of sever by periodically ask the status of its lock, if the server dies, master delete its server file in chubby

To store metadata tables

Write:

Aquire the lock of the key from Chubby and get the server id
Go to corresponding server, insert update into memtable, which will be serialized to SSTable into GFS later
Write commit log into GFS
Once it is done, unlock the key

Read:

Aquire the lock of the key from Chubby and get the server id
Go to corresponding server, check memtable, bloom filter and index
Binary search in SSTable to get the newest value
Return value, unlock the key

今際の国の呵呵君

Sunday, November 18, 2018

Big Table

No comments:

Post a Comment