Sunday, November 18, 2018

Big Table


A Bigtable is a sparse, distributed, persistent multi-dimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.

Schema:

  • Rows: Arbitrary string, every read or write under a single row key is atomic. Support single row transaction
  • Column Families: Column keys are grouped into sets called column families, which form the basic unit of access control. All data stored in a column family is usually of the same type
  • Timestamps: Each cell in Big table can contain multiple versions of the same data; these versions are indexed by timestamp. Different version of cells are stored in decreasing order so that the most recent one can be read first. Bigtable can garbage collect it automatically
Tablet: Bigable maintains data in lexicographic order by row key. The row range for a table is dynamically partitioned. Each row range is called a tablet, which is the unit of distribution and load balancing.

SSTable: 
  • Sorted String Table is used internally to stored Bigtable data. An SSTable provide a persistent, ordered immutable map from key to values, where both keys and values are arbitrary byte strings
  • Each SSTable contains a sequence of blocks(typically 64KB in size but this is configurable)
  • A block index(B+ tree, store a the end of SSTable) is used to load blocks; the index is loaded into memory when the SSTable is opened
  • Use bloom filter to optimize reading operations, so we can decide if given key might be in the SSTable quickly. Since it is also small, in can be also loaded into memory for each SSTable
  • If a SSTable is too big, it will be splitted, background process will also be run to merge SSTables to clean old data and reduce space costs.
Memtable: A sorted buffer in memory to keep track of recent changes.

Compactions:
  • Minor compaction: When the memtable size reaches threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to a SSTable and written into GFS.
  • Major compaction: A merging operation, which reads the contents of a few SSTables and the memtable, and writes out a new SSTable.
WAL(Write Ahead Log):
  • Updates are commit to WAL that stores redo records, in case we lose memtable and we can still restore from it
  • Like SSTable, the log file is also store in GFS, and we only store a single commit log files for all tablet servers
  • Don't keep separate log for each tablet server simce a very large of files would be written concurrently in GFS and also reduce the effectiveness of the group commit optimizations
  • But how to identify log for each table server? Sorting the commit log in order of the keys <table, row name, log sequence number>, so we can do binary search and read sequentially in the output. We partition the log file and sort each segment in parallel on different tablet server, the process is coordinate by master.
Scaling(Master-Slave): Sharding based on row key
  • Master:
    • Knows which server to go for each row key
    • Assigning tablets to tablet servers
      • Ask tablets servers what tables are stored in the server
      • Scan metadata table to find tables are not assigned(may result from split or merge)
      • Reassign them to tablet servers
    • Detecting addition and expiration of table servers
    • Load Balancing, sharding
    • Garbage collection
  • Slaves: Tablet servers
    • Handle read and write requests
    • Call GFS to store SSTables to avoid implementation details for replicas, file restore, etc
    • Local disk can cache the result from GFS
    • Stores index and bloom filer for each SSTable(it is client side of GFS in this case)
    • Memtable, write WAL to GFS
  • Chubby(Distributed lock service):
    • To ensure there is only one master
    • To discover tablet servers and finalize table servers death
      • Tablet server acquire lock in chubby when start and master monitor the server directory to discover active servers
      • Master detect status of sever by periodically ask the status of its lock, if the server dies, master delete its server file in chubby
    • To store metadata tables
Write:
  1. Aquire the lock of the key from Chubby and get the server id
  2. Go to corresponding server, insert update into memtable, which will be serialized to SSTable into GFS later
  3. Write commit log into GFS
  4. Once it is done, unlock the key
Read:
  1. Aquire the lock of the key from Chubby and get the server id
  2. Go to corresponding server, check memtable, bloom filter and index
  3. Binary search in SSTable to get the newest value
  4. Return value, unlock the key

No comments:

Post a Comment