Scenario:
Decentralized distributed key-value storage service. High availability, be able to fit into different consistency model.
- "always writable" data store
- no requirement to support hierarchical namespaces or complex relational schema
- latency sensitive, avoid routing request through multiple nodes
APIs: get(key), put(key, object, context)
- get(), return single object of a list of objects with conflicting versions along with context
- put(), write data to replicas. context encodes system metadata that is opaque to the caller and includes information such as the version of the object. The context information is store with the object so system can verify the validity of the context object supplied in the put request
Key partitions
- Consistent hashing: The key space is treated as circular space and each machine will have a node on the ring, the given key will be handled by the first node it encounters when walking clockwise on the ring. Each node becomes responsible for the region in the ring between it and its predecessor node on the ring. The advantage is that departure of or arrival of a node only affects its immediate neighbors and other nodes remain unaffected.
- Advanced version: Each node gets assigned multiple virtual nodes on the ring. So when a machine joins or departs, multiple virtual nodes will be added/deleted on the ring
- If a node is unavailable, the load handled by this node is dispersed across remaining machines
- We a node is added, the new node accepts a roughly equivalent load from each of the other available machines
- Heterogeneity: better machine can have more virtual nodes, poor machine can have less
Replications: Replication is necessary when you want to achieve high availability and durability
- For each key, if it is replicated at N hosts. The coordinator replicates these keys oat the N - 1 clockwise successor nodes in the ring. This results in a system where each node is responsible for the ring between it and its Nth predecessor.
- It is possible the N nodes on the ring represents less than physical machine, so just skip the node if the machine is already included, only distinct machine will be contained in the list
- A concrete example can be found in this article
Conflicts and Data Versioning: Dynamo provides eventual consistency, which allows updates to be propagated to all replicas asynchronously. Subsequent get() may end up with data without latest update, another update on this data will result in conflicts. Also, the propagate process may also not arrive on some nodes and result in conflicts across different nodes.
- Dynamo treat result of each modification as a new and immutable version. Most of the time, the new version subsumes the previous version. If version branching happens, all conflicting version cannot be resolved will be stored in the version tree. The client needs to perform the reconcile process later when he tries to get the data.
- Capture conflicts: vector clock, which is essentially a list of (node counter) pairs. One vector clock is associated to every version of the object. For more information, please check following articles:
- A update must specify the version it is updating. It is got by the previous read operations. If there are multiple version branches, all leaves will be returned. An update using this context is considered to reconcile all the conflict versions and branches are collapsed into a single new version.
- Vector clock can be really large when have too many replicas for the same data(Usually we won't have too many so it is fine). In this case we just limit the size and truncate when it reach the threshold. We trade the inefficiencies in reconciliation for solving this space issue.
Read/Write:
- For a given key, we allow any node which has replica to perform read/write operation on this data. But it will definite brings up consistency issue, if you perform a write on server 1, and read from a replica on server 2 which doesn't have the latest update, then we have a problem.
- Quorum like protocol: N replicas, W replicas need to be updated successfully before we return success for this write operations. R replicas need to be read to get the latest version of the data. If we have R + W > N, we can guarantee the latest data will always be returned. Of course R and W is configurable to fit into different scenario. Usually (N, W, R) = (3, 2, 2)
Handle Failures:
- quorum like operations will be perform at first R/W health nodes, in case if there is failure
- If a node in the first N list is temporarily down, the corresponding replica will be sent to N + 1 node. The replica send to N + 1 will have hint in the metadata indicates where it belongs to. The data is stored in a separate local database. Once the down node comes alive, N + 1 node will try to deliver it back, once it is done, it will delete the data
- If a node is permanently down, replica synchronization is required. To detect the inconsistency between replicas and minimize the amount of transfer data, anti-entropy protocol to keep the replicas synchronized. Dynamo uses Merkel tree, a Merkel tree is a hash tree where leaves are hashes of the values of individual keys. Parent nodes are hashes of their respective children. This allows we compare the subset of the data instead of the entire set.
- Each nodes maintains a separate Merkle tree for each key range. Two nodes exchanges root of Merkel trees of the key range they host in common. Use tree traversal to find difference and perform synchronizations
Membership and Failure Detection
- A gossip-based protocol to propagates membership changes and maintain an eventually consistent of view of membership.
- Seed nodes, bootstrap the gossip process for new nodes joining the cluster, known to all nodes
- node A may consider node B failed if B doesn't respond its message. A then uses alternative nodes to serve request that maps to B's range and periodically retries B to check for later recovery.
- More information for gossip, please check this article.
No comments:
Post a Comment