今際の国の呵呵君: [System Design]Scaling Memcache

Scenario:

Distributed look-aside cache service to support high concurrent read queries and keep data consistent across write/delete operations.

APIs: read(key), write(key, value), delete(key)

read: first check cache, if exists, returm; otherwise, retrieve from database and populate the cache
write: issue SQL statement, then send a delete request to memcache. Delete is idempotent and minimize the probability we get stale data

Cluster: consists of thousand of machines, including front-end nodes, memcache nodes and storage nodes.

In Cluster:

Items are distributed across servers through consistent hashing
All to all communication, every web server can talk to every memcache server
Reduce latency:

Parallel requests, construct the request DAG representing the dependency and fetch concurrently
Client server communications, memcached servers don't talk to each other. Client maintains all available servers, also serialization, compression, request routing, error handling and request batching. Mcrouter is the client name.
Use UDP and TCP

UDP for get operations, error checking on client side by sequence numbers, but it won't try to recover from the error. Treat get error as a cache miss, but we don't insert it into cache since it may already exist and we don't want to add load. UDP is connectionless(doesn't need to establish connection first and have data channel setup) and faster and also get is a pretty common request.
TCP for write and delete, since we need to make sure those state change request succeeds
Having TCP connections between every web thread and memcached servers is expensive, use connection coalescing(reuse existing connection if possible) to reduce the resource we need

All to all communications can lead to incast congestions, use sliding window mechanism like TCP to avoid it, window size can be configured to achieve the best performance

Reducing Load:

Stale sets and thundering herds problem

Stale sets: Image the key k is not cache, and we retrieve it from DB(1) and try to set the cache(2), if there is another write request arrive between (1) and (2), and invalidate the cache before(2), we will end up stale set in cache.
Thundering herds: when a specific hot key undergoes heavy read and write activity. Writes keep invalidate cache and many reads will go to database layer which is more expensive path.
Memcached uses lease to handle stale set problem, it gives lease to client to set data back into the cache when client experiences a cache miss. In this way, we can guarantee there is only one thread involved in cache reconstruction process. If new delete request comes in, we will know the next lease set is stale, and we will just invalidate it.
Same for thundering herd problem, we can configure to have the server to return lease token every 10 second per key. If there is other read request comes in, it will get hot miss result with stale data. When we delete data in cache, we will transfer it to a data structure hold recently deleted items, where it lives for a short time before being flushed. Client can decide whether to use the stale data or try again later via exponential backoff

Memcache Pools

Data is different, there might be data get accessed often but for which a cache miss is inexpensive. There is also data not frequently accessed by cache miss is super expenssive.
Putting those data in same pool will have problem since the high-churn key will keep kicking low-churn key out of the pool and expensive cache reset will be done every time.
So we can put different kinds of keys in different pools.

Replication within pools

Partition key spaces won't help since usually we batch our requests, for 1M request, each of them contains 100keys, we divide them into two parts, maybe 40keys for server 1, 60 keys for server 2, both will still need to serve 1M request.
Replication can help since we can send request to any of servers. But we don't want do replication for all keys since that will lead to memory inefficiency
We choose to replicate a category of keys within a pool when

the application routinely fetches many keys simultaneously
the entire dataset fis in one or two mamcached servers
request rate is much higher than what a single server can handle

Gutter pool to handle failures

Dedicate a small set of machines, named Gutter, to take over the responsibilities of a few failed servers.
When client receives no response to its get request, it assumes the server has failed and issue the request to Gutter pool. If it still misses, client insert the retrieved data into Gutter machine. Data in Gutter expires quickly to avoid stale data.
Why not rehashing the key to remain memcached servers? A single key can be expensive, may account for 20% of a servers' request. By shunting loads to idle servers will limit that risk.

Region: split web and memcached servers into clusters, these clusters along with storage cluster that contain the database, defines a region

Regional invalidations, storage cluster hold the source of truth, we still need to make sure invalidate data across front-end memcached servers to reduce the amount of time stale data exists.

Mcsqueal, invalidation daemons on every database, monitor all SQL statements in database commits, and broadcast deletes via mcrouter
Reduce packet rate, mcsqueal batches reqeusts, send to dedicate server running mcrouter, mcrouter unpack request and re-route it to corresponding servers

Regional Pools

If users' requests are randomly routed to all available frontend cluster then cached data will be roughly same across all frontend clusters. Over-replication can be memory inefficient. We can reduce the number of replicas by having multiple frontend cluster share the same set of memcached servers, we call this regional pool
For some data, it is more cost efficient to forgo the advantage of replicating data and have a single copy per region, and we can put it into regional pool.

Cold cluster warmup

Allowing clients in cold cluster to retrieve data from the warm cluster rather than the persistent storage. Take advantage of the aforementioned data replication happens across multiple frontend clusters
Inconsistent issue, DB update in cold cluster, then we retrieve the value for them same key from another warm cluster(deletion broadcasting might not arrived yet). Memcached delete supports non-zero hold-off time, we can reject add operations in this period of time. The failure of add indicates the new data is in database, so just retrieve it from database

Consistency across regions:

Designate one region to hold the master database and other regions to contain read-only replicas. Consistency issue comes from the problem that replica database may lag behind the master database.
Writes from a master region, consider a web server in the master region writes data in database and try to invalidate stale data in memcache. It is safe to do that in region since database already contain the latest data, and subsequent read will get the latest data. But having web servers in replica region can be dangerous since the update may not be populated to replica database yet. But the previous mentioned mcsqueal way solves this issue elegantly, since all invalidation requests comes from database, so slave DBs will commit the operations from masters first then broadcast it in replica regions, we won't have this problem, but we may suffer slightly stale data in replica region. That's why we decide to issue the invalidations from DB layer instead of web server layer.
Writes from a non-master region, this brings up the stale data issue, clients set data in replica region but the slave DBs don't catch up with the latest master update yet, and subsequent reads will result in stale data. Use remote marker to solve this problem:

when update key k in replica region, we also set a remote maker rk in the region
SQL statement will contains the code to invalidate rk
delete k in replica region
whenever receive a request of the key, if key doesn't exist and rk is present, we redirect the request to master region
later on, when slave DBs catch up with the update, it will invalidate the rk
remote marker rk is stored in region pool, where all front end clusters share single copy of this data

今際の国の呵呵君

Monday, December 24, 2018

[System Design]Scaling Memcache

No comments:

Post a Comment