今際の国の呵呵君: GFS

Google File System is a distributed file system service which supports large data read, write and deletion.

Chunk: Each file are splitted into chunks of size 64MB and stored in GFS

Why 64MB:

If too small:

Two many metadata to store
Clients need to interact with master too many times when it read/write across many chunks

If too large:

Internal fragmentation
Hot spot if too many clients access the same chunk

Master-Slave pattern:

Master:

Store metadata, 64B for 64MB chunk

chunk namespaces, tree like structure, logic representation, file is not actually stored in this path, persistent on disk
mapping from file to chunks, persistent on disk
location of each chunk's replica, don't persist on disk, ask servers when master startup and heartbeat

Namespace management and locking, GFS logically represents its namespace as a lookup table mapping full pathnames to metadata. Each node in the namespace tree has an associate read-write lock(only need read lock for parent directory when create file)
Replica placement: 2 close, 1 far
Creation, Re-replication, Rebalancing

Creation

chunkserver with below-average disk space utilization
limit the number of recent creations on each chunkserver
spread replica across rack

Master re-replicate once number of replicas fall below user-specific goal, this can happen when chunkserver goes down, error happens when write replica, etc. Chunkserver simply as the server which has the correct replica and persist the data on disk
Master rebalancing replicas periodically for better disk space and load balancing

Garbage Collection, master will run garbage collection periodically to clean files below

Deleted files, which will be renamed to hidden files
Unknown files, which is not in metadata. This can happen when creation fails, on some chunkservers, deletion message lost, etc.
Stale replicas, For each chunk, master maintains chunk version number to detect stale replicas, this can happen if a chunkserver fails and misses mutations to the chunk while it is down

Server:

Store chunk replicas
Store chunk metadata and index to search a given chunk in disk
Operation log for metadata change in master, master checkpoints for master recovery. The shadow master can also be initialized based on those information once master is down and can't recover from it

Operation log contains historical record of critical metadata changes. Also serves as a logical time line that defines the order of concurrent operations.

Consistency:

Multiple producer one consumer queue for concurrently mutations
Lease:

Master grants chunk lease to one of the replicas, which is called primary. The primary picks a serial order for all mutations to the chunk. All replicas follow this order when applying mutations.

Client asks the master which chunkserver holds the current lease and the location of other replica. Client will cache those information
Client push data to all replicas. Data flow is decoupled from control flow, so the performance can be improved by schedule the data flow based on network topology
Once all replicas receive the data, client send request to primary. Primary pick the order of all pending mutations and apply it to its own local state
The primary forwards the write request to all secondary replicas, all replicas follow the same order as primary
Secondaries reply to primary once the finish. Primary reply to client.
If there is error, retry, and delete corrupted data

Data Integrity:

Checksum:

When write, calculate and write the checksum when writing current chunk
When read, check all checksum of chunks which are covered by current reading range, even though only part of it is covered
Chunkserver can also periodically run process to check it

Read:

Client ask master of the locations of the chunks which consist of the file and cache it
Go to one of the server which has the replica
Check checksum for each chunk and read it

Write:

Client asks the master which chunkserver holds the current lease and the location of other replica. Client will cache those information
Follow steps above to make changes to primary and secondary replicas
Master update metadata if necessary

今際の国の呵呵君

Monday, November 19, 2018

GFS

No comments:

Post a Comment