Google File System is a distributed file system service which supports large data read, write and deletion.
Chunk: Each file are splitted into chunks of size 64MB and stored in GFS
Why 64MB:
- If too small:
- Two many metadata to store
- Clients need to interact with master too many times when it read/write across many chunks
- If too large:
- Internal fragmentation
- Hot spot if too many clients access the same chunk
Master-Slave pattern:
- Master:
- Store metadata, 64B for 64MB chunk
- chunk namespaces, tree like structure, logic representation, file is not actually stored in this path, persistent on disk
- mapping from file to chunks, persistent on disk
- location of each chunk's replica, don't persist on disk, ask servers when master startup and heartbeat
- Namespace management and locking, GFS logically represents its namespace as a lookup table mapping full pathnames to metadata. Each node in the namespace tree has an associate read-write lock(only need read lock for parent directory when create file)
- Replica placement: 2 close, 1 far
- Creation, Re-replication, Rebalancing
- Creation
- chunkserver with below-average disk space utilization
- limit the number of recent creations on each chunkserver
- spread replica across rack
- Master re-replicate once number of replicas fall below user-specific goal, this can happen when chunkserver goes down, error happens when write replica, etc. Chunkserver simply as the server which has the correct replica and persist the data on disk
- Master rebalancing replicas periodically for better disk space and load balancing
- Garbage Collection, master will run garbage collection periodically to clean files below
- Deleted files, which will be renamed to hidden files
- Unknown files, which is not in metadata. This can happen when creation fails, on some chunkservers, deletion message lost, etc.
- Stale replicas, For each chunk, master maintains chunk version number to detect stale replicas, this can happen if a chunkserver fails and misses mutations to the chunk while it is down
- Server:
- Store chunk replicas
- Store chunk metadata and index to search a given chunk in disk
- Operation log for metadata change in master, master checkpoints for master recovery. The shadow master can also be initialized based on those information once master is down and can't recover from it
- Operation log contains historical record of critical metadata changes. Also serves as a logical time line that defines the order of concurrent operations.
Consistency:
- Multiple producer one consumer queue for concurrently mutations
- Lease:
- Master grants chunk lease to one of the replicas, which is called primary. The primary picks a serial order for all mutations to the chunk. All replicas follow this order when applying mutations.
- Client asks the master which chunkserver holds the current lease and the location of other replica. Client will cache those information
- Client push data to all replicas. Data flow is decoupled from control flow, so the performance can be improved by schedule the data flow based on network topology
- Once all replicas receive the data, client send request to primary. Primary pick the order of all pending mutations and apply it to its own local state
- The primary forwards the write request to all secondary replicas, all replicas follow the same order as primary
- Secondaries reply to primary once the finish. Primary reply to client.
- If there is error, retry, and delete corrupted data
Data Integrity:
- Checksum:
- When write, calculate and write the checksum when writing current chunk
- When read, check all checksum of chunks which are covered by current reading range, even though only part of it is covered
- Chunkserver can also periodically run process to check it
Read:
- Client ask master of the locations of the chunks which consist of the file and cache it
- Go to one of the server which has the replica
- Check checksum for each chunk and read it
Write:
- Client asks the master which chunkserver holds the current lease and the location of other replica. Client will cache those information
- Follow steps above to make changes to primary and secondary replicas
- Master update metadata if necessary
No comments:
Post a Comment