Tuesday, November 20, 2018

MapReduce


MapReduce: programming framework for processing and generating large data sets.


Master-Slave pattern:

  • Master:
    • Task sate(idle, in-progress, completed)
    • Identity of worker machine
    • Keep track of slave status by heartbeat, if worker failed to respond. Any map task completed by this machine will be marked as idle. In-progress map/reduce task will also be reset as idle. Completed reduce is fine since it is stored in GFS.
    • If maser dies, a new copy can be started from the last checkpoint state
  • Woker:
    • Perform map and reduce task
    • Atomicity
      • If master receive the message of an already completed map task, it ignore it
      • The reduce worker atomically renames its temporary output file to the final output file. Rely on the atomic rename operation provided by GFS
Workflow:
  1. The input file(usually in GFS), splitted into M pieces. The master picks idle workers and assign each one map task or reduce task. Usually master tend to assign map task to the machine which has one of the file replica
  2. A worker who is assigned a map task reads the content. It parses key/value pairs based on user-defined function, buffered in memory
  3. Periodically, the buffered pairs are written to local disk, partitioned into R regions(external sorting) by the partitioning function. Location of the files are passed back to master.
  4. Master notify one of reducers, it uses RPC to read the buffer data from the local disk of map workers. It sorts it by the intermediate keys the same keys are grouped together.
  5. The reduce worker iterate the sorted intermediate data and passes the key and the set of values to the user's reduce function, which is written to GFS.

Example Code:


No comments:

Post a Comment