MapReduce is a processing unit in Hadoop. MapReduce is a programming model that processes huge amounts of data in parallel by dividing tasks into a set of independent tasks. The MapReduce data flow defines how data flows from various phases in MapReduce for processing vast amounts of data. In this article, you will explore how the data flows in Hadoop MapReduce.
This article on Hadoop MapReduce data flow provides you the complete Hadoop MapReduce data flow chart. The article covers several phases of MapReduce job execution like Input Files, InputSplit, RecordReader, Mapper, Shuffling and Sorting, and Reducer.
Before learning MapReduce data flow, let us first see a short introduction to MapReduce and some MapReduce related terminology.
What is MapReduce?
MapReduce is the Hadoop data processing layer. It is a programming model designed for processing large data sets in parallel. It processes huge data sets by splitting the job into a set of independent jobs. MapReduce job is a unit of work a client wants to perform. A MapReduce job consists of input data, MapReduce program, and configuration information. The data processed by MapReduce is stored in the Hadoop Distributed File System.
MapReduce Data Flow
The below figure depicts the data flow in Hadoop MapReduce. Apache Hadoop runs the MapReduce job by dividing it into tasks.
Let us see How data has to flow from various phases in Hadoop MapReduce to handle upcoming data in a parallel and distributed manner.
1. Input Files
Input files store the data which is processed by a MapReduce program. The Input files are stored in HDFS. They have arbitrary format. The binary format and the line-based log files format can also be used. The input file is the input to the MapReduce program. Input file consists of multiple records.
2. Input Split
The InputSplit splits the input into multiple logical models that can be handled by the Mapper.
3. RecordReader
After Input Split, there is a RecordReader. It converts the record into key-value pairs. The RecorReader converts the records into key-value pairs because the mapper can read and handle only key-value pairs.
4. Mapper
The key-value pair output from the RecordReader is then passed to the mapper. The mapper handles these key-value pairs. In each mapper, one input split is processed at a time. Developers put the business logic into the mapper as per the business logic requirement or implementation. Mappers run in parallel fashion in all the machines. Mapper produces the intermediate outputs which are stored in a local hard disk but not in HDFS.
5. Shuffling and Sorting
The intermediate outputs from mapper are then shuffled to the reduce node. (Reduce node is also a normal slave node. It runs the reduce phase, hence called as reducer node.). The shuffling refers to the physical movement of data done over the network.
Once all mappers finish their task and their output is shuffled on the reducer nodes, the intermediate output is then merged and sorted. The sorting is done on the values stored in the keys. After shuffling and sorting, the output is made available to the reducer.
6. Reducer
The reducer executes the reduce function on the output received from mapper for generating the final output. The output from the reducer is the final output and is stored in HDFS.
Summary
I hope after reading this article, you clearly understand how data flows among different phases in Hadoop MapReduce. Now, we can say that the data flow in Hadoop MapReduce is the combination of various phases such as Input Files, InputSplits, RecordReader, Mapper, Shuffling, and Sorting, and Reducer. All these components play a vital role in MapReduce working.