MapReduce is a big data processing technique, and a model for how to programmatically implement that technique. Its goal is to sort and filter massive amounts of data into smaller subsets, then distribute those subsets to computing nodes, which process the filtered data in parallel.
In data analytics, this general approach is called split-apply-combine.
A generic MapReduce procedure has three main steps: map, shuffle, and reduce. Each node on the distributed MapReduce system has local access to an arbitrary small portion of the large data set.
- Map: Each node applies the mapping function to its portion of the data, filtering and sorting it according to parameters. The mapped data is output to temporary storage, with a corresponding set of keys (identifying metadata), indicating how the data should be redistributed.
- Shuffle: The mapped data is redistributed to other nodes on the system, so that each node contains groups of key-similar data.
- Reduce: Data is processed in parallel, per node, per key.
The following software and data systems implement MapReduce:
- Hadoop — Developed by the Apache Software Foundation. Written in Java, with a language-agnostic API.
- Spark — Developed by AMPLab at UC Berkeley, with APIs for Python, Java, and Scala.
- Disco — A MapReduce implementation originally developed by Nokia, written in Python and Erlang.
- MapReduce-MCI — Developed at Sandia National Laboratories, with bindings for C, C++, and Python.
- Phoenix — A threaded MapReduce implementation developed at Stanford University, written in C++.
Limitations and alternatives
Due to its constant shuffling of data, MapReduce is poorly suited for iterative algorithms, such as those used in ML (machine learning).
Alternatives to traditional MapReduce, that aim to alleviate these bottlenecks, include:
- Amazon EMR — High-level management for running distributed frameworks on the AWS (Amazon Web Services) platform, such as Hadoop, Spark, Presto, and Flink.
- Flink — An open-source framework for stream processing of incomplete or in-transit data. Optimized for large datasets, developed by Apache.
- Google Cloud Dataflow — Distributed data processing for GCP (Google Cloud Platform).
- Presto — An open-source distributed SQL query engine for big data.