MapReduce : Simplified Data Processing on Large Cluster

  • Muthu Dayalan Senior Software Developer, ANNA University, India


MapReduce is a data processing approach, where a single machine acts as a master, assigning map/reduce tasks to all the other machines attached in the cluster. Technically, it could be considered as a programming model, which is applied in generating, implementation and generating large data sets. The key concept behind MapReduce is that the programmer is required to state the current problem in two basic functions, map and reduce. The scalability is handles within the system, rather than being handled by the concerned programmer. By applying various restrictions on the applied programming style, MapReduce performs several moderated functions such fault tolerance, locality optimization, load balancing as well as massive parallelization. Intermediate k/v pairs are generated by the Map, and then fed o the reduce workers by the use of the incorporated file system. The data received by the reduce workers is then merged using the same key, to produce multiple output file to the concerned user (Dean & Ghemawat, 2008). Additionally, the programmer is only required to master and write the codes regarding the easy to understand functionality.


Download data is not yet available.


[1] Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113.
[2] Dean, J., & Ghemawat, S. (2010). MapReduce: a flexible data processing tool. Communications of the ACM, 53(1), 72-77.
[3] Kolpin, G. (2006). MapReduce: Simplified Data Processing on Large Clusters.
[4] McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., ... &
[5] DePristo, M. A. (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome research, 20(9), 1297-1303.
[6] Yang, H. C., Dasdan, A., Hsiao, R. L., & Parker, D. S. (2007, June). Map-reduce-merge:
[7] simplified relational data processing on large clusters. In Proceedings of the 2007 ACM SIGMOD international conference on Management of data(pp. 1029-1040). ACM.
[8] Bin Saadon, A.,G., & Mokhtar, H. M. O. (2017). iiHadoop: An asynchronous distributed framework for incremental iterative computations. Journal of Big Data, 4(1), 1-30. doi:
[9] Boja, C., Pocovnicu, A., & Batagan, L. (2012). Distributed parallel architecture for "big data". Informatica Economica, 16(2), 116-127. Retrieved from
[10] Chang, B., Tsai, H., Tsai, Y., Kuo, C., & Chen, C. (2016). Integration and optimization of multiple big data processing platforms. Engineering Computations, 33(6), 1680-1704. Retrieved from
[11] Chen, L., Liu, Y., Gallagher, M., Pailthorpe, B., Sadiq, S., Shen, H. T., & Li, X. (2012). Introducing cloud computing topics in curricula. Journal of Information Systems Education, 23(3), 315-324. Retrieved from
[12] Ding, S., Li, G., Li, Y., Li, X., Zhai, Q., Champion, A. C., . . . Zheng, Y. F. (2017). SurvSurf: Human retrieval on large surveillance video data. Multimedia Tools and Applications, 76(5), 6521-6549. doi:
[13] Hare, J. S., Samangooei, S., & Lewis, P. H. (2014). Practical scalable image analysis and indexing using hadoop. Multimedia Tools and Applications, 71(3), 1215-1248. doi:
[14] Islam, A. K., M, T., Jeong, B., Bari, A. T., M, . . . Jeon, S. (2015). MapReduce based parallel gene selection method. Applied Intelligence, 42(2), 147-156. doi:
[15] Lamari, Y., & Said, C. S. (2017). Clustering categorical data based on the relational analysis approach and MapReduce. Journal of Big Data, 4(1), 1-16. doi:
[16] Landset, S., Khoshgoftaar, T. M., Richter, A. N., & Hasanin, T. (2015). A survey of open source tools for machine learning with big data in the hadoop ecosystem. Journal of Big Data, 2(1), 1-36. doi:
[17] Liu, X., Wang, X., Matwin, S., & Japkowicz, N. (2015). Meta-MapReduce for scalable data mining. Journal of Big Data, 2(1), 1-21. doi:
[18] Mishra, P., & Somani, A. K. (2017). Host managed contention avoidance storage solutions for big data. Journal of Big Data, 4(1), 1-42. doi:
[19] Mohamed, H., & Marchand-maillet, S. (2014). Distributed media indexing based on MPI and MapReduce. Multimedia Tools and Applications, 69(2), 513-537. doi:
[20] Nagwani, N. K. (2015). Summarizing large text collection using topic modeling and clustering based on MapReduce framework. Journal of Big Data, 2(1), 1-18. doi:
[21] Najafabadi, M. M., Khoshgoftaar, T. M., Villanustre, F., & Holt, J. (2017). Large-scale distributed L-BFGS. Journal of Big Data, 4(1), 1-17. doi:
[22] Rathee, S., & Kashyap, A. (2018). Adaptive-miner: An efficient distributed association rule mining algorithm on spark. Journal of Big Data, 5(1), 1-17. doi:
[23] Sharma, S., & Toshniwal, D. (2017). Scalable two-phase co-occurring sensitive pattern hiding using MapReduce. Journal of Big Data, 4(1), 1-18. doi:
[24] Singh, D., & Reddy, C. K. (2014). A survey on platforms for big data analytics.Journal of Big Data, 2(1), 1-20. doi:
[25] Suthakar, U., Magnoni, L., David, R. S., Khan, A., & Andreeva, J. (2016). An efficient strategy for the collection and storage of large volumes of data for computation. Journal of Big Data, 3(1), 1-17. doi:
[26] Trifu, M. R., & Ivan, M. (2016). Big data components for business process optimization. Informatica Economica, 20(1), 72-78. doi:
[27] Wang, H., Zhu, F., Xiao, B., Wang, L., & Jiang, Y. (2015). GPU-based MapReduce for large-scale near-duplicate video retrieval. Multimedia Tools and Applications, 74(23), 10515-10534. doi:
How to Cite
DAYALAN, Muthu. MapReduce : Simplified Data Processing on Large Cluster. International Journal of Research and Engineering, [S.l.], v. 5, n. 5, p. 399-403, june 2018. ISSN 2348-7860. Available at: <>. Date accessed: 31 may 2020. doi: