MEMoMR: Accelerate MapReduce via reuse of intermediate results

Abstract

MapReduce has been widely regarded as a flexible, scalable, and easy-to-use distributed programming paradigm for big data processing such as social network data analysis on cloud computing platforms. To embrace the upcoming of big data era, many efforts have been devoted to accelerating the MapReduce performance from different aspects, especially intermediate result reusing like Dache. In this paper, we observe that existing intermediate result reusing mechanism is not efficient enough as many I/O operations are wasted. Efficient reusing of the intermediate results could potentially improve the MapReduce performance. Inspired by such fact, we propose a framework named MEMoMR (more efficient intermediate result reusing for MapReduce) by introducing a novel reusing mechanism that can substantially reduce the I/O overhead. To this end, we invent a new metadata description method and apply it in the reusing phase. We practically realize MEMoMR and evaluate its performance by implementing it in a real cluster. The experiment results show that MEMoMR can improve the system performance as high as 23.4%, comparing against Dache.

Publication
In Concurrency and Computation Practice and Experience.