Abstract
MapReduce is programming model for processing large data set. It is typically used to do distributed computing on clusters of computers such as Cloud computing platform. Examples of bit data set include unstructured logs, web indexing, scientific data, surveillance data, etc. MapReduce is a distributed processing program framework, a computing job is broken down into many smaller Map tasks and a Reduce task. Each Map task processes a partition of the given data set and Reduce aggregates the results of Maps to produce final result. Hadoop is an open-source MapReduce architecture, and is widely used in many cloud-based services.To best utilize computing resource in a cloud server, a task scheduler is essential to assign tasks to appropriate processors as well as to prioritize resource allocation. The default scheduler of Hadoop is first-in-first-out (FIFO) scheduler which is simple but has a performance inefficiency yet to be improved. Although there have been many researches aiming to improve the performance of MapReduce platform in the past year, there still have many issues hindering the performance improvement, such as dynamic load balance, data locality, and heterogeneity of computing nodes. To improve data locality, we propose a new scheduler called Data Locality Driven Scheduler (DLDS) based on Hadoop platform. DLDS improve Hadoop's performance by allocating Map tasks as close as possible to the data block they are to process. We evaluated the proposed DLDS against several other schedulers by simulation on an 8 nodes real Hadoop system. Experimental results show that DLDS can improve data locality by 10-15%, which results in a significant performance improvement.