基於MapReduce之雲端運算下具地域特性之動態排程

Dynamic Locality Driven Scheduler for MapReduced Based Cloud Computing

陳耀宗

MapReduce 是目前最熱門的雲端技術，用來處理大量資料，不論資料探勘、非結構化的紀錄檔、網頁索引處理及其他需要大量資料處理的科學研究，都可透過 MapReduce 得到極佳的執行效率。MapReduce 為一分散式批次資料處理程式框架，將一個工作分解為許多小的 map 任務以及 reduce 任務，由map 處理每個小問題，再由reduce將問題彙整，得到最終的結果。 Hadoop 是一個開放原始碼的 MapReduce 架構，並且被廣泛地應用在以大規模資料運算為主的雲端計算。Hadoop有一個非常重要的元件scheduler ，可以說是 hadoop的中樞，用來調度指派任務和資源分配的優先順序，預設scheduler將任務以先進先出(FIFO)的方法排程，scheduler如何選擇任務、分配任務，將會影響 MapReduce 工作的執行效率與整個叢集的使用率。提升MapReduce運算效能的挑戰之一為如何分配適當的Mapper 和 Reducer數量給雲端裡的每個節點來執行。儘管過去已經有許多改善MapReduce運算效能的研究，但是大部分的方法在實際的服務中，仍存在很多的問題，如工作節點的動態負載、data locality的問題，計算節點的異質性等等。我們發現目前的 Hadoop MapReduce 對於這些問題並沒有妥善處理，並且在相關的情況下，整體效能會下降。我們針對效能的問題，提出data locality driven scheduler的方法，並實踐在 Hadoop上。我們設計不同的實驗，比較在不同狀況下和其他的排程演算法的差異，實驗結果，透過提高資料地域性的比率，提昇平均 10% 至 15% 的效能。

Abstract

MapReduce is programming model for processing large data set. It is typically used to do distributed computing on clusters of computers such as Cloud computing platform. Examples of bit data set include unstructured logs, web indexing, scientific data, surveillance data, etc. MapReduce is a distributed processing program framework, a computing job is broken down into many smaller Map tasks and a Reduce task. Each Map task processes a partition of the given data set and Reduce aggregates the results of Maps to produce final result. Hadoop is an open-source MapReduce architecture, and is widely used in many cloud-based services.To best utilize computing resource in a cloud server, a task scheduler is essential to assign tasks to appropriate processors as well as to prioritize resource allocation. The default scheduler of Hadoop is first-in-first-out (FIFO) scheduler which is simple but has a performance inefficiency yet to be improved. Although there have been many researches aiming to improve the performance of MapReduce platform in the past year, there still have many issues hindering the performance improvement, such as dynamic load balance, data locality, and heterogeneity of computing nodes. To improve data locality, we propose a new scheduler called Data Locality Driven Scheduler (DLDS) based on Hadoop platform. DLDS improve Hadoop's performance by allocating Map tasks as close as possible to the data block they are to process. We evaluated the proposed DLDS against several other schedulers by simulation on an 8 nodes real Hadoop system. Experimental results show that DLDS can improve data locality by 10-15%, which results in a significant performance improvement.