No icon

Speed Up Big Data Analytics by Unveiling the Storage Distribution of Sub-datasets

Speed Up Big Data Analytics by Unveiling the Storage Distribution of Sub-datasets

Abstract:

In this paper, we study the problem of sub-dataset analysis over distributed file systems, e.g, the Hadoop file system. Our experiments show that the sub-datasets distribution over HDFS blocks, which is hidden by HDFS, can often cause corresponding analyses to suffer from a seriously imbalanced or inefficient parallel execution. Specifically, the content clustering of sub-datasets results in some computational nodes carrying out much more workload than others; furthermore, it leads to inefficient sampling of sub-datasets, as analysis programs will often read large amounts of irrelevant data. We conduct a comprehensive analysis on how imbalanced computing patterns and inefficient sampling occur. We then propose a storage distribution aware method to optimize subdataset analysis over distributed storage systems referred to as DataNet. Firstly, we propose an efficient algorithm to obtain the metadata of sub-dataset distributions. Secondly, we design an elastic storage structure called ElasticMap based on the HashMap and BloomFilter techniques to store the meta-data. Thirdly, we employ distribution-aware algorithms for sub-dataset applications to achieve balanced and efficient parallel execution. Our proposed method can benefit different sub-dataset analyses with various computational requirements. Experiments are conducted on PRObEs Marmot 128-node cluster testbed and the results show the performance benefits of DataNet.

Existing System:

This skewed storage distribution can lead to an imbalanced computing in sub-dataset analysis because analysis tasks, such as MapReduce, are usually scheduled based on the HDFS block granularity without considering the distribution of sub-datasets. On the other hand, the sampling of sub-datasets could be inefficient or imprecise. To sample a certain percentage or quantity of data from a subdataset, traditional Hadoop jobs scan the whole dataset, even if reading a small portion of the whole dataset is enough to satisfy the sampling requirement. This is because current Hadoop has no mechanism for curtailing its execution and finish early. Additionally, even if Hadoop were to provide support for early termination, it may still suffer from inefficient sampling execution when a skewed sub-dataset distribution is presented. For example, it may process a large portion of the whole dataset, but obtain little valid data belonging to a specific sub-dataset.We will analyze how imbalanced computing patterns and inefficient sampling occur and how they are affected by the size of a cluster.

Proposed System:

We theoretically analyzed how an imbalanced data distribution, caused by the clustering of relevant data within HDFS blocks, affects parallel computing.

We developed an efficient algorithm with linear complexity to obtain the complete distribution of sub-datasets through a single scan.

We designed a new light-weight data structure called ElasticMap, which can enable applications to quickly obtain the distribution of desired sub-datasets with a low overhead.

We presented a workload-balanced scheduling algorithm for parallel sub-dataset analysis.

We presented an input data selection algorithm for efficient sub-dataset sampling.

We prototyped and evaluated DataNet against several wellknown MapReduce applications. The evaluation results confirm the efficiency and effectiveness of our proposed methods.

Comment As:

Comment (0)