No icon

HDM: A Composable Framework for Big Data Processing

HDM: A Composable Framework for Big Data Processing


Over the past years, frameworks such as MapReduce and Spark have been introduced to ease the task of developing big data programs and applications. However, the jobs in these frameworks are roughly defined and packaged as executable jars without any functionality being exposed or described. This means that deployed jobs are not natively composable and reusable for subsequent development. Besides, it also hampers the ability for applying optimizations on the data flow of job sequences and pipelines. In this paper, we present the Hierarchically Distributed Data Matrix (HDM) which is a functional, strongly-typed data representation for writing composable big data applications. Along with HDM, a runtime framework is provided to support the execution, integration and management of HDM applications on distributed infrastructures. Based on the functional data dependency graph of HDM, multiple optimizations are applied to improve the performance of executing HDM jobs. The experimental results show that our optimizations can achieve improvements between 10% to 40% of the Job-Completion-Time for different types of applications when compared with the current state of art, Apache Spark.

Existing System:

However, in current big data platform such as MapReduce and Spark, there is no proper way to share and expose a deployed and well-tuned online component to other developers. Therefore, there are massive and even unseen redundant development in big data applications.

In addition, as the pipeline evolves, each of the online components might be updated and re-developed, new components can also be added in the pipeline. As a result, it is very hard to track and check the effects during the evolving process.

Google’s recent report shows the challenges and problems that they have encountered in managing and evolving large scale data analytic applications. Furthermore, as the pipeline become more and more complicated, it is almost impossible to manually optimize the performance for each component not mentioning the whole pipeline.

Proposed System:

Many real-world applications require a chain of operations or even a pipeline of data processing programs. Optimizing a complicated job is difficult and optimizing pipelined ones are even harder. Additionally, manual optimizations are time-consuming and errorprone and it is almost impossible to manually optimize every program.

Integration, composition and interaction with big data programs/jobs are not natively supported: Many practical data analytics and machine learning algorithms require combination of multiple processing components each of which is responsible for a certain analytical functionality.

Maintenance and management of evolving big data applications are complex and tedious. In a realistic data analytic process, data scientists need to explore the  datasets and tune the algorithms back and force to find out a more optimal solution.

Comment As:

Comment (0)