Please use this identifier to cite or link to this item:
For citation please use:
Main Title: Collaborative cluster configuration for distributed data-parallel processing: A research overview
Author(s): Thamsen, Lauritz
Scheinert, Dominik
Will, Jonathan
Bader, Jonathan
Kao, Odej
Other Contributor(s): Nature, Springer
Type: Article
Abstract: Many organizations routinely analyze large datasets using systems for distributed data-parallel processing and clusters of commodity resources. Yet, users need to configure adequate resources for their data processing jobs. This requires significant insights into expected job runtimes and scaling behavior, resource characteristics, input data distributions, and other factors. Unable to estimate performance accurately, users frequently overprovision resources for their jobs, leading to low resource utilization and high costs. In this paper, we present major building blocks towards a collaborative approach for optimization of data processing cluster configurations based on runtime data and performance models. We believe that runtime data can be shared and used for performance models across different execution contexts, significantly reducing the reliance on the recurrence of individual processing jobs or, else, dedicated job profiling. For this, we describe how the similarity of processing jobs and cluster infrastructures can be employed to combine suitable data points from local and global job executions into accurate performance models. Furthermore, we outline approaches to performance prediction via more context-aware and reusable models. Finally, we lay out how metrics from previous executions can be combined with runtime monitoring to effectively re-configure models and clusters dynamically.
Subject(s): scalable data analytics
batch processing
distributed dataflows
runtime prediction
resource allocation
Issue Date: 31-May-2022
Date Available: 5-Aug-2022
Language Code: en
DDC Class: 004 Datenverarbeitung; Informatik
Sponsor/Funder: TU Berlin, Open-Access-Mittel – 2022
BMBF, 01IS18025A, Verbundprojekt BIFOLD-BBDC: Berlin Institute for the Foundations of Learning and Data
DFG, 414984028, SFB 1404: FONDA – Grundlagen von Workflows für die Analyse großer naturwissenschaftlicher Daten
Journal Title: Datenbank-Spektrum
Publisher: Springer Nature
Volume: 22
Publisher DOI: 10.1007/s13222-022-00416-z
Page Start: 143
Page End: 151
EISSN: 1610-1995
ISSN: 1618-2162
TU Affiliation(s): Fak. 4 Elektrotechnik und Informatik » Inst. Telekommunikationssysteme » FG Verteilte offene Systeme
Appears in Collections:Technische Universität Berlin » Publications

Files in This Item:

Item Export Bar

This item is licensed under a Creative Commons License Creative Commons