Please use this identifier to cite or link to this item: http://dx.doi.org/10.14279/depositonce-15981
For citation please use:
Main Title: Data Cleaning and AutoML: Would an Optimizer Choose to Clean?
Author(s): Neutatz, Felix
Chen, Binger
Alkhatib, Yazan
Ye, Jingwen
Abedjan, Ziawasch
Other Contributor(s): Nature, Springer
Type: Article
URI: https://depositonce.tu-berlin.de/handle/11303/17202
http://dx.doi.org/10.14279/depositonce-15981
License: https://creativecommons.org/licenses/by/4.0/
Abstract: Data cleaning is widely acknowledged as an important yet tedious task when dealing with large amounts of data. Thus, there is always a cost-benefit trade-off to consider. In particular, it is important to assess this trade-off when not every data point and data error is equally important for a task. This is often the case when statistical analysis or machine learning (ML) models derive knowledge about data. If we only care about maximizing the utility score of the applications, such as accuracy or F1 scores, many tasks can afford some degree of data quality problems. Recent studies analyzed the impact of various data error types on vanilla ML tasks, showing that missing values and outliers significantly impact the outcome of such models. In this paper, we expand the setting to one where data cleaning is not considered in isolation but as an equal parameter among many other hyper-parameters that influence feature selection, regularization, and model selection. In particular, we use state-of-the-art AutoML frameworks to automatically learn the parameters that benefit a particular ML binary classification task. In our study, we see that specific cleaning routines still play a significant role but can also be entirely avoided if the choice of a specific model or the filtering of specific features diminishes the overall impact.
Subject(s): machine learning
data cleaning
AutoML
Issue Date: 13-May-2022
Date Available: 5-Aug-2022
Language Code: en
DDC Class: 004 Datenverarbeitung; Informatik
Sponsor/Funder: TU Berlin, Open-Access-Mittel – 2022
BMBF, 01IS18025A, Verbundprojekt BIFOLD-BBDC: Berlin Institute for the Foundations of Learning and Data
BMBF, 01IS18037A, Verbundprojekt BIFOLD-BZML: Berlin Institute for the Foundations of Learning and Data
Journal Title: Datenbank-Spektrum
Publisher: Springer Nature
Volume: 22
Publisher DOI: 10.1007/s13222-022-00413-2
Page Start: 121
Page End: 130
EISSN: 1610-1995
ISSN: 1618-2162
TU Affiliation(s): Fak. 4 Elektrotechnik und Informatik » Inst. Softwaretechnik und Theoretische Informatik » FG Datenbanksysteme und Informationsmanagement (DIMA)
Appears in Collections:Technische Universität Berlin » Publications

Files in This Item:
Neutatz_etal_Data_2022.pdf
Format: Adobe PDF | Size: 415.21 kB
DownloadShow Preview
Thumbnail

Item Export Bar

This item is licensed under a Creative Commons License Creative Commons