Data Cleaning and AutoML: Would an Optimizer Choose to Clean?

dc.contributor.authorNeutatz, Felix
dc.contributor.authorChen, Binger
dc.contributor.authorAlkhatib, Yazan
dc.contributor.authorYe, Jingwen
dc.contributor.authorAbedjan, Ziawasch
dc.date.accessioned2022-08-05T09:15:29Z
dc.date.available2022-08-05T09:15:29Z
dc.date.issued2022-05-13
dc.description.abstractData cleaning is widely acknowledged as an important yet tedious task when dealing with large amounts of data. Thus, there is always a cost-benefit trade-off to consider. In particular, it is important to assess this trade-off when not every data point and data error is equally important for a task. This is often the case when statistical analysis or machine learning (ML) models derive knowledge about data. If we only care about maximizing the utility score of the applications, such as accuracy or F1 scores, many tasks can afford some degree of data quality problems. Recent studies analyzed the impact of various data error types on vanilla ML tasks, showing that missing values and outliers significantly impact the outcome of such models. In this paper, we expand the setting to one where data cleaning is not considered in isolation but as an equal parameter among many other hyper-parameters that influence feature selection, regularization, and model selection. In particular, we use state-of-the-art AutoML frameworks to automatically learn the parameters that benefit a particular ML binary classification task. In our study, we see that specific cleaning routines still play a significant role but can also be entirely avoided if the choice of a specific model or the filtering of specific features diminishes the overall impact.en
dc.description.sponsorshipTU Berlin, Open-Access-Mittel – 2022en
dc.description.sponsorshipBMBF, 01IS18025A, Verbundprojekt BIFOLD-BBDC: Berlin Institute for the Foundations of Learning and Dataen
dc.description.sponsorshipBMBF, 01IS18037A, Verbundprojekt BIFOLD-BZML: Berlin Institute for the Foundations of Learning and Dataen
dc.identifier.eissn1610-1995
dc.identifier.issn1618-2162
dc.identifier.urihttps://depositonce.tu-berlin.de/handle/11303/17202
dc.identifier.urihttp://dx.doi.org/10.14279/depositonce-15981
dc.language.isoenen
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/en
dc.subject.ddc004 Datenverarbeitung; Informatikde
dc.subject.othermachine learningen
dc.subject.otherdata cleaningen
dc.subject.otherAutoMLen
dc.titleData Cleaning and AutoML: Would an Optimizer Choose to Clean?en
dc.typeArticleen
dc.type.versionpublishedVersionen
dcterms.bibliographicCitation.doi10.1007/s13222-022-00413-2en
dcterms.bibliographicCitation.journaltitleDatenbank-Spektrumen
dcterms.bibliographicCitation.originalpublishernameSpringer Natureen
dcterms.bibliographicCitation.originalpublisherplaceHeidelbergen
dcterms.bibliographicCitation.pageend130en
dcterms.bibliographicCitation.pagestart121en
dcterms.bibliographicCitation.volume22en
tub.accessrights.dnbfreeen
tub.affiliationFak. 4 Elektrotechnik und Informatik::Inst. Softwaretechnik und Theoretische Informatik::FG Datenbanksysteme und Informationsmanagement (DIMA)de
tub.affiliation.facultyFak. 4 Elektrotechnik und Informatikde
tub.affiliation.groupFG Datenbanksysteme und Informationsmanagement (DIMA)de
tub.affiliation.instituteInst. Softwaretechnik und Theoretische Informatikde
tub.publisher.universityorinstitutionTechnische Universität Berlinen

Files

Original bundle
Now showing 1 - 1 of 1
Loading…
Thumbnail Image
Name:
Neutatz_etal_Data_2022.pdf
Size:
415.21 KB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
4.86 KB
Format:
Item-specific license agreed upon to submission
Description:

Collections