Augmenting mathematical formulae for more effective querying & efficient presentation

Schubotz, Moritz

Augmenting mathematical formulae for more effective querying & efficient presentation

dc.contributor.advisor	Markl, Volker
dc.contributor.author	Schubotz, Moritz
dc.contributor.grantor	Technische Universität Berlin	en
dc.contributor.referee	Markl, Volker
dc.contributor.referee	Youssef, Abdou
dc.contributor.referee	Pitman, Jim
dc.date.accepted	2017-03-31
dc.date.accessioned	2017-07-27T08:41:05Z
dc.date.available	2017-07-27T08:41:05Z
dc.date.issued	2017
dc.description	Printexemplar auch unter ISBN 9783745062083 veröffentlicht	de
dc.description.abstract	Mathematical Information Retrieval (MIR) is a research area that focuses on the Information Need (IN) of the Science, Technology, Engineering and Mathematics (STEM) domain. Unlike traditional Information Retrieval (IR) research, that extracts information from textual data sources, MIR takes mathematical formulae into account as well. This thesis makes three main contributions: 1. It analyses the strengths and weaknesses of current MIR systems and establishes a new MIR task for future evaluations; 2. Based on the analysis, it augments mathematical notation as a foundation for future MIR systems to better fit the IN from the STEM domain; and 3. It presents a solution on how large web publishers can efficiently present mathematics to satisfy the INs of each individual visitor. With regard to evaluation of MIR systems, it analyses the first international MIR task and proposes the Math Wikipedia Task (WMC). In contrast to other tasks, which evaluate the overall performance of MIR systems based on an IN, that is described by a combination of textual keywords and formulae, WMC was designed to gain insights about the math-specific aspects of MIR systems. In addition to that, this thesis investigates how different factors of similarity measures for mathematical expressions influence the effectiveness of MIR results. Based on the aforementioned evaluations, this thesis proposes to rethink the fundamentals of MIR systems. MIR systems should elevate the internal representation of mathematics and use a more semantic rather than syntactic representation for the retrieval algorithms. This approach simplifies MIR research by defining three orthogonal MIR research challenges: (1) Augmentation; (2) Querying; and (3) Efficient Execution. As augmentation target, this thesis proposes the concept of context-free formulae visualized by the idea of Formula Home Page (FHP). By visiting a FHP, a mathematically literate person can fully understand the formula semantics without context or additional resources. As a first step towards unsupervised formula augmentation, this thesis introduces Mathematical Language Processing (MLP). MLP extracts knowledge about individual formulae from their surrounding text. To achieve that, it borrows concepts from Natural Language Processing (NLP) and adapts them to the specifics of mathematical language. To finally satisfy the users mathematical IN, formulae (i.e., data representing mathematical semantics) need to be presented to the user. Given the large variety of users and information systems, delivering math in a robust, scalable, fast and accessible way, was an open research problem. This thesis investigates different approaches to solve this problem and demonstrates the feasibility of a service-oriented multi-format approach which was implemented and is known as the Mathoid math rendering service. This implementation improves the math rendering for all Wikimedia sites including Wikipedia in production.	en
dc.description.abstract	Die digitale Revolution hat die Informationsbeschaffung grundlegend verändert. Das Internet ist zum ersten Anlaufpunkt zur Befriedigung des täglichen Informationsbedarfs avanciert - sowohl im privaten, als auch im professionellen Leben. Dies gilt auch für die Disziplinen Mathematik, Ingenieurswissenschaften, Natur und Technik (MINT). Der hohe Anteil mathematischer Ausdrücke, die in MINT-Fächern integraler Bestandteil der Schriftsprache sind, stellt eine besondere Herausforderung für Systeme wie Suchmaschinen und Literaturempfehlungsdienste dar. Mit dieser Thematik beschäftigt sich das Forschungsgebiet Mathematical Information Retrieval (MIR). Einige Probleme, wie beispielsweise die Disambiguierung, können durch Adaption korrespondierender Methoden aus der Computerlinguistik gelöst werden. Viele Aspekte erfordern jedoch auch vollständig neue Lösungen. Die Anwendungsszenarien für bessere Verarbeitungs- und Analyseverfahren von Texten mit einem hohen Anteil mathematischer Notation sind vielfältig und reichen von der Literaturrecherche wissenschaftlicher Texte, über die Vermeidung von Plagiaten bis zur Verbesserung von Lernsoftware für die MINT-Fächer in Schulen und Universitäten. Die vorliegende Dissertation leistet die folgenden Beiträge zur MIR-Forschung: 1. Analyse der Stärken und Schwächen bereits bestehender MIR-Systeme und Entwicklung eines standardisierten Evaluationssystems zur Quantifizierung der Effektivität von MIR- Systemen. 2. Erforschung von Verfahren zur automatischen, semantischen Anreicherung mathematischer Ausdrücke. 3. Entwicklung eines Lösungsvorschlags für die effiziente und skalierbare Darstellung mathematischer Inhalte. Basierend auf der Analyse bereits bestehender MIR-Systeme, wird in dieser Arbeit eine Dreiteilung der MIR-Forschung vorgeschlagen: (1) Augmentierung; (2) Anfragengenerierung und (3) Effiziente Ausführung. Es wird ein Evaluationsverfahren zur Quantifizierung der Effektivität von MIR-Systemen entwickelt, bestehend aus einem auf Wikipedia basierenden Testkorpus, einer Aufgabenliste und einem vollautomatischen Auswertungssystem der Messergebnisse. Im Gegensatz zu herkömmlichen Evaluationsverfahren, bei denen die Aufgaben aus Schlagwortelisten bestehen, verwendet das hier vorgestellte Verfahren Formelmuster. Das Evaluationsverfahren war Teil des ersten offiziellen, internationalen Wettbewerbs für MIR-Systeme. Darüber hinaus wurde das Evaluationsverfahren auch außerhalb des Wettbewerbs zur Evaluation von MIR-Systemen verwendet und von anderen Wissenschaftlern weiterentwickelt. In einem Prozess, der als "Mathematical Language Processing" (MLP) bezeichnet wird, werden mathematische Bezeichner durch Informationen aus dem umgebenden Text semantisch angereichert. In einem zweiten Schritt wird nicht nur der umgebende Text eines Bezeichners betrachtet, sondern die Gesamtheit der Texte aus ähnlichen Themengebieten analysiert, um die Bedeutungen einzelner Bezeichner zu identifizieren und die Effektivität der semantischen Anreicherung weiter zu verbessern. In einem weiteren Schritt wird die Darstellung und Verarbeitung von mathematischen Formeln in Wikipedia grundlegend verbessert. Dazu werden die mathematischen Ausdrücke, die bis zu diesem Zeitpunkt in Bilddateien dargestellt wurden, in HTML5- Code umgewandelt. Dies ermöglicht eine schnellere und skalierbare Verarbeitung der mathematischen Inhalte in Wikipedia. Seit Mai 2016 wird dieses Verfahren weltweit auf allen Wikipediaseiten mit mathematischen Ausdrücken verwendet.	de
dc.identifier.uri	https://depositonce.tu-berlin.de/handle/11303/6526
dc.identifier.uri	http://dx.doi.org/10.14279/depositonce-6034
dc.language.iso	en	en
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/	en
dc.subject.ddc	020 Bibliotheks- und Informationswissenschaften	de
dc.subject.other	information retrieval	en
dc.subject.other	digital libraries	en
dc.subject.other	MathML	en
dc.subject.other	mathematical language processing	en
dc.subject.other	machine learning	en
dc.subject.other	Informationsbeschaffung	de
dc.subject.other	digitale Bibliotheken	de
dc.subject.other	mathematische Sprachverarbeitung	de
dc.subject.other	maschinelles Lernen	de
dc.title	Augmenting mathematical formulae for more effective querying & efficient presentation	en
dc.title.translated	Augmentation mathematischer Formeln zur effizienteren Auffindbarkeit und zur effektiven Darstellung	de
dc.type	Doctoral Thesis	en
dc.type.version	acceptedVersion	en
tub.accessrights.dnb	free	en
tub.affiliation	Fak. 4 Elektrotechnik und Informatik::Inst. Softwaretechnik und Theoretische Informatik::FG Datenbanksysteme und Informationsmanagement (DIMA)	de
tub.affiliation.faculty	Fak. 4 Elektrotechnik und Informatik	de
tub.affiliation.group	FG Datenbanksysteme und Informationsmanagement (DIMA)	de
tub.affiliation.institute	Inst. Softwaretechnik und Theoretische Informatik	de
tub.publisher.universityorinstitution	Technische Universität Berlin	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: schubotz_moritz.pdf
Size:: 6.32 MB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 5.75 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Publications