Automatic glossary term extraction from large-scale requirements specifications
Creating glossaries for large corpora of requirments is an important but expensive task. Glossary term extraction methods often focus on achieving a high recall rate and, therefore, favor linguistic proecssing for extracting glossary term candidates and neglect the benefits from reducing the number of candidates by statistical filter methods. However, especially for large datasets a reduction of the likewise large number of candidates may be crucial. This paper demonstrates how to automatically extract relevant domain-specific glossary term candidates from a large body of requirements, the CrowdRE dataset. Our hybrid approach combines linguistic processing and statistical filtering for extracting and reducing glossary term candidates. In a twofold evaluation, we examine the impact of our approach on the quality and quantity of extracted terms. We provide a ground truth for a subset of the requirements and show that a substantial degree of recall can be achieved. Furthermore, we advocate requirements coverage as an additional quality metric to assess the term reduction that results from our statistical filters. Results indicate that with a careful combination of linguistic and statistical extraction methods, a fair balance between later manual efforts and a high recall rate can be achieved.
Published in: 2018 IEEE 26th International Requirements Engineering Conference (RE), 10.1109/RE.2018.00052, IEEE