Please use this identifier to cite or link to this item: http://dx.doi.org/10.14279/depositonce-12010
For citation please use:
Main Title: Identifying Sensitive URLs at Web-Scale
Author(s): Matic, Srdjan
Iordanou, Costas
Smaragdakis, Georgios
Laoutaris, Nikolaos
Type: Conference Object
Language Code: en
Abstract: Several data protection laws include special provisions for protecting personal data relating to religion, health, sexual orientation, and other sensitive categories. Having a well-defined list of sensitive categories is sufficient for filing complaints manually, conducting investigations, and prosecuting cases in courts of law. Data protection laws, however, do not define explicitly what type of content falls under each sensitive category. Therefore, it is unclear how to implement proactive measures such as informing users, blocking trackers, and filing complaints automatically when users visit sensitive domains. To empower such use cases we turn to the Curlie.org crowdsourced taxonomy project for drawing training data to build a text classifier for sensitive URLs. We demonstrate that our classifier can identify sensitive URLs with accuracy above 88%, and even recognize specific sensitive categories with accuracy above 90%. We then use our classifier to search for sensitive URLs in a corpus of 1 Billion URLs collected by the Common Crawl project. We identify more than 155 millions sensitive URLs in more than 4 million domains. Despite their sensitive nature, more than 30% of these URLs belong to domains that fail to use HTTPS. Also, in sensitive Webpages with third-party cookies, 87% of the third-parties set at least one persistent cookie.
URI: https://depositonce.tu-berlin.de/handle/11303/13215
http://dx.doi.org/10.14279/depositonce-12010
Issue Date: 27-Oct-2020
Date Available: 8-Jun-2021
DDC Class: 000 Informatik, Informationswissenschaft, allgemeine Werke
Subject(s): security
privacy
WWW
World Wide Web
GDPR
General Data Protection Regulation
network measurements
Sponsor/Funder: EC/H2020/679158/EU/Resolving the Tussle in the Internet: Mapping, Architecture, and Policy Making/ResolutioNet
EC/H2020/871370/EU/PIMCity: Building the Next Generation Personal Data Platforms/PIMCITY
License: http://rightsstatements.org/vocab/InC/1.0/
Proceedings Title: Proceedings of the ACM Internet Measurement Conference (IMC 2020)
Publisher: Association for Computing Machinery (ACM)
Publisher Place: New York, NY
Publisher DOI: 10.1145/3419394.3423653
Page Start: 619
Page End: 633
ISBN: 978-1-4503-8138-3
Appears in Collections:FG Internet Measurement and Analysis (IMA) » Publications

Files in This Item:
matic_etal_2020.pdf

Accepted manuscript

Format: Adobe PDF | Size: 1.63 MB
DownloadShow Preview
Thumbnail

Item Export Bar

Items in DepositOnce are protected by copyright, with all rights reserved, unless otherwise indicated.