Identifying Sensitive URLs at Web-Scale

dc.contributor.authorMatic, Srdjan
dc.contributor.authorIordanou, Costas
dc.contributor.authorSmaragdakis, Georgios
dc.contributor.authorLaoutaris, Nikolaos
dc.date.accessioned2021-06-08T13:11:58Z
dc.date.available2021-06-08T13:11:58Z
dc.date.issued2020-10-27
dc.description.abstractSeveral data protection laws include special provisions for protecting personal data relating to religion, health, sexual orientation, and other sensitive categories. Having a well-defined list of sensitive categories is sufficient for filing complaints manually, conducting investigations, and prosecuting cases in courts of law. Data protection laws, however, do not define explicitly what type of content falls under each sensitive category. Therefore, it is unclear how to implement proactive measures such as informing users, blocking trackers, and filing complaints automatically when users visit sensitive domains. To empower such use cases we turn to the Curlie.org crowdsourced taxonomy project for drawing training data to build a text classifier for sensitive URLs. We demonstrate that our classifier can identify sensitive URLs with accuracy above 88%, and even recognize specific sensitive categories with accuracy above 90%. We then use our classifier to search for sensitive URLs in a corpus of 1 Billion URLs collected by the Common Crawl project. We identify more than 155 millions sensitive URLs in more than 4 million domains. Despite their sensitive nature, more than 30% of these URLs belong to domains that fail to use HTTPS. Also, in sensitive Webpages with third-party cookies, 87% of the third-parties set at least one persistent cookie.en
dc.description.sponsorshipEC/H2020/679158/EU/Resolving the Tussle in the Internet: Mapping, Architecture, and Policy Making/ResolutioNeten
dc.description.sponsorshipEC/H2020/871370/EU/PIMCity: Building the Next Generation Personal Data Platforms/PIMCITYen
dc.identifier.isbn978-1-4503-8138-3
dc.identifier.urihttps://depositonce.tu-berlin.de/handle/11303/13215
dc.identifier.urihttp://dx.doi.org/10.14279/depositonce-12010
dc.language.isoenen
dc.rights.urihttp://rightsstatements.org/vocab/InC/1.0/en
dc.subject.ddc000 Informatik, Informationswissenschaft, allgemeine Werkede
dc.subject.othersecurityen
dc.subject.otherprivacyen
dc.subject.otherWWWen
dc.subject.otherWorld Wide Weben
dc.subject.otherGDPRen
dc.subject.otherGeneral Data Protection Regulationen
dc.subject.othernetwork measurementsen
dc.titleIdentifying Sensitive URLs at Web-Scaleen
dc.typeConference Objecten
dc.type.versionacceptedVersionen
dcterms.bibliographicCitation.doi10.1145/3419394.3423653en
dcterms.bibliographicCitation.originalpublishernameAssociation for Computing Machinery (ACM)en
dcterms.bibliographicCitation.originalpublisherplaceNew York, NYen
dcterms.bibliographicCitation.pageend633en
dcterms.bibliographicCitation.pagestart619en
dcterms.bibliographicCitation.proceedingstitleProceedings of the ACM Internet Measurement Conference (IMC 2020)en
tub.accessrights.dnbfreeen
tub.affiliationFak. 4 Elektrotechnik und Informatik::Inst. Telekommunikationssysteme::FG Internet Measurement and Analysis (IMA)de
tub.affiliation.facultyFak. 4 Elektrotechnik und Informatikde
tub.affiliation.groupFG Internet Measurement and Analysis (IMA)de
tub.affiliation.instituteInst. Telekommunikationssystemede
tub.publisher.universityorinstitutionTechnische Universität Berlinen

Files

Original bundle
Now showing 1 - 1 of 1
Loading…
Thumbnail Image
Name:
matic_etal_2020.pdf
Size:
1.59 MB
Format:
Adobe Portable Document Format
Description:
Accepted manuscript
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
5.75 KB
Format:
Item-specific license agreed upon to submission
Description:

Collections