Focused Crawling for Structured Data
Meusel, Robert
;
Mika, Peter
;
Blanco, Roi
DOI:
|
https://doi.org/10.1145/2661829.2661902
|
URL:
|
https://s.yimg.com/ge/labs/v2/uploads/anthelion.pd...
|
Weitere URL:
|
http://de.slideshare.net/RobertMeusel/focused-craw...
|
Dokumenttyp:
|
Konferenzveröffentlichung
|
Erscheinungsjahr:
|
2014
|
Buchtitel:
|
CIKM 2014 : Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management
|
Seitenbereich:
|
1039-1048
|
Veranstaltungsort:
|
Shanghai, China
|
Veranstaltungsdatum:
|
November 3-7, 2014
|
Ort der Veröffentlichung:
|
New York, NY
|
Verlag:
|
ACM
|
ISBN:
|
978-1-4503-2598-1
|
Sprache der Veröffentlichung:
|
Englisch
|
Einrichtung:
|
Fakultät für Wirtschaftsinformatik und Wirtschaftsmathematik > Information Systems V: Web-based Systems (Bizer 2012-)
|
Fachgebiet:
|
004 Informatik
|
Freie Schlagwörter (Englisch):
|
bandit-based selection , focused crawling , microdata , online learning
|
Abstract:
|
The Web is rapidly transforming from a pure document collection to the largest connected public data space. Semantic annotations of web pages make it notably easier to extract and reuse data and are increasingly used by both search engines and social media sites to provide better search experiences through rich snippets, faceted search, task completion, etc. In our work, we study the novel problem of crawling structured data embedded inside HTML pages. We describe Anthelion, the first focused crawler addressing this task. We propose new methods of focused crawling specifically designed for collecting data-rich pages with greater efficiency. In particular, we propose a novel combination of online learning and bandit-based explore/exploit approaches to predict data-rich web pages based on the context of the page as well as using feedback from the extraction of metadata from previously seen pages. We show that these techniques significantly outperform state-of-the-art approaches for focused crawling, measured as the ratio of relevant pages and non-relevant pages collected within a given budget.
|
| Dieser Eintrag ist Teil der Universitätsbibliographie. |
Suche Autoren in
Sie haben einen Fehler gefunden? Teilen Sie uns Ihren Korrekturwunsch bitte hier mit: E-Mail
Actions (login required)
|
Eintrag anzeigen |
|
|