Closing the gap: Sequence mining at scale
Beedkar, Kaustubh
;
Berberich, Klaus
;
Gemulla, Rainer
;
Miliaraki, Iris
DOI:
|
https://doi.org/10.1145/2757217
|
URL:
|
https://madoc.bib.uni-mannheim.de/40019
|
Weitere URL:
|
http://dl.acm.org/citation.cfm?doid=2799368.275721...
|
URN:
|
urn:nbn:de:bsz:180-madoc-400197
|
Dokumenttyp:
|
Zeitschriftenartikel
|
Erscheinungsjahr:
|
2015
|
Titel einer Zeitschrift oder einer Reihe:
|
ACM Transactions on Database Systems : TODS
|
Band/Volume:
|
40
|
Heft/Issue:
|
2
|
Seitenbereich:
|
Art. 8, 1-44
|
Ort der Veröffentlichung:
|
New York, NY
|
Verlag:
|
ACM Press
|
ISSN:
|
0362-5915 , 1557-4644
|
Sprache der Veröffentlichung:
|
Englisch
|
Einrichtung:
|
Fakultät für Wirtschaftsinformatik und Wirtschaftsmathematik > Practical Computer Science I: Data Analytics (Gemulla 2014-)
|
Fachgebiet:
|
004 Informatik
|
Abstract:
|
Frequent sequence mining is one of the fundamental building blocks in data mining. While the problem has been extensively studied, few of the available techniques are sufficiently scalable to handle datasets with billions of sequences; such large-scale datasets arise, for instance, in text mining and session analysis. In this article, we propose MG-FSM, a scalable algorithm for frequent sequence mining on MapReduce. MG-FSM can handle so-called “gap constraints”, which can be used to limit the output to a controlled set of frequent sequences. Both positional and temporal gap constraints, as well as appropriate maximality and closedness constraints, are supported. At its heart, MG-FSM partitions the input database in a way that allows us to mine each partition independently using any existing frequent sequence mining algorithm. We introduce the notion of ω-equivalency, which is a generalization of the notion of a “projected database” used by many frequent pattern mining algorithms. We also present a number of optimization techniques that minimize partition size, and therefore computational and communication costs, while still maintaining correctness. Our experimental study in the contexts of text mining and session analysis suggests that MG-FSM is significantly more efficient and scalable than alternative approaches.
|
| Dieser Eintrag ist Teil der Universitätsbibliographie. |
| Das Dokument wird vom Publikationsserver der Universitätsbibliothek Mannheim bereitgestellt. |
Suche Autoren in
BASE:
Beedkar, Kaustubh
;
Berberich, Klaus
;
Gemulla, Rainer
;
Miliaraki, Iris
Google Scholar:
Beedkar, Kaustubh
;
Berberich, Klaus
;
Gemulla, Rainer
;
Miliaraki, Iris
ORCID:
Beedkar, Kaustubh, Berberich, Klaus, Gemulla, Rainer ORCID: https://orcid.org/0000-0003-2762-0050 and Miliaraki, Iris
Sie haben einen Fehler gefunden? Teilen Sie uns Ihren Korrekturwunsch bitte hier mit: E-Mail
Actions (login required)
|
Eintrag anzeigen |
|
|