Closing the gap: Sequence mining at scale
Beedkar, Kaustubh
;
Berberich, Klaus
;
Gemulla, Rainer
;
Miliaraki, Iris
DOI:
|
https://doi.org/10.1145/2757217
|
URL:
|
https://madoc.bib.uni-mannheim.de/40019
|
Additional URL:
|
http://dl.acm.org/citation.cfm?doid=2799368.275721...
|
URN:
|
urn:nbn:de:bsz:180-madoc-400197
|
Document Type:
|
Article
|
Year of publication:
|
2015
|
The title of a journal, publication series:
|
ACM Transactions on Database Systems : TODS
|
Volume:
|
40
|
Issue number:
|
2
|
Page range:
|
Art. 8, 1-44
|
Place of publication:
|
New York, NY
|
Publishing house:
|
ACM Press
|
ISSN:
|
0362-5915 , 1557-4644
|
Publication language:
|
English
|
Institution:
|
School of Business Informatics and Mathematics > Practical Computer Science I: Data Analytics (Gemulla 2014-)
|
Subject:
|
004 Computer science, internet
|
Abstract:
|
Frequent sequence mining is one of the fundamental building blocks in data mining. While the problem has been extensively studied, few of the available techniques are sufficiently scalable to handle datasets with billions of sequences; such large-scale datasets arise, for instance, in text mining and session analysis. In this article, we propose MG-FSM, a scalable algorithm for frequent sequence mining on MapReduce. MG-FSM can handle so-called “gap constraints”, which can be used to limit the output to a controlled set of frequent sequences. Both positional and temporal gap constraints, as well as appropriate maximality and closedness constraints, are supported. At its heart, MG-FSM partitions the input database in a way that allows us to mine each partition independently using any existing frequent sequence mining algorithm. We introduce the notion of ω-equivalency, which is a generalization of the notion of a “projected database” used by many frequent pattern mining algorithms. We also present a number of optimization techniques that minimize partition size, and therefore computational and communication costs, while still maintaining correctness. Our experimental study in the contexts of text mining and session analysis suggests that MG-FSM is significantly more efficient and scalable than alternative approaches.
|
| Dieser Eintrag ist Teil der Universitätsbibliographie. |
| Das Dokument wird vom Publikationsserver der Universitätsbibliothek Mannheim bereitgestellt. |
Search Authors in
BASE:
Beedkar, Kaustubh
;
Berberich, Klaus
;
Gemulla, Rainer
;
Miliaraki, Iris
Google Scholar:
Beedkar, Kaustubh
;
Berberich, Klaus
;
Gemulla, Rainer
;
Miliaraki, Iris
ORCID:
Beedkar, Kaustubh, Berberich, Klaus, Gemulla, Rainer ORCID: https://orcid.org/0000-0003-2762-0050 and Miliaraki, Iris
You have found an error? Please let us know about your desired correction here: E-Mail
Actions (login required)
|
Show item |
|
|