Closing the gap: Sequence mining at scale


Beedkar, Kaustubh ; Berberich, Klaus ; Gemulla, Rainer ; Miliaraki, Iris


[img]
Preview
PDF
a8-beedkar.pdf - Published

Download (2MB)

DOI: https://doi.org/10.1145/2757217
URL: https://madoc.bib.uni-mannheim.de/40019
Additional URL: http://dl.acm.org/citation.cfm?doid=2799368.275721...
URN: urn:nbn:de:bsz:180-madoc-400197
Document Type: Article
Year of publication: 2015
The title of a journal, publication series: ACM Transactions on Database Systems : TODS
Volume: 40
Issue number: 2
Page range: Art. 8, 1-44
Place of publication: New York, NY
Publishing house: ACM Press
ISSN: 0362-5915 , 1557-4644
Publication language: English
Institution: School of Business Informatics and Mathematics > Practical Computer Science I: Data Analytics (Gemulla 2014-)
Subject: 004 Computer science, internet
Abstract: Frequent sequence mining is one of the fundamental building blocks in data mining. While the problem has been extensively studied, few of the available techniques are sufficiently scalable to handle datasets with billions of sequences; such large-scale datasets arise, for instance, in text mining and session analysis. In this article, we propose MG-FSM, a scalable algorithm for frequent sequence mining on MapReduce. MG-FSM can handle so-called “gap constraints”, which can be used to limit the output to a controlled set of frequent sequences. Both positional and temporal gap constraints, as well as appropriate maximality and closedness constraints, are supported. At its heart, MG-FSM partitions the input database in a way that allows us to mine each partition independently using any existing frequent sequence mining algorithm. We introduce the notion of ω-equivalency, which is a generalization of the notion of a “projected database” used by many frequent pattern mining algorithms. We also present a number of optimization techniques that minimize partition size, and therefore computational and communication costs, while still maintaining correctness. Our experimental study in the contexts of text mining and session analysis suggests that MG-FSM is significantly more efficient and scalable than alternative approaches.




Dieser Eintrag ist Teil der Universitätsbibliographie.

Das Dokument wird vom Publikationsserver der Universitätsbibliothek Mannheim bereitgestellt.




Metadata export


Citation


+ Search Authors in

+ Download Statistics

Downloads per month over past year

View more statistics



You have found an error? Please let us know about your desired correction here: E-Mail


Actions (login required)

Show item Show item