Gesellschaft für Informatik e.V.

Lecture Notes in Informatics

Datenbanksysteme für Business, Technologie und Web (BTW) P-214, 37-56 (2013).

Gesellschaft für Informatik, Bonn

Copyright © Gesellschaft für Informatik, Bonn


MR-DSJ: distance-based self-join for large-scale vector data analysis with mapreduce

Thomas Seidl , Sergej Fries and Brigitte Boden


Data analytics gets faced with huge and tremendously increasing amounts of data for which MapReduce provides a very convenient and effective distributed programming model. Various algorithms already support massive data analysis on computer clusters but, in particular, distance-based similarity self-joins lack efficient solutions for large vector data sets though they are fundamental in many data mining tasks including clustering, near-duplicate detection or outlier analysis. Our novel distance-based self-join algorithm for MapReduce, MR-DSJ, is based on grid partitioning and delivers correct, complete, and inherently duplicate-free results in a single iteration. Additionally we propose several filter techniques which reduce the runtime and communication of the MR-DSJ algorithm. Analytical and experimental evaluations demonstrate the superiority over other join algorithms for MapReduce.

Full Text: PDF

Gesellschaft für Informatik, Bonn
ISBN 978-3-88579-608-4

Last changed 04.10.2013 18:38:48