Abstract
In entity matching classification, we are given two sets R and S of objects where whether r and s form a match is known for each pair (r, s) in R x S. If R and S are subsets of domains D(R) and D(S) respectively, the goal is to discover a classifier function f: D(R) x D(S) > {0, 1} from a certain class satisfying the property that, for every (r, s) in R x S, f(r, s) = 1 if and only if r and s are a match.
Past research is accustomed to running a learning algorithm directly on all the labeled (i.e., match or not) pairs in R times S. This, however, suffers from the drawback that even reading through the input incurs a quadratic cost. We pursue a direction towards removing the quadratic barrier. Denote by T the set of matching pairs in R times S. We propose to accept R, S, and T as the input, and aim to solve the problem with cost proportional to R+S+T, thereby achieving a large performance gain in the (typical) scenario where T<<RS.
This paper provides evidence on the feasibility of the new direction, by showing how to accomplish the aforementioned purpose for entity matching with linear classification, where a classifier is a linear multidimensional plane separating the matching and nonmatching pairs. We actually do so in the MPC model, echoing the trend of deploying massively parallel computing systems for largescale learning. As a side product, we obtain new MPC algorithms for three geometric problems: linear programming, batched range counting, and dominance join.
BibTeX  Entry
@InProceedings{tao:LIPIcs:2018:8605,
author = {Yufei Tao},
title = {{Massively Parallel Entity Matching with Linear Classification in Low Dimensional Space}},
booktitle = {21st International Conference on Database Theory (ICDT 2018)},
pages = {20:120:19},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {9783959770637},
ISSN = {18688969},
year = {2018},
volume = {98},
editor = {Benny Kimelfeld and Yael Amsterdamer},
publisher = {Schloss DagstuhlLeibnizZentrum fuer Informatik},
address = {Dagstuhl, Germany},
URL = {http://drops.dagstuhl.de/opus/volltexte/2018/8605},
URN = {urn:nbn:de:0030drops86057},
doi = {10.4230/LIPIcs.ICDT.2018.20},
annote = {Keywords: Entity Matching, Linear Programming, Range Counting, Dominance Join, Massively Parallel Computation}
}
Keywords: 

Entity Matching, Linear Programming, Range Counting, Dominance Join, Massively Parallel Computation 
Collection: 

21st International Conference on Database Theory (ICDT 2018) 
Issue Date: 

2018 
Date of publication: 

05.03.2018 