Abstract
We study distributed algorithms for some fundamental problems in data summarization. Given a communication graph G of n nodes each of which may hold a value initially, we focus on computing sum_{i=1}^N g(f_i), where f_i is the number of occurrences of value i and g is some fixed function. This includes important statistics such as the number of distinct elements, frequency moments, and the empirical entropy of the data.
In the CONGEST~ model, a simple adaptation from streaming lower bounds shows that it requires Omega~(D+ n) rounds, where D is the diameter of the graph, to compute some of these statistics exactly. However, these lower bounds do not hold for graphs that are wellconnected. We give an algorithm that computes sum_{i=1}^{N} g(f_i) exactly in {tau_{G}} * 2^{O(sqrt{log n})} rounds where {tau_{G}} is the mixing time of G. This also has applications in computing the top k most frequent elements.
We demonstrate that there is a high similarity between the GOSSIP~ model and the CONGEST~ model in wellconnected graphs. In particular, we show that each round of the GOSSIP~ model can be simulated almost perfectly in O~({tau_{G}}) rounds of the CONGEST~ model. To this end, we develop a new algorithm for the GOSSIP~ model that 1 +/ epsilon approximates the pth frequency moment F_p = sum_{i=1}^N f_i^p in O~(epsilon^{2} n^{1k/p}) rounds , for p >= 2, when the number of distinct elements F_0 is at most O(n^{1/(k1)}). This result can be translated back to the CONGEST~ model with a factor O~({tau_{G}}) blowup in the number of rounds.
BibTeX  Entry
@InProceedings{su_et_al:LIPIcs:2019:11340,
author = {HsinHao Su and Hoa T. Vu},
title = {{Distributed Data Summarization in WellConnected Networks}},
booktitle = {33rd International Symposium on Distributed Computing (DISC 2019)},
pages = {33:133:16},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {9783959771269},
ISSN = {18688969},
year = {2019},
volume = {146},
editor = {Jukka Suomela},
publisher = {Schloss DagstuhlLeibnizZentrum fuer Informatik},
address = {Dagstuhl, Germany},
URL = {http://drops.dagstuhl.de/opus/volltexte/2019/11340},
URN = {urn:nbn:de:0030drops113400},
doi = {10.4230/LIPIcs.DISC.2019.33},
annote = {Keywords: Distributed Algorithms, Network Algorithms, Data Summarization}
}
Keywords: 

Distributed Algorithms, Network Algorithms, Data Summarization 
Collection: 

33rd International Symposium on Distributed Computing (DISC 2019) 
Issue Date: 

2019 
Date of publication: 

08.10.2019 