Abstract
We consider the problem of learning a discrete distribution in the presence of an epsilon fraction of malicious data sources. Specifically, we consider the setting where there is some underlying distribution, p, and each data source provides a batch of >= k samples, with the guarantee that at least a (1  epsilon) fraction of the sources draw their samples from a distribution with total variation distance at most \eta from p. We make no assumptions on the data provided by the remaining epsilon fraction of sourcesthis data can even be chosen as an adversarial function of the (1  epsilon) fraction of "good" batches. We provide two algorithms: one with runtime exponential in the support size, n, but polynomial in k, 1/epsilon and 1/eta that takes O((n + k)/epsilon^2) batches and recovers p to error O(eta + epsilon/sqrt(k)). This recovery accuracy is information theoretically optimal, to constant factors, even given an infinite number of data sources. Our second algorithm applies to the eta = 0 setting and also achieves an O(epsilon/sqrt(k)) recover guarantee, though it runs in poly((nk)^k) time. This second algorithm, which approximates a certain tensor via a rank1 tensor minimizing l_1 distance, is surprising in light of the hardness of many lowrank tensor approximation problems, and may be of independent interest.
BibTeX  Entry
@InProceedings{qiao_et_al:LIPIcs:2018:8321,
author = {Mingda Qiao and Gregory Valiant},
title = {{Learning Discrete Distributions from Untrusted Batches}},
booktitle = {9th Innovations in Theoretical Computer Science Conference (ITCS 2018)},
pages = {47:147:20},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {9783959770606},
ISSN = {18688969},
year = {2018},
volume = {94},
editor = {Anna R. Karlin},
publisher = {Schloss DagstuhlLeibnizZentrum fuer Informatik},
address = {Dagstuhl, Germany},
URL = {http://drops.dagstuhl.de/opus/volltexte/2018/8321},
URN = {urn:nbn:de:0030drops83215},
doi = {10.4230/LIPIcs.ITCS.2018.47},
annote = {Keywords: robust statistics, informationtheoretic optimality}
}
Keywords: 

robust statistics, informationtheoretic optimality 
Collection: 

9th Innovations in Theoretical Computer Science Conference (ITCS 2018) 
Issue Date: 

2018 
Date of publication: 

12.01.2018 