Abstract
Ashtiani et al. proposed a SemiSupervised Active Clustering framework (SSAC), where the learner is allowed to make adaptive queries to a domain expert. The queries are of the kind "do two given points belong to the same optimal cluster?", where the answers to these queries are assumed to be consistent with a unique optimal solution. There are many clustering contexts where such same cluster queries are feasible. Ashtiani et al. exhibited the power of such queries by showing that any instance of the kmeans clustering problem, with additional margin assumption, can be solved efficiently if one is allowed to make O(k^2 log{k} + k log{n}) samecluster queries. This is interesting since the kmeans problem, even with the margin assumption, is NPhard.
In this paper, we extend the work of Ashtiani et al. to the approximation setting by showing that a few of such samecluster queries enables one to get a polynomialtime (1+eps)approximation algorithm for the kmeans problem without any margin assumption on the input dataset. Again, this is interesting since the kmeans problem is NPhard to approximate within a factor (1+c) for a fixed constant 0 < c < 1. The number of samecluster queries used by the algorithm is poly(k/eps) which is independent of the size n of the dataset. Our algorithm is based on the D^2sampling technique, also known as the kmeans++ seeding algorithm. We also give a conditional lower bound on the number of samecluster queries showing that if the Exponential Time Hypothesis (ETH) holds, then any such efficient query algorithm needs to make Omega (k/poly log k) samecluster queries. Our algorithm can be extended for the case where the query answers are wrong with some bounded probability. Another result we show for the kmeans++ seeding is that a small modification of the kmeans++ seeding within the SSAC framework converts it to a constant factor approximation algorithm instead of the well known O(log k)approximation algorithm.
BibTeX  Entry
@InProceedings{ailon_et_al:LIPIcs:2018:8335,
author = {Nir Ailon and Anup Bhattacharya and Ragesh Jaiswal and Amit Kumar},
title = {{Approximate Clustering with SameCluster Queries}},
booktitle = {9th Innovations in Theoretical Computer Science Conference (ITCS 2018)},
pages = {40:140:21},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {9783959770606},
ISSN = {18688969},
year = {2018},
volume = {94},
editor = {Anna R. Karlin},
publisher = {Schloss DagstuhlLeibnizZentrum fuer Informatik},
address = {Dagstuhl, Germany},
URL = {http://drops.dagstuhl.de/opus/volltexte/2018/8335},
URN = {urn:nbn:de:0030drops83358},
doi = {10.4230/LIPIcs.ITCS.2018.40},
annote = {Keywords: kmeans, semisupervised learning, query bounds}
}
Keywords: 

kmeans, semisupervised learning, query bounds 
Collection: 

9th Innovations in Theoretical Computer Science Conference (ITCS 2018) 
Issue Date: 

2018 
Date of publication: 

12.01.2018 