License: Creative Commons Attribution 3.0 Unported license (CC BY 3.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.ICDT.2019.11
URN: urn:nbn:de:0030-drops-103135
Go to the corresponding LIPIcs Volume Portal

Li, Yuliang ; Wang, Jianguo ; Pullman, Benjamin ; Bandeira, Nuno ; Papakonstantinou, Yannis

Index-Based, High-Dimensional, Cosine Threshold Querying with Optimality Guarantees

LIPIcs-ICDT-2019-11.pdf (1 MB)


Given a database of vectors, a cosine threshold query returns all vectors in the database having cosine similarity to a query vector above a given threshold. These queries arise naturally in many applications, such as document retrieval, image search, and mass spectrometry. The present paper considers the efficient evaluation of such queries, providing novel optimality guarantees and exhibiting good performance on real datasets. We take as a starting point Fagin's well-known Threshold Algorithm (TA), which can be used to answer cosine threshold queries as follows: an inverted index is first built from the database vectors during pre-processing; at query time, the algorithm traverses the index partially to gather a set of candidate vectors to be later verified against the similarity threshold. However, directly applying TA in its raw form misses significant optimization opportunities. Indeed, we first show that one can take advantage of the fact that the vectors can be assumed to be normalized, to obtain an improved, tight stopping condition for index traversal and to efficiently compute it incrementally. Then we show that one can take advantage of data skewness to obtain better traversal strategies. In particular, we show a novel traversal strategy that exploits a common data skewness condition which holds in multiple domains including mass spectrometry, documents, and image databases. We show that under the skewness assumption, the new traversal strategy has a strong, near-optimal performance guarantee. The techniques developed in the paper are quite general since they can be applied to a large class of similarity functions beyond cosine.

BibTeX - Entry

  author =	{Yuliang Li and Jianguo Wang and Benjamin Pullman and Nuno Bandeira and Yannis Papakonstantinou},
  title =	{{Index-Based, High-Dimensional, Cosine Threshold Querying with Optimality Guarantees}},
  booktitle =	{22nd International Conference on Database Theory (ICDT 2019)},
  pages =	{11:1--11:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-101-6},
  ISSN =	{1868-8969},
  year =	{2019},
  volume =	{127},
  editor =	{Pablo Barcelo and Marco Calautti},
  publisher =	{Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{},
  doi =		{10.4230/LIPIcs.ICDT.2019.11},
  annote =	{Keywords: Vector databases, Similarity search, Cosine, Threshold Algorithm}

Keywords: Vector databases, Similarity search, Cosine, Threshold Algorithm
Collection: 22nd International Conference on Database Theory (ICDT 2019)
Issue Date: 2019
Date of publication: 19.03.2019

DROPS-Home | Fulltext Search | Imprint | Privacy Published by LZI