License: Creative Commons Attribution 4.0 International license (CC BY 4.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.CPM.2023.11
URN: urn:nbn:de:0030-drops-179650
URL: https://drops.dagstuhl.de/opus/volltexte/2023/17965/
Go to the corresponding LIPIcs Volume Portal


Gabory, Esteban ; Mwaniki, Moses Njagi ; Pisanti, Nadia ; Pissis, Solon P. ; Radoszewski, Jakub ; Sweering, Michelle ; Zuba, Wiktor

Comparing Elastic-Degenerate Strings: Algorithms, Lower Bounds, and Applications

pdf-format:
LIPIcs-CPM-2023-11.pdf (1 MB)


Abstract

An elastic-degenerate (ED) string T is a sequence of n sets T[1],…,T[n] containing m strings in total whose cumulative length is N. We call n, m, and N the length, the cardinality and the size of T, respectively. The language of T is defined as ℒ(T) = {S_1 ⋯ S_n : S_i ∈ T[i] for all i ∈ [1,n]}. ED strings have been introduced to represent a set of closely-related DNA sequences, also known as a pangenome. The basic question we investigate here is: Given two ED strings, how fast can we check whether the two languages they represent have a nonempty intersection? We call the underlying problem the ED String Intersection (EDSI) problem. For two ED strings T₁ and T₂ of lengths n₁ and n₂, cardinalities m₁ and m₂, and sizes N₁ and N₂, respectively, we show the following:
- There is no 𝒪((N₁N₂)^{1-ε})-time algorithm, thus no 𝒪((N₁m₂+N₂m₁)^{1-ε})-time algorithm and no 𝒪((N₁n₂+N₂n₁)^{1-ε})-time algorithm, for any constant ε > 0, for EDSI even when T₁ and T₂ are over a binary alphabet, unless the Strong Exponential-Time Hypothesis is false.
- There is no combinatorial 𝒪((N₁+N₂)^{1.2-ε}f(n₁,n₂))-time algorithm, for any constant ε > 0 and any function f, for EDSI even when T₁ and T₂ are over a binary alphabet, unless the Boolean Matrix Multiplication conjecture is false.
- An 𝒪(N₁log N₁log n₁+N₂log N₂log n₂)-time algorithm for outputting a compact (RLE) representation of the intersection language of two unary ED strings. In the case when T₁ and T₂ are given in a compact representation, we show that the problem is NP-complete.
- An 𝒪(N₁m₂+N₂m₁)-time algorithm for EDSI.
- An Õ(N₁^{ω-1}n₂+N₂^{ω-1}n₁)-time algorithm for EDSI, where ω is the exponent of matrix multiplication; the Õ notation suppresses factors that are polylogarithmic in the input size.
We also show that the techniques we develop have applications outside of ED string comparison.

BibTeX - Entry

@InProceedings{gabory_et_al:LIPIcs.CPM.2023.11,
  author =	{Gabory, Esteban and Mwaniki, Moses Njagi and Pisanti, Nadia and Pissis, Solon P. and Radoszewski, Jakub and Sweering, Michelle and Zuba, Wiktor},
  title =	{{Comparing Elastic-Degenerate Strings: Algorithms, Lower Bounds, and Applications}},
  booktitle =	{34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)},
  pages =	{11:1--11:20},
  series =	{Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN =	{978-3-95977-276-1},
  ISSN =	{1868-8969},
  year =	{2023},
  volume =	{259},
  editor =	{Bulteau, Laurent and Lipt\'{a}k, Zsuzsanna},
  publisher =	{Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address =	{Dagstuhl, Germany},
  URL =		{https://drops.dagstuhl.de/opus/volltexte/2023/17965},
  URN =		{urn:nbn:de:0030-drops-179650},
  doi =		{10.4230/LIPIcs.CPM.2023.11},
  annote =	{Keywords: elastic-degenerate string, sequence comparison, languages intersection, pangenome, acronym identification}
}

Keywords: elastic-degenerate string, sequence comparison, languages intersection, pangenome, acronym identification
Collection: 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)
Issue Date: 2023
Date of publication: 21.06.2023


DROPS-Home | Fulltext Search | Imprint | Privacy Published by LZI