Abstract
Indexable elastic founder graphs have been recently proposed as a data structure for genomics applications supporting fast pattern matching queries. Consider segmenting a multiple sequence alignment MSA[1..m,1..n] into b blocks MSA[1..m,1..j₁], MSA[1..m,j₁+1..j₂], …, MSA[1..m,j_{b1}+1..n]. The resulting elastic founder graph (EFG) is obtained by merging in each block the strings that are equivalent after the removal of gap symbols, taking the strings as the nodes of the block and the original MSA connections as edges. We call an elastic founder graph indexable if a node label occurs as a prefix of only those paths that start from a node of the same block. Equi et al. (ISAAC 2021) showed that such EFGs support fast pattern matching and studied their construction maximizing the number of blocks and minimizing the maximum length of a block, but left open the case of minimizing the maximum number of distinct strings in a block that we call graph height. For the simplified gapless setting, we give an O(mn) time algorithm to find a segmentation of an MSA minimizing the height of the resulting indexable founder graph, by combining previous results in segmentation algorithms and founder graphs. For the general setting, the known techniques yield a lineartime parameterized solution on constant alphabet Σ, taking time O(m n² logΣ) in the worst case, so we study the refined measure of prefixaware height, that omits counting strings that are prefixes of another considered string. The indexable EFG minimizing the maximum prefixaware height provides a lower bound for the original height: by exploiting exploiting suffix trees built from the MSA rows and the data structure answering weighted ancestor queries in constant time of Belazzougui et al. (CPM 2021), we give an O(mn)time algorithm for the optimal EFG under this alternative height.
BibTeX  Entry
@InProceedings{rizzo_et_al:LIPIcs.CPM.2022.19,
author = {Rizzo, Nicola and M\"{a}kinen, Veli},
title = {{Indexable Elastic Founder Graphs of Minimum Height}},
booktitle = {33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022)},
pages = {19:119:19},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {9783959772341},
ISSN = {18688969},
year = {2022},
volume = {223},
editor = {Bannai, Hideo and Holub, Jan},
publisher = {Schloss Dagstuhl  LeibnizZentrum f{\"u}r Informatik},
address = {Dagstuhl, Germany},
URL = {https://drops.dagstuhl.de/opus/volltexte/2022/16146},
URN = {urn:nbn:de:0030drops161467},
doi = {10.4230/LIPIcs.CPM.2022.19},
annote = {Keywords: multiple sequence alignment, pattern matching, data structures, segmentation algorithms, dynamic programming, suffix tree}
}
Keywords: 

multiple sequence alignment, pattern matching, data structures, segmentation algorithms, dynamic programming, suffix tree 
Collection: 

33rd Annual Symposium on Combinatorial Pattern Matching (CPM 2022) 
Issue Date: 

2022 
Date of publication: 

22.06.2022 