Abstract
Let W be a string of length n over an alphabet Σ, k be a positive integer, and 𝒮 be a set of lengthk substrings of W. The ETFS problem (Edit distance, Total order, Frequency, Sanitization) asks us to construct a string X_ED such that: (i) no string of 𝒮 occurs in X_ED; (ii) the order of all other lengthk substrings over Σ (and thus the frequency) is the same in W and in X_ED; and (iii) X_ED has minimal edit distance to W. When W represents an individual’s data and 𝒮 represents a set of confidential patterns, the ETFS problem asks for transforming W to preserve its privacy and its utility [Bernardini et al., ECML PKDD 2019].
ETFS can be solved in 𝒪(n²k) time [Bernardini et al., CPM 2020]. The same paper shows that ETFS cannot be solved in 𝒪(n^{2δ}) time, for any δ > 0, unless the Strong Exponential Time Hypothesis (SETH) is false. Our main results can be summarized as follows:
 An 𝒪(n²log²k)time algorithm to solve ETFS.
 An 𝒪(n²log²n)time algorithm to solve AETFS (Arbitrary lengths, Edit distance, Total order, Frequency, Sanitization), a generalization of ETFS in which the elements of 𝒮 can have arbitrary lengths. Our algorithms are thus optimal up to subpolynomial factors, unless SETH fails.
In order to arrive at these results, we develop new techniques for computing a variant of the standard dynamic programming (DP) table for edit distance. In particular, we simulate the DP table computation using a directed acyclic graph in which every node is assigned to a smaller DP table. We then focus on redundancy in these DP tables and exploit a tabulation technique according to dyadic intervals to obtain an optimal alignment in 𝒪̃(n²) total time. Beyond string sanitization, our techniques may inspire solutions to other problems related to regular expressions or contextfree grammars.
BibTeX  Entry
@InProceedings{mieno_et_al:LIPIcs.CPM.2021.19,
author = {Mieno, Takuya and Pissis, Solon P. and Stougie, Leen and Sweering, Michelle},
title = {{String Sanitization Under Edit Distance: Improved and Generalized}},
booktitle = {32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)},
pages = {19:119:18},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {9783959771863},
ISSN = {18688969},
year = {2021},
volume = {191},
editor = {Gawrychowski, Pawe{\l} and Starikovskaya, Tatiana},
publisher = {Schloss Dagstuhl  LeibnizZentrum f{\"u}r Informatik},
address = {Dagstuhl, Germany},
URL = {https://drops.dagstuhl.de/opus/volltexte/2021/13970},
URN = {urn:nbn:de:0030drops139709},
doi = {10.4230/LIPIcs.CPM.2021.19},
annote = {Keywords: string algorithms, data sanitization, edit distance, dynamic programming}
}
Keywords: 

string algorithms, data sanitization, edit distance, dynamic programming 
Collection: 

32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021) 
Issue Date: 

2021 
Date of publication: 

30.06.2021 