License:
Creative Commons Attribution 4.0 International license (CC BY 4.0)
When quoting this document, please refer to the following
DOI: 10.4230/LIPIcs.CPM.2021.19
URN: urn:nbn:de:0030-drops-139709
URL: https://drops.dagstuhl.de/opus/volltexte/2021/13970/
Mieno, Takuya ;
Pissis, Solon P. ;
Stougie, Leen ;
Sweering, Michelle
String Sanitization Under Edit Distance: Improved and Generalized
Abstract
Let W be a string of length n over an alphabet Σ, k be a positive integer, and 𝒮 be a set of length-k substrings of W. The ETFS problem (Edit distance, Total order, Frequency, Sanitization) asks us to construct a string X_ED such that: (i) no string of 𝒮 occurs in X_ED; (ii) the order of all other length-k substrings over Σ (and thus the frequency) is the same in W and in X_ED; and (iii) X_ED has minimal edit distance to W. When W represents an individual’s data and 𝒮 represents a set of confidential patterns, the ETFS problem asks for transforming W to preserve its privacy and its utility [Bernardini et al., ECML PKDD 2019].
ETFS can be solved in 𝒪(n²k) time [Bernardini et al., CPM 2020]. The same paper shows that ETFS cannot be solved in 𝒪(n^{2-δ}) time, for any δ > 0, unless the Strong Exponential Time Hypothesis (SETH) is false. Our main results can be summarized as follows:
- An 𝒪(n²log²k)-time algorithm to solve ETFS.
- An 𝒪(n²log²n)-time algorithm to solve AETFS (Arbitrary lengths, Edit distance, Total order, Frequency, Sanitization), a generalization of ETFS in which the elements of 𝒮 can have arbitrary lengths. Our algorithms are thus optimal up to subpolynomial factors, unless SETH fails.
In order to arrive at these results, we develop new techniques for computing a variant of the standard dynamic programming (DP) table for edit distance. In particular, we simulate the DP table computation using a directed acyclic graph in which every node is assigned to a smaller DP table. We then focus on redundancy in these DP tables and exploit a tabulation technique according to dyadic intervals to obtain an optimal alignment in 𝒪̃(n²) total time. Beyond string sanitization, our techniques may inspire solutions to other problems related to regular expressions or context-free grammars.
BibTeX - Entry
@InProceedings{mieno_et_al:LIPIcs.CPM.2021.19,
author = {Mieno, Takuya and Pissis, Solon P. and Stougie, Leen and Sweering, Michelle},
title = {{String Sanitization Under Edit Distance: Improved and Generalized}},
booktitle = {32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)},
pages = {19:1--19:18},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {978-3-95977-186-3},
ISSN = {1868-8969},
year = {2021},
volume = {191},
editor = {Gawrychowski, Pawe{\l} and Starikovskaya, Tatiana},
publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
address = {Dagstuhl, Germany},
URL = {https://drops.dagstuhl.de/opus/volltexte/2021/13970},
URN = {urn:nbn:de:0030-drops-139709},
doi = {10.4230/LIPIcs.CPM.2021.19},
annote = {Keywords: string algorithms, data sanitization, edit distance, dynamic programming}
}
Keywords: |
|
string algorithms, data sanitization, edit distance, dynamic programming |
Collection: |
|
32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021) |
Issue Date: |
|
2021 |
Date of publication: |
|
30.06.2021 |