Abstract
Recently, due to the genomic sequence analysis in several types of cancer, genomic data based on copy number profiles (CNP for short) are getting more and more popular. A CNP is a vector where each component is a nonnegative integer representing the number of copies of a specific segment of interest. The motivation is that in the late stage of certain types of cancer, the genomes are progressing rapidly by segmental duplications and deletions, and hence obtaining the exact sequences becomes difficult. Instead, the number of copies of important segments can be predicted from expression analysis and carries important biological information. Therefore, significant research has recently been devoted to the analysis of genomic data represented as CNP’s.
In this paper, we present two streams of results. The first is the negative results on two open problems regarding the computational complexity of the Minimum Copy Number Generation (MCNG) problem posed by Qingge et al. in 2018. The Minimum Copy Number Generation (MCNG) is defined as follows: given a string S in which each character represents a gene or segment, and a CNP C, compute a string T from S, with the minimum number of segmental duplications and deletions, such that cnp(T)=C. It was shown by Qingge et al. that the problem is NPhard if the duplications are tandem and they left the open question of whether the problem remains NPhard if arbitrary duplications and/or deletions are used. We answer this question affirmatively in this paper; in fact, we prove that it is NPhard to even obtain a constant factor approximation. This is achieved through a generalpurpose lemma on setcover reductions that require an exact cover in one direction, but not the other, which might be of independent interest. We also prove that the corresponding parameterized version is W[1]hard, answering another open question by Qingge et al.
The other result is positive and is based on a new (and more general) problem regarding CNP’s. The Copy Number Profile Conforming (CNPC) problem is formally defined as follows: given two CNP’s C₁ and C₂, compute two strings S₁ and S₂ with cnp(S₁)=C₁ and cnp(S₂)=C₂ such that the distance between S₁ and S₂, d(S₁,S₂), is minimized. Here, d(S₁,S₂) is a very general term, which means it could be any genome rearrangement distance (like reversal, transposition, and tandem duplication, etc). We make the first step by showing that if d(S₁,S₂) is measured by the breakpoint distance then the problem is polynomially solvable. We expect that this will trigger some related research along the line in the near future.
BibTeX  Entry
@InProceedings{lafond_et_al:LIPIcs:2020:12147,
author = {Manuel Lafond and Binhai Zhu and Peng Zou},
title = {{Genomic Problems Involving Copy Number Profiles: Complexity and Algorithms}},
booktitle = {31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)},
pages = {22:122:15},
series = {Leibniz International Proceedings in Informatics (LIPIcs)},
ISBN = {9783959771498},
ISSN = {18688969},
year = {2020},
volume = {161},
editor = {Inge Li G{\o}rtz and Oren Weimann},
publisher = {Schloss DagstuhlLeibnizZentrum f{\"u}r Informatik},
address = {Dagstuhl, Germany},
URL = {https://drops.dagstuhl.de/opus/volltexte/2020/12147},
URN = {urn:nbn:de:0030drops121471},
doi = {10.4230/LIPIcs.CPM.2020.22},
annote = {Keywords: Computational genomics, cancer genomics, copy number profiles, NPhardness, approximation algorithms, FPT algorithms}
}
Keywords: 

Computational genomics, cancer genomics, copy number profiles, NPhardness, approximation algorithms, FPT algorithms 
Collection: 

31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020) 
Issue Date: 

2020 
Date of publication: 

09.06.2020 