Using N-terminal targeting sequences, amino acid composition, and sequence motifs for predicting protein subcellular localization
Functional annotation of unknown proteins is a major goal in proteomics. A key step in this annotation process is the definition of a protein's subcellular localization. As a consequence, numerous prediction techniques for localization have been developed over the years. These methods typically focus on a single underlying biological aspect or predict a subset of all possible subcellular localizations. There is a clear need for new methods that utilize and represent available protein specific biological knowledge from several sources, in order to improve accuracy and localization coverage for a wide range of organisms. Here we present a novel Support Vector Machine (SVM)-based approach for predicting protein subcellular localization, which integrates information about N-terminal targeting sequences, amino acid composition, and protein sequence motifs. An important step is taken towards emulating the protein sorting process by capturing and bringing together biologically relevant information. Our novel approach has been used to develop two new prediction methods, TargetLoc and MultiLoc. TargetLoc is restricted to analysis of proteins containing N-terminal targeting sequences, whereas MultiLoc covers all major eukaryotic subcellular localizations for animal, plant, and fungal proteins. Compared to similar methods, TargetLoc performs better than these. MultiLoc performs considerably better than comparable prediction methods predicting all major eukaryotic subcellular localizations, and shows better or comparable results to methods that are specialized on fewer localizations or for one organism.
Full Text: PDF