Evolutionary-based Feature Generation

Extracting Function from Biological Sequences

Evolutionary-based Feature Generation

This work concerns the proposal of powerful and novel feature generation methods for improving the feature-based classification and annotation of genomic sequences. Important work by us and our collaborators has shown that evolutionary-based algorithms employing techniques from genetic algorithms reveal important k-mer features that improve the classification of hypersensitive sites and splice sites in DNA sequences. This work has appeared in: Uday Kamath, Amarda Shehu, and Kenneth A De Jong, “Using Evolutionary Computation to Improve SVM Classification,” IEEE World Congress on Computational Intelligence (WCCI), Barcelona, Spain, 2010 and in Uday Kamath, Kenneth A De Jong, and Amarda Shehu, “Selecting Predictive Features for Recognition of Hypersensitive Sites of Regulatory Genomic Sequences with an Evolutionary Algorithm,” Genetic and Evolutionary Computation Conference (GECCO), Portland, Oregon, 2010, pg. 179-186.

Our most recent work explores the automatic construction of complex features through genetic programming techniques. Feature generation is a difficult problem, and it is often the task of domain experts or exhaustive feature enumeration techniques to generate a few relevant features whose predictive power is then tested in the context of classification. We propose an evolutionary algorithm approach to effectively explore a large feature space, beyond k-mers, and automatically generate predictive features from sequence data. The features generated by the algorithm include compositional, positional, correlational, disjunctive, and conjunctive features. Such complex features are evolved in a genetic programming setting through basic operations and evaluated in the context of classification by Support Vector Machines.

The proposed algorithm, named FG-EA for Feature Generation with an Evolutionary Algorithm, is tested for its effectiveness on an important component of the gene-finding problem, DNA splice-site prediction. While FG-EA is generally relevant for feature-based classification, this particular application is chosen due to the demonstrated complexity of the features needed to obtain high classification accuracy and precision. Additional experiments show that the algorithm is useful beyond the classification setting, in annotating new sequences with splice site information.

This work has appeared in: 1) Uday Kamath, Jack Compton, Rezarta Islamaj Dogan, Kenneth A. De Jong, and Amarda Shehu, “An Evolutionary Algorithm Approach for Feature Generation from Sequence Data and its Application to DNA Splice-Site Prediction,” Trans Comp Biol and Bioinf 2012, 9(5):1387-1398.

In response to community interests on details of the method and usability of the code, we post further details here. Please feel free to contact any of the authors for any questions.

Members Of this Project:
Uday Kamath
Jack Compton (Undergraduate student)
Rezarta Islamaj-Dogan
Kenneth De Jong
Amarda Shehu