Machine learning-based protein annotation tool predicts protein function

(Nanowerk News) Microbes drive key processes of life on Earth. They affect global elemental cycles—the movement of carbon, nitrogen, and other elements. They also promote plant growth and affect the development of diseases. These roles are essential in every ecosystem. Research constantly expands the database of microbial DNA sequences but does not provide all the biological information about proteins.
To engineer microbes for sustainable bioenergy and other bioproducts, scientists need a fuller understanding of the function of proteins and other molecules. Scientists infer the function of a protein by comparing it with reference databases of already characterized proteins.
However, these comparisons are difficult and not scalable to massive databases. To address this challenge, scientists have applied machine learning to models that predict protein function. The result is the program Snekmer, which allows scientists to quickly model families of proteins.
Studying biological protein molecules in microbes will help scientists pursue new applications for engineered microbes. Snekmer is easy to deploy in high-performance computing environments. In addition, it is incorporated into the KBase framework as a new application that will allow users to annotate their genome and metagenome sequences. This will help scientists to better model the effects of engineering microbes. That includes these microbes’ effect on the climate and their benefits for crop health and bioproduction. Snekmer will also help scientists study the evolution of microbes and patterns in microbiomes.
The inability of current methods to predict function for 30-50% of bacterial protein sequences is a significant barrier to better understanding of complex systems such as soil microbiomes. Most protocols rely on pair-wise alignments, which are becoming computationally intractable and more challenging to interpret as databases expand.
For alignment-based models of protein families, the sensitivity and accuracy depend on initial training sets, which risk obsolescence as additional sequence diversity is discovered. Many bacterial proteins have either no functional assignment or are only assigned a general function based solely on taxonomic understanding.
To address this need, researchers at Pacific Northwest National Laboratory, Baylor University, and Oregon Health & Science University developed Snekmer, a software tool leveraging redundancy of amino acid residue properties to reduce sequence space and using short protein sequence (kmer) features for machine learning to generate protein family models. Snekmer users can recode protein sequences into reduced alphabet kmer vectors and perform construction of supervised classification models trained on input protein families, or protein functional classification based on Snekmer models.
The research has been published in Bioinformatics Advances ("Snekmer: a scalable pipeline for protein sequence fingerprinting based on amino acid recoding").
Source: Pacific Northwest National Laboratory (Note: Content may be edited for style and length)