Machine learning helps predict protein functions

(Nanowerk News) Proteins are integral components of all living organisms. They are composed of a sequence of building blocks called amino acids. That sequence determines their function, which can range from setting the structure of cells to regulating metabolism.
Scientists can change a protein sequence and experimentally test if and how that change alters its function. However, there are too many possible amino acid sequence changes to test them all in the laboratory.
Instead, researchers build highly complex computational models that predict protein function based on their amino acid sequence. This is critical for engineering proteins with novel functions.
Scientists have now combined multiple machine learning approaches for building a simple predictive model that often works better than established, complex methods (Nature Biotechnology, "Learning protein fitness models from evolutionary and assay-labeled data").
probability density model
Scientists create a “probability density model” using evolutionary data from related proteins. They also experimentally measure functions of related protein variants to “augment” the model and predict how to modify a protein to improve its function. (Image: University of California, Berkeley)
Naturally occurring proteins serve many crucial functions in maintaining life. But scientists can also engineer natural proteins for desired purposes such as gene editing and the synthesis of valuable chemicals.
This new combined modeling approach to predict protein function will aid in the design and engineering of novel proteins. This approach will allow scientists to easily redesign proteins for a huge range of applications such as new enzymes to convert plant matter into biofuels or bioproducts or to create new biomaterials.
Scientists have several approaches to predict functional properties of a given protein that use the protein’s amino acid sequence to build a computational model. Scientists create such models employing both classical statistical methods and modern-day machine learning computational approaches.
One of those statistical methods, called regression analysis, associates a given amino acid sequence with an experimentally measured functional property of a protein. To increase the amount of data available to make functional predictions for a protein, researchers include sequences of evolutionarily-related proteins as additional input.
In general, those evolutionarily-related proteins are likely to share the property of the protein of interest, albeit often without direct experimental evidence. Researchers use a machine learning modeling approach based on the statistical properties of those sequences.
In the study highlighted here, researchers combined regression analysis and evolutionary data to propose a simple, effective machine learning approach. The researchers found that this simple combination approach is competitive with, and often outperforms, more sophisticated methods.
Source: U.S. Department of Energy, Office of Science