AI system can generate novel proteins that meet structural design targets

(Nanowerk News) MIT researchers are using artificial intelligence to design new proteins that go beyond those found in nature.
They created algorithms for machine learning that can produce proteins with particular structural characteristics, potentially used to manufacture materials with specific mechanical traits, such as rigidity or pliability. These materials, which draw inspiration from biology, may eventually substitute petroleum or ceramic-based materials, but with a significantly reduced carbon footprint.
Scientists from MIT, the MIT-IBM Watson AI Lab, and Tufts University utilized a generative model, the identical kind of machine-learning model structure employed in AI platforms such as DALL-E 2. Nonetheless, instead of using it to generate authentic pictures from natural language prompts as DALL-E 2 does, they modified the model structure to forecast amino acid sequences of proteins that meet predetermined structural goals.
illustration of proteins
A new machine-learning system can generate protein designs with certain structural features, and which do not exist in nature. These proteins could be utilized to make materials that have similar mechanical properties to existing materials, like polymers, but which would have a much smaller carbon footprint. (Image: Jose-Luis Olivares)
In a paper published in the journal Chem ("Generative design of de novo proteins based on secondary-structure constraints using an attention-based diffusion model"), the researchers exhibit how these models can create genuine and fresh proteins. Markus Buehler, the Jerry McAfee Professor in Engineering and professor of civil and environmental engineering and of mechanical engineering, the senior author, states that the models, which grasp biochemical connections governing protein formation, can generate innovative proteins that have the potential to facilitate distinct applications.
For example, this technology could be applied to devise food coatings that mimic proteins, which could prolong the freshness of fruits and vegetables while being safe for human consumption. Additionally, Buehler emphasizes that the models can produce millions of proteins within days, providing researchers with a vast assortment of novel concepts to investigate in a short amount of time.
According to Buehler, who is a member of the MIT-IBM Watson AI Lab, "When considering the creation of proteins that nature has yet to unveil, it is an immense design space that cannot be resolved with manual approaches. It is necessary to comprehend the language of life, how DNA encodes amino acids, and how they combine to produce protein structures. Prior to the advent of deep learning, this was not possible."
Bo Ni, a postdoctoral researcher in Buehler's Laboratory for Atomistic and Molecular Mechanics, and David Kaplan, the Stern Family Professor of Engineering and professor of bioengineering at Tufts, are also authors of the article.

Adapting new tools for the task

Proteins are produced by strings of amino acids that fold into three-dimensional configurations. The mechanical characteristics of the protein are determined by the sequence of amino acids. Although scientists have detected thousands of proteins that have been shaped by evolution, they estimate that a vast number of amino acid sequences have yet to be identified.
To expedite the process of protein discovery, scientists have recently designed deep learning models that can forecast the 3D structure of a protein for a given set of amino acid sequences. However, the converse issue - estimating a sequence of amino acid structures that satisfy design objectives - has proven to be even more complicated.
Buehler and his colleagues were able to confront this challenging issue by utilizing attention-based diffusion models, which represent a novel breakthrough in machine learning.
According to Buehler, attention-based models are crucial in protein development as they can learn and capture long-range relationships. This is particularly important because even a single mutation in a lengthy amino acid sequence can have a significant impact on the entire design. By utilizing diffusion models, the learning process involves adding noise to training data and subsequently recovering the original data by removing the noise. These models are highly effective in generating high-quality and realistic data that can be conditioned to meet specific design objectives. Therefore, they are often preferred over other models in meeting design requirements.
Using this architecture, the researchers developed two machine-learning models capable of predicting novel amino acid sequences that form proteins meeting specific structural design targets.
Buehler explains that in the biomedical industry, having a completely unknown protein can be problematic as its properties are not well understood. However, in some applications, it may be desirable to create a novel protein with similar characteristics to those found in nature but with distinct functions. By using the developed models, a range of proteins can be generated and controlled by adjusting certain parameters, allowing for tailored designs to meet specific requirements.
Secondary structures, or common folding patterns, of amino acids result in various mechanical properties in proteins. For example, proteins with alpha helix structures tend to be stretchy, while those with beta sheet structures are usually rigid. Combining both alpha helices and beta sheets in a protein can create a material that is both stretchy and strong, much like silk.
The researchers created two models, one that functions at the overall structural level of the protein and another that operates at the amino acid level. Both models combine amino acid structures to produce proteins. In the first model, which works at the overall structural level, the user inputs a desired percentage of different structures, such as 40 percent alpha-helix and 60 percent beta sheet, and the model generates sequences that fulfill those requirements. The second model requires the scientist to specify not only the percentage but also the order of amino acid structures, providing greater control over the final product.
The developed models are linked to an algorithm that can predict the folding of proteins. The researchers use this algorithm to determine the 3D structure of the generated proteins. They then calculate the resulting mechanical properties of the protein and compare them against the specified design requirements. This enables them to verify whether the designed proteins meet the desired specifications.

Realistic yet novel designs

To evaluate the effectiveness of their models, the researchers compared the newly generated proteins to existing proteins with similar structural properties. They found that many of the generated proteins shared about 50 to 60 percent overlap with existing amino acid sequences, indicating that they were feasible for synthesis. Additionally, the models produced entirely new sequences, which demonstrated their capability to design novel proteins. According to Buehler, the level of similarity between the generated and existing proteins suggests that the designed proteins can be synthesized.
To validate the reliability of the designed proteins, the researchers attempted to fool the models by providing physically impossible design targets. Instead of producing unlikely proteins, the models generated the most realistic and synthesizable solutions. This result suggests that the models are robust and can identify the closest feasible solution even when provided with improbable design specifications.
Ni highlights that the machine learning algorithm is capable of identifying hidden relationships in nature. This capability provides the researchers with confidence that the generated proteins are likely to be realistic and feasible for synthesis.
In the next stage, the researchers intend to validate some of the newly designed proteins experimentally by synthesizing them in the laboratory. Additionally, they plan to further improve and refine their models, allowing them to design amino acid sequences that meet additional criteria, such as specific biological functions. The ultimate goal is to develop a versatile platform that can generate a wide range of protein designs for use in various applications, including biomedicine and materials science.
Buehler emphasizes that the application areas, such as sustainability, medicine, food, health, and materials design, require solutions beyond what nature has provided. Therefore, the newly developed design tool can play a significant role in creating potential solutions to address pressing societal issues. The tool allows researchers to design new proteins with specific properties, which can be used in various applications, from developing new medicines to creating sustainable materials. Overall, the tool can provide a new approach to problem-solving and contribute to addressing major global challenges.
The research received support from several organizations, including the MIT-IBM Watson AI Lab, the U.S. Department of Agriculture, the U.S. Department of Energy, the Army Research Office, the National Institutes of Health, and the Office of Naval Research. The support from these organizations highlights the significance and potential impact of the research in various fields.
Source: By Adam Zewe, MIT (Note: Content may be edited for style and length)