[ad_1]
Discovering new materials and drugs typically involves a manual, trial-and-error process that can take decades and cost millions of dollars. To simplify this process, scientists often use machine learning to predict molecular properties and narrow down the molecules they need to synthesize and test in the lab.
Researchers at MIT and the MIT-Watson AI Lab have developed a new, unified framework that can simultaneously predict molecular properties and generate new molecules far more efficiently than popular deep learning approaches.
To teach a machine learning model to predict the biological or mechanical properties of a molecule, researchers must show it millions of labeled molecular structures—a process known as training. Due to the cost of discovering molecules and the challenges of manually labeling millions of structures, large training datasets are often difficult to obtain, limiting the effectiveness of machine learning approaches.
In contrast, the system developed by the MIT researchers can efficiently determine molecular properties using only a small amount of data. Their system has a basic understanding of the rules that dictate how building blocks come together to produce functional molecules. These rules capture similarities between molecular structures, helping the system generate new molecules and predict their properties in a data-efficient way.
This method outperformed other machine learning approaches on both small and large data sets and was able to accurately predict molecular properties and generate viable molecules when the data set contained fewer than 100 samples.
“Our goal in this project is to use data-driven methods to speed up the discovery of new molecules, so you can train a model to make predictions without all these expensive experiments,” says lead author Minghao Guo. Graduated in Computer Science and Electrical Engineering (EECS).
Guo’s co-authors include MIT-IBM Watson AI Lab research staff Veronica Tost, Payal Das and Ji Chen; recent MIT graduates Samuel Song ’23 and Adithya Balachandran ’23; and senior author Wojciech Matusik, a professor of electrical engineering and computer science and a member of the MIT-IBM Watson AI Lab, who leads the computational design and fabrication group at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). The research will be presented at the International Conference on Machine Learning.
Learning the language of molecules
To get the best results with machine learning models, scientists need to train datasets with millions of molecules that have similar properties to the ones they detect. In reality, these domain-specific datasets are usually very small. Thus, researchers use models that have been previously trained on large datasets of general molecules, which they then apply to much smaller, targeted datasets. However, because these models lack domain-specific knowledge, they perform poorly.
The MIT team took a different approach. They created a machine learning system that automatically learns the “language” of molecules—known as molecular grammar—using only small, domain-specific data sets. It uses this grammar to construct viable molecules and predict their properties.
In language theory, words, sentences or paragraphs are formed according to a single grammatical rule. You can think of molecular grammar the same way. It is a set of manufacturing rules that dictate how molecules or polymers are formed by combining atoms and substructures.
Just like the grammar of a language that can generate a set of sentences using the same rules, a single molecular grammar can represent a large number of molecules. Molecules with similar structures use the same grammatical production rules, and the system learns to understand these similarities.
Since structurally similar molecules often have similar properties, the system uses its knowledge of molecular similarities to more effectively predict the properties of new molecules.
“Once we have this grammar as a representation for all the different molecules, we can use it to power the property prediction process,” says Guo.
The system learns the rules for producing molecular grammars using reinforcement learning—a trial-and-error process where the model is rewarded for behavior that brings it closer to achieving a goal.
But because there can be billions of ways to combine atoms and substructures, the process of learning grammar production rules would be too computationally expensive for anything but the smallest data sets.
The researchers divided molecular grammar into two parts. The first part, called metagrammar, is a general, widely used grammar that they create by hand and feed the system from scratch. It then only needs to learn a much smaller, molecule-specific grammar from the domain data set. This hierarchical approach speeds up the learning process.
Big results, small dataset
In experiments, the researchers’ new system simultaneously created viable molecules and polymers and predicted their properties more accurately than several popular machine learning approaches, even with only a few hundred samples of domain-specific datasets. Some other methods also required expensive upfront preparation, which the new system avoids.
The technique has been particularly effective in predicting the physical properties of polymers, such as the glass transition temperature, which is what a material needs to change from solid to liquid. Obtaining this information by hand is often very expensive because the experiments require extremely high temperatures and pressures.
To further enhance their approach, the researchers reduced one training session by more than half—to just 94 samples. Their model still achieved results that matched methods trained using the entire database.
“This representation based on grammar is very powerful. And because the grammar itself is a very general representation, it can be mapped to data in many different graphical forms. We are trying to identify other applications beyond chemistry or material science,” says Guo.
In the future, they also want to expand their current molecular grammar to include the 3D geometry of molecules and polymers, which is key to understanding the interactions of polymer chains. They also develop an interface that shows the user the production rules of the learned grammar and requires feedback to correct rules that may be wrong, increasing the accuracy of the system.
This work is funded, in part, by the MIT-IBM Watson AI Lab and its member company Evonik.
[ad_2]
Source link