Science Success Story
AI Uses Language Rules to Simulate Molecular Motions on XSEDE-Allocated Bridges
Recreates known chemical rules, opens door to improved vaccines, drugs, industrial processes
By Ken Chiacchia, Pittsburgh Supercomputing Center
The AI-based method successfully predicted which amino acids (in red, labeled with their symbol letters and their number in the protein's amino-acid chain) would be critical for the binding of benzene (cyan spheres) to the lysozyme protein. From Wang Y et al. Past–future information bottleneck for sampling molecular reaction coordinate simultaneously with thermodynamics and kinetics. Nature Communications (2019) 10:3573.
Better predictions of molecular motions could lead to improved vaccines, drugs, and any number of improved industrial chemical processes. A team from the University of Maryland used natural language processing artificial intelligence (AI) on the XSEDE-allocated Bridges platform at the Pittsburgh Supercomputing Center (PSC) to recreate known chemistry, showing that AI may be able to reduce molecular dynamics to rules of grammar and syntax. The work offers the potential for leaping ahead of current computational limits in the field.
Why It's Important
The movements of molecules—molecular dynamics—loom large for our health, safety, and economy. Our ability to fight COVID-19 relies on our antibodies' ability to wrap around virus proteins. Cleanup of polluted environments can be improved by engineering microbes to eat the bad stuff. Refinery processes can be modified to produce cleaner, more efficient fuels. Better predictions of chemistry could make us safer, cleaner, wealthier.
Modern supercomputers have revolutionized our ability to simulate molecular dynamics. These predictions direct laboratory experiments to create better vaccines, drugs, and other chemical reactions. But the limits of even powerful computers still restrict the field.
"Natural language prediction is not an easy problem. You just can't look at the words. These algorithms, they're quite fancy. They tend to take into account the context of the words: what is the bigger picture? So the question is, can we take a molecular dynamics trajectory and…map it into an abstract language?…Maybe we can use language processing tools…to learn a better language ‘spoken' by these molecules." — Pratyush Tiwary, University of Maryland
Pratyush Tiwary and his students at the University of Maryland wondered whether they could completely change up scientists' approach to molecular dynamics. Would it be possible to harness the power of AI to detect a simpler set of rules? They turned to natural language processing, a set of so-called recurrent neural network tools loosely based on the workings of living brains. It's the technology that powers word suggestions on smartphones. Could natural language processing suggest next molecular movements in the same way? In other words, could the twists and turns of a protein map onto a sequence of virtual characters—"letters," "words" and "sentences"—in a way that the computer can understand as a kind of grammar, which is a mostly solved problem in natural language processing?
They would need access to a powerful supercomputer to make the approach work. They turned to XSEDE resources.
How XSEDE Helped
Natural language processing works by understanding what's been said already as well as what's likely to be said next. The AI does this by learning. You begin with a set of sentences in which each word is labeled to explain its meaning and role in the sentence. The computer program processes what's already been "said" via a series of layers, each representing a different concept in language structure, with many connections between the nodes in each layer. The output of these layers tries to predict the next word. The AI removes connections when it's wrong and tries again, removing and reconnecting connections until it's predicting correctly.
After this training step, the scientists present the AI with an unlabeled testing data set. The AI is scored on its ability to predict next words. Like a student in school, the AI goes back and forth between training and testing until its performance is good enough to try to solve a real problem with new data.
Tiwary and his team—graduate students Yihang Wang, Sun-Ting Tsai, Zachary Smith and others— tested their AI-based approaches by moving their computations between the then state-of-the-art NVIDIA Tesla P100 GPU nodes of PSC's Bridges and the XSEDE-allocated platform's CPU nodes. The AI used the GPUs to learn and predict; the CPUs performed molecular dynamics to test the predictions.
"I get students with a very different level of HPC usage profiles…Some don't even know what's a shell script, and having these [workshop] resources which can…get them going on how to use HPC resources and how to use them efficiently is super-critical. I mean, XSEDE…[plays] a big role in this in getting students up to snuff." — Pratyush Tiwary, University of Maryland
Training was an important facet of the work in another sense. Tiwary himself used the XSEDE-allocated Stampede2 supercomputer at the Texas Advanced Computing Center to perform molecular dynamics simulations when he was a postdoctoral fellow. In turn, he's found XSEDE's HPC workshops to be crucial to getting his students up and running with AI programming.
The team trained and tested their AI in a number of trial systems. Among these, they simulated the twists of two critical chemical bonds in the simple molecule alanine dipeptide. They duplicated the binding of a benzene molecule to the protein lysozyme. And they recreated the workings of a riboswitch, a molecular switch that changes which amino acid is inserted into a protein chain as it's being assembled by a living cell.
"We used a lot of Bridges CPU and Bridges GPU [nodes]. We used the P100s heavily. And they are very fast for this. The classical molecular dynamics is run on the CPUs. Bridges has such nice installations of [the software], and scaling is quite efficient…We were doing molecular dynamics on CPUs for an extended amount of time, then coming back to the GPUs to train the AI model, then going back to the CPUs." — Pratyush Tiwary, University of Maryland
The team's AI made excellent predictions in all of these trial systems, providing a crucial proof of concept for the method. But it did more. Examining the text-prediction algorithm created by the AI, the scientists realized that the machine, with no prompting from them, had recreated path entropy, a concept introduced by pioneering scientists like Ludwig Boltzmann, Claude Shannon and Edwin Thompson Jaynes. A rule for how transfers of energy restrict physical processes, entropy is a cornerstone of modern physics and chemistry. This gave the researchers confidence that their AI is fundamentally on the right track. They published their results in two papers in the journal Nature Communications, in August 2019 and October 2020, which you can read here and here.
The Maryland scientists plan to expand their work to more complicated systems, using the more-advanced V100 GPUs in the XSEDE-allocated Bridges-AI system and Bridges' replacement, the upcoming, greatly expanded Bridges-2 platform. Another goal will be to understand how the language rules created by the AI relate to the exact mechanisms of the protein's movements, such as the twisting of chemical bonds. Their hope is to simplify the task of molecular dynamics predictions to leap ahead of the current limitations of the most powerful computing systems, in a way that chemists can understand and have confidence in. Ultimately, such a tool could improve medical and industrial tools.
At a Glance
Better predictions of molecular motions could lead to improved vaccines, drugs, and any number of improved industrial chemical processes.
A team from the University of Maryland used natural language processing artificial intelligence (AI) on the XSEDE-allocated Bridges platform at the Pittsburgh Supercomputing Center (PSC) to recreate known chemistry.
- The work showed that AI may be able to reduce molecular dynamics to rules of grammar and syntax, offering the potential for leaping ahead of current computational limits in the field.