Science Success Story
COVID-19 Analysis Performed With Galaxy Bioinformatics Platform
XSEDE provides large-scale compute infrastructure for the analysis of thousands of genomes
By Faith Singer-Villalobos, Texas Advanced Computing Center
Scientists don't know how SARS-CoV-2, the virus that causes COVID-19, will evolve, but they say it's not going away anytime soon.
Human coronaviruses were first identified in the mid-1960s and are named for the crown-like spikes on their surface. The current pathogen is new to the human population.
About 100 organizations worldwide have already contributed genomic data to the study of the pandemic, mainly academic labs and genome sequencing facilities. Genomic data is critical because it helps identify how the virus is evolving, which can provide critical clues to how to stop it. A number of these teams have experience with response efforts to rapidly ramp up genome sequencing as has been done in the past with HIV, Ebola, Zika, influenza, and Hepatitis C.
SARS-CoV-2 is different, however.
"The community wasn't expecting this much data this quickly," said Sergei Pond, a professor in Biology at Temple University in Philadelphia. "We're seeing the epidemic develop in real-time. This is a unique feature of the current outbreak. It has never happened before."
|Sergei Pond, Professor in Biology, Temple University|
Pond and his colleague, Anton Nekrutenko of Penn State, are collaborating on the Galaxy project, one of the world's largest, most successful, web-based bioinformatics platforms. More than 30,000 biomedical researchers run approximately 500,000 computing jobs a month via the platform.
The researchers perform the majority of their parallel processing and analyses on the XSEDE-allocated Stampede2 (TACC) and Jetstream (IU/TACC) supercomputers using parallel processing and big data analytics. In addition, Galaxy employs the Bridges (PSC) platform, also an XSEDE-allocated resource, for assembly jobs that require large amounts of shared memory. XSEDE awards supercomputer resources and expertise to researchers and is funded by the National Science Foundation.
"For us, the open public resources on XSEDE are a way to show the value of a public cloud which is built specifically for research," Nekrutenko said. "We're enabling anyone in the world to do analysis using proven tools and robust workflows. We think XSEDE-allocated resources through TACC and PSC are ideal platforms for doing this," Nekrutenko said. Nekrutenko and his close collaborator, James Taylor, started Galaxy in 2005 at Penn State.
Taylor, the Ralph S. O'Connor Professor of Biology and Computer Science at Johns Hopkins University, passed away unexpectedly on April 2, 2020, at the age of 40. The official eulogy from the Galaxy Project is published here.
|Anton Nekrutenko, Professor of Biochemistry and Molecular Biology, Penn State|
Galaxy uses open source tools and public cyberinfrastructure for transparent, reproducible analyses of viral datasets. "We run hundreds of thousands of analyses per month, and we're spiking now in terms of usage and viral analyses," Nekrutenko said.
As a renowned expert in infectious disease evolution, Pond develops software tools and methods for people who do this research. Currently, he and Nekrutenko are working feverishly on several research projects funded by federal agencies to integrate the tools that Pond's lab has developed to bring them into Galaxy and to the broader community.
"We're well positioned to address the current issue with SARS-CoV-2 because we've been working in this domain for several years now," Pond said.
Pond's methods allow researchers to trace where viruses come from and how they evolve. He developed a widely used set of tools called HyPhy, specifically for selection analysis in infectious diseases.
With Galaxy and HyPhy working together, researchers can perform robust, reproducible analysis of SARS-CoV-2 genomic sequences.
|James Taylor (pictured) along with Anton Nekrutenko started Galaxy in 2005 at Penn State. Taylor, the Ralph S. O'Connor Professor of Biology and Computer Science at Johns Hopkins University, passed away unexpectedly on April 2, 2020, at the age of 40.|
Nekrutenko also leads an NIH grant that puts tools for HIV analyses into the Galaxy. "Conceptually, these are the same tools that you would use for studying SARS-CoV-2. We can essentially solve all genomic data analysis needs for the worldwide research community when it comes to SARS-CoV-2," Nekrutenko said.
In February 2020, there were 100-150 genomes of the virus available. In March, the number started growing exponentially, and it's getting faster because diagnostic and academic labs around the world are sequencing these genomes and depositing them into large central databases. "For all we know next week there could be 50,000 genomes," Pond said.
The goal is to decipher these data to understand in real-time whether there's anything unique happening with the virus before it impacts the course of the pandemic.
In the past — as with SARS, MERS, Ebola and Zika—many interesting analyses were performed after the outbreak ended. This was mostly because the outbreaks were contained before they became a pandemic, unlike what is happening currently. Also, until about five years ago, researchers didn't have the sequencing technology that they needed available.
"Now you have instruments that you can set up and run very quickly, and public infrastructure to do data analysis," Pond said. "This is all developing live. We're turning around the analysis as quickly as the data come in."
On the positive side, the SARS-CoV-2 virus is mutating more slowly than influenza because it has an enzyme that does proofreading during RNA synthesis and RNA replication. "What this means is that we should be able to design a successful vaccine that's fairly uniform," Pond said. "You could take a sequence from Japan and a sequence from Africa and they'll be very similar to each other, which means we can develop with a high degree of confidence a fairly reliable vaccine."
People are anticipating that SARS-CoV-2 may become a seasonal infection, which means scientists have to look for evolutionary changes and possibly design a new vaccine every season. Eventually, our immune systems will develop immunity in the host. But it takes time and passage through the population — taking months to years.
"What we're doing is the first step, which is generating the variation of the genome and finding the most important among all of these thousands of positions that we can look at. We're helping to focus the effort on where some of the interesting evolutionary dynamics might be taking place," Pond said.
Researchers know that the virus contains 30,000 base pairs — three times larger than influenza or HIV. "It's as large as a virus can get before it runs into fundamental constraints in molecular replication," Pond said. "It mutates slower compared to influenza or HIV, but mutates much faster than the genomes in mammals or bacteria simply because it goes through rapid replication cycles."
For Nekrutenko, Pond, and other collaborators who work on Galaxy, their idea is to enable researchers to perform these analyses regardless of locale. "For example, if someone in Africa or China or Brazil generates data sets, they can use Galaxy to perform the analysis in an established, standardized way for free. We're establishing a level playing field so that analyses performed by different labs are comparable," Nekrutenko said.
"Here we have a situation where we don't know what's going to happen," Pond said. "So, we're looking forward, trying to do predictive analysis with these data. It's very exciting for a scientist because it's a unique opportunity that's never occurred before. I feel like everyone has to contribute to the best of their ability. And providing data analytics is definitely something that has to be done."
At A Glance
- The Galaxy project is one of the world's largest, most successful, web-based bioinformatics platforms. More than 30,000 biomedical researchers run approximately 500,000 computing jobs a month on the platform.
- With regard to Covid-19, researchers Anton Nekrutenko (Penn State) and Sergiei Pond (Temple University) are deciphering a deluge of data to understand in real-time what's unique with the virus before it impacts the course of the pandemic.
- The researchers perform the majority of their parallel processing and analyses on the XSEDE-allocated Stampede2 and Jetstream supercomputers located at TACC. In addition, Galaxy employs the XSEDE-allocated Bridges platform from PSC for genome assembly jobs that require large amounts of shared memory.
- "We can essentially solve all genomic data analysis needs for the worldwide research community when it comes to SARS-CoV-2." Anton Nekrutenko, Penn State