Science Success Story

XSEDE Systems Power DNA-Based Identification of Surface, Airborne Microbes Worldwide

Samples from 60 cities reveal previously unknown microbes, diversity of antibiotic resistance genes

By Ken Chiacchia, Pittsburgh Supercomputing Center

The fingerprint of the microbial species in a given city's public transit surfaces (coded by color) changed over time (samples from 2016 shown as circles, 2017 triangles). But the cities remained distinct and different from each other. From Danko D, Bezdan D et al. "A Global Metagenomic Map of Urban Microbiomes and Antimicrobial Resistance," Cell, May 27, 2021.

Dangerous microbes can emerge with little warning. Using XSEDE's advanced research computers, a team of scientists led by Weill Cornell Medicine has undertaken a vast analysis of microbial DNA in thousands of urban air and surface samples worldwide. The results revealed city-specific "fingerprints" of bacteria and viruses. They also gave us a first look at the population of dangerous antibiotic-resistance-conveying genes across the globe, as well as thousands of previously undiscovered species in the urban microbial world.

Why It's Important

The rise of COVID-19 has given the world a harsh lesson on the importance of being aware of what the microbial world is doing, and which bacteria and viruses are where. Public health experts would like to monitor the microbial world much as an airport control tower monitors the airspace around it, knowing which aircraft are near and where they're heading.

In a nutshell, we wanted to build a genetic, functional and geospatial map of the DNA of the world's cities, just kind of like the Google Maps of DNA of the Earth.—Christopher Mason, Weill Cornell Medicine

That's why a huge international collaboration, led by Christopher Mason of Weill Cornell Medicine, decided to employ the immense power of advanced research computing to assemble the DNA sequences of bacteria and viruses in the air in six cities and on surfaces in public transit locales in 60 cities worldwide. Using three of the most powerful XSEDE-allocated supercomputers in three different eras—the Pittsburgh Supercomputing Center's (PSC's) Blacklight from 2010 to 2015, Bridges from 2015 to February 2021, and Bridges-2 since—the team sequenced thousands of microbes' DNA all at once, using the computers' brute force to sort the genes and species electronically into a many-species metagenomic map for use by scientists and public health experts.

How XSEDE Helped

Christopher Mason of Weill Cornell Medicine

Many of the most-used DNA sequencing methods can sequence at most a few hundred nucleotides—the A, C, T, G alphabet of the genetic code—at a time. Because of that, scientists need to match overlapping fragments of DNA to put the millions of nucleotides in an organism's genome in proper order. When the task is to sort and assemble DNA fragments from thousands of species of bacteria and viruses at once in a sample from the environment, this assembly task becomes enormous. Blacklight, Bridges and Bridges-2 all offered large-memory nodes that made this kind of task possible. The same as RAM in a personal computer, larger memory allows the machine to compare more fragments at once without wasting time going back to storage—like in a PC's hard drive—for more data.

Graduate student David Danko at Weill Cornell Medicine and research associate Daniela Bezdan at Weill Cornell Medicine and the Abdulaziz Alsaud Institute for Computational Biomedicine worked with Mason and hundreds of scientists worldwide to collect 4,728 surface samples from mass transit locations in 60 cities worldwide in 2016 and 2017, and analyze their DNA sequences using these systems. In parallel work, M. H. Y. Leung and X. Tong at the City University of Hong Kong and K. O. Bøifot at the Norwegian Defence Research Establishment performed a similar analysis of 259 airborne samples in Denver, Hong Kong, London, New York City, Oslo, and Stockholm.

To do all of our de novo assembly at scale, [XSEDE] gave us the fastest and most expansive computational framework through which we could assemble all the sequences to find what were the real novel species and the novel genetic elements in this data set … It probably literally wouldn't have been possible in this time frame without that infrastructure … Also with these assembly projects, we will hit, occasionally, issues with bugs in the code or challenges with file structures. And Phil [Blood, PSC senior director of research and XSEDE co-PI] was a key and a pivotal collaborator to make sure that we can actually do all the assemblies and get them up and running well … He's our go-to man for ‘What happened, what seems to be breaking?'—Christopher Mason, Weill Cornell Medicine

The assembly results gave scientists their first metagenomic map of urban areas worldwide, opening a new era of disease surveillance. The surface samples contained more than 15,000 species of virus, bacteria and archaea, primitive bacteria-like organisms from which more complex plants and animals evolved. The airborne samples showed evidence of more than 450 microbial species. More interesting, fewer than 10 percent of the microbes identified from their DNA were species known to science, revealing a vast unknown microbial environment.

As expected from earlier research, cities had distinct microbial populations, with varying amounts of 31 species from surfaces and 17 from the air, forming a kind of fingerprint that the scientists could use to identify the city of origin. These fingerprints varied over time, though the cities remained recognizably distinct. The team's results suggest that differences in climate, geography, population density and other factors may help drive these variations. One important facet of the work was to detect and monitor differences in 20 known genes that give bacteria resistance to antibiotics. These also differed widely between the cities.

Together, the results offer the first high-resolution view of the types of microbes that exist in the environment in a way that can be harnessed to public health efforts. The collaborators reported their results in two papers, a coveted cover story in the journal Cell on May 26, 2021, and an upcoming report in the journal Microbiome. The team is expanding their research, now collecting RNA data, which will open up a view of RNA viruses such as the coronavirus that causes COVID-19. They're also investigating artificial-intelligence driven classification of the results, to automatically detect metagenomic shifts that pose a threat to human health.

You can read the Cell paper here.

For the Cell paper, the Tri-I Program in Computational Biology and Medicine (CBM) funded by the NIH grant 1T32GM083937, GitHub, Philip Blood and the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562. Specifically, it used the Bridges-2 system, which is supported by NSF award number ACI-1928147, at the Pittsburgh Supercomputing Center (PSC). Funding also came from the WCM SCU and Epigenomics and Genomics Core Facilities, the Vallee Foundation, the WorldQuant Foundation, Igor Tulchinsky, The Pershing Square Sohn Cancer Research Alliance, NASA (NNX14AH50G, NNX17AB26G), the NIH (R01AI151059, R25EB020393, R21AI129851, R35GM138152, U01DA053941), STARR Foundation (I13-0052), LLS (MCL7001-18, LLS 9238-16, LLS-MCL7001-18) the NSF (1840275), the Bill and Melinda Gates Foundation (OPP1151054), the Alfred P. Sloan Foundation (G-2015-13964), Swiss National Science Foundation grant #407540_167331, NIH Award Number UL1TR000457, the US Department of Energy Joint Genome Institute under contract number DE-AC02-05CH11231, the National Energy Research Scientific Computing Center, supported by the Office of Science of the US Department of Energy, Stockholm Health Authority grant SLL 20160933, the Institut Pasteur Korea an NRF Korea grant (NRF-2014K1A4A7A01074645, 2017M3A9G6068246), the CONICYT Fondecyt Iniciación grants 11140666 and 11160905, the Millennium Science Initiative of the Ministry of Economy, Development and Tourism, Government of Chile, Keio University Funds for Individual Research, funds from the Yamagata prefectural government and the City of Tsuruoka, JSPS KAKENHI Grant Number 20K10436, the bilateral AT-UA collaboration fund (WTZ:UA 02/2019; Ministry of Education and Science of Ukraine, UA:M/84-2019, M/126-2020), Kyiv Academic Univeristy, Ministry of Education and Science of Ukraine, project No. 0118U100290 and 0120U101734, the Spanish Ministry of Economy and Competitiveness, ‘Centro de Excelencia Severo Ochoa 2013-2017, the CERCA Programme / Generalitat de Catalunya, the "la Caixa" Foundation, the CRG-Novartis-Africa mobility programme 2016, TMB Director Eladio De Miguel Sainz, research funds from National Cheng Kung University and the Ministry of Science and Technology, Taiwan (MOST grant No. 106-2321-B-006-016), the Weill Cornell Clinical and Translational Science Center (CTSC), CUNY Hunter College, Macaulay Honors College at CUNY, City College of the City University of New York, Cornell University, Columbia University, the Icahn School of Medicine at Mt. Sinai, Rockefeller University, and New York University (NYU). We thank all the volunteers who made sampling NYC possible, Colciencias (project No. 639677758300), CNPq (EDN - 309973/2015-5), the Open Research Fund of Key Laboratory of Advanced Theory and Application in Statistics and Data Science – MOE, ECNU, the Research Grants Council of Hong Kong through Project 11215017. National Key RD Project of China (2018YFE0201603) Shanghai Municipal Science and Technology Major Project (2017SHZDZX01) (Leming Shi).

The Microbiome paper used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562. Specifically, it used the Bridges-2 system, which is supported by NSF award number ACI-1928147, at the Pittsburgh Supercomputing Center (PSC). PKHL acknowledges the support provided by the Research Grants Council of Hong Kong through Project 11215017. CEM thanks funding from the Bert L and N Kuggie Vallee Foundation, Igor Tulchinsky and the WorldQuant Foundation, the National Institutes of Health (R25EB020393, R01NS076465, 1R21AI129851, 1R01MH117406, U01DA053941), the Bill and Melinda Gates Foundation (OPP1151054), and the Alfred P. Sloan Foundation (G-2015-13964). This research was partly funded by the National Institute for Health Research Health Protection Research Unit (NIHR HPRU) in Health Impacts of Environmental Hazards at King's College London in partnership with Public Health England (PHE).

At a Glance

  • Using XSEDE's advanced research computers, a team of scientists led from Weill Cornell Medicine has undertaken a vast analysis of microbial DNA in thousands of urban air and surface samples worldwide

  • The work garnered a prestigious cover in the journal Cell

  • The results revealed city-specific "fingerprints" of bacteria and viruses

  • They also gave us a first look at the population of dangerous antibiotic-resistance-conveying genes across the globe, as well as thousands of previously undiscovered species in the urban microbial world