Science Success Story

A Galactic Choice

AI Running on XSEDE Systems Surpasses Humans at Classifying Galaxies

By Ken Chiacchia, Pittsburgh Supercomputing Center

Images from the Dark Energy Survey that the AI identified as spiral (top) or elliptical (bottom).

New telescope surveys are discovering hundreds of millions of new galaxies—far more than humans can classify. A National Center for Supercomputing Applications (NCSA)-led team has employed deep learning artificial intelligence (AI) on XSEDE-allocated systems to produce a galaxy-classifying artificial intelligence with better-than-human accuracy and capacity.

Why It's Important

Astronomers estimate there are at least 100 billion galaxies in the observable universe.

Scientists would like to get a better handle on these huge collections of stars for a number of reasons. For one, most of the mass of the universe seems to be invisible. One way we "see" the presence of this dark matter is through its effects on galaxies. Also, the motions of galaxies tell us that the expansion of the universe is accelerating. The reason for this may be that most of the energy of the universe is in an unknown form called dark energy. Astrophysical Surveys, such as the recent Dark Energy Survey (DES) and the upcoming Legacy Survey of Space and Time (LSST), are collecting data to study these fundamental questions.

"Cataloging all the galaxies in the universe is of fundamental interest in science for a number of reasons. For instance, combining gravitational wave observations with large scale galaxy catalogs has enabled the first gravitational wave standard-siren measurement of the Hubble constant which tells us how fast the universe is expanding…Astronomers have been trying to use AI to automate these tasks for quite some time, but traditional machine-learning algorithms, while promising, couldn't achieve human-level accuracy." — Asad Khan, NCSA

As a first step, scientists are studying the shapes of galaxies. The shape of a galaxy tends to be strongly intertwined with the history of its evolution. Shape also sheds light on a galaxy's star-formation rate, past mergers and interactions with other galaxies as well as other properties.

The logical starting point for astronomers in modern surveys is to classify and sort the vast number of galaxies observed. The main classification is whether a galaxy has a spiral shape, with curving arms like the Milky Way, or elliptical, which looks like a uniform ball of stars.

A method of visualizing how the AI classified galaxies helped give astronomers confidence. Classification of the labeled Dark Energy Survey test set (left), the Sloan Digital Sky Survey test set (center) and the predictions made by the AI for unlabelled galaxies (right).

This simple task is enormous owing to the tremendous number of galaxies. Astronomers initially turned to crowdsourcing to solve it. One highly successful effort was Galaxy Zoo. It used thousands of volunteers to classify galaxies. They classified 900,000 in the project's first phase. Volunteers will continue to have a role. But newer surveys of farther-away galaxies will dwarf that effort. The earlier Sloan Digital Sky Survey (SDSS) identified 50 million galaxies. The DES has identified more than 300 million. Even with thousands of volunteers, astronomers could never classify that many.

Graduate student Asad Khan, his advisor Eliu Huerta, and colleagues at NCSA at the University of Illinois Urbana-Champaign, as well as at Argonne National Laboratory, decided to solve this problem using deep learning on the XSEDE-allocated systems Bridges at the Pittsburgh Supercomputing Center and Comet at the San Diego Supercomputer Center.

How XSEDE Helped

Previous attempts to apply AI to galaxy classifications couldn't achieve human-level accuracy. To improve on that, the NCSA-led team turned to a type of machine learning called deep learning (DL). In DL, the computer learns a representation of the data, using a multi-level artificial neural network. They employed Comet in the early phases of the work, transitioning to Bridges to take advantage of the most advanced processors available for deep learning at the time—NVIDIA Tesla P100 GPUs. Today, both Bridges and Comet contain P100 nodes.

"XSEDE was pretty helpful for quickly testing out initial ideas for our project and hence played an important role in shaping our research that eventually resulted in a peer-reviewed publication that has been cited several times, and which has been extensively followed up by specialized magazines in Europe and the U.S. It is useful to have a shared resource for computation at the national level that can quickly respond to the demands of scientists from different and varied disciplines. We were able to submit several jobs to do a hyperparameter search for the best architecture for our problem. The ability to submit several jobs in parallel—and access to several GPUs—was pretty useful to cut back on [computational] time by at least four-fold. We saved about $1,000 that we would have required to do the same computing and data storage on the cloud." — Asad Khan, NCSA

For the data set, the scientists used a subset of the SDSS classified by the volunteers of Galaxy Zoo and verified as being above 90 percent accurate. They divided the data into three subsets: a roughly 36,000-galaxy training data set; a 1,000-galaxy validation data set; and a 12,500-galaxy testing data set. They chose the latter two data sets so that the galaxies in them lie in parts of the sky that both the SDSS and the DES had surveyed, taking advantage of the lessons learned by the earlier study. To generate and process all of the data sets that they used for training and testing, they used the Blue Waters supercomputer at NCSA, an XSEDE SP-2 resource.

In the testing phase, the AI matched the Galaxy Zoo classifications 85 percent of the time. But when they adjusted for the known error rate in Galaxy Zoo, they found their AI was over 99 percent accurate—better than the humans. As a last step, the scientists applied their AI to predict galaxy types in a set of about 10,000 not-yet-labelled galaxies. In addition, they had built their AI so that its processes for classifying the galaxies could be examined by humans. This step, which explained how the AI works, was important for convincing astronomers that the AI's methods can be trusted.

The team reported their results in the journal Physics Letters B in August 2019. They presented their visualization the following November at the annual SC19 supercomputing conference.

"In order to accelerate the adoption of AI tools for big-data analytics, it is essential to understand how these algorithms process data and extract information to make trustworthy predictions. In this article, we first designed AI algorithms that significantly outperform humans at classification and data labelling tasks, and then produced scientific visualizations that shed new and detailed information about how neural networks perform these tasks." — Eliu Huerta, NCSA

Future work will be to apply the method to larger groups of unidentified galaxies, automating galaxy identification to keep pace with the hundreds of millions expected to be discovered in the near future. The team has also begun using XSEDE-allocated Bridges-AI, whose NVIDIA Tesla V100 GPUs are currently the most advanced GPUs for deep learning. The platform's NVIDIA DGX-2 enterprise AI research system enables high-performance deep learning across 16 V100s.

This research is part of the Blue Waters sustained-petascale computing project, which is supported by the National Science Foundation (NSF) (awards OCI-0725070 and ACI-1238993) and the State of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign and its National Center for Supercomputing Applications. NVIDIA donated several Tesla P100 and V100 GPUs used for the analysis. This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation (NSF) grant number ACI-1548562. Specifically, it used the Bridges system, which is supported by NSF award number ACI-1445606, at the Pittsburgh Supercomputing Center (PSC) and Comet, which is supported by NSF award number ACI- 1341698 at the San Diego Supercomputer Center (SDSC). Additional support was through grant TG-PHY160053. This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.


Read the paper.

At a Glance

  • New telescope surveys are discovering hundreds of millions of new galaxies — far more than humans can classify.
  • A National Center for Supercomputing Applications (NCSA)-led team has employed "deep learning" artificial intelligence (AI) on XSEDE-allocated systems to produce a galaxy-classifying AI.
  • The system demonstrates better-than-human accuracy and capacity.