XSEDE14 is hosted by
In cooperation with
XSEDE is supported by the
National Science Foundation
Non-profit Silver Sponsor
Non-profit Bronze Sponsors
Scheduled for Wednesday, July 16, 2014, 3-4:30PM. Each of the 10 talks will be 8 minutes long.
iCER Interns: Engaging Undergraduates in High Performance Computing
The Institute for Cyber-Enabled Research (iCER) at Michigan State University (MSU) has an internship program to provide undergraduate students with hands-on experience in advanced computational research. The goals of the iCER Intern program are: (1) to give students in-depth exposure to Advanced Computing; (2) to leverage students' knowledge and skills to support iCER administrators, staff and researchers; and (3) to minimize the investment of mentors' time and resources, while maximizing student productivity. This paper details the evolution and structure of the iCER Intern program, which provides a rich educational experience for undergraduates while efficiently managing the efforts of iCER mentors. The program structure helps iCER staff quickly assess the interests and skills of new student interns and encourages productivity among the undergraduates. The result is a net positive return on the investment of mentors' time and effort, and a valuable professional learning experience for students. The methodology described here is based on lessons learned in mentoring more than 50 undergraduate students, and other institutions or high performance computing centers could readily adopt this program structure and approach.
Benefits of Cross Memory Attach for MPI libraries on HPC Clusters
With the number of cores per node increasing in modern clusters, an efficient implementation of intra-node communications is critical for application performance. MPI libraries generally use shared memory mechanisms for communication inside the node, unfortunately this approach has some limitations for large messages. The release of Linux kernel 3.2 introduced Cross Memory Attach (CMA) which is a mechanism to improve the communication between MPI processes inside the same node. But, as this feature is not enabled by default inside MPI libraries supporting it, it could be left disabled by HPC administrators which leads to a loss of performance benefits to users.
In this paper, we explain how to use CMA and present an evaluation of CMA using micro-benchmarks and NAS parallel benchmarks (NPB) which are a set of applications commonly used to evaluate parallel systems.
Our performance evaluation reveals that CMA outperforms shared memory performance for large messages. Micro-benchmark level evaluations show that CMA can enhance the performance by as much as a factor of four. With NPB, we see up to 24.75% improvement in total execution time for FT and up to 24.08% for IS.
An Open Extensible Multi-Target Application Generation Tool for Simple Rapid Deployment of Multi-Scale Scientific Codes
Combining modules wrapping a diversity of executable codes derived from various scientific research labs with a range of computational and data scales into a sustainable framework requires careful considerations. In the described framework, we have separated the module's executable codes from the user-interface and created an application generation tool which produces all the code necessary to create a web based science gateway simultaneously with a local GUI based application. The work was driven by requirements related to an international collaborative grant. Ongoing development is producing applications and will be in the hands of beta testers at the time of this publication.
MS-FLUKSS and Its Application to Modeling Flows of Partially Ionized Plasma in the Heliosphere
Nikolai Pogorelov, Sergey Borovikov, Jacob Heerikhuisen, Tae Kim, Igor Kryukov and Gary Zank
Flows of partially ionized plasma are frequently characterized by the presence of both thermal and nonthermal populations of ions. This occurs, e.g., in the outer heliosphere - the part of interstellar space beyond the solar system whose properties are determined by the solar wind (SW) interaction with the local interstellar medium (LISM). Understanding the behavior of such flows requires us to investigate a variety of physical phenomena occurring throughout the solar system. These include charge exchange processes between neutral and charged particles, the birth of pick-up ions (PUIs), the origin of energetic neutral atoms (ENAs), SW turbulence, etc. Collisions between atoms and ions in the heliospheric plasma are so rare that they should be modeled kinetically. PUIs born when LISM neutral atoms charge-exchange with SW ions represent a hot, non-equilibrium component and also require a kinetic treatment. The behavior of PUIs at the SW termination shock (TS) is of major importance for the interpretation of the puzzling data from the Voyager 1 and 2 spacecraft, which are now the only in situ space mission intended to investigate the boundary of the solar system. We have recently proposed an explanation of the sky-spanning "ribbon" of unexpectedly intense emissions of ENAs detected by the Interstellar Boundary Explorer (IBEX) mission. Numerical solution of these problems with the realistic boundary conditions provided by remote and in situ observations of the SW properties, requires the application of adaptive mesh refinement (AMR) technologies and petascale supercomputers. Supported by the NSF ITR program and various NASA projects, we have implemented these in our Multi-Scale FLUid-Kinetic Simulation Suite (MS-FLUKSS), which is a collection of problem-oriented routines incorporated into the Chombo AMR framework. For the next 5-10 years, heliophysics research is faced with an extraordinary opportunity that cannot be soon repeated. This is to make in situ measurements of the SW from the Sun to the heliospheric boundaries and, at the same time, extract information about the global behavior of the evolving heliosphere through ENA observations by IBEX. In this paper, we describe the application of new possibilities provided within our Extreme Science and Engineering Discovery Environment (XSEDE) project to model challenging space physics and astrophysics problems. We used XSEDE supercomputers to analyze flows of magnetized, rarefied, partially-ionized plasma, where neutral atoms experience resonant charge exchange and collisions with ions. We modeled the SW flows in the inner and outer heliosphere and compared our results with in situ measurements performed by the ACE, IBEX, and Voyager spacecraft.
Methods For Creating XSEDE Compatible Clusters
Jeremy Fischer, Richard Knepper, Matthew Standish, Craig A. Stewart, Resa Alvord, David Lifka, Barbara Hallock and Victor Hazelwood
XSEDE has created a suite of software that is collectively known as the XSEDE-compatible cluster build. It has been distributed as a Rocks roll for some time. It is now available as individual RPM packages, so that it can be downloaded and installed in portions as appropriate on existing and working clusters. In this paper, we explain the concept of the XSEDE-compatible cluster and explain how to install individual components as RPMs through use of Puppet and the XSEDE compatible cluster YUM repository.
Launcher: A Shell-based Framework for Rapid Development of Parallel Parametric Studies
Lucas Wilson and John Fonner
Petascale computing systems have enabled tremendous advances for traditional simulation and modeling algorithms that are built around parallel execution. Unfortunately, scientific domains using data-oriented or high-throughput paradigms have difficulty taking full advantage of these resources without custom software development. This paper describes our solution for rapidly developing parallel parametric studies using sequential or threaded tasks: The launcher. We detail how to get ensembles executing quickly through common job schedulers SGE and SLURM, and the various user-customizable options that the launcher provides. We illustrate the efficiency of or tool by presenting execution results at large scale (over 65,000 cores) for varying workloads, including a virtual screening workload with indeterminate runtimes using the drug docking software Autodock Vina.
Computational Anatomy Gateway: Leveraging XSEDE Computational Resources for Shape Analysis
Saurabh Jain, Daniel Tward, David Lee, Anthony Kolasny, Timothy Brown, Tilak Ratnanather, Laurent Younes and Michael Miller
Computational Anatomy (CA) is a discipline focused on the quantitative analysis of the variability in biological shape. The Large Deformation Diffeomorphic Metric Mapping (LDDMM) is the key algorithm which assigns computable descriptors of anatomical shapes and a metric distance between shapes. This is achieved by describing populations of anatomical shapes as a group of diffeomorphic transformations applied to a template, and using a metric on the space of diffeomorphisms. LDDMM is being used extensively in the neuroimaging (www.mristudio.org) and cardiovascular imaging (www.cvrgrid.org) communities. There are two major components involved in shape analysis using this paradigm. First is the estimation of the template, and second is calculating the diffeomorphisms mapping the template to each subject in the population. Template estimation is a computationally expensive problem, which involves an iterative process, where each iteration calculates one diffeomorphism for each target. These can be calculated in parallel and independently of each other, and XSEDE is providing the resources, in particular those provided by the cluster Stampede, that make these computations for large populations possible. Mappings from the estimated template to each subject can also be run in parallel. In addition, the use of NVIDIA Tesla GPUs available on Stampede present the possibility of speeding up certain convolution-like calculations which lend themselves well to the General Purpose GPU computation model. We are also exploring the use of the available Xeon Phi Co-processors to increase the efficiency of our codes. This will have a huge impact on both the neuroimaging and cardiac imaging communities as we bring these shape analysis tools online for use by these communities through our webservice (www.mricloud.org), with the XSEDE Computational Anatomy Gateway providing the resources to handle the computational demands for large populations.
Fast, low-memory algorithm for construction of nanosecond level snapshots of financial markets
Robert Sinkovits, Tao Feng and Mao Ye
We present a fast, low-memory algorithm for constructing an order-by-order level snapshot of financial markets with nanosecond resolution. This new implementation is 20-50x faster than an earlier version of the code. In addition, since message data are retained only for as long as it they are needed, the memory footprint is greatly reduced. We find that even the heaviest days of trading spanning the NASDAQ, NYSE and BATS exchanges can now easily be handled using compute nodes with very modest memory (~ 4 GB). A tradeoff of this new approach is that the ability to efficiently manage large numbers of small files is more critical. We demonstrate how we can accommodate these new I/O requirements using the solid-state storage devices (SSDs) on SDSC's Gordon system.
Slices: Provisioning Heterogeneous HPC Systems
Alexander Merritt, Naila Farooqui, Vishakha Gupta, Magdalena Slawinska, Ada Gavrilovska and Karsten Schwan
High-end computing systems are becoming increasingly heterogeneous, with nodes comprised of multiple CPUs and accelerators, like GPGPUs, and with potential additional heterogeneity in memory configurations and network connectivities. Further, as we move to exascale systems, the view of their future use is one in which simulations co-run with online analytics or visualization methods, or where a high fidelity simulation may co-run with lower order methods and/or with programs performing uncertainty quantification. To explore and understand the challenges when multiple applications are mapped to heterogeneous machine resources, our research has developed methods that make it easy to construct `virtual hardware platforms' comprised of sets of CPUs and GPGPUs custom-configured for applications when and as required. Specifically, the `slicing' runtime presented in this paper manages for each application a set of resources, and at any one time, multiple such slices operate on shared underlying hardware. This paper describes the slicing abstraction and its ability to configure cluster hardware resources. It experiments with application scale-out, focusing on their computationally intensive GPGPU-based computations, and it evaluates cluster-level resource sharing across multiple slices on the Keeneland machine, an XSEDE resource.
Calculation of sensitivity coefficients for individual airport emissions
Scott Boone, Mark Reed and Saravanan Arunachalam
Problem: Estimate airport-level results for sensitivity analysis of PM2.5 emissions in the U.S.; however, the drastic increase in computational time severely limits our domain.
Solution: Use of Stampede computing resource allows us to expand the scope of the project for detailed characterization, as well as drastically decrease runtime.
Fine particulate matter of diameter less than 2.5 micrometers (PM2.5) is a federally regulated air pollutant with well-known impacts on human health. Due to its ability to penetrate deep into the respiratory system, PM2.5 is strongly associated with an increase in lung cancer and cardiopulmonary mortality. The commercial aviation sector is responsible for approximately one percent of total anthropogenic PM2.5, and the Federal Aviation Administration (FAA) projects aviation activity to increase by approximately 2.5% annually.
The FAA's Destination 2025 program seeks to decrease aviation-related health impacts across the United States by 50% by the year 2018. However, any future increase in emissions will differ from airport to airport. Therefore, it is important to understand the sensitivity of atmospheric PM2.5 concentrations to variation in aviation activity and associated emissions on an airport-by-airport basis, in order to find out where health impacts from PM2.5 can be mitigated.
Eulerian atmospheric models, such as the Community Multiscale Air Quality model (CMAQ), are used to estimate the atmospheric concentration of pollutants—such as PM2.5—as a series of three-dimensional, well-mixed cells. At each time step, the model simulates photochemical, meteorological, emissions and deposition processes that occur across the domain. Modeling the continental United States at a horizontal resolution of 36x36 kilometers with about 30 to 40 vertical layers results in a domain of nearly one million grid cells. The core gas-phase chemical mechanism employed in CMAQ models over 50 species and around 150 reactions; additional modules for aerosol and cloud chemistry processes add to this number significantly. In each cell, every environmental process needs to be calculated and recorded hourly, at substantial computational cost.
Sensitivity analysis of these models has long been limited to two forms: subtractive, or "brute force" methods, which calculate the finite difference between two runs; and response surface models (RSM), which build a regression model of a limited number of training runs in which output concentrations are sampled from a series of runs with perturbed inputs. Both of these methods require multiple model runs—on the order of one run per scenario—and thus are better suited to domain-wide variations in inputs (for example, a sector-wide increase in aviation activity). However, they are unable to offer detailed or ad-hoc analysis for changes within a domain, such as changes in emissions on an airport-by-airport basis.
In order to calculate the sensitivity of PM2.5 concentrations to emissions from individual airports, we utilize the Decoupled Direct Method in three dimensions (DDM-3D), an advanced sensitivity analysis tool recently implemented in CMAQ (figure 1). DDM-3D allows calculation of sensitivity coefficients at each time step during the modeling process, eliminating the need for multiple model runs. However, while the output provides results for a variety of input perturbations in a single run, the processing time for each run is dramatically increased compared to model runs without the DDM-3D module.
While the DDM-3D algorithm is well suited for use on parallel-processor computing clusters, each additional sensitivity parameter—in our case, an emitted PM2.5 precursor species from an individual airport—causes a linear increase in processing time. Based on our initial benchmarking tests using local equipment, it would take nearly two million cpu-hours to conduct an experiment with a temporal domain of one year. The modeling process also generates an enormous amount of output data, requiring a great deal of storage and analysis capacity. We aim to compute sensitivity coefficients for each of 139 major airports in the U.S., due to six different precursor emissions that form PM2.5 in the atmosphere (figure 2).
Use of the XSEDE Stampede computing cluster allows us to calculate sensitivity coefficients for a greater number of individual airports over a longer period of modeled time than would be possible using only local resources at UNC. By using the high-performance Stampede computing cluster, we are able to dramatically increase the domain in our experiment and therefore allow for a much wider variety of aviation policy scenarios to be generated "on the fly" than would be possible using regression-based or subtractive sensitivity analysis methods. Our results will create a useful dataset for both policy makers and health impact researchers.