Technology Track

Sessions will feature technology developments and capabilities that enable increased performance, capability, productivity, and/or reliability of TeraGrid users, applications, and resources. Presenters will describe the technology in detail, discuss achieved or potential impact, and articulate future plans. Everything that is presented is new, previously unpublished work.

Tuesday, Aug. 3, 10 AM - Noon

Abhinav Bhatele and Laxmikant V. Kale. Mapping parallel applications on the machine topology: Lessons learned

Download the presentation(PDF)
Petascale machines with hundreds of thousands of cores are being built. These machines have varying interconnect
topologies and large network diameters. Computation is cheap and communication on the network is becoming the
bottleneck for strong scaling of parallel applications. Most applications typically have a certain communication
topology. Mapping of tasks in a parallel application based on their communication graph, to the physical processors on
the machine can potentially lead to performance improvements.
This talk will demonstrate that it is not wise to assume that message latencies are independent of the distance a
message travels. When there is contention on the network due to sharing of links between messages, all messages
are slowed down and the effective bandwidth available to each link decreases. We will show that even on Cray
machines with a high bandwidth interconnect, contention can affect message latencies and hence considering the
topology is important.
The talk will show success stories for different MPI and Charm++ applications — performance improvements obtained
using topology aware mapping. Weather Research and Forecasting model (WRF) and Parallel Ocean Program (POP) are
the two MPI applications that will be discussed. The Charm++ applications which will be mentioned are NAMD and
OpenAtom. We will discuss how some applications benefit more from topology aware mapping and others less, while
some others are not affected at all.

Karl Schulz, Ernesto Prudencio and William Barth. A Network Latency and Statistical Calibration Study for an InfiniBand
Fat-Tree Topology

Download the presentation(PDF)
Abstract: Over the past decade, InfiniBand has emerged as a popular commodity
network delivering high-bandwidth and low-latency for use in
high-performance computing (HPC) cluster configurations over a range
of scales. This popularity is exemplified in the Top500 rankings for
systems in the Top 50 and that 36.2% of all Top 500 systems are
InfiniBand based. A frequent design choice for building large-scale
clusters is to use a non-blocking fat-tree topology using leaf
switches to connect groups of compute nodes to larger switches which
contain redundant connections to higher levels of the fat-tree. A
large example of this approach is the TACC Ranger system,
deployed in February 2007 with 62,976 cores, a non-blocking InfiniBand
interconnect, and a peak performance of 579 TFlops.
The present study seeks to statistically calibrate a simple network
model which characterizes the point-to-point MPI latency between any
two of Ranger's 3,456 compute hosts using a simple linear model based
on the number of network hops between compute hosts. The impetus for
this work arises from a companion effort to improve application
performance via topology-aware scheduling. This simple model requires
prior knowledge of the number of hops traversed between all endpoint
host permutations. To fulfill this need, an InfiniBand query tool has
been created which uses available tools provided via OFED to expose
the complete linear-forwarding table of each InfiniBand switch and
this data is traversed to give the address of each switch along the
route between any two hosts.
Using the query service, a set of 3,892 latency measurements was first
generated in a quiescent environment, where each value was associated
with a known number of switch hops. After calibration, the latency
model provided an average error of less than 1.5% compared
to the original Ranger measurements.
In a typical deterministic calibration, one gets model coefficients by
solving an optimization problem. However, we treat the calibration
problem with a Bayesian approach where the solution is given by a
posterior joint probability density function (PDF) We chose a uniform
prior and used a Markov chain Monte Carlo algorithm to generated
samples to obtain a calibrated network model for the system in a
quiescent environment.
For the present model and calibration problem, the deterministic
approach is simpler than the statistical one. However, the Bayesian
approach offers the possibility of (a) treating problems with multiple
minima, and (b) quantitatively comparing competing models.

Pablo Mininni, Duane Rosenberg, Raghu Reddy and Annick Pouquet. Investigation of Performance of a Hybrid
MPI-OpenMP Turbulence Code

Abstract: A hybrid scheme that utilizes MPI for distributed memory parallelism and OpenMP
for shared memory parallelism is presented. The work is motivated by the desire
to achieve exceptionally high Reynolds numbers in pseudospectral computations of
fluid turbulence on emerging petascale, high core-count, massively parallel process-
ing systems. The hybrid implementation derives from and augments a well-tested
scalable MPI-parallelized pseudospectral code. The hybrid paradigm leads to a new
picture for the domain decomposition of the pseudospectral grids, which is helpful in
understanding, among other things, the 3D transpose of the global data that is necessary
for the parallel fast Fourier transforms that are the central component of the
numerical discretizations. Details of the hybrid implementation are provided, and
performance tests illustrate the utility of the method. It is shown that the hybrid
scheme achieves near ideal scalability up to 20000 compute cores with a maximum
mean efficiency of 83%. Data are presented that demonstrate how to choose
the optimal number of MPI processes and OpenMP threads in order to optimize
code performance on two different platforms.

Chao Mei, Gengbin Zheng, Filippo Gioachin and Laxmikant V.Kale. Optimizing a Parallel Runtime System for Multicore
Clusters: a Case Study

Download the presentation(PDF)
Abstract: Clusters of multicore nodes have become the most popular option for new
HPC systems due to their scalability and performance/cost ratio. The
complexity of programming multicore systems underscores the need for powerful
and efficient runtime systems that manage resources such as threads and
communication sub-systems on behalf of the applications.
In this paper, we study several multicore performance issues on clusters using
Intel, AMD and IBM processors in the context of the Charm++ runtime system.
We then present the optimization techniques that
overcome these performance issues. The techniques presented are
general enough to apply to other runtime systems as well. We demonstrate the benefits
of these optimizations through both synthetic benchmarks and production quality
applications including NAMD and ChaNGa on several popular multicore platforms. We
demonstrate performance improvement of NAMD and ChaNGa by about 20% and 10%, respectively.

Tuesday, Aug. 3, 2:20 PM - 4 PM

Subhashini Sivagnanam and Kenneth Yoshimoto. TeraGrid Resource Selection Tools - A Road Test

Download the presentation(ODP)

Abstract: On a grid of computers, users often must decide between individual machines for job submission. Usually,
the goal is to minimize time-to-completion. Several tools are available on TeraGrid to help users make this
decision. In this paper, we use these tools to perform actual job submissions on TeraGrid machines. We evaluate the
relative resource selection capability of these tools.

Stuart Martin, Joe Bester and Steve Tuecke. Globus GRAM5 - a scalable and reliable job management service for
Abstract: The Globus GRAM service enables remote job management for thousands of resource providers and large
scale cyberinfrastructures around the world. One of the key grid services offered by TeraGrid since its inception,
Globus GRAM has recently undergone significant transformation to address scalability concerns and improve the user
experience. Globus GRAM5, currently deployed across a broad range of TeraGrid systems, is now capable of efficiently
managing tens of thousands of jobs while providing backward compatibility with GRAM2 clients including Condor-G
and JGlobus. New features, such as returning job exit codes to the client, job manager logging, service version
detection, and auditing of TeraGrid gateway attributes, enhance usability for both end users and administrators. The
presentation will detail these improvements, describe the benefits to TeraGrid users, and provide a better
understanding of GRAM's use through various examples and use case scenarios.

Warren Smith. The Karnak Prediction Service
Abstract: The existence of cyberinfrastructures, such as the TeraGrid, allows users to execute their applications on a
variety of different computer systems. The obvious question users have each time they wish to run an application is
which computer system should they use? Many factors go into making this decision, but one of them is an estimate of
when the application will begin to execute if submitted to specific computer systems. Similarly, after a user submits an
application to a system, a prediction of when the application will begin to execute allows users to better plan their
work. In addition to users, metaschedulers can make use of these predictions to plan and optimize the amount of time
it takes to execute user workflows. The Karnak service addresses these needs by providing predictions of when jobs
will begin to execute on TeraGrid resources.

Tuesday, Aug. 3, 4:15 PM - 5:15 PM

John Hammond, Tommy Minyard and James Browne. End-to-End Framework for Fault Management for Open Source
Clusters: Ranger

Download the presentation(PDF)
Abstract: The scale and complexity of both hardware and software on large open source software systems such as
Ranger make occurrence of faults and failures inevitable. What is not inevitable is that they should be allowed to go
undetected, nor that diagnosis and recovery from failures should continue to be largely manual and effort intensive.
This paper presents a framework for end-to-end fault management for open source clusters which is being developed
on Ranger, but which targets general open source software based clusters. The elements of the framework are: a
rationalized system
logging stack for Linux, low overhead log and status monitoring, and a multilevel suite of diagnostic analyses. This
paper describes this framework, presents the accomplishments to date, the results which have been obtained with the
elements of the framework which are in place, and the plans for future development including a solicitation for
collaboration on the project.

Bilel Hadri, Mark Fahey and Nick Jones. Identifying Software Usage at HPC Centers with the Automatic Library Tracking Database

Download the presentation(PPTX)
Abstract: A library tracking database has been developed to monitor software/library usage. This Automatic Library
Tracking Database (ALTD) automatically and transparently stores, into a database, information about the libraries
linked into an application at compilation time and also the executables launched in a batch job. Information gathered
into the database can then be mined to provide reports. Analyzing the results from the data collected will help to
identify, for example, the most frequently used and the least used libraries and codes, and those users that are using
deprecated libraries or applications. We will illustrate the usage of libraries and executables on the Cray XT platforms
hosted at the National Institute for Computational Sciences and the Oak Ridge Leadership Computing Facility (both
located at Oak Ridge National Laboratory).

Wednesday, Aug. 4, 10 AM - Noon

Nicholas Malaya, Karl Schulz and Robert Moser. Petascale I/O using HDF-5
Abstract: Our work is focused on performing Petascale Direct Numerical Simulations (DNS) of turbulent flows. An
essential performance component of these simulations are the restart files, as a single petascale simulation will write
on the order of a petabyte. Petascale I/O requires both performance and dataset maintainability, for archival and post
processing of the velocity fields for statistics. This paper presents benchmarks and comparisons between a single
shared file written using the HDF-5 library and a POSIX compliant I/O library on several top-10 machines and
filesystems. It is shown that a properly tuned HDF-5 routine provides strong I/O performance, which coupled with the
metadata handling and portability available to the file format, indicates that the lower performance provides a worthy
tradeoff. The benchmarks presented were provided from real turbulence simulations and required extensive tuning for
different platforms, and in particular, between different file systems (GPFS and Lustre).

Jiahua He, Jeffrey Bennett and Allan Snavely. DASH-IO: an Empirical Study of Flash-based IO for HPC

Download the presentation(PDF)
Abstract: HPC applications are becoming more and more data-intensive as a function of ever-growing simulation sizes
and burgeoning data-acquisition. Unfortunately, the storage hierarchy of the existing HPC architecture has a 5-orderof-
magnitude latency gap between main memory and spinning disks and cannot respond to the new data challenge
well. Flash drives are promising to fill the gap with their 2-order-of-magnitude lower latency. However, since all the
existing hardware and software were designed without flash in mind, the question is how to integrate the new
technology into existing architectures. DASH is a new Teragrid resource aggressively leveraging flash technology (and
also distributed shared memory technology) to fill the latency gap. To explore the potentials and issues of integrating
flash into today's HPC systems, we swept a large parameter space by fast and reliable measurements to investigate
varying design options. We here provide some lessons we learned and also suggestions for future architecture design.
Our results show that performance can be improved by 8x with appropriate existing technologies and
probably further improved by future ones.

Josephine Palencia, Robert Budden and Kevin Sullivan. Kerberized Lustre 2.0 over the WAN

Download the presentation(PDF)
Abstract: In this paper, we describe our current implementation of
kerberized Lustre 2.0 over the WAN with partners from the Teragrid (SDSC), the Naval Research Lab, and the Open Science
Grid (University of Florida). After formulating several single
kerberos realms, we enable the distributed OSTs over the WAN,
create local OST pools, and perform kerberized data transfers
between local and remote sites. To expand the accessibility to
the lustre filesystem, we also include our efforts towards
cross-realm authentication and integration of Lustre 2.0 with
the kerberos-enabled NFS4.

Stephen Simms, Joshua Walgenbach, Justin Miller and Kit Westneat. Enabling Lustre WAN for Production Use on the
TeraGrid: A Lightweight UID Mapping Scheme

Abstract: The Indiana University Data Capacitor wide area Lustre file system provides over 350 TB of short- to
mid-term storage of large research data sets. It spans multiple geographically distributed compute, storage, and
visualization resources. In order to effectively harness the power of these resources from various institutions, it has
been necessary to develop software to keep ownership and permission data consistent across many client mounts.
This paper describes the Data Capacitor’s Lustre WAN service and the history, development, and implementation of
IU’s UID mapping scheme that enables Lustre WAN on the TeraGrid.

Wednesday, Aug. 4, 2:20 PM - 4 PM

Stuart Martin, Steve Tuecke, Steve Graham and Lisa Childers. high-performance, reliable data movement
SaaS for TeraGrid

Download the presentation(PPTX)
Abstract: Scientists have an increasing need to move high volume data across wide area networks in the normal
course of their research, but are often frustrated by limited visibility into how transfers are progressing and in dealing
with failed transfers. Leveraging the widely deployed GridFTP service on TeraGrid, provides
high-performance, reliable data movement using a Software-as-a-Service (SaaS) model that benefits end users,
resource providers, and science gateways alike. End users need not be technology experts in order to share data
among multiple sites. The hosted service reliably manages file transfers, transparently handling transient
errors while allowing end users to monitor and manage requests in progress. For science gateways, command line and
REST interfaces allow these tasks to be delegated to, reducing the overall support burden. And resource
providers can offer a turnkey service with minimal changes to back end infrastructure. The user experience is further
enhanced by planned functionality, including improved notifications and integration with common security
infrastructures such as OpenID and Shibboleth. This presentation provides an overview of the service and
illustrates how it is being used in large scale scientific research.

Jim Basney, Terry Fleury, Jon Siwek and Von Welch. Federated Login to TeraGrid

Download the presentation(PPTX)
Abstract: We present a new federated login capability for the TeraGrid. Federated login enables TeraGrid users to
authenticate using their home organization credentials for secure access to TeraGrid high performance computers,
data resources, and high-end experimental facilities. Our novel system design links TeraGrid identities with campus
identities and bridges from SAML to PKI credentials to meet the requirements of the TeraGrid environment. Our
presentation will describe the capability provided by the standalone site as well as the effort
underway to integrate federated login with the TeraGrid User Portal at
This work was previously presented at the 9th Symposium on Identity and Trust on the Internet (IDtrust 2010),
Gaithersburg, MD, April 13-15, 2010,

Rion Dooley and Maytal Dahan. ActionFolders: Automation, Notification, Workflow
Abstract: In the past, complex software stacks were created to provide job submission, monitoring, notification, and
archiving. Action Folders allow application developers to replace multiple layers of software and services with simple
file management. An Action Folder is a content-aware folder that performs actions based on changes to its contents.
Events such as file/folder creations, deletions, and modifications trigger an Action Folder to execute. Action Folders
utilize a flexible grammar for defining conditions under which actions should take place. Conditions can be simple
create, remove, update, or delete (CRUD) operations on files, process existence, job statuses, regular expression
matching, and file size comparisons. Action Folders are brought to life by a background Perl process called, the Action
Daemon. This process is configurable, runs in user space, and manages a dynamic registry of folders to watch. Action
Folders are powerful enough for complex workflows, automation, administration, notifications, and much more.
Normal folders become Action Folders when a .action template is present. They cease to be Action Folders when the
template is removed. The template file consists of two sections: DEPENDENCIES and ACTIONS. The DEPENDENCIES
section contains one or more grammar statements defining the situation when action should take place. The ACTIONS
section contains a shell script that will execute when all of the DEPENDENCIES are satisfied. Thus, the user has total
flexibility over both the timing and nature of the Action Folder.
Here is a snippet of a .action template that provides push notification of a starting job:
file named test.out created
job 123 running

# target file is present in your environment as $EVENT_FILE
# callout to a trigger service
# email a user
SUBJECT="Job 123 update"
/usr/bin/mail -s "$SUBJECT" "$TO" <<EOF
Your job 123 on `uname -n` is now running.
Time: `date`
`qstat 123`
Rather than deploy a job submission service, create a Action Folder that submits every batch submit script copied into
it. Rather than polling for job status, turn your output folder into an Action Folder that notifies your application
when the job changes state. Rather than deploying a complex workflow engine, simply chain together several tasks
with Action Folders that trigger when the previous task has finished.
Action Folders do not replace the need for mature middleware solutions in complex architectures, but in many cases,
Action Folders are a lightweight and simple way to build up functionality while minimizing the dependencies on
third-party services and APIs. A simpler middleware means less complexity in developing gateways and other
value-added services. This in turn means that more time can be spent adding value for the end user rather than wiring
up the basic functionality.
Today Action Folders are used in a wide array of scenarios. The iPlant project ( uses
Action Folders for job submission and monitoring. The TeraGrid Share Service (beta release) uses Action Folders to
keep metadata in sync with file system changes that occur both online and offline. The CIPRES project
( uses Action Folders to prune GRAM log files before they become too large. Several individual
TeraGrid users employ Action Folders both on and off of TeraGrid systems to manage a diverse list of tasks. Action
Folders are being used to manage user space and remove core dumps before they consume valuable disk space. Also,
Action Folders are being employed as an automator for staging files out of archival storage and onto
disk. Furthermore, Action Folders are being used as a way to create "smart" folders that automatically submit jobs,
convert image formats, or backup data when the appropriate content is copied into them.
Action Folders is still a relatively new project. It was designed based on user needs and remains primarily a
user-driven project. As acceptance grows in the gateway and user communities, new user feedback and requirements
will be incorporated into the project's road map. Future work includes an expanded grammar, broader batch scheduler
support, and a simplified installation process. The Action Folder code is available for download at:

Wednesday, Aug. 4, 4:15 PM - 5:15 PM

Kenny Welshons, Patrick Dorn, Andrei Hutanu, Petr Holub, John Vollbrecht and Gabrielle Allen. Design and
Implementation of a Production Dynamically Configurable Testbed

Abstract: Production quality high speed networks connecting end
devices are needed to design and prototype new types of
distributed applications. The eaviv testbed has been
deployed to provide such a facility, connecting Louisiana
State University, National Center for Supercomputing
Applications and Masaryk University, and providing services
for dynamical provisioning of dedicated bandwidth
via Internet2 ION. This paper includes technical challenges
and solutions in deploying eaviv such as dealing
with multiple administrative domains, network addressing,
and end-to-end connectivity.These issues apply not only to
temporary testbeds, but also to networks which
facilitate long-running experiments and research and whose infrastructure is
global and at least partially dynamic.

William Michener, John Cobb, Robert Cook, Rebecca Koskela and Dave Vieglais. DataONE:Building a virtual data center
for the biological, ecological and environmental sciences

Abstract: Addressing the Earth's environmental problems requires that we change the ways that we do science;
harness the enormity of existing data; develop new methods to combine, analyze, and visualize diverse data
resources; create new, long-lasting cyberinfrastructure; and re-envision many of our longstanding institutions.
DataONE (Observation Network for Earth) represents a new virtual organization whose goal is to enable new science
and knowledge creation through universal access to data about life on earth and the environment that sustains it.
DataONE is designed to be the foundation for new innovative environmental science through a distributed framework
and sustainable cyberinfrastructure that meets the needs of science and society for open, persistent, robust, and
secure access to well-described and easily discovered Earth observational data.
Supported by the U.S. National Science Foundation, DataONE will ensure the preservation and access to multi-scale,
multi-discipline, and multi-national science data. DataONE is interdisciplinary, making biological data available from
the genome to the ecosystem; making environmental data available from atmospheric, ecological, hydrological, and
oceanographic sources; providing secure and long-term preservation and access; and engaging scientists,
land-managers, policy makers, students, educators, and the public through logical access and intuitive visualizations.
Most importantly, DataONE will serve a broader range of science domains both directly and through the
interoperability with the DataONE distributed network. DataONE is a five-year project that began in late 2009.
This talk identifies key environmental scientific, cyberinfrastructure, and sociocultural challenges and provides a road
map for how DataONE is addressing these challenges. In particular, the DataONE cyberinfrastructure includes: a global
network of Coordinating Nodes (providing global access to metadata and support of network-wide services) and
Member Nodes (data repositories hosting data from research networks and academic, governmental and
non-governmental organizations) as well as an Investigator Toolkit (enabling access to key tools and services that
support science data life cycle activities).

Thursday, Aug. 5, 10 AM - Noon

Frank Willmore, John Cazes and Carlos Rosales. Experiences with the Distributed Debugging Tool

Download the presentation(PDF)
Abstract: To achieve results at Petascale and beyond, HPC developers are exploring a range of technologies - from
GPU systems to homogeneous multi-core systems at unprecedented scale. This is leading to the most complex of
software development challenges that can only be solved with help from developer tools such as debuggers.
This presentation will discuss the experiences that members of the Texas Advanced Computing Center have had over
the past three years in deploying a distributed debugger - Allinea's DDT - at scale in multiple systems at TACC. In
particular, the presentation will cover the following topics:
• Use of DDT at large scale on Ranger
• Training of TeraGrid users
• Implementation of DDT with the SGE scheduler
• Use of DDT with CUDA enhanced codes
Currently, TACC staff are using the DDT debugger to refine a CUDA/GPGPU-enhanced molecular dynamics application
on Longhorn. TACC staff also use DDT regularly to assist users and investigate MPI scaling issues.

Eric Seidel, Gabrielle Allen, Steven Brandt, Frank Löffler and Erik Schnetter. Simplifying Complex Software Assembly:
The Component Retrieval Language and Implementation

Download the presentation(PDF)
Abstract: Assembling simulation software along with the associated tools and utilities is a challenging endeavor,
particularly when the components are distributed across multiple source code versioning systems. It is problematic for
researchers compiling and running the software across many different supercomputers, as well as for novices in a field
who are often presented with a bewildering list of software to collect and install.
In this paper, we describe a language (CRL) for specifying software
components with the details needed to obtain them from source code
repositories. The language supports public and private access. We
describe a tool called GetComponents which implements CRL and
can be used to assemble software.
We demonstrate the tool for application scenarios with the Cactus
Framework on the NSF TeraGrid resources. The tool itself is
distributed with an open source license and freely available from our
web page.

Doug Roberts, James Rineer and Diglio Simoni. ABM++, A Distributed ABM Software Framework
Abstract: The ABM++ software framework is a tool which allows the developer to implement agent based models
using C++ that are to be deployed on distributed memory Linux clusters. The framework provides the necessary
functionality to allow applications to run on distributed architectures. A C++ message passing API is provided which
provides the ability to send MPI messages between distributed objects. The framework also provides an interface that
allows objects to be serialized into message buffers, allowing them to be moved between distributed compute nodes. A
synchronization method is provided, and both time-stepped and distributed discrete event time update mechanisms
are provided.
ABM++ is completely flexible with respect to how the developer chooses to design his C++ representations of agents.
All of the functionality necessary for distributed computation is provided by the framework; the developer provides the
C++ agent implementations for his application. Distributed computing functionality is provided by the framework to
the application via simple inheritance and containment. The source code includes a simple working example in the
Applications directory to illustrate how the framework is used.
A virtual machine "appliance" version of the ABM++ framwork is also provided. The appliance consists of an Ubuntu
image file that can be run using either VirtualBox or VMWare. The ABM++ application is imbedded in the Eclipse IDE
in the appliance. The appliance may be run on Windows, MacOSX, or Linux desktop/laptop machines. This
environment provides a simple, convenient way to develop, test and debug ABM++ distributed applications without
requiring access to a Linux cluster.
See the ABM++ User's Guide for more information, and for instructions on how to use the appliance.
This code is released under the terms of the GNU General Programming License (GPL).

Emre Brookes and Borries Demeler. Performance Optimization of Large Non-Negatively Constrained Least Squares
Problems with an Application in Biophysics

Download the presentation(PDF)
Abstract: Solving large non-negatively constrained least squares systems is frequently used in the physical sciences
to estimate model parameters which best fit experimental data. Analytical Ultracentrifugation (AUC) is an important
hydrodynamic experimental technique used in biophysics to characterize macromolecules and to determine
parameters such as molecular weight and shape. We previously developed a parallel divide and conquer method to
facilitate solving the large systems obtained from AUC experiments. New AUC instruments equipped with multiwavelength
(MWL) detectors have recently increased the data sizes by three orders of magnitude. Analyzing the MWL
data requires significant compute resources. To better utilize these resources, we introduce a procedure allowing the
researcher to optimize the divide and conquer scheme along a continuum from minimum wall time to minimum
compute service units. We achieve our results by implementing a preprocessing stage performed on a local
workstation before job submission.


Contact Technology Track Co-Chairs, JP Navarro (ANL) and Michael Pflumacher (NCSA).