Click on an the individual title to be taken directly to  a tutorial description.



Title: Running Parallel Simulations and Enabling Science Gateways with the NSF MATLAB Experimental Resource at Cornell

Length: Half-day tutorial.

Level of the material: Primarily introductory.

Agenda: Participants will learn how to run MATLAB programs remotely on the experimental "MATLAB on the TeraGrid" cluster located at the Cornell Center for Advanced Computing (CAC), as well as learn how to access this NSF-funded resource from a Science Gateway or Web portal.

Following a project overview, participants will download and install the client code required to run MATLAB programs on the NSF experimental computing resource. After ensuring that all users are able to run the basic examples provided with the client code, we will walk through the development and testing of a parallel code that will demonstrate important features of the client software. We will also demonstrate how a user can develop an application locally and scale it to the remote resource, and how to debug problems as they arise. Experimentation with the demo code, as well as trying out personal research codes, will be encouraged.

Participants will then learn about nanoHUB, an NSF Science Gateway, and will run MATLAB applications through the nanoHUB portal. Development details will be covered, including how to detect, catch, and diagnose errors; how to run compiled MATLAB codes remotely; how to run non-interactive sessions; and how to run simulations without creating separate TeraGrid accounts for each user.

Dr. Nathaniel Woody, Cornell University Center for Advanced Computing
Dr. Steven Clark, Purdue University Rosen Center for Advanced Computing
Susan Mehringer, Cornell University Center for Advanced Computing

Software requirement: Participants will be encouraged to arrive with a laptop, with the required software (MATLAB 2009a or MATLAB 2009b) and Parallel Computing Toolbox http://www.mathworks.com/products/parallel-computing/ installed. Updates on additional supported versions will be shared with registrants in advance of the workshop. We will have a few laptops available for participants who do not have them. Participants will be provided with accounts allowing them to access the NSF experimental cluster at Cornell.

Prerequisites: Participants should have a working knowledge of MATLAB and a basic understanding of the concepts underlying parallel computing.

Abstract: Cornell University in partnership with Purdue University received a National Science Foundation award to deploy a new experimental computing resource called "MATLAB on the TeraGrid." This cluster provides a parallel MATLAB capability which is available from a MATLAB client running on a user's workstation. Scientists and engineers can seamlessly scale applications from a local machine running Linux, Windows, or Mac OS X to the remote cluster. In addition, Science Gateways technologists may use the cluster as a backend to their portal and integrate MATLAB code into their job submission framework. The pervasiveness of MATLAB in a wide variety of fields has created the need for a computational resource that seamlessly scales applications from the desktop to a larger-scale MATLAB resource without a steep learning curve. MATLAB Distributed Computing Server runs effectively on this resource and provides parallel and distributed computational services to interactive desktop users as well as Science Gateways such as NanoNet, a nanoHUB application. The MATLAB on TeraGrid cluster lowers the barriers to parallel computing for inexperienced users and, at the same time, provides the more advanced features required by experience parallel programmers. This tutorial will include instruction on how to use the MATLAB on the TeraGrid cluster both as an extension of MATLAB on the desktop and as a simulation tool for Science Gateways. Participants will learn how to seamlessly shift work to a remote cluster operating at CAC as well as how nanoHUB uses this cluster resource to enable hundreds of users to run parallel MATLAB code.

Title: How to Design an HPC Cluster

Time: Half-day session

Level: Introductory 50%, Intermediate 50%

Software Requirements: None

Prerequisites: None

Daniel LaPine, NCSA
Jeremy Enos, NCSA
Nathaniel Mendoza, SDSC

This tutorial will present the participants with information on how to design an HPC cluster. We will focus on the design of the small to mid range cluster (less than 128 nodes) and work on the specific questions which a good design should answer. We will cover hardware, software and general design with information on the options available today. Given the focus on GPU acceleration, we will take some time to consider the benefits and pitfalls of using GPU's in HPC's. For the final session, we will pick two computing goals, and go over the process of designing clusters to meet those goals.

Session 1, intro and design goals
Session 2, hardware and software
Session 3, GPU acceleration
Session 4, design exercise

Title:Scalable Systems Management with Puppet

Download the tutorial(PPT)

LEVEL: Intermediate (Half Day)

REQUIREMENTS: Participants are expected to have basic Linux knowledge. Experience with system administration would be a plus.

Stephen McNally (NICS)
Nick Jones (NICS)

This tutorial will teach participants how to automate common data center tasks using Puppet. Puppet is a configuration management tool that can be used to manage system changes such as adding users, installing packages, and updating server configurations. Additionally Puppet can ease deployment of new systems, help you recover from hardware failures, provide security benefits, and manage clusters. Throughout this tutorial, we will discuss common challenges that system administrators face, and how many of them can be overcome using Puppet. The instructors will give live demonstrations of managing a Puppet server, and participants will be encouraged to participate and ask questions throughout the session.


08:00 Challenges that System Administrators Face Discuss day to day challenges that system administrators face when not using a centralized configuration management tool.
08:30 Why Puppet? Present Puppet's features, and show how Puppet can manage a datacenter environment. Compare Puppet to other configuration management tools.
09:00 Puppet Installation and Configuration Discuss the installation and configuration of both the Puppet client and the Puppet server. Present an overview of how Puppet works internally.
09:45 Break
10:00 Managing Your Infrastructure with Puppet How to perform basic tasks with Puppet including managing files, packages, and services. Discuss examples of managing common applications and tools such as Apache, DNS, DHCP, MySQL, and more.
11:00 Advanced Puppet Topics Discuss module dependencies and inheritance, system installation, cluster management, and using Puppet as a security tool.

Title: Hands-on Tutorial for Building Cyberinfrastructure-Enabled and Community-Centric Science Gateway Applications

Download the tutorial(PDF)

Type: Half-day hands-on tutorial

Material level: Intermediate

Yan Liu, Shaowen Wang (UIUC)
Nancy Wilkins-Diehr (SDSC)

Software requirements: Web browser (Firefox is strongly recommended), ssh client (e.g, Putty)

Prerequisites: General understanding of Grid computing, Web 2.0 technologies (JavaScript, AJAX), Web application development (PHP and Java), and Web services

Tutorial description:
The science gateway approach has been widely adopted to establish bridges between cyberinfrastructure (CI) and domain science communities and enable end-to-end domain-specific computations on CI through efficient management of CI complexities within science gateways. As CI resources become increasingly available and accessible for researchers and scientists, the effectiveness of gateways depends on their community-wide usability and to the degree which researchers are able to concentrate on their domain problem-solving. This tutorial uses SimpleGrid, a toolkit for efficient learning and development of science gateway building blocks, to provide hands-on experience on leveraging TeraGrid resources for scientific computing, developing and integrating TeraGrid-enabled domain-specific science gateway applications, and creating highly-usable science gateway user environments. The intended audience for this tutorial include researchers and developers who are interested in building new CI-powered science gateways.

This tutorial will cover the following objectives:

  • Use TeraGrid computing and data resources for domain-specific scientific computing
  • Develop CI-enabled domain applications by using the SimpleGrid application programming interface (API) to access TeraGrid capabilities, i.e., security, data transfer, job submission and monitoring, and information services
  • Build CI-enabled application Web services to achieve "anytime, anywhere" service access and scalable gateway application integration
  • Compare Web portal development and Web 2.0-based rich-client user environment development
  • Develop highly-interactive Web 2.0 science gateway user environment to enable community-wide sharing and collaboration

This tutorial is carried out with the practice of turning an example domain application from geographic information
science (GIScience) into a science gateway application featuring highly-usable Web 2.0 user interface and seamless TeraGrid access. This application is a representative domain-specific application that requires dataset handling, intensive computing, and visualization. The proposed agenda includes:

  1. Illustrate how to manually conduct domain-specific application computation on TeraGrid using command line tools, along with introduction to TeraGrid account access, data resources, HPC clusters and job management services, software environment, and information services.
  2. Streamline the tasks in step 1 programmably and develop a TeraGrid-enabled application using the SimpleGrid Application Programming Interface (API). Useful features for science gateway application development, such as community account credential management and batch-mode job submission, will be introduced.
  3. Build TeraGrid-enabled application in step 2 into a REST/SOAP Web service. The Apache Axis2 toolkit will be used to streamline the Web service development process. The advantage of using service-oriented architecture for gateway application integration will be discussed.
  4. Develop a Web 2.0 user interface to enable community-wide shared access to the Web service developed in step 3. The usability of this user interface is improved by using Web 2.0 user interface (Yahoo UI) and communication (AJAX) technologies. Comparison will be given to illustrate the difference from portal applications (portlets), which typically follows Model-View-Control (MVC) pattern, to rich-client Web 2.0 applications with the control and view part moved to Web client (browser) side.

The design principles and technologies used in SimpleGrid originate from the development of the TeraGrid Geographic Information Science Gateway (http://GISolve.org). Hands-on tutorials and demos using SimpleGrid have been conducted at TeraGrid 2007 and 2009, SuperComputing 2007, and SciDAC 2009 conferences. The covered topics were enthusiastically received by audience. The SimpleGrid Toolkit has been used and evaluated as a learning tool to build new TeraGrid science gateways such as the CIPRES gateway.

Title: Remote Scientific Visualization

Title: Introduction to Scientific Visualization on Longhorn

Longhorn, TACC's Dell XD Visualization Cluster has 256 nodes + 2 login nodes, with 240 nodes containing 48GB of RAM, 8 Intel Nehalem cores (@ 2.5 GHz), and 2 NVIDIA Quadro FX 5800 GPUs . Longhorn also has an additional 16 large-memory nodes containing 144GB of RAM, 8 Intel Nehalem cores (@ 2.5 GHz), and 2 NVIDIA Quadro FX 5800 GPUs. Longhorn has a QDR Infiniband interconnect and has an attached Lustre Parallel file system. Longhorn users have access to Ranger's Lustre parallel file system thus making it more convenient to work on datasets generated on Ranger.

Tutorial attendees will receive instructions on the use of remote visualization software to visualize data sets generated on systems such as Ranger. A review of the scientific visualization process will precede an overview of the visualization software available to Longhorn users, including the parallel visualization software VisIt and ParaView. Hands-on lab sessions will provide students with the opportunity to prepare data sets to be visualized using these applications. In addition, attendees will be introduced to the Longhorn visualization portal.

Instructor: Dr. Kelly Gaither, Texas Advanced Computing Center

Software Requirements: Attendees must bring their own laptop and install the following software: VisIt, ParaView, VNC Viewer, ssh client, and an X Window manager (e.g. Xming). Prerequisites: Basic understanding of Linux/Unix.

Duration: Half day


Title: Cloud Technologies, Data Intensive Science and the TeraGrid

Scott McCaulay,
Judy Qiu,
Marlon Pierce,
Rich Knepper
– Indiana University

Level: Intermediate,

Hands-on Duration: Half Day

Prerequisites: Assumes experience in scientific computing including Linux, MPI and Service oriented architectures. Experience working in a Unix environment, developing and running scientific codes written in C, C++, Java, or C#.

Requirements: Students are required to have their own desktops or laptops (preferably 2 or more cores), Windows Vista or Windows 7.

Overview: Several new computing paradigms are emerging from large commercial clouds. These include virtual machine based utility computing environments such as Amazon AWS and Microsoft Azure. There are also new MapReduce programming paradigms coming from the Information retrieval field which have been shown to be effective for scientific data analysis. In addition to commercial availability, Indiana University's FutureGrid project makes cloud technologies available on a TeraGrid resource. This tutorial introduces key concepts and provides a common set of simple examples. It is designed to help participants understand and compare capabilities of these new technologies and infrastructure, and to provide a basis to begin using these tools.

Proposed Agenda:

  1. Introduction: Current Clouds with Infrastructure, Platform and Software as a Service
  2. Basic Amazon EC2 and S3 and initial discussion of applications that will be used
  3. Academic Cloud environments: Open-Source Eucalyptus, Nimbus, Open-Nebula …
  4. Microsoft Platform as a Service: Azure
  5. Linux Platform as a Service: Advanced Amazon and Google App Engine
  6. MapReduce: Using Hadoop and Dryad
  7. Extensions of MapReduce including Twister (i-MapReduce) from Indiana University
  8. Performance of Life Sciences applications using Azure, EC2, Eucalyptus, Nimbus and FutureGrid with comparison of Cloud and MPI technologies,
  9. Accessing Cloud Technologies through Gateways
  10. FutureGrid Implementation of Cloud Technologies

Title: Using vSMP and Flash Technologies for Data Intensive Applications

Duration: Half day

Content level:
% Introductory: 50
% Intermediate: 50

A Hands-on/Demo is planned.

Mahidhar Tatineni San Diego Supercomputer Center (SDSC) University of California, San Diego (UCSD) mahidhar@sdsc.edu
Jerry Greenberg San Diego Supercomputer Center (SDSC) University of California, San Diego (UCSD) jpg@sdsc.edu
Arun S. Jagatheesan San Diego Supercomputer Center (SDSC) University of California, San Diego (UCSD) arun@sdsc.edu

Virtual shared-memory (vSMP) and flash memory technologies can significantly accelerate investigations of a wide range of data-intensive problems. Dash is a new TeraGrid resource (at SDSC) which showcases both these technologies. This tutorial will be a basic introduction to using vSMP and flash technologies and how to access it on the TeraGrid. Hands on material will be used to demonstrate the use and performance benefits.


  1. Dash Architecture
    1. System Details
    2. Introduction to vSMP
    3. Use of flash and non-volatile memory hierarchy for HPC and High Performance Data (HPD) (fast file I/O, random IOPS, swap space)
  2. Dash user environment
    1. Accessing vSMP node and flash memory
    2. Compilers, Software libraries
    3. Running jobs on regular and vSMP partitions (Hands on examples)
  3. Hands on examples on use of vSMP node
    1. Running MPI, OpenMP, and Hybrid codes; Process pinning and best practices
    2. Memory pinning
    3. Aggregated filesystems (/ramfs, /ssdfs)
  4. Hands on examples illustrating flash memory use
    1. Direct access (database example)
    2. Improving HPC and High Performance Data (HPD) using flash and vSMP Programming using flash -As swap space -In memory hierarchy
  5. Q&A session including hands on preliminary work with attendee codes on vSMP nodes and flash IO nodes.

Title: Introducing RDAV & Nautilus–Resources for Remote Visualization, Data Analysis and Workflow Management

RDAV, the UT/NICS Center for Remote Data Analysis and Visualization, provides new TeraGrid resources for remote visualization, data analysis and workflow management. The hub of RDAV is Nautilus, an SGI UltraViolet machine with 4 TB of shared memory and 1024 processors. This half-day tutorial will introduce participants to the capabilities of RDAV and Nautilus in the areas of remote visualization, data analysis and workflow management. Participants will be able to log onto the Nautilus system and complete short hands-on exercises that demonstrate typical usages in these three areas.

Proposed Agenda:

  • Introduction/Overview of RDAV and Nautilus system
  • Remote visualization using VisIt (basics)
    • Getting data into VisIt
    • Plotting and manipulating data
    • Quantitative analysis
  • Data analysis with R (basics)
    • Getting your data into R
    • Objects and handling data
    • Introduction to functions, plots and packages
    • Interactive and batch parallel programming
  • Scientific workflow
    • Managing and monitoring simulations
    • Creating a "Hello World!" workflow from standard components
    • Opening and running an existing workflow
    • Creation of web service based workflows

Instructors: Sean Ahern (ORNL/UT)
Amy Szczepanski (UT)
Gary Liu (ORNL)
Scott Simmerman (UT)

Software Requirements: Participants are encouraged to have VisIt 2.0 installed on their laptops for the remote visualization portion of the tutorial.

Prerequisites: None

Time: Half Day

Level: Introductory (1st half)

Title: Open Grid Computing Environments Software for Science Gateways

Marlon Pierce (Indiana University)
Suresh Marru (Indiana University)

Intended Audience: science gateway and Web portal developers interested in developing science gadgets, wrapping science applications as services, and managing workflows on grids and clouds. Requires experience with Java and Web programming. Tutorial level: intermediate-advanced

Expected Duration: Half Day

Tutorial Format: hands-on exercises and interactive demonstrations supplemented with overview slides and detailed presentations.


Agenda: Science Gateways are Web-based environments that enable users to construct, share, execute, and monitor science applications on Grids and Clouds. They consist of Web-based user interfaces and supporting Web services. Typically, the science applications that form the core of the Gateway are designed to run in single user environments rather than as Web applications. These tasks often need to be tied together into composite applications that span multiple computing resources on the TeraGrid.

In this tutorial, we present a set of packaged, downloadable Science Gateway development tools to address these problems. These tools include the OGCE Gadget Container, a Google gadget-based Web portal container for hosting gadget interfaces; XBaya, a workflow composition, execution, and monitoring tool; GFAC, a Web service for creating and managing other Web services that wrap command-line scientific applications; the OGCE Messaging Service, a remote event management service; and XRegistry, a service registry that is used as a repository for sharing GFAC and other services and workflows. This domain independent tool suite allows developers supporting many different scientific communities to build gateways that allow users to selectively and securely share their gadgets, applications, and workflows.

Tutorial Outline: The tutorial will be divided into the following sections.

  • Introduction and overview of the OGCE project, its software components, its source code organization, and its build and deploy system. These will prepare the students for the hands-on sessions.
  • Demonstration of the OGCE gadget container and gadgets for science gateways, followed by hands-on session with software. Topics include deploying and modifying the software and building science gateway gadgets. We will also show how to build gadgets that interact with TeraGrid Information services. We will discuss OpenID and OAuth security issues for portals and gadgets.
  • Demonstration of the OGCE science application wrapping services for managing jobs and workflows on grids and clouds. The tutorial examples will include simple command line applications as well as the WRF weather forecasting tool. Tutorial participants will set up and run OGCE services (GFAC, XRegistry, Messenger) and user interface components (Registry, Experiment Builder, and XBaya gadgets). These components will be used to wrap, register, and compose service-based workflows through Web front ends.

Author 1
John McGee
Renaissance Computing Institute, UNC-CH

Jason Reilly
Renaissance Computing Institute, UNC-CH

Mats Rynge
USC, Information Sciences Institute

Title: Computing Across Both Open Science Grid and the TeraGrid

We will present an overview of computational science in the context of High Performance Computing (HPC), High Throughput Computing (HTC), and an emerging combination known as High Throughput Parallel Computing (HTPC). Lectures will be based around real world examples of computational research activities conducted during the previous 12 months and discuss the CI services that enable the solutions. We will discuss existing workflow based use cases that combine HPC on TeraGrid with HTC on the Open Science Grid, and an overview of the RENCI Science Portal (RSP). The tutorial will be roughly 60% lecture and 40% hands on lab exercises, with exercises divided into two sections, one utilizing the OSG Engage hosted infrastructure to run jobs on OSG, and another utilizing web service interfaces in the RENCI Science Portal to run jobs across TeraGrid and OSG.

Computing Across Both Open Science Grid and the TeraGrid
In this tutorial, we will present an overview of computational science in the context of High Performance Computing (HPC), High Throughput Computing (HTC), and an emerging combination known as High Throughput Parallel Computing (HTPC). The lectures will be based around real world examples of computational research activities conducted during the previous 12 months including (but not limited to): metagenomics, astronomy, biochemistry, weather, coastal circulation and sea level rise modeling. We will provide a description of Workload Management Systems available for both Open Science Grid and TeraGrid, and briefly discuss the core infrastructure components and services of these CI programs that enable these WMS's (eg information services, common jobs submission interfaces, monitoring, etc).

We will discuss existing workflow based use cases that combine HPC on TeraGrid with HTC on the Open Science Grid. This tutorial will also provide a brief discussion of Science Gateways and overview of the RENCI Science Portal (RSP) by dissecting recent large scale usage for high throughput computing across Open Science Grid, TeraGrid, local RENCI resources, and an NIH sponsored cluster.

The tutorial will be roughly 60% lecture and 40% hands on lab exercises. The hands on lab exercises will be divided into two sections, one utilizing the OSG Engage hosted infrastructure to run jobs on OSG, and another utilizing web service interfaces in the RENCI Science Portal to run jobs across TeraGrid and OSG. The requirements for attendee computers will be an SSH client for the OSG Engage work, and a recent model Java environment for accessing the Science Gateway web services. Attendees will also be required to add a CA certificate into the Java keystore so that the SSL connections can be made to the Science Gateway.

Length: Half Day

Title: Optimization in Multi-Core Systems

Download the tutorial(PDF)

Kent Milfeld
Lars Koesterke
Yaakoub El-Khamra
John Lockman
Carlos Rosales

Kent Milfeld
Lars Koesterke
Carlos Rosales

Affiliation: Texas Advanced Computing Center, The University of Texas at Austin

Duration: Half-day

Level: Intermediate

Prerequisites: Knowledge of C or Fortran, and MPI.

Software: A laptop with an ssh client will be required for the hands-on sessions.

This tutorial will cover simple yet effective optimization strategies for codes running in multi-core and multi-node systems. The tutorial will have a strong hands-on component where exercises and examples will reinforce the theory.

Proposed agenda:
  • Basic optimization
    • Compiler options
    • High performance libraries
    • Hand tuning
    • IO strategies
    • Hands-on session
      • Comparing kernel/code performance under different optimization strategies
  • Parallel optimization
    • MPI
      • Pt-2-Pt Considerations
      • Collective operations
      • Using topologies
      • Efficiency of different MPI implementations
      • Simple approaches to using MPI-IO effectively
    • OpenMP
      • Numa Control: Memory policy and process affinity
      • Hybrid codes
    • A few notes on GPUs
      • GPUs as massively threaded processors
      • CUDA and memory management
    • Hands-on session
      • Creating subgroups and communicators
      • Scaling of point-to-point and collective operations
      • Optimizing collective communications using subgroups

Title: Performance Analysis and Tuning for GPUs

Download a PDF of the tutorial(part 1)

Download a PDF of the tutorial(part 2)

Richard Vuduc
Hyesoon Kimy
Brandon Hill
Tabitha Samuel

Institutions: Georgia Institute of Technology, Atlanta University of Tennessee, Knoxville

Motivation and objectives. The goal of this full-day tutorial is to provide a hands-on introduction to the analysis, modeling, and tuning of performance for general-purpose graphics processing unit (GPGPU) platforms. Such systems feature prominently in current and future NSF TeraGrid acquisitions, including Lincoln at NCSA and the upcoming XD Keeneland system at ORNL, so that it is becoming increasingly important for developers to understand how best to use GPU resources. Fortunately, there is a wide variety of high-quality introductory material available on basic GPU programming, as well as numerous examples, making it easy to get started using GPUs. However, we believe there is far less such guidance available on the more advanced topics of performance modeling, analysis, and tuning, so that efficient use of GPUs remains a challenge for most developers. Our proposed tutorial aims to meet this need.

Prerequisites. The level of difficulty of our tutorial is suitable for attendees with an "advanced beginner" or intermediate-level GPU background. That is, we assume some exposure to the basics of CUDA and/or OpenCL programming models, with prior GPU programming experience highly recommended. (We review the basics, but prior exposure helps to facilitate the hands-on exercises.) Note that CUDA and OpenCL are the dominant programming models used for GPGPU programming on both the TeraGrid NCSA/Lincoln and ORNL/Keeneland platforms, which are based on NVIDIA GPUs. To participate in the hands-on exercises, attendees will need to provide their own laptops with ssh software capabilities.

Outline of material. The material we cover draws from numerous sources, including (a) the available information and guidance on GPU performance analysis and tuning; (b) material we have developed and used for our undergraduate and graduate courses at Georgia Tech; (c) insights from our research on performance analysis, modeling, and tuning for GPUs [7, 4, 5, 8, 1, 6, 3, 2]; and (d) experience at the National Institute for Computational Sciences (NICS) at UTK with assisting developers porting codes to GPU systems. We summarize the material we intend to cover by topic below. Each topic has a hands-on component, which we also describe.

  1. (1.5 hours) CUDA review: Review the basics of the CUDA programming model and typical NVIDIA architectures, including the new Fermi architectures that are the basis for ORNL/Keeneland. Attendees will receive accounts on a Georgia Tech GPU cluster similar to NCSA/Lincoln, and will practice logging in and compiling simple CUDA programs that we will provide.
  2. (1.5 hour) Performance tools and analysis: We will demonstrate basic performance analysis tools, with hands-on exercises that show how to diagnose various kinds of bottlenecks on examples that we provide. The tools include NVIDIA's CUDA Profiler, as well as GPUOcelot, a Georgia Tech-based tool suite that provides more advanced analytical capabilities. The key distinction of our tutorial from existing material is its focus on step-by-step analysis.
  3. (1.5 hours) From analysis to tuning: In this segment, we do a step-by-step walkthrough in which we take a computation and, through a series of analysis and transformation steps, tune it to achieve a high level of performance.
  4. (.5 hours) Application case study: Bruce Loftis and Tabitha Samuel (NICS/UTK), who work closely NICS customers, will report on their experience helping customers tune for GPUs.

Instructors. Hyesoon Kim and Rich Vuduc are both faculty members at Georgia Tech, and Bruce Loftis and Tabitha Samuel provide direct support to users of the major computing facilities ORNL. Kim's research focus is in the area of computer architecture and performance tools, particularly for GPU-based systems [7, 4]. Vuduc's research focus is on parallel algorithms and automated performance tuning (autotuning), with a recent focus on GPU systems [5, 8, 1, 6, 3, 2]. Bruce Loftis leads user support at NICS, and Tabitha Samuel works closely with developers migrating to new systems. In addition, two graduate students from Georgia Tech will accompany Kim and Vuduc to serve as teaching assistants for the hands-on exercises.

[1] N. Arora, A. Shringarpure, and R. Vuduc. Direct n-body kernels for multicore platforms. In Proc. Int'l. Conf. Parallel Processing (ICPP), Vienna, Austria, September 2009.
[2] A. Chandramowlishwaran, S. Williams, L. Oliker, I. Lashuk, G. Biros, and R. Vuduc. Optimizing and tuning the fast multipole method for state-of-the-art multicore architectures. In Proc. IEEE Int'l. Parallel and Distributed Processing Symp. (IPDPS), Atlanta, GA, USA, April 2010.
[3] J. W. Choi, A. Singh, and R. W. Vuduc. Model-driven autotuning of sparse matrix-vector multiply on GPUs. In Proc. ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP), Bangalore, India, January 2010.
[4] S. Hong and H. Kim. An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In Proc. ACM Int'l. Symp. Comp. Arch. (ISCA), pages 152–163, Austin, TX, USA, June 2009.
[5] S. Kang, D. Bader, and R. Vuduc. Understanding the design trade-offs among current multicore systems for numerical computations. In Proc. IEEE Int'l. Parallel and Distributed Processing Symp. (IPDPS), Rome, Italy, May 2009.
[6] I. Lashuk, A. Chandramowlishwaran, H. Langston, T.-A. Nguyen, R. Sampath, A. Shringarpure, R. Vuduc, L. Ying, D. Zorin, and G. Biros. A massively parallel adaptive fast multipole method on heterogeneous architectures. In Proc. ACM/IEEE Conf. Supercomputing (SC), Portland, OR, USA, November 2009.
[7] C.-K. Luk, S. Hong, and H. Kim. Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In Proc. IEEE/ACM Int'l. Symp. Microarchitecture (MICRO), New York, NY, USA, December 2009.
[8] S. Venkatasubramanian and R. W. Vuduc. Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU platforms. In Proc. ACM Int'l. Conf. Supercomputing (ICS), New York, NY, USA, June 2009.

Title: PerfExpert: An Automated Approach to Analyzing and Optimizing the Node-Level Performance of HPC Applications

Download the tutorial(PPTX)

James Browne, The University of Texas at Austin, browne@cs.utexas.edu
Martin Burtscher, The University of Texas at Austin, burtscher@ices.utexas.edu

Length: half day

Breakdown: introductory (35%), intermediate (35%), advanced (30%)

Intended audience: application writers and performance support staff for HPC clusters

URL: http://www.tacc.utexas.edu/perfexpert/ (quick-start guide, Supercomputing 10 paper, etc.)

Tentative outline:
Performance evaluation overview (1/4 hour)
PerfExpert introduction and demo (3/4 hour)
Node-level optimizations (1 hour)
Hands-on optimization of applications and porting PerfExpert to new platforms (1 hour)

The goal of this tutorial is to enable application domain experts to readily optimize their HPC codes for multicore chips and multichip nodes by using PerfExpert to analyze and optimize the node-level performance of their application. PerfExpert is an expert system that captures knowledge of multicore chip architecture and compilers. It automatically detects probable performance bottlenecks in each important procedure and loop and identifies the likely cause of each bottleneck. For each bottleneck type, PerfExpert suggests optimization strategies, code examples, and compiler switches that can be used by the application developer to improve performance. PerfExpert thus minimizes the effort and knowledge needed to execute node-level performance optimization for a given architecture. We have applied PerfExpert to several HPC production codes on TACC's Ranger supercomputer with considerable success, including accelerating a global Earth mantle convection simulation running on 32,768 cores by 40%. PerfExpert has been successfully taught in performance optimization classes for Ranger given at TACC.

This tutorial will begin by discussing the complexity of node-level performance tuning. Then we will introduce PerfExpert and give a live demonstration of its operation on several HPC applications. We will study in detail a representative set of source-code optimizations that we have found to result in substantial speedup on a variety of HPC applications. Participants are invited to apply PerfExpert (on Ranger) with the organizers available to answer questions. (Participants interested in experimental use of PerfExpert on Ranger should contact the organizers at burtscher@ices.utexas.edu well in advance for support with porting an application to Ranger.) The tutorial will conclude with a sketch of the process needed to set up an instance of PerfExpert on a new platform.

Title: Running Applications at Scale on the First Academic Petaflop Supercomputer

Download the tutorial(PDF)

LEVEL: Intermediate (Full Day)

Participants must provide their own laptop computers for the Hands-On sessions and are encouraged to bring their own codes. Participants are expected to have experience in using a UNIX type OS and in the aspects of running parallel scientific applications such as compiling, utilizing batch systems, MPI, and/or OpenMP.

Glenn Brook (NICS)
Lonnie Crosby (NICS)
Meng-Shiou Wu (NICS)
and Haihang You (NICS)

This tutorial is focused on the effective utilization of the First Academic Petaflop Supercomputer: Kraken, a Cray XT5 located at the National Institute for Computational Sciences (NICS). Being the first and only petaflop supercomputing resource on the TeraGrid, special attention will be given to the scalability of application codes running at this scale. A quick introduction to Kraken will provide the framework for lectures and hands-on exercises in available tools and software, performance optimization techniques for MPI and OpenMP, performance optimization techniques for I/O, and a comparison of available performance monitoring tools such as CrayPAT and TAU. Ample time will be given for individual discovery and the answering of questions.


08:00 Introduction to Kraken: Cray XT5 A quick overview detailing the specifics of this implementation of the Cray XT5 will be presented. Topics will include access, running batch jobs, and software environment.

08:30 Available Software and Tools The software and tools available on Kraken will be presented. Topics will include the availability and use of debuggers, diagnostic tools, and performance tools. Compilation methodology will also be discussed.

09:15 Hands-On Session A hands-on session will be conducted to allow participants to access Kraken and prepare for upcoming hands-on sessions.

09:45 Break

10:00 MPI/OpenMP Techniques and Performance Optimization A discussion of various MPI and OpenMP techniques will be presented with emphasis on best practices and scalability. Other topics will include the Cray XT5 interconnect, the MPT library, and common problems.

11:00 Hands-On Session A hands-on session will be conducted to give participants the opportunity to investigate MPI techniques and performance optimizations. An example code with instructions will be provided to facilitate individual discovery. Additionally, participants may bring and work on their own personal codes.

12:00 Lunch

13:00 Performance Tools Various performance tools installed on Kraken: FPMPI, CrayPAT, PAPI, IPM, and TAU will be discussed. Discussions will include performance measurement, profiling, tracing, and analysis focused on improving application performance. Additionally, the strengths of each performance tool will be highlighted in different scenarios.

14:00 I/O Techniques and Performance Optimization A discussion of various I/O paradigms will be presented with emphasis on scalability. Additional topics will include parallel I/O with MPI-IO, the Lustre file system, and common problems.

15:00 Break

15:30 Hands-On Session A hands-on session will be conducted to give participants the opportunity to investigate I/O techniques and performance optimizations. An example code with instructions will be provided to facilitate individual discovery. Additionally, participants may also investigate the use of performance tools on example codes or their own personal codes.

17:00 Adjourn

The Scientific Computing Group at the National Institute for Computational Sciences (NICS) has ample experience in providing and/or presenting educational workshop/tutorial material. In 2009, staff contributed to eight workshops or tutorials and gave thirteen presentations (in addition to those given for workshops and tutorials). Venues for these workshops/tutorials included The University of Tennessee, The University of California at Berkeley, Oak Ridge National Laboratory, Oak Ridge Associated Universities, Texas Advanced Computing Center, TeraGrid 2009, and The National Institute for Computational Sciences. Individual presentations were presented at various TeraGrid forums, The University of Tennessee, Cray User Group Meeting, and The National Institute for Computational Sciences.

HPC Advisory Council Tutorial: HPC Applications Best Practices and the Effect of New Acceleration technologies

Formed in May 2008, the HPC Advisory Council (http://www.hpcadvisorycouncil.com) is a high-performance computing educational and outreach center, consists of 150 organizations. The HPC Advisory Council includes four special interest subgroups – HPC|Works, HPC|Scale, HPC|Storage and HPC|Cloud, and operates a technology leading computing center to support development, research, and explorations of high-performance applications, solutions and technologies. The center also provides free access to users around the world.


  • Gilad Shainer (HPC Advisory Council chairman)
  • Tong Liu (HPC Advisory Council HPC|Works subgroup Chair)
  • David Cownie
  • Jeff Layton

The “HPC Applications Best Practices and the Effect of New Acceleration technologies” will provide an overview on latest technologies that have been reviewed as part of the HPC advisory Council activities (such as GPUDirect, MPI offloads, HPC in clouds, RoCE and more) and an in-depth overview of libraries and applications optimizations, profiling at small scale and large scale (including results from some of the world fastest supercomputers) and best practices. Hands-on activities around applications best particles will be included as part of the tutorial.

8:00-8:15: HPC Advisory Council activities overview
8:15-9:15: Next generation technologies overview: GPUDirect for GPU accelerations, MPI and SHMEM offloading, RoCE (RDMA over Ethernet), InfiniBand FDR/EDR roadmaps, next generation CPUs and large scale integrations, NAS, SAN and cluster file systems
9:15-10:00: 101 on applications profiling – tools, compilers, setup, libraries, and troubleshooting for CPU/memory/networking; applications installation best practices (such as OpenFoam, OpenAtom, CCSM, NWchemm NAMD etc)
10:00-10:15 - Break
10:15-11:15: Applications best practices and lessons learned at large scale, including setup and troubleshooting, different MPI implementations, usage of collectives operations and offloading, performance optimization and self building Lustre for cluster file system, power efficiency and more
11:15-12:00: Hands-on utilizing the HPC Advisory Council systems center, open discussions and live troubleshooting. As time permit, overview into HPC in cloud computing and a demonstration of a HPC cloud environment

For questions on the tutorial material, please contact info@hpcadvisorycouncil.com