XSEDE16 Poster Session

Tuesday 5:15pm – 6:15pm

Biscayne Ballroom

Biomedical disease name entity recognition: An ensemble approach

Thomas Hahn, Hidayat Ur Rahman and Dr. Richard Segall

In biomedical literature "named entities" represent words or sequences of words, which represent specific terms such as protein, DNA, RNA or disease name. The process of tagging individual entities is called "named entity recognition" (NER). Performance of biomedical NER – compared to general NER (e.g. name, country, date, time, money, etc.) – is very low due to the following two reasons: Firstly, biomedical entities have variable lengths, for example, disease names can be very long (e.g. up to seven words, e.g. "Familial deficiency of the seventh component of complement") or as short as only 2 letters, e.g. "HD"; and secondly, the biomedical entities lack a persistent morphology and hence they are not proper nouns containing digits, letter and Greek symbols, which also increases the ambiguities of classification. In our research we primarily focus on disease name recognition using National Center for Biotechnology Information (NCBI) dataset. For this purpose weak learners have been combined. Weak learners consist of statistical learners (such as naïve Bayesian and Bayesian network) and rule based learners (such as Partial Decision Trees (PART), Naïve Bayesian combined with decision table (DTNB) and non-nested generalized exemplars (NNGE)). Performances of these classifiers were examined using standard metrics such as precision, recall and F-score. A hybrid classification approach has been proposed by combining the classifier using majority voting.


Thomas Hahn, Shen Lu and Dr. Richard Segall

The use of pattern recognition software is not new in the field of bioinformatics, but is perhaps not as developed as it should be. With the growing amounts of data that are being produced by various microarray technologies and other devices on the one hand, and an appreciation of the fact that there is more to the vast amount of 'non-coding' DNA for H. sapiens on the other, we survey microarray experimental data to see possibilities and problems to control microarray expression data. We use both variable and attribute measures to visualize microarray expression data. According to the data structure of the attributes, we use control charts to visualize fold change and t-test attributes in order to find the root causes for analytical variations in microarray data quality.

Computational analysis for modeling bluff-body flames

Lu Chen and Francine Battaglia

A flame is used in a wide range of industrial applications such as engines, boilers, furnaces, etc. One type of burner uses a bluff-body that separates the fuel and the air prior to reacting. The bluff-body burner also stabilizes the flame during the industrial process. This unique burner is an ideal case to investigate the interaction between chemical reactions and turbulence, and can connect theoretical problems and engineering applications. However, bluff-body flames are still challenging to model not only because of the complexity of turbulent flow but also the complexity of the chemistry. Computational fluid dynamics (CFD) can be used to understanding reacting flows and help with analysis and design of burners. The poster presentation will include CFD predictions that were validated with the experiments of Correa and Gulati (1992), who measured the major species of the non-premixed flame of 27.5%CO, 32.3%H2, 40.2%N2 and air. A comparative study of five turbulence models (the standard k-epsilon, the modified k-epsilon, the realizable k-epsilon, the RNG k-epsilon and the RSM model) is conducted and evaluated with the experiments of Correa and Gulati (1992). With the validated numerical models, novel CFD results regarding dynamic mechanisms will also be presented in the complex geometry for the industry-designed combustor. Broadly this research will significantly enhance the understanding of combustion mechanisms inside the complicated prototype burner geometry and could positively impact the industrial applications of bluff-body burners by providing cleaner, more efficient energy. In the study of this topic, CFD is a promising tool for designing, troubleshooting and optimizing reactors for operation to reduce pollutants and HPC is a key element to enable efficient simulations with large number of computational nodes for complex systems.

SEAGrid User Authentication and Data management Enhancements Using Globus

Sudhakar Pamidighantam, Stuart Martin and Eric Blau

This poster will highlight the new user authentication enhancements planned for the SEAGrid Gateway (SEAGrid.org) by integrating with the Globus (www.globus.org) Auth service.

SEAGrid rich client is a Java FX based desktop application that provides an interface to execute quantum chemistry applications on XSEDE and other compute resources. SEAGrid rich client interfaces with the Apache Airavata infrastructure for all file and job management. When executing jobs on XSEDE resources, the SEAGrid gateway uses a community account through which all user jobs are submitted. Individual identities are managed through a WSO2 identity management system in Airavata. Improvements are planned to leverage the new Globus Auth service to provide single sign on to both SEAGrid and Globus. This will make integration straightforward for Globus file transfer and sharing in the SEAGrid rich client. The poster will show the component architecture for how workflow and security will be done.

Globus is delivering innovative cloud-based, web accessible, Software as a Service (SaaS) to support big data management, analysis, and collaboration for science. The services aim to bring sophisticated capabilities to research communities that to date may have found such resources to be out of reach without specialized personnel or infrastructure.

SIMULOCEAN science gateway using Docker on Bridges and Globus for user authentication and data management

Jian Tao, Eric Blau, Shui Yuan, Mona Wong, Stuart Martin and Qin J. Chen

SIMULOCEAN is a web-based scientific application and visualization framework for the management and deployment of software serving the coastal modeling community. The framework helps to collect observational data, schedule modeling codes for execution, manage data transfer, and visualize both observational and numerical results. With all the information collected, SIMULOCEAN can also provide direct validation and verification for models, and generate high quality technical reports.

Globus is delivering innovative cloud-based, web accessible, Software as a Service (SaaS) to support big data management, analysis, and collaboration for science. The services aim to bring sophisticated capabilities to research communities that to date may have found such resources to be out of reach without specialized personnel or infrastructure.

SIMULOCEAN has been deployed as a XSEDE Science Gateway and includes two novel aspects. First is the utilization of XSEDE's new High Performance Computing (HPC) system, Bridges, and its support for Docker which allows the packaging of the application, data and other dependencies into a standardized unit for execution. Second, SIMULOCEAN uses the new Globus Auth, an identity and access management system to broker authentication and authorization interactions, as well as the new Globus file transfer and sharing capabilities for large file management by the user. The poster will show the component architecture for how workflow and security will be done.

Efficient Seismic Modeling using Poroelastic approach

Khemraj Shukla, Jan S Hesthaven and Priyank Jaiswal

Popular methods of seismic modeling fall under acoustic or elastic category, which respectively assume subsurface to be purely fluid or purely solid. In reality, wave propagation in porous media excites both the mineral grains (solids) and interstitial fluids. To honor the physics of solid-fluid interaction during wave propagation the porous media needs to be described using poroealstic system of equations, which is a combination of wave equation and dissipation potential (originates from Darcy's law). Till date, however, very limited poroelastic applications have been made with real field data mainly due to computational challenges. Numerical methods for modeling the complex poroelastic system have not been fully explored in scientific literature.

We have developed the poroleastic system for orthorhombic media in conservation form by combining the Biot's and Hamiltonian mechanics. This system constitutes eight time dependent hyperbolic PDEs in 2D and thirteen in 3D. Eigen decomposition of Poroelastic systems predicts presence of an additional diffusive wave, also known as the slow P-wave, with velocity slower than the shear wave. I am solving the poroelastic system using nodal discontinuous Galerkin (Nodal-DG) approach currently using triangular meshes in 2D, extendable to tetrahedral for 3D problems.

We are implementing the Nodal-DG scheme in parallel environment using two levels of granularity. Partitioning the mesh and distributing over the cores achieve the coarser level of granularity. The nodal approach of numerical scheme inherently provides thread safe computation and thus fetching an automatic fine level granularity. To achieve fine level granularity, we used NVIDIA GPU TESLA-K40. Although the current testing is with Lax-Friedrich Flux, problem specific development of upwind flux and Roe solver is in progress.

Using Jetstream as a Machine Learning Cloud to Detect Chemicals

Mengyuan Zhu

In this project, we combined fluorescent studies, 3D printing, mobile platform development, machine learning and cloud computing in an effort to detect chemicals. A device that can be attached to a phone case was designed. The device can capture the fluorescent light intensity using the phone camera. The captured image will be uploaded to the XSEDE resource Jetstream, which will compute the result and send concentration data back the phone for display. Please see the attached file for more information.

XDMoD Job Viewer: A Tool to Monitor Job Performance

Steven M. Gallo, Matthew Jones, Abani Patra, Jeanette Sperhac, Thomas Yearke, Jeffrey Palmer, Nikolay Simakov, Martins Innus and Ben Plessinger

XDMoD is a web-based tool intended to provide XSEDE stakeholders with a variety of usage and performance data. One new feature of XDMoD is the Job Viewer that allows users and user support personnel to view detailed job level performance information. The user can inspect detailed time dependent performance data (from TACC_Stats) to determine how efficiently their job ran, and if performance was poor, to gain insight into possible causes of the problem and how to fix it. This poster provides basic information regarding the job level performance data that is available through the XDMoD Job Viewer and how to use it to assess job performance.

Collective dynamics of PKM2 by principal component and contact analysis

Kenneth Huang, Kathy Li, Chunli Yan, Ivaylo Ivanov and Peng Wang

Cancer cells have altered metabolic functions to meet the significantly increased energy demands. Metastatic tumor cells also need an abundance of phosphometabolites used to synthesize components for cellular proliferation. Known as the Warburg effect, pyruvate kinase muscle isoform 2 (PKM2) is a key facilitator of this process. PKM2 is preferentially expressed in cancer cells, and has been designated a biomarker in various cancers. Increasing pyruvate kinase activity by activating PKM2 is capable of suppressing tumor growth, making it a promising target in reducing cancer cell proliferation. We have previously found that that micheliolide (MCL) activates PKM2 by covalently binding to a cysteine conserved in human PKM2, but the precise mechanics of this stability change have been difficult to determine. Using molecular dynamics simulations to see conformational changes occurring throughout the protein across multiple variants of PKM2, we explored how using a combination of principal component analysis in conjunction with contact statistics may provide an answer of the exact mechanism that induces tetermeric stability over its potential dissociation.

SeedMe2: Data sharing building blocks

Amit Chourasia, David Nadeau, John Moreland, Dmitry Mishin and Michael Norman

Computational simulations have become an indispensible tool in a wide variety of science and engineering investigations. Nearly all scientific computation and analyses create important transient data and preliminary results. These transient data include information dumped while a job is running, such as coarse output and run statistics. Preliminary results include data output by a running or finished job that needs to be quickly processed to get a view of the job's success or failure. These job output data provide vital guidance that helps scientists review a current job and adjust parameters for the next job to run. Quick and effective assessments of these data are necessary for efficient use of the computation resources, but this is complicated when a large collaborating team is geographically dispersed and/or some team members do not have direct access to the computation resource and output data. Current methods for sharing and assessing transient data and preliminary results are cumbersome, labor intensive, and largely unsupported by useful tools and procedures. Each research team is forced to create their own scripts and ad hoc procedures to push data from system to system, and user to user, and to make quick plots, images, and videos to guide the next step in their research. These custom efforts often rely on email, ftp, and scp, despite the ubiquity of much more flexible dynamic web-based technologies and the impressive display and interaction abilities of today's mobile devices. Better tools, building blocks, and cyberinfrastructure are needed to better support transient data and preliminary results sharing for collaborating computational science teams.

SeedMe project is developing web-based building blocks and cyberinfrastructure to enable easy sharing and streaming of transient data and preliminary results from computing resources to a variety of platforms, from mobile devices to workstations, and make it possible to quickly and conveniently view and assess results and provide an essential missing components in High Performance Computing (HPC) and cloud computing infrastructure. This work is an evolution of the SeedMe project [1, 2, 3] that will ultimately offer modular and flexible data sharing building blocks to the computation community. The building blocks will include authentication/authorization, granular access controls, data sharing and indexing, micro format ingestion and presentation for dashboard like functionality.

SeedMe building blocks is broadly applicable to a diverse set of scientific and engineering communities, and SeedMe will be released as a suite of open source modular building blocks that may be extended by others. With this poster we'd like to showcase the current progress on the project and engage with the HPC community to get feedback.

Initial design and documentation for the project is available at the project website


1. SeedMe. 2016. SeedMe (Stream Encode, Explore and Disseminate My Experiments) Retrieved May 2, 2016 from https://www.seedme.org

2. A. Chourasia, M. Wong-Barnum, M. Norman. SeedMe Preview: Your Results from Disk to Device In Proceedings of the Conference on Extreme Science and 4. Engineering Discovery Environment: Gateway to Discovery (XSEDE '13). ACM, New York, NY, USA, Article 35, 4 pages.

3. A. Chourasia, M. Wong-Barnum, M. Norman. SeedMe: A Cyberinfrastructure for Sharing Results. Presented at the XSEDE 2015 conference, St. Louis, MO, Jul 29, 2015

Connecting XSEDE Resources to Geospatial Data Analysis Workflows using GABBS Building Blocks

Lan Zhao and Carol Song

Geospatial Data Analysis Building Blocks (GABBs) is an NSF-funded Data Infrastructure Building Blocks (DIBBS) project to create a powerful yet easy-to-use web-based system that provides the software building blocks allowing researchers. who are typically not a computer expert, to self-manage, curate, share, analyze, and visualize geospatial data for their research. This opens the way for rapid development of a variety of web-enabled interactive tools for probing and presenting geospatial data. The development of GABBS building blocks is driven by the requirements from several user communities, including hydrologic modeling and data sharing, applied economics modeling workflow, meteorological data management and visualization, K-12 education tools, and high education course development with online modeling support. GABBs also integrate XSEDE resources, Globus, iRODS and other cyberinfrastructure support, enabling large scale computation and data capabilities for the user community. This poster will present the design and implementation of the GABBS software consisting of geospatial data management functions that are integrated into HUBzero's Project data space, software and hardware based mapping libraries and map viewer widgets for geospatial tool development, and dynamic invocation of data analysis tools from the Hub Project data space. Several use cases will be included to illustrate how GABBS interoperates with Globus Online and XSEDE to connect geospatial data analysis workflows with HPC resources, enabling a broad community to have easy access to advanced national computational infrastructure.

Enabling Hydrological Modeling Education Using an XSEDE Science Gateway

Lan Zhao, I Luk Kim, Venkatesh Merwade, Adnan Rajib and Carol Song

Most hydrologists use hydrologic models such as SWAT (Soil Water Assessment Tool) to simulate the hydrologic processes to understand hydrologic pathways and fluxes for research, decision making and engineering design. However, most of the hydrologic models have steep learning curves, intensive input data needs, and demand for computational resources that are usually not available to researchers or educators in many settings. Towards bridging this gap, an online application called SWATShare was developed and deployed on the WaterHUB science gateway (https://mygeohub.org/groups/water-hub) which aims at enabling sharing of hydrologic data and modeling tools in an interactive environment. Users can utilize SWATShare to search and download existing SWAT models, upload new SWAT models, as well as to run and calibrate a SWAT model and visualize the results. Metadata such as the name of the watershed, name of the person or agency who developed the model, simulation period, time step, and list of calibrated parameters are also published with an individual model.  The capability for online model sharing and execution reduces duplication of efforts in input data preparation, opens doors for new collaborations, and makes it possible to teach hydrologic modeling with real world examples in classroom settings. In this poster we will present the ECSS effort in connecting SWATShare with the XSEDE HPC resource Comet, and the experience of supporting a hydrologic modeling class CE54900 at Purdue in Spring 2016 where students get hands on experiences in comparing the hydrologic processes under different geographic and climatic settings. As a new way of teaching, the students not only learn how to use the SWAT simulation tool, but also learn other aspects of scientific collaboration such as metadata and model sharing.

Heavy-metal contamination and its effects on microbial community structure in soils near Picher, OK, within the Tar Creek Superfund Site

Rachelle Beattie, Wyatt Henke, Conor Davis, Maria Campa, Terry Hazen, Aaron Johnson, Nigel Hoilett, Rex McAliley and James Campbell

The Tri-State Mining District of Missouri, Kansas and Oklahoma was the site of large-scale mining operations for lead, zinc, and other heavy metals until the mid-1950s. Although mining across the area has ceased, high concentrations of heavy metals remain in the region's soil and water systems. The town of Picher, OK, lies within this district and was included in the Tar Creek Superfund Site by the US Environmental Protection Agency in 1980 due to extensive environmental contamination and its effects on residents' health.

To elucidate the extent of heavy-metal contamination, a soil chemistry survey of the town of Picher in Ottawa County, Oklahoma was conducted. Samples (n=111) were collected from mine tailings, locally known as chat, in Picher and along cardinal-direction transects within an 8.05-km radius of the town in August 2015 and analyzed for soil metal content using Inductively Coupled Plasma Optical Emission Spectrometry, pH, and moisture content. Phospholipid Fatty Acid (PLFA) analyses, Next-Generation Sequencing (NGS) of prokaryotic 16S rRNA genes and qPCR calculations of total Bacteria and Archaea have been used in a systems biology approach to compare soil chemistry to microbial community structure in these contaminated soils.

Most statistical aspects of this project can be approached using standard laptop computers, but NGS data present an analytical bottleneck. Thus far, more than 500,000 sequences have been generated and must be condensed into "Operational Taxonomic Units" (OTUs) using pairwise comparisons and binning based upon genetic distance. While conceptually simple, the magnitude of these calculations requires specialized software and HPC resources. Thus, software such as mothur and QIIME are being employed on TACC's Stampede system.

ICP-OES analyses of 20 metals showed high concentrations of lead (>1000 ppm), cadmium (>40 ppm) and zinc (>4000 ppm) throughout the sampled region. Soil moisture content ranged from 0.30-35.9%, and pH values ranged from 5.14-7.42. MANOVA analysis of metal profiles determined that soils collected from the north transect were significantly different (p=0.001) than other sampled directions. Lead, cadmium and zinc were correlated with one another, moisture content was significantly correlated with cadmium (p=0.016) and pH was significantly correlated with aluminum (p<0.001) and zinc (p=0.049). These data show an unequal distribution of contamination surrounding the Picher mining site. Analysis of total bacteria using qPCR indicated a positive and significant correlation with moisture content, but negative and significant correlations with lead, cadmium, zinc and magnesium. Total numbers of archaea did not correlate significantly with any measured variables. Illumina sequencing of 16S rRNA genes was used to elucidate changes in community structure. A significant proportion of variation in these data were explained by pH (12.5%), lead (0.07%), cadmium (0.06%) and zinc (0.08%). Mapping the distribution of heavy-metal contamination and microbial communities in these soils represents the first step in understanding the effects of heavy-metal contamination at a basic trophic level.

Wasp: Intelligent Storage for Gridded Numerical Data

John Clyne, Larry Frank, Tom Lesperance, Alan Norton and Scott Pearse

Advances in HPC are enabling numerical simulations of a broad gamut of scientific phenomena at unprecedented scale. At the same time digital imaging technologies are revolutionizing a wide range of scientific disciplines by facilitating the acquisition of high resolution 2D, 3D, and even 4D imagery. These capabilities come with a cost: increasing data size and complexity require more sophisticated methods for data analysis and visualization. This is particularly true in the biological and geosciences, two seemingly very different disciplines, sharing a common problem, and for which we provide a common solution. In this talk we present our work on the NSF funded Wavelet-enabled progressive data Access and Storage Protocol (WASP). The WASP protocol defines an intelligent storage model for gridded data that supports three key capabilities for enabling highly efficient analysis of large, structured data sets: 1) efficient sub-setting; 2) progressive refinement; and 3) lossy or lossless compression.

Evaluation of potential mean force for insertion of peptide into pore in DOPC/DOPG mixed lipid bilayer by molecular dynamics simulation

Yuan Lyu, Ning Xiang and Ganesan Narsimhan

Antimicrobial peptides (AMPs) inactivate microbial cells through pore formation in cell membrane. Because of their different mode of action compared to antibiotics, AMPs can be effectively used to combat drug resistant bacteria in human health. AMPs can also be used to replace antibiotics in animal feed and immobilized on food packaging films. Energetics of addition of an AMP to a transmembrane pore is important for evaluation of growth of pores. This study characterizes the potential of mean force through molecular dynamics (MD) simulation for the addition of melittin, a naturally occurring AMP, into a DOPC/DOPG mixed bilayer (mimicking bacteria cell membrane) for different extents of penetration into either a bilayer or a pore consisting of three to six transmembrane peptides. The energy barrier for insertion of a melittin molecule into DOPC/DOPG lipid bilayer was highest in the absence of transmembrane peptides and decreased for number of transmembrane peptides from three to six, eventually approaching zero. The driving force for the additional peptide into the pore also increased with the aggregate size. Water channel formation occurred only for insertion into pores consisting of four or more transmembrane peptides with the radius of water channel being larger for larger number of transmembrane peptides. The structure of the pore is found to be paraboloid with the minimum radius being at the center of the lipid bilayer. Estimated free energy barrier for insertion of melittin into an ideal paraboloid pore accounting for different intermolecular interactions were consistent with MD simulation results. This study has performed 18.4 microsecond simulation in total, which costs 1.0598 million SUs on XSEDE clusters (using Comet and Stampede CPU node).

Towards Parallelization and Scalability of the Spatially Variant Lattice Algorithm

Henry Moncada, Shirley Moore and Raymond Rumpf

The purpose of this research is to develop a faster and more efficient implementation of a computational electromagnetics algorithm to generate spatially-variant lattices. The algorithm is used to synthesize a spatially-variant lattice (SVL) for a periodic electromagnetic structure. The algorithm has the ability to spatially vary, or functionally grade, all of the attributes of a lattice without deforming the unit cells which would weaken or destroy its electromagnetic properties. Attributes include lattice spacing, unit cell orientation, fill factor, lattice symmetry, patter within the unit cell, and material composition. The algorithm produces a lattice that is smooth, continuous, and free of defects. This is important for maintaining consistent properties throughout the lattice. So far, all SVL periodic structures have been built and simulated in a short scale using just a small number of unit cells. The simulation of this small array does not require too much computer time. But nowadays, there is an increasing desire for larger and more complex periodic structures, for example to enable 3D printing of devices with electromagnetic functionality. These new complex configurations are not easily handled by a single computer because of the increasing amount of computational time.

Our current research effort is to write a portable code for spatially-variant lattices on parallel computer architectures. To develop the code, we chose a general purpose programming language that supports structured programming. We began our work by writing an optimized sequential code that uses FFTW (fastest Fourier transform in the west) for handling the Fourier decomposition of the unit cell and CSparse (Concise Sparse Matrix Package in C) for handling the numerical linear algebra operations. For the parallel code, we used FFTW for handling the Fourier Transform of the unit cell device and PETSc (Portable, Extensible Toolkit for Scientific Computation) for handling the numerical linear algebra operations. Using Message Passing Interface (MPI) for distributed memory helped us to improve the performance of the spatially variant code when it was executed on a parallel system. We show performance and scaling results for our implementation on the TACC Stampede supercomputer.

Scientific Data Management with signac

Carl Simon Adorf, Paul Dodd and Sharon C. Glotzer

Researchers in the field of computational physics, chemistry, and materials science are regularly posed with the challenge of managing large and heterogeneous data spaces. The amount of data increases in lockstep with computational efficiency multiplied by the amount of available computational resources, which shifts the bottleneck within the scientific process from data acquisition to data post-processing and analysis. We present a framework designed to aid in the integration of various specialized formats, tools and work flows. The signac framework provides all basic components required to create a well-defined and thus collectively accessible data space, simplifying data access and modification through a homogeneous data interface. The framework's data model is designed not to require absolute commitment to the presented implementation, simplifying adaption into existing datasets and work flows and scales from deployment on individual workstations to super computers. This approach does not only increase the efficiency for the production of scientific results, but also significantly lowers barriers for collaborations requiring shared data access.

Human Placental Gene Expression: Collectins

Tashmay Jones and Raphael Isokpehi

Collectins (collagen-containing C-type lectins) are a part of the innate immune system, forming a family of collagenous calcium ion dependent defense lectins, which are found in animals. Within the collectin (COLEC) gene family, there are ten genes. The placenta is the highly specialized but relatively understudied organ of pregnancy that supports the normal growth and development of the fetus. It is the least understood human organ. The objective of the research study was to determine the expression levels of the individual members within the collectin family in normal placentae during the gestational ages of human pregnancy. We hypothesize that a combination of bioinformatics and visual analytics methods will help us to identify data patterns in placental expression levels of collectins that could be of potential biological significance in placental function and development. Affymetrix probe sets expression values (signal intensities) for the gene family were extracted from NCBI GEO DataSet GDS4037 (12 normal placentae at first-trimester, second-trimester, and term). A total of 7 probe sets were analyzed for the collectin gene family. A box plot representation identified probe set "221019_s_at" from the COLEC12 gene as having the highest expression level in the placenta data set. COLEC12 (CL-P1) is a scavenger receptor, mediating the uptake of oxidized low density lipoprotein and microbes, as well as predominantly facilitating phagocytosis for fungi in vascular endothelia. CL-P1 mainly associates with cytotrophoblasts and syncytiotrophoblasts of the placenta. Future research could investigate the COLEC12 expression as a non-invasive marker for prediction of adverse pregnancy outcomes.

Teaching Scalable and Flexible Text Analysis with R

Tassie Gniady, Grace Thomas and Eric Wernert

Digital Humanities (DH) is often mislabeled as a new field because it has gained many new practitioners in recent years. However, beginning in the 1940s with Father Roberto Busa and his work on the Index Thomasticus designed to search the corpus of Thomas of Aquinas, humanists have been leveraging computers to gain insights into the written word. [1] As the community grows, it is important to acknowledge the history of DH because this context makes a fundamental understanding of computing methods more appealing, and reveals to newcomers that DH has long been marrying humanities analyses with computational ones.

Text analysis tools like Voyant [2] and TAG [3] perform significant computational processing for users while hiding most or all of the implementation details. Voyant 2.0 has a wide array of analysis tools, but all of the computation is "black boxed", unless one downloads and parses the entire Java code set. The idea behind TAG is more open: the user is guided through decisions from data retrieval to post-processing, and code generation is available; however the overarching principle is still that TAG users do not want to write code or learn to create computational workflows themselves. To the contrary, we believe that, for scholars and students doing original research in this area, an understanding of fundamentals of the coding behind text analysis is necessary for them to be full participants in the research and to be able to question results adequately. This poster presents our approach to teaching and promoting computational text analysis and documents initial reactions to the approach.

Over the past year, the Cyberinfrastructure for Digital Humanities Group at Indiana University has been developing an open instructional workflow for text analysis that aims to build algorithmic understanding and basic coding skills before scaling up analyses. Like TAG, we have chosen to bootstrap in R, a high level and high productivity language, with methods that are repeatable and sustainable. The aim is to provide code templates that can be adapted, remixed, and scaled to fit a wide range of text analysis tasks.

Typically, the lesson for each text analysis algorithm or workflow progresses through four stages: 1) a basic tutorial with a simple, interactive Shiny interface to build user intuition and understanding, 2) a detailed R Notebook which explicates each line of code and why it is necessary; 3) a lightly commented R script that the user can readily modify with their own data and parameters; and 4) one or more scalable versions of the R script which incorporate basic optimization and parallelization techniques, including multi-core and multi-node implementations. Two sample corpora are used throughout the lessons: Shakespeare's plays and Twitter scrapings from the 2016 Presidential primaries.

While the introductory series of analyses are basic in nature, they make the task of understanding both computational text analysis and the fundamentals of R less daunting. Moreover, they are consistent with the Campus Bridging philosophies of XSEDE, and provide promise for expanding the variety of users and types of analyses that are able to utilize XSEDE resources.


1. Hockey, S. "The History of Humanities Computing. A Companion to Digital Humanities, ed. Susan Schreibman, Ray Siemens, John Unsworth. Oxford: Blackwell, 2004.


2. Sinclair, S. and G. Rockwell, 2016. Voyant Tools. Web. http://voyant-tools.org/.

3. Black, M. and D. Schmidt. XSEDE Text Analytics Gateway. 2015. https://github.com/XSEDEScienceGateways/TAG.

Drag Reduction Systems for Heavy Vehicles

David Manosalvas and Antony Jameson

Over 65% of the total energy consumed by heavy vehicles at highway speeds goes towards overcoming aerodynamic drag. Just a 12% reduction in fuel consumption across the national fleet of heavy vehicles would reduce diesel fuel consumption by 3.2 billion gallons per year and prevent the production of 28 million tons of CO2 emissions. This poster shows the use of computational tools for the design of add-on flow injection devices which manipulate the flow around the vehicle to reduce the size of the turbulent wake, and ultimately reduce aerodynamic drag.

Magnetic Fields In White Dwarfs

Boyan Hristov, David Collins, Peter Hoeflich, Charles Weatherford, Eva Hengeler and Tiara Diamond

Thermonuclear explosions of White Dwarf stars, Type Ia Supernovae (SNe Ia), are a cornerstone of modern cosmology. The light curves and spectra are powered by the radioactive decay of 56Ni -> 56Co -> 56Fe, making gamma and positron transport key for spectral analysis and for understanding the well established diversity of SNe Ia/Iax­. Progenitor systems include a wide variety of systems including single (SD) and double degenerate systems (DD). The explosion itself may be triggered during the dynamical merging of two white dwarfs (WD), or by compressional heat when the WD approaches the Chandrasekhar mass (MCh), which originates from either SD or DD systems. Although the former scenario may be favored from the LCs and spectra including the statistical properties, all current 3D­ models for the deflagration show strong mixing which is in conflict with observations. Recently, we have found evidence for high magnetic fields based on late time spectra NIR and MIR spectra.

We present studies to 1) determine the effects on magnetic fields on the nuclear burning; and 2) decipher the distribution of 56Ni and the magnitude and morphology of the magnetic ­fields based on IR line profiles of the [FeII] line at 1644 nm and LCs. The simulations employ detailed non­LTE RT calculations for thermonuclear SNe and MHD calculations. Comparisons with SNe Ia observed suggest consistency with WD masses close to the Chandrasekhar mass MCh but also support high initial magnetic fields.

Developing a Decision Support System for Breast Cancer Detection on Mammograms using Image Processing & Machine Learning

Mahmudur Rahman and Nuh Alpaslan

Breast cancer is the second leading causes of cancer death among women after lung cancer. Earlier detection of breast cancer could reduce not only treatment cost but also patients' mortality and morbidity rates. Currently, screening mammography is the standard and recommended preventive care procedure and is estimated to result in a 3-13% reduction in mortality. However, interpreting mammograms is often subjective and error prone with inter- observer variability. Current computer-aided detection and/or diagnosis (CAD) systems has no or little impact in helping radiologists detecting more subtle cancers associated with mass-like abnormalities due to the relatively low performance in mass detection. This work aims to provide radiologists an interactive visual aid for facilitating clinical viewing of mammographic masses that will be able to respond to image based visual queries of automatically segmented suspicious mass region by displaying mammograms of relevant masses of past cases that are similar to the queried region as well as predicting the image categories (e.g., malignant, benign and normal masses). Our hypothesis is that by providing a computerized library with a set of pathologically-confirmed images of past cases can refresh the radiologist's mental memory to guide them to a precise diagnosis with concrete visualizations instead of suggesting a second diagnosis like many other CAD systems. However, the most challenging problem in this task is detecting the mass from the background and extracting the discriminative local features of clinical importance. For mass detection, image pre-processing is performed with the top-hat morphological operation and a marker-controlled watershed segmentation algorithm is applied on images with no background artifacts. To extract features invariant to the linear shift and rotation, the histogram of oriented gradients (HOG), the co-occurrence histogram of oriented gradients (CoHOG) methods are used in addition to several geometric features. For the image retrieval and classification, two-class separation (normal and abnormal) and three-class study (normal, benign, and malignant cases) are carried out. For retrieval, performance evaluation was done using precision and recall curves obtained from comparison between the query and retrieved images. Classification is performed on the individual and combined input feature spaces by utilizing Support Vector Machines (SVM) with 10-fold cross validation. The proposed system is tested on the widely used Digital Database for Screening Mammography (DDSM) mammogram database of 2604 cases including craniocaudal and mediolateral-oblique views. It demonstrates the effectiveness based on precision (79% to 83% considering the first 25% of the retrieved images) and classification accuracy (93.83%) of the proposed system and shows the potential of the implemented methodology to serve as a diagnostic aid for mammography and for real clinical application.

Improving Karnak prediction service

Jungha Woo, Shava Smallen and John-Paul Navarro

Karnak (http://karnak.xsede.org/karnak/index.html) is the prediction service of job queue wait time for the XSEDE resources including Comet, Darter, Gordon, Maverick and Stampede. Karnak users include individual researchers and science gateways that consult wait time predictions to decide where to submit their computation within XSEDE.

Based on feedback from the community, this XSEDE Software Development and Integration (SD& I) project aims at improving the Karnak service to increase the accuracy of its predictions. This poster will describe Karnak's design, the machine learning technique used, and the accuracy improvement made through this SD&I project.

Karnak uses a decision tree machine learning technique for prediction. It uses 27 predictors (features) to estimate queue wait time of jobs that have been submitted. We will present a visualization of the decision trees and benchmark results comparing the improved Karnak prediction with a multiple linear regression algorithm.  Karnak's decision trees have been enhanced by reducing number of predictors as large number of variables typically lead to overfitting and bad predictions. The multiple linear regression algorithm will be presented as an alternative to decision tree algorithm as it produces a simple, easy to understand, and globally optimized wait times. Only small number of key predictors (features) will be utilized from the linear regression equation.

High Performance Visualization Pipeline for LiDAR Point Cloud Data

Ayat Mohammed, Faiz Abidi, Srijith Rajamohan and Nicholas Polys

Light Detection and Ranging or Laser Imaging, Detection and Ranging (LiDAR) is a remote-sensing technique that uses laser light to sample the surface of the earth, generating highly dense accurate x,y,z georeferenced data set. LiDAR, primarily used in airborne laser mapping applications, is cost-effective compared to traditional surveying techniques such as photogrammetry. There are numerous fields that use LiDAR data such as agriculture, urban planning, forestry, 3D archeological reconstruction, hydrologic modeling, terrain analysis, and infrastructure design. In this poster we provide a High Performance Visualization Pipeline (HPVP) that uses open source tools to manage and visualize LiDAR points cloud. Using this pipeline, researchers from STEM fields as well as humanities can analyze and visualize massive point clouds using open source tools enabled on Virginia Tech High performance Computing (HPC) clusters.

Finite Water-Content Module: Teaching Hydrology and HPC Using OnRamp

Jason Regina and Samantha Foley

The OnRamp project has developed a gateway that significantly increases access to high-performance computers. It contains a web-application and several educational modules that allow students from diverse backgrounds to begin learning and applying parallel computing concepts quickly. The web-application provides an intuitive graphical front-end, so students can focus on learning specific parallel computing concepts without generating complex job scripts ran from a command line. Currently, OnRamp has been deployed on flux, a SLURM cluster at the University of Wisconsin-LaCrosse and a LittleFe mini-cluster using a PBS scheduler. Future targets include XSEDE's Stampede, Blue Waters at the University of Illinois at Urbana-Champaign, and Mount Moran at the University of Wyoming.

As part of the XSEDE scholars program, a module was developed for OnRamp featuring the Finite Water-Content (FWC) method used in hydrological modeling. The FWC method is used to simulate the infiltration of water into the ground, and can serve as a starting point for more complex, large-scale watershed models. Similar models have already seen use by watershed managers to explore flood and drought prediction, as well as the effects of changing land-use on water supply.

The FWC module is intended to introduce computational hydrology students to parallel computing, and provide computer science students with a real physics-based application. The module allows students to alter soil characteristics, rainfall intensity, and other model parameters to test model performance and accuracy. In terms of application, the FWC module acts as an introduction to how parallel computing can be leveraged to solve real-world problems in water resources.

This poster highlights the FWC method, its use in the educational module, and the process of developing a module for OnRamp. A brief overview of the OnRamp project is also presented.

Developing serial and parallel implementations for the construction of suffix arrays

Efraín Vargas Ramos and Ana Gonzalez

A suffix array is a text-indexing data structure. Informally, a suffix array is a list of the starting positions of the suffixes of the text, sorted by their alphabetical order and the LCP array stores the length of the longest common prefix between every adjacent pair of suffixes in the suffix array. Suffix arrays and its corresponding longest common prefix array (LCP) have many applications in bioinformatics and the retrieval of information. For example, it is useful for solving the problem of finding the occurrences of a pattern string in a given text; it is one of the fundamental computational tasks in bioinformatics.

In this work for the suffix array construction were used different sorting algorithms in serial and parallel version. The suffix array construction was implemented using C programming language, OpenMP and the times were taken in Stampede and the results were compared and analyzed in each case. Finally, different data of different sizes were indexed with its corresponding suffix array and were used to take the times for this analysis.

Immersive Visualization of big data models using ParaView

Faiz Abidi, Nicholas Polys, Srijith Rajamohan and Ayat Mohammed

ParaView is a capable open source tool for building data analytic and visualization applications; scientific visualizations of complex systems can be created with ParaView and analyzed interactively. Meanwhile, the value and capability of immersive displays is increasingly shown for spatial judgement tasks, embodied interaction, and abstract concept understanding. With immersive technology, such room-sized, multi-screen projection environments, users can view models with stereoscopy and head-tracked perspective as well as tracked input devices (such as a wand). However, the process of setting up ParaView to work in such venues involves various steps and challenges from data preparation, software configuration, and interaction techniques. In this poster, we present the entire pipeline of how to enable ParaView visualization in a CAVE environment using multiple X displays and the Virtual Reality Peripheral Network (VRPN) for handling user input data. We present and reflect on our experience at Virginia Tech's Visionarium lab and consider future research and development opportunities.

Using openDIEL as a Workflow Engine

Kwai Wong and Tanner Curren

In this poster, we present a workflow software platform – the open Distributive Interoperable Executive Library (openDIEL). OpenDIEL is built to facilitate execution of multi-disciplinary computational projects suited for a diversified research community on large-scale parallel computers that are available in XSEDE. OpenDIEL allows users to plug in their individual science codes (modules), prescribe the sequence of workflow, and perform computation under a single MPI executable.

Running multiple copies, of serial or parallel codes written in C, C++, and FORTRAN under openDIEL is a simple process. Serial executables including programs written in R, perl, python or JAVA can be executed without any modification. A bash script is used to convert users' parallel codes to be modules of openDIEL, primarily adding the needed header files and replacing the MPI sub-communicator in users' programs.

Two interfaces of communication functions are also built to accommodate the transfer of data in deterministic or stochastic manners. The presented openDIEL framework is designed to be portable across many computing platforms.

We will demonstrate how openDIEL are used to support computations of two ECSS projects on darter and comet.

HPC for Computational chemistry: Semi-Empirical Approaches and Applications to Biology and Material Science

Jorge Alarcon Ochoa and Humberto Terrones

Computer simulations are becoming increasingly important. Experiments provide a picture a macroscopic description of nature, while computer simulations provide a microscopic description.

When dealing with simulations the two main issues are those of validation and verification.

Verification deals with the issue of making sure that computer model is in accord with the theoretical model.

Validation, on the other hand, deals with making sure the results from the computer experiment are representative of reality.

We address some issues regarding the development of force fields for molecular dynamics simulations, discussing comparisons between first principle calculations and experimental results. Furthermore, we will emphasize some routes for the optimization of force field parameters with applications to biological systems (proteins in crowded environments) along with uses in the study of other materials. We will discuss the optimization of protein-protein and protein-solvent interactions via simulations using the small peptide trp-cage, along with optimization and modeling of graphenic structures such as minimal surfaces and Schwarzites, which have very peculiar mechanical and electronic properties.

System-level Performance Monitoring Practice on NCAR Computational Resources: an Implementation of XDMoD with Ganglia Data Collection

Shiquan Su, Shawn Needham, Nathan Rini, Martins Innus, Joseph White, Tom Furlani, Tom Engel and Siddhartha Ghosh

In the time of fast pace development of High Performance Computational (HPC) technology, building a better and faster supercomputer becomes a more and more sophisticated task. Making important design and management decisions needs to refer to the analysis of solid and detailed performance data collected first-hand from the supercomputer. So it is mission critical to understand the system-level job performance on the HPC platforms.

XSEDE program is the pioneer and major workhouse of HPC research for NSF. There is such a great variety of HPC platforms running under XSEDE which provide informative data sources. XDMoD takes the central role to analyze the data and feedback knowledge to the XSEDE researchers and managers.

XDMoD provides an open framework for all the HPC centers at XSEDE to inject and analyze their performance data. In this presentation, we will discuss a workflow of integrating Ganglia Data collection into XDMoD on National Center for Atmospheric Research (NCAR) HPC Resources.

Developing Cyberinfrastructure to Improve the Understanding of Building Energy Efficiency for Public Users

Karthik Abbineni, Kelsie Stopak, Shan He and Ulrike Passe

The primary goal of our Research project is to develop and use Cyberinfrastructure to understand and improve the public's perception of building energy efficiency and sustainable design. The whole Cyberinfrastructure developed is explained below in the order of development starting from Data Acquisition to Interactive Online Access and has been divided into completed, present and future tasks to give a clear overview of the research work.

Our completed tasks include development of a data acquisition system, data cleaning, and data storage. For the data acquisition system, we use Campbell Scientific data logger CR1000 to record data from over 100 various sensors at 20s interval since 2012. The data can be harvested remotely with Campbell Scientific LoggerNet, which is a software interface used by the research team for temporary data visualization and system diagnostic. For the data cleaning; we use ‘Quantile Method' to filter out both positive and negative extreme outliers. There are other processes involved like changing the Energy Consumption Sensor data from a cumulative data collection to a minutely average data collection method. Our data is then stored in a legacy system. The legacy system involved storing monthly raw data in csv files and accumulated data in dat files stored on a server, which made it difficult to look at the seasonality and trend of the data for multiple months or years. So, we built a Relational Database to store all the available data in a single repository.

Our present tasks include data visualization and creating an online interface aimed at user groups with a middle school education or higher. Tableau is used to visualize the time series sensor data performance, building performance topics such as outdoor conditions, indoor conditions, space heating and cooling, photovoltaic performance, water consumption, and energy consumption. Website design using Wordpress and Tableau iFrames for the online access of the Research Building Performance Topics as discussed above, building components of the house along with the break down of the house into different zones to give a better understanding of the performance and environment of the house to the public. the user group is aimed at an average person with middle school education level or higher. Through the use of cleaned simplified graphics the public can understand the historic performance of the Interlock house compared to the Passive House Standards, typical residential energy consumption intensity of homes in Iowa and other home in the same climate zone. The graphics take the data collected and filter it down to its essence. The general public is exposed to information on consumption at an annual, monthly, daily and hourly level, while the original data is collected every 20 seconds for research purposes.

Our future works is focused on validating the legibility of the data visualization and user interface. The interface is aimed at people with a middle school age education or higher and with no background in building energy consumption. Through a series of surveys, we will look at how different user groups respond to the information.

Improving the scalability of mothur for large metagenomic studies

Tara Urner, Kellan Steele, Kristin Muterspaw, Nicholas Arnold, Ashutosh Rai and Charles Peck

Mothur is an open source bioinformatics software package for analyzing 16S microbial DNA sequences. Mothur's workflow takes raw sequencer output and produces the identity and relative/absolute abundance of microbial species in the original samples. Three particularly time and space intensive steps in the mothur workflow contribute significantly to the total runtime: pre.cluster(), chimera.uchime(), and cluster.split(). We focus on pre.cluster(), which implements a pseudo-single linkage clustering algorithm to remove erroneous sequences introduced by pyrosequencing errors. Because abundant sequences are more likely to produce erroneous sequences, the pre.cluster() algorithm first sorts the input sequences in abundance order. The algorithm proceeds through the sorted list and clusters rarer sequences with a more abundant sequence if they are within a preset base-difference threshold (typically 2 out of 600 base-pairs). The current implementation of pre.cluster() incorporates process-cloning parallelism. Focusing on the core algorithm we developed three different parallel implementations: an approach using CPU threads, accelerator threads, and a hybrid of distributed memory and CPU threads. We investigate the performance of the pre.cluster() algorithm with each of our parallel approaches on our local cluster, Blue Waters, and XSEDE resources.


[1] Huse SM, Welch DM, Morrison HG, Sogin ML. Ironing out the wrinkles in the rare biosphere through improved OTU clustering. Environmental Microbiology. 2010;12(7):1889-1898. doi:10.1111/j.1462-2920.2010.02193.x.

[2] Schloss, P.D., et al., Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol, 2009. 75(23):7537-41.

[3] Multidisciplinary research and education with open tools: Metagenomic analysis of 16S rRNA using Arduino, Android, Mothur and XSEDE. K. Muterspaw, T. Urner, R. Lewis, I. Babic, D. Srinath, C. Peck, P. Lemiszki, M. Sanchez-Miranda, M. Mayorga-Mendez, O. Petursson, B. Smith, and D. A. Cerda-Granados. XSEDE '15 Conference Proceedings, July 2015.

An inexpensive and open platform with UAS + LiDAR + DNG for archeological and ecological surveys

Nicholas Arnold, Erin L. Lewis, Kristin Muterspaw and Charles Peck

The cost-of-entry for unmanned aircraft systems (UAS) and light detection and ranging (LiDAR) gear have both dropped significantly over the past few years. This allows for new advances in archeological and ecological surveying among other fields. Our work combines a multi-modal sensor system mounted to a UAS combined with machine learning algorithms to perform archeological and ecological surveys in Skalanes, Iceland. Archeologists have had to rely on historical studies and ground probing when searching for potential sites in this area. The images from the camera are output in a raw digital negative format (DNG) which is then transformed into photogrammetric models and combined with the LiDAR point cloud in a geographic information system (GIS) which allows for an information dense structure where machine learning algorithms are used to search for potential sites. Flights are site-specific on areas with a known potential for being of archaeological or ecological interest. The UAS will be flown at ~3m elevation giving a 2mm resolution for the point cloud.

Creating and Deploying Virtual Clusters on Infrastructure Clouds

Jonathan Pastor and Kate Keahey

In the last few years NSF has funded a range of resources exploring the Infrastructure-as-a-Service model. These resources typically allow the user to provision individual virtual machines, or (in the case of Chameleon) bare metal instances. One challenge that arises in this context is the creation of complex images representing virtual clusters or similar constructs of multiple related resources. Deploying such complex images is often more difficult than the creation of individual instances because their configuration involves exchange of information assigned or generated at deployment time, such as IP addresses, hostnames, or hostkeys. Multiple solutions with different trade-offs in terms of functionality, ease of use and ease of maintenance have been proposed. In this presentation, we will describe and discuss the complex image (virtual cluster) management on the Chameleon system.

NSG-R: Programmatic Access to Neuroscience Applications

Subhashini Sivagnanam, Amit Majumdar, Kenneth Yoshimoto and Ted Carnevale

The recent trajectory of neuroscience research has been towards development and adoption of methods and tools requiring computational resources. Examples include: powerful open-source neural simulators that enable creation of empirically based models of unprecedented complexity; neuroscience community projects (NCPs) that enable collaborative model development and promote data and model sharing; fMRI, high resolution light microscopy, and other imaging methods that generate large data sets that require numerically intensive processing and analysis. For growing numbers of neuroscientists, these advances have produced a critical need to use high performance computing (HPC) resources in their research and teaching.

Neuroscience Gateway (NSG) provides neuroscientists convenient access to HPC resources. NSG's browser-based interface shields researchers from tedious technical and administrative details, enabling performance of tasks that exceed local hardware capabilities. However, researchers must first leave their familiar work environments and carry out numerous step-by-step actions that are potentially error prone.

This motivated our next step in refining and extending NSG, which we present here: creation of a software infrastructure that allows seamless access to HPC resources whether from the familiar environment of widely used NCPs or from the existing work environment on neuroscientists' own laptop or desktop computers. Called NSG-R, this infrastructure is implemented as a RESTful web services to the NSG. It will allow neuroscientists to access HPC resources in order to analyze data, run simulations, and retrieve results, all within the context of their familiar workflows. NSG-R will also allow running models directly on HPC from one's own laptop or desktop computers.

Open Source Brain is the first NCP that will integrate NSG-R, making HPC resources transparently available to its users. NSG-R will subsequently be integrated with other projects, e.g. ModelDB, the Neuroscience Information Framework, and OpenWorm, and made available to developers of neural simulators and neuroscience data analysis software. This poster will highlight the work done in collaboration with Open Source Brain, including collecting requirements, enabling submission to XSEDE HPC resources through NSG-R

By enabling seamless, ubiquitous access to HPC, the NSG-R project is likely to have a transformative impact on research, catalyzing progress in neuroscience and widening opportunities for educational and career advancement regardless of the institutional affiliation of those who benefit from it. This project levels the playing field for all students and researchers by democratizing access to HPC.

Supported by NSF 1458495 , NSF 1458840, and BBSRC Research Grant BB/N005236/1.

HPCmatlab v2.0: A Platform for Fast Prototyping of Parallel Applications in MATLAB

Ayush Mishra, Mukul Dave and Mohamed Sayeed

Scientific computing now involves increasing scales of computation and data intensive workloads. Matlab and Octave are extremely popular for numerical computing due to their programmability and use of highly optimized back end libraries. However, users often face the challenge of very high run times. The HPCmatlab framework was developed to enable users to carry out distributed memory programming in Matlab through MPI and shared memory programming through Pthreads. Support for heterogeneous architectures has now been added to the framework using OpenCL API. This would allow offloading vectorizable computations to accelerators using predefined kernels. MPI IO functions have also been added to allow reading and writing of large amounts of data to file. Moreover, the framework has been made compatible with Octave. Sample programs demonstrating the use of these new functionalities are included along with preliminary benchmark results. Performance scaling for a real application involving big data is shown. Another potential application where HPCmatlab will be used for parallelization is also explained. The results confirm that HPCmatlab/HPCoctave can significantly enhance programmer productivity and enable the execution of data intensive and computationally intensive applications in Matlab/Octave.

Computers and Carboranedithiol Self-Assembled Monolayers: Molecules Do the Can-Can

Olivia Irving, Andrew Serino, Dominic Goronzy, Harsharn Auluck, Jacqueline Deirmenjian, Anastassia Alexandrova, Tomáš Baše, Paul Weiss and Elisa Jimenez-Izal

Self-assembly, defined as the spontaneous organization of a disordered molecular system, has proven to be a viable route for bottom-up design approaches in nanotechnology. Recently, carboranethiols have been shown to assemble on Au{111} substrates and to form two-dimensional monolayers made rigid through intermolecular dipole-dipole interactions. Because of molecular symmetry and the resultant reduction in types of defects, these assemblies are simpler than the prototypical and well-studied n-alkanethiols on Au{111}, which have multitude of distinguishable domains and defects. Within this new family of molecules, ortho carboranedithiols display two different binding modes – doubly and singly bound – after adsorption. We investigated surfaces have been studied and explored with scanning tunneling microscopy and infrared (IR) spectroscopy. Plane-wave density functional theory code is employed to elucidate the energetics of different binding sites on stoichiometric gold; binding energies of the local minima of both binding modalities are obtained. Experimental evidence supports majority control of the singly (dual) bound moieties with acid or base, shown on the local and ensemble scales, which is further supported by ground-state calculations and binding energies. Visualization of charge distributions of the surface and cage are provided as well as Natural Bond Order orbitals analysis of the combined system. Simulated STM images using the Tersoff-Hamann method help with identification of the chemical species on the surface. This is all done utilizing high performance computing.

Investigating Topic Models for Big Data Analysis in Social Science Domain

Nitin Sukhija, Nicole Brown, Paul Rodriguez, Mahidhar Tatineni and Mark Van Moer

In recent years there has been a quantum leap in the amount of digitized data available regarding scientific, national security, business and social community's domains. Furthermore, with this surge in the availability of textual data, the challenges involved in summarizing, understanding, and making sense of the rapidly increasing data for advancing new discoveries in political, social and other areas have also surged. Text analytics have received increased attention over the years for analyzing and integrating the textual data within a collection of documents (known as a corpus) to realize the untapped potential in many real-world domains. The ability to organize a corpus of documents according to some underlying characteristic is an important function for text analytics. Clustering methods are capable of partitioning documents into groups, according to similarity metrics [1][2]. However, the more recent method of topic modeling has the more satisfying outcome of modeling the words that appear in documents according to dependencies between documents and topics, and dependencies between topics and words. The topics themselves are not given a priori, and it is part of an algorithm to search for latent topics [3].

The most widely used method for assigning topics is using the Latent Dirichlet Allocation (LDA) process [4]. The LDA is a probabilistic topic model for performing unsupervised analysis of large document collections and require no manual construction of the training data. Using a Bayesian formulation, documents are treated as a bag of words (i.e. without sequence information), and word-document dependencies on topics are given a probabilistic framework. In fact, it is the probabilistic framework that gives LDA expressive and interpretative power over standard clustering algorithms like k-means. However, the cost is a more complicated algorithm, more complex processing, and possibly more computational demands. Moreover, there are also several algorithms for finding topics, and accompanying parameters and metrics, which are not obvious how to choose or compare. While LDA and related models can be easily applied to discover the topics and an assignment of topics to documents in a corpus, one of the major challenges in executing these models on big datasets is the computational complexity and the cost involved in executing them in high performance computing environments. In general, the topic modeling is computationally expensive as it requires both large amount of memory and considerable amount of computational power. The performance of these models is often essential, even critical sometimes, to achieve the objectives proposed by the domain areas which are making use of them. Therefore, many research efforts have been attempted to optimize performance. These optimizations include improving performance (per core), increasing scalability of their execution in parallel and distributed environments, and dealing with dynamically changing large data sets. In recent years, few toolkits such as, Mallet [5], Apache Spark, Spark-RDMA [6] and others have been developed to address some of the computational challenges related to the execution of the topic models using big datasets [7]. The dataset used in this research work is characterized by a collection of documents from the JSTOR's [8] Data for Research website (dfr.jstor.org). The corpus comprises of almost 200,000 publications written between 1965 and 2014 corresponding to the identified search criteria's such as, realms of power, neoliberalism, consumerism, welfare rights, and others [9].

In this work, we present a case study that involves the text analytics of the JSTOR digitized data of social articles to understand the Intersectional political consumerism defined as consumer activities motivated by one's intersecting social locations (race, class and gender). The research presented herein encompasses the usage of XSEDE high performance computing resources, the big data sets and the topic modeling toolkits to characterize broadly what themes are included within discussions of business/industry (consumerism), the government, economic policy (neoliberalism), and economic condition; specifically, (the feminization and blackening of) poverty.

Parallel Constrained Programming (CP) Solver for Optimizing the Performance of CP Problems in Cloud Computing Environments

Tarek Menouer and Nitin Sukhija

The parallelization of Constraint Programming (CP) solvers is widely proposed in the literature to improve the resolution of CP problems by reducing the computing time. Moreover, the current CP problems addressed by the industry involve big data scenarios and solving such complex problems in less time require the use of scalable many-core computing resources that are easily offered by the cloud computing environments. However, achieving optimal performance on CP solvers is a major concern while dealing with cloud infrastructure. To address this issue, we propose a parallelization approach that enables communication between several parallel CP solvers to effeciently exploit the massively available computing resources, thus minimizing the performance degradation of the CP solvers in cloud environments. The basic principle underlying the approach is to obtain a good load balancing between all cores of the computing system and to insure that there are no waiting machines in the cloud while others are overloaded.

Online Health Monitoring Via Wireless Body Area Networks

Yohannes Alemu and Hongmei Chi

In healthcare, Wireless Body Area Networks (WBANs) are wireless networks of heterogeneous wearable medical computing devices that enable remote monitoring of a patient's health status, physiological monitoring of vital signs. An important aspect in the design and development of online health monitoring of such WBAN is the speed and accuracy of responses. Immediate response to the changes in the health condition of a patient rescue lives. Faulty measurement signal false alarm and create unusual intervention over healthcare personnel, which makes the online health monitoring system unreliable. This study uses to combine Particle Swarm Optimization (PSO) outlier detection method along with Majority Voting (MV) algorithm, the PSO for abnormal data detection and MV algorithm to identify and signal false alarm. The MV is activated only when the PSO detect temporal outliers. A large real-world dataset are used to test our approach and results for this method are promising. We also use synthetically generated data to test the rate of false alarm and compare its performance with other similar algorithms and the proposed approach outperforms two similar outlier detection algorithms.

Job Inter-arrival Modeling on Kraken

Gwang Son and Gregory Peterson

As high performance computing (HPC) is becoming highly accessible, more researchers from a variety of research fields started to use supercomputers to solve large scientific problems. The overall performance of a supercomputer depends heavily on the quality of the scheduling system. The performance of a supercomputer scheduler is greatly affected by the workload to which the supercomputer is applied, and there is no single scheduling algorithm working perfectly for all workloads. Therefore, workload characterization of a supercomputer is an important step to develop and evaluate scheduling strategies. In this paper, we present an analysis of the workload characteristics of Kraken which was the world's fastest academic supercomputer and 11th on the 2011 Top500 list. Particularly, we focus on job inter-arrival process. We modeled inter-arrival process as a combination of burst job arrivals by same users within a threshold time from the previous job submissions and independent singleton job arrivals that took more than the threshold time from the previous job submissions by the same users. We found that, overall, burst job arrivals can be modeled with a combination of scaled Exponential distribution and scaled two-phase-Uniform distribution. In addition, singleton arrivals can be modeled with a combination of two scaled Exponential distributions. We also present characteristics (i.e., duration, number of jobs, percentage of node mode, and runtime coefficient of variation) of burst job inter-arrival process by same users.

BigSounds as BigData: Addressing Soundscape Ecology Studies of Hundreds of TeraBytes using HPC Platform Tools

Boyu Zhang, Amandine Gasc and Bryan Pijanowski

Soundscape studies represent an attractive new methodology for biodiversity assessment, as they are non-invasive and well suited to the challenge of global ecological assessment over large spatial and temporal scales. Recent advances in automated acoustic recording, massive data storage, and rapid soundscape measurement result in considerable amounts of data. This massive amount of data has posed new challenges to all aspects of data injection, storage, preprocessing, and analysis. Traditional workflows in soundscape studies involve streaming data from external hard drives or High Performance Computing (HPC) cluster storage systems to one's laptop or workstation and performing subsequent sanity checks and analysis. This workflow is no longer feasible with hundreds of terabytes of recording data due to the added cost of data transfer and the limited processing power of the workstations. In this paper, we present a modularized system that enables acoustic analysis workflows on HPC platforms over hundreds of terabytes of soundscape recording collections. The system allows users to construct flexible analysis workflows and automatically runs them on HPC clusters in parallel. We present and discuss two representative statistical analysis workflows: 1) the calculation of single recordings acoustic diversity indices (α-diversity), and 2) the calculation of pairwise recording dissimilarities (β-diversity). The results show that our system can efficiently break up the large-scale workflows into manageable smaller jobs, compute results, and synthesis for further analysis in time frames that are reasonable.

An Enhanced Version of Variogram Selection and Kriging

Erin Hodgess and Kendra Mhoon

We will demonstrate a new approach to solving variogram models and kriging using the R statistical language in conjunction with Fortran (F90) and the Message Passing Interface (MPI) on the Stampede Supercomputer. This new approach has led to great improvements in timing when results are compared to those with R individually, or even with R with C and MPI. These improvements include processing longer vectors, such as fitting and forecasting vectors of size 25000 in an average time of 15 seconds as compared to previous times of nearly one hour. To disseminate our new approach, we will build an R package for general access and a fully online course on the XSEDE website.

The Communication Model of Agile Software Development

Michelle Williams, Dominique Stewart, Kerk Kee and Mona Sleiman

As the computational movement gains more traction in the scientific community, there is an increasing need to understand the communication behaviors that facilitate productive agile software development in e-science. This investigation reveals how geographically dispersed teams virtually organize to execute the development and implementation of computational tools. The data consist of 135 interviews with domain scientists, computational technologists, and supercomputer center administrators. A systematic analysis revealed that iterative needs assessment, the defining characteristic of agile software development, depends on three communication processes: synergistic relationship building, strategic organizational design, and transparent information sharing. Each of the processes is comprised of a range of behaviors that occur during agile software development. For example, synergistic relationship building encompasses building rapport and establishing trust within the team. Strategic organizational design includes items such as role delineation and the establishment of objectives. Lastly, transparent information sharing is comprised of historical record-keeping and open information flow. This list of communication processes and involved behaviors will serve as a guide for key stakeholders to communicate effectively throughout iterative needs assessment, which involves re-evaluating of requirements and providing solutions. This paper is submitted to the "Software and Software Environments" track because it has implications for design strategies and engagement of user communities.

Large-Scale Document Analysis through a Mobile Phone

Joseph Molina, Ritu Arora

Large, heterogeneous, and complex collection of documents can be difficult to manage manually. Tools and techniques that can help in automating the management of such a collection in a timely manner, are therefore, required. In this research project, we are developing an android application that can be used for automating some of the steps required for managing large collections of documents that reside on Google drive and on largescale systems at the openscience data centers. Examples of the steps in the data management process that would be enabled through our android application are (1) the classification of documents on the basis of their content, and (2) filtering or searching the documents on the basis of a text pattern. We refer to such steps in the data management process as document analysis. We are incorporating the Optical Character Recognition (OCR) technique in our application for extracting text from born digital documents. We will enable the OCR of the documents in batchmode to reduce the timetoresults. 

Immersive Tools: Visualizing the Postictal Neural Phase

Marvis A. Cruz, Grace M. Rodriguez Gomez, Andrew Solis and Dr. Brian McCann

Clinicians recognize the ‘ictal' neural phase as an epileptic seizure, ‘ictal' refers to the seizure itself and ‘postictal' is the neural phase following the seizure. The postictal phase is critical for clinicians to determine the presence of epilepsy in a seizure patient. Hitherto, understanding the extent of post-seizure symptoms remains confounding and difficult for most to comprehend. Over the summer of 2016, our goal is to develop a scientifically focused virtual reality experience recreating symptoms of the postictal neural phase for users to experience via the HTC Vive headset. To develop a VR experience as closely analogous to the postictal neural phase, we are using Unity3D, a multifaceted game development platform and C# (Csharp) scripting. Scholarly neuropsychological articles will be utilized for scientific support as well as an interactive version of the Toronto Empathy Scale to assess empathy before users experience the tool and thereafter.

Analysis of Random Alloy Nanoparticles as Catalysts for Oxygen Reduction Reactions

Fatima Al-Quaiti, Dr. Juliana Duncan, Benjamin Corona, Christopher Lee, Esteban Treviño, Dr. Graeme Henkelman

Fuel cells utilize the energy produced by the reduction of oxygen and oxidation of hydrogen. Because the reaction rate is slow, a metal catalyst, such as platinum, is used to increase the rate of the reaction. Since platinum is costly, a cheaper and more efficient catalyst needs to be developed. Alloying, the mixing of two or more metals, can be used to create catalysts that are cheaper and tunable. Computational methods are used to understand the trends in reactivity at the (111) hollow points of the catalysts. These methods are also used to predict candidates for alloy catalysts. This is done by modeling a 38-atom nanoparticle in which the atoms will consist of a 50/50 ratio of two metals to predict the ideal combination of metals.

Building a Shared HPC Center Across University Schools and Institutes: A Case Study

Glen MacLachlan, Jason Hurlburt, Adam Wong, Marco Suarez

Over the past several years, The George Washington University has recruited a significant number of researchers in a wide variety of domains requiring the availability of advanced computational resources. We discuss the challenges and obstacles encountered planning and establishing a first-time high performance computing center at the university level and present a set of solutions that will be useful for any university developing a fledgling high performance computing center. We focus on justification and cost model, strategies for determining anticipated use cases, planning appropriate resources, staffing, user engagement, and metrics for gauging success.