ECSS Symposium Archive
ECSS staff share technical solutions to scientific computing challenges monthly in this open forum.
Previous years' ECSS seminars may accessed through these links:
December 17, 2019
Extracting Domain Information using Deep Learning
Presenter(s): Amit Gupta (TACC)
In this session we will present an overview of our exploration of using Deep Learning in extracting entities of interest from journal article text. Over various scientific domains, extracting and curating new knowledge from large bodies of text remains a challenging task. To this end, we have developed a computational tool, named DIVE (Domain Informational Vocabulary Extraction) to provide entity extraction and expert curation functionality. The tool has been integrated with the publication pipeline used by American Society of Plant Biologists. Using the author feedback mechanism in our deployed tool we were able to create an expert user annotated dataset based on articles submitted over an entire year. This new gold standard dataset for supervised training now enables us to contrast several methods for the entity extraction task. We use the NeuroNER tool to investigate the effectiveness of deep neural network in this task and also contrast it with other tools using a variety of different methods such as ABNER (using CRF) and DIVE (using an ensemble of regular expression rules, keyword dictionaries and ontology files). Our early results from NeuroNER training with author annotations shows very promising improvement on predicting the important words from the documents. This makes it an excellent candidate for future development and integration into the DIVE tool.
The Distant Reader: Reading at scale
Presenter(s): Eric Lease Morgan (Notre Dame)
The Distant Reader is a tool for reading. It takes an arbitrary amount of unstructured data (text) as input, and it outputs sets of structured data for analysis -- reading. Given a corpus of just about any size (hundreds of books or thousands of journal articles), the Distant Reader analyzes the corpus, and outputs a myriad of reports enabling the researcher to use & understand the corpus. Designed with college students, graduate students, scientists, or humanists in mind, the Distant Reader is intended to supplement the traditional reading process. This presentation outlines the problems the Reader is intended to address as well as the way it is implemented on the Jetstream platform with the help of both software and personnel resources from XSEDE. The Distant Reader is freely available for anybody to use at https://distantreader.org
October 15, 2019
On Developing Reusable Software Components for the Advanced Cyberinfrastructure
Presenter(s): Ritu Arora (TACC)
Developing reusable software components that can be integrated in unforeseen software projects has the potential of enhancing the productivity of the programmers who are reusing the software. However, the initial cost of developing such components can be higher than developing components for a single use-case. In this talk, we will discuss a couple of reusable software components that were developed for the BOINC@TACC and Gateway-In-a-Box (GIB) projects. One software component is named as Greyfish and it is a portable, cloud-based filesystem. Another software component is named as Midas, which is a tool for automating the generation of Docker images from source code. Both these software components were initially prototyped for predefined needs and were tightly coupled with other components they interoperated with. However, after determining that the amount of effort involved in teasing out these components and making them available as stand-alone software is insignificant and can help with the sustainability goals of the aforementioned projects, we refactored these software components, and wrote clear documentation for installing and using them. Doing this helped us in improving the software quality - people in the community started using the software, and helped us in fixing some bugs and improving the documentation. In summary, there is often a direct or indirect cost involved in making software reusable, and this cost may vary from project to project. However, the long-term sustainability and maintenance needs of the project may far outweigh the cost associated with software reusability.
Exploring the Dynamics of a Quantum-Mechanical Compton Generator
Presenter(s): Marty Kandes (SDSC)
In 1913, while he was still was an undergraduate, American physicist Arthur Compton invented a simple way to measure the rotation rate of the Earth with a tabletop-sized experiment, independent of any astronomical observation. The experiment consisted of a large diameter circular ring of thin glass tubing filled with water and oil droplets. After placing the ring in a plane perpendicular to the surface of the Earth and allowing the fluid mixture of oil and water to come to rest, Compton then abruptly rotated the ring, flipping it 180 degrees about an axis passing through its own plane. The result of the experiment was that the water acquired a measurable drift velocity due to the Coriolis effect arising from the daily rotation of the Earth about its own axis. Compton measured this induced drift velocity by observing the motion of the oil droplets in the water with a microscope. This device, now named after him, is known as a Compton generator. The fundamental research objective of this XSEDE project is to explore the dynamics of a quantum-mechanical analogue to the classical Compton generator experiment through the use of numerical simulations. In this presentation, I describe how the physics of the problem itself drives many of the computational challenges in the simulations; what numerical methods and computational techniques were implemented in the custom simulation code written to explore the problem (and other quantum systems in rotating frames of reference); the performance characteristics and limitations of this code; some challenges in creating a post-simulation visualization pipeline; as well as the latest results and future directions of the project.
September 17, 2019
The "Morelli Machine": A Proposal Testing a Critical, Algorithmic Approach to Art History
Presenter(s): Paul Rodriguez (SDSC)
The Morelli Machine refers to an algorithmic approach to characterizing authorship from the late 19th century which proposed that fine details of minor items in a painting would reveal particular styles. The PIs set out to test the hypothesis that contemporary computer vision techniques could perform this sort of "stylistic" matching. In order to do this, they sought to mechanize a method that is indigenous to art history and that uses details as a proxy for style. This project approached the question of "style" as one of extracting features that have some discriminatory power for distinguishing paintings or groups of paintings. We used feature discovery from a pretrained convolution network (VGG19) for object recognition. We processed both whole images and some class of image parts (ie mouths), and performed clustering. In this presentation I will review the image preparation steps, extraction steps, clustering results, and cluster evaluation. The upshot is that all convolution layers indeed have discriminatory features, and different layers might have different kinds of features, with different interpretability that may be hard to define.
Improving Science Gateways usage reporting for XSEDE
Presenter(s): Amit Chourasia (SDSC)
Science domain-specific gateways have gained wide use by providing easy web-based access to complex cyberinfrastructure. Science Gateways are consuming an increasing proportion of computational capacity provided by XSEDE. A typical approach used by Science Gateways is to use a single community account with a compute allocation to process compute jobs on behalf of their end users. The computation usage for Science Gateways is compiled from batch job submission systems and reported by the XSEDE service providers. However, this reporting does not capture and provide information about the user who actually initiated the computation, as the batch systems do not have this information. To overcome this reporting limitation, Science Gateways utilize a separate pipeline to submit job-specific attributes to XSEDE, which is then later co-joined with batch system information submitted by the Service Providers to create detailed usage reports. In this presentation I will describe improvements to the Gateway attribute reporting system, which better serves the needs of the growing Science Gateway community and provides them with a simpler and streamlined way to report usage and ultimately publish this information via XDMoD.
August 20, 2019
Hadoop and Spark on a Shared Resource
Presenter(s): Byron Gill (PSC)
Hadoop, Spark, and the ecosystem of other software that interacts with them are in demand, but many of the assumptions about the typical use case for these programs don't apply to the typical user on a shared HPC cluster. This talk will explore some of the challenges in creating a workable environment within the confines of a shared cluster and describe some of the approaches we've used at PSC to accommodate the needs of our users.
Lessons learned in Developing a coupling interface between Kinetic PUI code (Fortran) and a Global MHD code (C++)
Presenter(s): Laura Carrington (SDSC)
The objective of the PI's team was to obtain a quantitative understanding of the dynamical heliosphere, from its solar origin to its interaction with the LISM, by creating a data-driven suite of models of the Sun-to-LISM connection. To accomplish this, I worked to develop a coupling interface between a Kinetic PUI code (Fortran) and a Global MHD code (C++). The kinetic PUI code models the nonthermal (pickup) ions (PUIs) created as new populations of neutral atoms are born in the SW and LISM. The PUIs generate turbulence that heats up the thermal ions. PUIs are further accelerated to create anomalous cosmic rays (ACRs). This code was originally serial and designed to compute a single trajectory of a particle. The coupling allows the PUI code to get magnetic field data from a large Global MHD parallel simulation code and compute ~5000 trajectories in a single run. The challenges of parallelizing the PUI code and coupling its Fortran77 and Fortran90 code with the C++ Global MHD code is presented along with lessons learn in working with mixed mode codes and on TACC Stampede2.
June 18, 2019
HPC+Jupyter for Computational Chemistry
Presenter(s): Albert Lu (TACC)
Methods of computational chemistry have demonstrated remarkable power in predicting materials properties, and therefore are widely utilized in academic researches and industrial applications. In 2018, at TACC for example, over 30% of the computational time used on the supercomputer Stampede2 were chemistry/materials science related applications. Providing a more intuitive way of performing simulations can not only help lower the learning curve for new users, but also create a different user experience and value. In this presentation, Albert Lu (TACC) will give an overview of interactive computing with Jupyter notebook, and demonstrate how to setup and run interactive simulation jobs (of LAMMPS) on Stampede2. Related tools for parallel computing (IPython Parallel) and workflow managing (Parsl) will also be discussed in this talk.
The Development of a Mobile Augmented Reality Application for Visualizing the Protein Data Bank
Presenter(s): Max Collins (UC Irvine)
Principal Investigator(s): Alan Craig (U. Illinois and Shodor)
In 2015-2016, then undergraduate student Max Collins was in the Blue Waters Student Internship Program. In that internship, he received training in high performance computing and developed a project in conjunction with his mentor, Alan Craig. His project was to create a mobile augmented reality application to visualize the Protein Data Bank. This presentation will discuss the technical details and development process of that application. In addition, Max will address how the internship and this application has affected his schooling and career choices. An early version of the application can be seen in the video on this page: http://www.ncsa.illinois.edu/news/story/blue_waters_intern_visualizes_a_career_in_app_development
May 21, 2019
ECSS CGEM project: Experiences & Beyond
Presenter(s): Kent Milfeld (Texas Advanced Computing Center)
Often performance can be analyzed through profilers such as gprof and VTune. At other times it is necessary to observe what is happening in a code with other tools to find performance problems. In this presentation we'll look at a few handy tools used to discover a performance problem in the marine science CGEM code.
SimCCS Science Gateway: Towards Creating a Dynamic Web based Portal for Carbon Capture and Storage
Presenter(s): Sudhakar Pamidighantam (Indiana University)
This presentation will describe the SimCCS Science gateway, a portal for Simulating Carbon Capture, Transport and Storage. We will motivate the need for the simulations and where it is used potentially and describe the gateway creation and interfaces in detail. The evolution of the gateway from basic optimization of pipeline network with user prepared inputs through a desktop application that drives the workflow ending with a web browser based interface for driving the workflow using Apache Airavata integrated Django framework will be presented.
April 16, 2019
The Digital Object Architecture and Enhanced Robust Persistent Identification of Data
Presenter(s): Rob Quick (Indiana University)
The expansion of the research community's ability to collect and store data has grown much more rapidly than its ability to catalog, make accessible, and make use of data. Recent initiatives in Open Science and Open Data have attempted to address the problems of making data discoverable, accessible and re-usable at internet scales. The Enhanced Robust Persistent Identification of Data (E-RPID) project's goal is to address these deficiencies and enable options for data interoperability and reusability in the current research data landscape by utilizing Persistent Identifiers (PIDs) and a kernel of state information available with PID resolution. To do this requires integrating a set of preexisting software systems along with a small set of newly developed software solutions. The combination of these software components and the core principles of making data FAIR (findable, accessible, interoperable and reusable) will allow us to use Persistent Identifiers to create an end-to-end fabric capable of realizing the Digital Object Architecture for researchers. This presentation will introduce the audience to the concepts of the Digital Object Architecture, describe the software services necessary to enable this architecture, introduce the existing E-RPID testbed that is available for experimental usage on the Jetstream cloud environment, and describe the diverse set of use cases already using E-RPID to enhance their data accessibility, interoperability and reusability.
March 19, 2019
SeedMe2: Data Sharing Cyberinfrastructure for Researchers
Presenter(s): Amit Chourasia (San Diego Supercomputer Center)
Data is an integral part of scientific research, and data size problems have become endemic as computation and analyses are producing an increasingly large amount of data that research teams are inevitably tasked with managing these rapidly growing data collections. Existing solutions are largely focused upon providing storage space, whether local or in the cloud, and a familiar folder tree-style hierarchy. While these file system solutions work, they separate the data from essential contextual information, such as metadata, descriptive text and equations, job execution parameters, visualizations, and on-going data discussion among the researchers. Important discussions, for instance, remain in email logs or forums, while descriptive text is left in README files or embedded in those same email logs and forums. This distribution of contextual information makes it harder to keep track of it all and keep data from being orphaned or misinterpreted. A more unified approach is needed that keeps data and context together within the same storage system. In this talk I will discuss and interactively demonstrate key features of building blocks for data sharing and data management developed by the SeedMe2 (Stream, Encode, Explore and Disseminate My Experiments) project . It enables research teams to manage, share, search, visualize, and present their data in a web- based environment using an access-controlled, branded, and customizable website they own and control. It supports storing and viewing data in a familiar tree hierarchy, but also supports formatted annotations, lightweight visualizations, and threaded comments on any file/folder. The system can be easily extended and customized to support metadata, job parameters, and other domain and project- specific contextual items. The software is open source and available as an extension to the popular Drupal content management system. Project website with easy trial option: http://dibbs.seedme.org
February 19, 2019
Sustaining Science Gateway Operations through SciGaP Service
Presenter(s): Suresh Marru (Science Gateways Research Center, Indiana University)
Science Gateways dramatically accelerate scientific discovery by providing crucial user- and science-centric points of entries to access cyberinfrastructure resources while shielding them from the technicalities of interacting with XSEDE like distributed infrastructure. XSEDE's Extended Collaborative Support Services (ECSS) has collaborated in making it as easy as possible for scientific communities to create such Science Gateways and help them integrate with XSEDE. However it is important to sustain these collaborative efforts and assist XSEDE communities in operating these gateways. In this talk we will present ECSS project exemplars which have adopted the hosted Apache Airavata services operated by the NSF funded Science Gateway Platform (SciGaP) project thus decreasing the overhead for gateway operations. The talk will conclude by providing references for future ECSS projects to take advantage of out-of-the box Gateway platform with customizable user interfaces, or integrating a la carte via direct programmatic access from existing community Gateway implementations.
Ansible on the Cloud: A match made in heaven
Presenter(s): Eric Coulter (Science Gateways Research Center, Indiana University)
One of the major difficulties facing researchers in getting started with national cyberinfrastructure (CI) is the pain of actually *using* it. For support staff, it is a continual struggle to effectively onboard new users and provide interfaces to compute resources. With the advent of cloudy research CI, it has become possible to provide highly customized resources for a variety of scientific domains, while at the same time giving access to those resources through gateways. I will discuss how customized infrastructure can enable a wide range of scientific projects, from bioinformatics to real-time data gathering. I will also demonstrate how the use of Ansible makes it relatively easy to create configurable, replicable infrastructure on Jetstream's Openstack cloud, and provide participants with a starting point for building their own customized infrastructure.
January 15, 2019
Searching through the SRA - A focus on the ECSS work
Presenter(s): Mats Rynge (USC)
The Sequence Read Archive (SRA), the world's largest database of sequences, hosts approximately 10 petabases (10^16 bp) of sequence data and is growing at the alarming rate of 10 TB per day. Yet this rich trove of data is inaccessible to most researchers: searching through the SRA requires large storage and computing facilities that are beyond the capacity of most laboratories. Enabling scientists to analyze existing sequence data will provide insight into ecology, medicine, and industrial applications. As a prototype project, we specifically focus on providing a search capability against metagenomic sequences (whole community datasets from different environments). These data represent approximately 46 TB of data in the SRA. We provided two different search algorithms that can be used by domain scientists to explore this data. The presentation includes details on how XSEDE ECSS helped to create a science gateway using open community science gateway framework, Apache Airavata, and an auto-scaled processing setup using Jetstream and direct mounted Wrangler storage for efficient data access for the growing user community of Searching the SRA.
Hyperglyphs: Pushing the Limits of Glyph Structure to Gain Insight Into Large Datasets
Presenter(s): Jeff Sale (SDSC)
The concept of a glyph in scientific visualization is well known and has found numerous applications over the years. However, the limits to the level of complexity of glyph structure have only begun to be fully explored. At the same time, a growing percentage of the big data torrent consists of semi-structured, unstructured, and non-traditional data, presenting a challenge for conventional visualization methods. Some data are so complex it is difficult to know where to begin to gain insight into trends and anomalies hidden within. We need new and innovative ways to visually explore such massive amounts of complex data. In this symposium I will provide a brief history of glyphs in scientific visualization and conditions in which their use is appropriate and beneficial. Then I make the case that conventional, simple glyphs should be extended and complexified into what I call ‘hyperglyphs', highly complex visual structures designed to encapsulate much more information within a single glyph and which, when thousands are arrayed in an interactive 3D space, can significantly enhance perception and information assimilation leading to new knowledge and insight. I will provide a wide range of examples from diverse fields including education, physiology, meteorology, public health, and social media.