Education and Outreach Blog

« Back

Reaching for the Stormy Cloud with Chameleon

Some scientists dream about big data. The dream bridges two divided realms. One realm holds lofty peaks of number-crunching scientific computation. Endless waves of big data analysis line the other realm. A deep chasm separates the two. Discoveries await those who cross these estranged lands. Unfortunately, data cannot move seamlessly between Hadoop (HDFS) and parallel file systems (PFS). Scientists who want to take advantage of the big data analytics available on Hadoop must copy data from parallel file systems. That can slow workflows to a crawl, especially those with terabytes of data. Computer Scientists working in Xian-He Sun's group are bridging the file system gap with a cross-platform Hadoop reader called PortHadoop, short for portable Hadoop. "PortHadoop, the system we developed, moves the data directly from the parallel file system to Hadoop's memory instead of copying from disk to disk," said Xian-He Sun, Distinguished Professor of Computer Science at the Illinois Institute of Technology. Sun's PortHadoop research was funded by the National Science Foundation and the NASA Advanced Information Systems Technology Program (AIST). The concept of 'virtual blocks' helps bridge the two systems by mapping data from parallel file systems directly into Hadoop memory, creating a virtual HDFS environment. These 'virtual blocks' reside in the centralized namespace in HDFS NameNode. The HDFS MapReduce application cannot see the 'virtual blocks'; a map task triggers the MPI file read procedure and fetches the data from the remote PFS before its Mapper function processes its data. In other words, a dexterous slight-of-hand from PortHadoop tricks the HDFS to skip the costly I/O operations and data replications it usually expects.Learn more at

Trackback URL:

No comments yet. Be the first.