GRIDtoday Logo IBM

DAILY NEWS AND INFORMATION FOR THE GLOBAL GRID COMMUNITY / JUNE 02, 2003: VOL. 2 NO. 22

   ( Table of Contents )   

Special Features:

ENG DIRECTOR REVEALS STATE OF THE TERAGRID

GRIDtoday: As they say in Silicon Valley, what's the elevator pitch on TeraGRID and your goals?

Pete Beckman: Simply put, the TeraGrid is the National Science Foundation's project to build the world's largest, most powerful computing environment for Grid-based computational science. Scientists around the world collaborate each day by using the Internet to exchange programs and data sets and share resources. Grid technologies take that a step further, and provide scientists simple and seamless access to data and resources across multiple, distributed sites -- improving the pace of innovation and discovery. Combine Grid technologies with mammoth Linux clusters and the fastest optical network on the planet for open science and you have the TeraGrid!

GT: What don't people know about the TeraGRID that people don't know much about that you think is important?

PB: The hard part to building successful Grid systems, capable of hosting hundreds of science projects, is a common software environment. It was a problem largely unsolved by the Grid community. To build the TeraGrid, we had to invent, design, and build the Common TeraGrid Software Stack (CTSS). It is the CTSS that binds our Grid together -- and permits something we call "TeraGrid Roaming", where scientists can easily move applications and data across our Grid. For example, scientists can use the Grid at the San Diego Supercomputer Center (SDSC) to store enormous data sets on terabytes of fast fibre channel storage while running teraflop computational jobs at the National Center for Supercomputing Applications (NCSA) in Illinois. Making it happen in practice, not just in theory, has been the real challenge, and an important result of our work.

GT: Where is TeraGRID today?

PB: Almost all of the sites have accepted the first wave of hardware, and the Common TeraGrid Software Stack has been defined and is now going through testing, evaluation, and scheduled revisions.

GT: How is the funding going? -- What kind of impact will cluster funding have on the TeraGRID?

PB: A quick look at the Top500 list shows that world-wide, Linux clusters have been a disruptive technology for high performance computation. About one out of every six teraflop machines is already a Linux cluster. The TeraGrid's successful deployment of extreme-performance clusters for the world's largest Grid applications will be an archetype for future Grid and cluster projects seeking funding.

GT: What date are you looking at now before going live?

PB: We are already live with some real-world applications. However, they are the Chuck Yeagers for the TeraGrid -- pushing the software, hardware, and network to the red line, and at this point, occasionally still crossing it. When the test pilots have finished helping the designers refine the software, we will officially "go live". Remember that the TeraGrid actually has two phases. The first phase hardware has already arrived, and the rest will arrive in the late summer or fall. General users can expect to see the TeraGrid officially go into production in the fall.

GT: What are the first applications that are going to run on this?

PB: The first applications will be Grid-based, science applications, everything from the molecular dynamics of proteins to simulating the formation of the first stars after the big bang. Hundreds of amazing scientific applications will eventually be running daily on parts of the TeraGrid.

GT: What are the biggest technical challenges you face now?

PB: The hardest technical challenges have centered around the difficulty of integrating multiple cutting edge technologies simultaneously. Building the world's fastest optical network for open science (four stripped lambdas from Chicago to Los Angeles) at the same time we are building the software environment for hosting Grid applications on the world's most advanced Linux cluster at the same time we are working though kernel and hardware problems for first-run IA64 systems has been daunting. Any one of the challenges has the potential to overwhelm a single technical team. Lucky for us, the combined strengths of Argonne National Laboratory / University of Chicago, NCSA, SDSC, Caltech, and the Pittsburgh Supercomputer Center have given us a tremendously deep bench for addressing the technical challenges.

GT: Is security an issue?

PB: Clearly, security has been a key component of the design from the beginning of the project, requiring the team to address both social and technical issues. Like all projects, the early going had hidden pitfalls, as even the most basic assumptions about methodologies, processes, and technologies are challenged at the ambitious scale of the TeraGrid. Forming certificate and identification policies and security procedures were only the first steps. The project had to evolve procedures for responding to security incidents across multiple sites -- and we have already had folks attempt crude social engineering attacks against the TeraGrid. Fortunately, the sites participating in the TeraGrid have been since the birth of the Internet, and participated in its design. They have a lot of experience with both traditional Internet and Grid security.

GT: Will TeraGRID help the industry address improved technical and political cooperation?

PB: The TeraGrid has already been an exemplar for how organizations will cooperate in the future. As you probably know, the National Science Foundation is currently evaluating proposals to extend the existing TeraGrid with new sites, additional networking, and additional resources. The TeraGrid will provide the basis for the NSF's Cyberinfrastructure Program to create an international cyberinfrastructure for scientific and engineering research and allied education. The TeraGrid is growing, and will both directly improve the capabilities for scientific discovery as well as provide a operational example of how diverse, even competing organizations can cooperate to form a Grid.

GT: What's going to be the first big revelation from the TeraGRID?

PB: I think overcoming the organizational and social issues required to build a large production-quality Grid hosting environment proved to be harder than we planned. I'm a computer scientist, and everyone running the TeraGrid comes from a deep technical background -- we live to solve puzzles and invent new things. While solving the technical problems remains challenging, creating the collaborative environment and organizational structures for constructing the TeraGrid was daunting. The five sites forming the TeraGrid each have autonomy and their own culture -- it was a challange to bootstrap the sites into collaborating -- kind of like trying to get France, China, Germany, Russia, and the US to agree, maybe harder.

GT: What will the impact on cluster computing will the TeraGRID have?

PB: The TeraGrid has several components -- one of which is to construct a general purpose software environment for cluster and Grid computing. Until this point, every cluster has had a completely different environment for the user. Sure, some things are generally common across unix-based environments, but the truth is that most cluster computing environments are mostly constructed using ad hoc methods. To provide a hosting environment for HPC Grid applications that is common across multiple sites, the TeraGrid has invested a tremendous amount of work into defining, constructing, building as well as automated testing, verification, and reporting of the computational environment. I believe our efforts in this area will be reused, and effect the future of large-scale production clusters.

GT: Does the current technology meet the dream?

PB: Almost... While the project has worked very hard to make sure the technology, and more importantly, the required improvements to the technology are well documented and centrally stored in a repository, there is still a measure of duct tape and bailing wire that any project of this size, integrating so many cutting edge pieces, must endure. I believe the challenge for the TeraGrid will be to learn from and share our experiences with the current generation of cluster and grid technologies with the larger community. Fortunately, interacting with the larger Grid community is easy. Charlie Catlett, the Executive Director for the TeraGrid happens to also be Chair of the Global Grid Forum (GGF), the international organization supporting the development, deployment, and implementation of Grid technologies, standards, and applications. Members of the TeraGrid actively participate in the GGF, so we can learn from the larger community and share our experiences and code.

GT: What are some of the more compelling GRID-related technologies emerging that interest you?

PB: I've been watching two emerging Grid-related technologies with a lot of interest. The first, is seamless access to remote data, in place, without additional copies. Currently, most HPC applications must manually move any input data files from a local storage array to the large parallel storage system attached to the remote computational engine before large-scale remote computation can be performed. With the ultra high-speed networks available to the TeraGrid, it may be possible to use Grid-based technologies to directly connect the computational cluster to to remote data thousands of miles away. If we can master that, the usability of the Grid will improve in a way not previously possible.

Secondly, the ability to use visualization clusters to post-process the data, or in some cases process the output data in real-time is very exciting for scientists. Providing straightforward, easy to use Grid technologies for remote visualization will dramatically effect scientists understanding of both the behavior of their application as well as the science driving it.

GT: What's missing in the technology picture today for making GRIDs more useful in the corporate world?

PB: While many corporations are already incorporating Grid technologies into their workflow (just look at the GGF sponsorship list!) there remains several components that need additional development to accelerate adoption. The TeraGrid, like most Grid projects is using the Globus Toolkit to construct the underlying Grid infrastructure. The Globus project, started at Argonne National Laboratory and USC ISI back in 1995 has become one of the most common Grid toolkits used in production environments. Globus, however, is plumbing. In much the same way the Apache web server provides the HTTPD protocol, the Globus Toolkit provides basic Grid services. Just like installing an Apache HTTPD server doesn't create a corporate e-commerce web site, complete with on- line ordering and customer tracking, installing the Globus Toolkit does not create a usable corporate Grid. The Grid needs the equivalent of the best- practices and work-flow tools that packages such as Dreamweaver and IBM's e- commerce suite provide for web sites.

GT: Compare & contrast your experiences getting TeraGRID up and running to your venture-funded private company experience getting a new product out the door?

PB: In my previous job, I was the VP of Engineering for a company that had well-established development teams spread across six cities and on three different continents. However, in the end, we were all one company, and I could divide our resources, as the projects required, across the basic engineering tasks: research, development, testing and QA, and support. In the TeraGrid, each site is mostly autonomous, and many of the participants (developers) are juggling a handful of multi-year projects in addition to the TeraGrid. This is simply how academic and government research laboratory based research is pursued, and makes creating a sustainable, focused multi- site development team quite challenging. Of course dealing with venture capitalists, customers who want your product for free, and porting software to legacy operating systems with a fondness for the color blue is also challenge, but that's a different story altogether....

Pete Beckman is the Director of Engineering, TeraGrid, Argonne National Laboratory. Beckman is a featured speaker at ClusterWorld Conference & Expo at the San Jose Convention Center, June 23- 26. At ClusterWorld, Beckman will talk about the latest developments with TeraGrid on June 24 at 10:30am. The title of his talk will be: Building the TeraGrid: The World's Largest Grid, Fastest Linux Cluster, and Fastest Optical Network Dedicated to Open Science

( Top of Page )

   ( Table of Contents )