 |
|
DAILY NEWS AND INFORMATION FOR THE GLOBAL GRID COMMUNITY / JUNE 02, 2003: VOL. 2 NO. 22
|
Special Features:
ENG DIRECTOR REVEALS STATE OF THE TERAGRID
GRIDtoday: As they say in Silicon Valley, what's the elevator pitch on
TeraGRID and your goals?
Pete Beckman: Simply put, the TeraGrid is the National Science Foundation's
project to build the world's largest, most powerful computing environment for
Grid-based computational science. Scientists around the world collaborate each
day by using the Internet to exchange programs and data sets and share
resources. Grid technologies take that a step further, and provide scientists
simple and seamless access to data and resources across multiple, distributed
sites -- improving the pace of innovation and discovery. Combine Grid
technologies with mammoth Linux clusters and the fastest optical network on
the planet for open science and you have the TeraGrid!
GT: What don't people know about the TeraGRID that people don't know much
about that you think is important?
PB: The hard part to building successful Grid systems, capable of hosting
hundreds of science projects, is a common software environment. It was a
problem largely unsolved by the Grid community. To build the TeraGrid, we had
to invent, design, and build the Common TeraGrid Software Stack (CTSS). It is
the CTSS that binds our Grid together -- and permits something we call
"TeraGrid Roaming", where scientists can easily move applications and data
across our Grid. For example, scientists can use the Grid at the San Diego
Supercomputer Center (SDSC) to store enormous data sets on terabytes of fast
fibre channel storage while running teraflop computational jobs at the
National Center for Supercomputing Applications (NCSA) in Illinois. Making it
happen in practice, not just in theory, has been the real challenge, and an
important result of our work.
GT: Where is TeraGRID today?
PB: Almost all of the sites have accepted the first wave of hardware, and the
Common TeraGrid Software Stack has been defined and is now going through
testing, evaluation, and scheduled revisions.
GT: How is the funding going? -- What kind of impact will cluster funding have
on the TeraGRID?
PB: A quick look at the Top500 list shows that world-wide, Linux clusters have
been a disruptive technology for high performance computation. About one out
of every six teraflop machines is already a Linux cluster. The TeraGrid's
successful deployment of extreme-performance clusters for the world's largest
Grid applications will be an archetype for future Grid and cluster projects
seeking funding.
GT: What date are you looking at now before going live?
PB: We are already live with some real-world applications. However, they are
the Chuck Yeagers for the TeraGrid -- pushing the software, hardware, and
network to the red line, and at this point, occasionally still crossing it.
When the test pilots have finished helping the designers refine the software,
we will officially "go live". Remember that the TeraGrid actually has two
phases. The first phase hardware has already arrived, and the rest will arrive
in the late summer or fall. General users can expect to see the TeraGrid
officially go into production in the fall.
GT: What are the first applications that are going to run on this?
PB: The first applications will be Grid-based, science applications,
everything from the molecular dynamics of proteins to simulating the formation
of the first stars after the big bang. Hundreds of amazing scientific
applications will eventually be running daily on parts of the TeraGrid.
GT: What are the biggest technical challenges you face now?
PB: The hardest technical challenges have centered around the difficulty of
integrating multiple cutting edge technologies simultaneously. Building the
world's fastest optical network for open science (four stripped lambdas from
Chicago to Los Angeles) at the same time we are building the software
environment for hosting Grid applications on the world's most advanced Linux
cluster at the same time we are working though kernel and hardware problems
for first-run IA64 systems has been daunting. Any one of the challenges has
the potential to overwhelm a single technical team. Lucky for us, the combined
strengths of Argonne National Laboratory / University of Chicago, NCSA, SDSC,
Caltech, and the Pittsburgh Supercomputer Center have given us a tremendously
deep bench for addressing the technical challenges.
GT: Is security an issue?
PB: Clearly, security has been a key component of the design from the
beginning of the project, requiring the team to address both social and
technical issues. Like all projects, the early going had hidden pitfalls, as
even the most basic assumptions about methodologies, processes, and
technologies are challenged at the ambitious scale of the TeraGrid. Forming
certificate and identification policies and security procedures were only the
first steps. The project had to evolve procedures for responding to security
incidents across multiple sites -- and we have already had folks attempt crude
social engineering attacks against the TeraGrid. Fortunately, the sites
participating in the TeraGrid have been since the birth of the Internet, and
participated in its design. They have a lot of experience with both
traditional Internet and Grid security.
GT: Will TeraGRID help the industry address improved technical and political
cooperation?
PB: The TeraGrid has already been an exemplar for how organizations will
cooperate in the future. As you probably know, the National Science Foundation
is currently evaluating proposals to extend the existing TeraGrid with new
sites, additional networking, and additional resources. The TeraGrid will
provide the basis for the NSF's Cyberinfrastructure Program to create an
international cyberinfrastructure for scientific and engineering research and
allied education. The TeraGrid is growing, and will both directly improve the
capabilities for scientific discovery as well as provide a operational example
of how diverse, even competing organizations can cooperate to form a Grid.
GT: What's going to be the first big revelation from the TeraGRID?
PB: I think overcoming the organizational and social issues required to build
a large production-quality Grid hosting environment proved to be harder than
we planned. I'm a computer scientist, and everyone running the TeraGrid comes
from a deep technical background -- we live to solve puzzles and invent new
things. While solving the technical problems remains challenging, creating the
collaborative environment and organizational structures for constructing the
TeraGrid was daunting. The five sites forming the TeraGrid each have autonomy
and their own culture -- it was a challange to bootstrap the sites into
collaborating -- kind of like trying to get France, China, Germany, Russia,
and the US to agree, maybe harder.
GT: What will the impact on cluster computing will the TeraGRID have?
PB: The TeraGrid has several components -- one of which is to construct a
general purpose software environment for cluster and Grid computing. Until
this point, every cluster has had a completely different environment for the
user. Sure, some things are generally common across unix-based environments,
but the truth is that most cluster computing environments are mostly
constructed using ad hoc methods. To provide a hosting environment for HPC
Grid applications that is common across multiple sites, the TeraGrid has
invested a tremendous amount of work into defining, constructing, building as
well as automated testing, verification, and reporting of the computational
environment. I believe our efforts in this area will be reused, and effect the
future of large-scale production clusters.
GT: Does the current technology meet the dream?
PB: Almost... While the project has worked very hard to make sure the
technology, and more importantly, the required improvements to the technology
are well documented and centrally stored in a repository, there is still a
measure of duct tape and bailing wire that any project of this size,
integrating so many cutting edge pieces, must endure. I believe the challenge
for the TeraGrid will be to learn from and share our experiences with the
current generation of cluster and grid technologies with the larger community.
Fortunately, interacting with the larger Grid community is easy. Charlie
Catlett, the Executive Director for the TeraGrid happens to also be Chair of
the Global Grid Forum (GGF), the international organization supporting the
development, deployment, and implementation of Grid technologies, standards,
and applications. Members of the TeraGrid actively participate in the GGF, so
we can learn from the larger community and share our experiences and code.
GT: What are some of the more compelling GRID-related technologies emerging
that interest you?
PB: I've been watching two emerging Grid-related technologies with a lot of
interest. The first, is seamless access to remote data, in place, without
additional copies. Currently, most HPC applications must manually move any
input data files from a local storage array to the large parallel storage
system attached to the remote computational engine before large-scale remote
computation can be performed. With the ultra high-speed networks available to
the TeraGrid, it may be possible to use Grid-based technologies to directly
connect the computational cluster to to remote data thousands of miles away.
If we can master that, the usability of the Grid will improve in a way not
previously possible.
Secondly, the ability to use visualization clusters to post-process the data,
or in some cases process the output data in real-time is very exciting for
scientists. Providing straightforward, easy to use Grid technologies for
remote visualization will dramatically effect scientists understanding of both
the behavior of their application as well as the science driving it.
GT: What's missing in the technology picture today for making GRIDs more
useful in the corporate world?
PB: While many corporations are already incorporating Grid technologies into
their workflow (just look at the GGF sponsorship list!) there remains several
components that need additional development to accelerate adoption. The
TeraGrid, like most Grid projects is using the Globus Toolkit to construct the
underlying Grid infrastructure. The Globus project, started at Argonne
National Laboratory and USC ISI back in 1995 has become one of the most common
Grid toolkits used in production environments. Globus, however, is plumbing.
In much the same way the Apache web server provides the HTTPD protocol, the
Globus Toolkit provides basic Grid services. Just like installing an Apache
HTTPD server doesn't create a corporate e-commerce web site, complete with on-
line ordering and customer tracking, installing the Globus Toolkit does not
create a usable corporate Grid. The Grid needs the equivalent of the best-
practices and work-flow tools that packages such as Dreamweaver and IBM's e-
commerce suite provide for web sites.
GT: Compare & contrast your experiences getting TeraGRID up and running to
your venture-funded private company experience getting a new product out the
door?
PB: In my previous job, I was the VP of Engineering for a company that had
well-established development teams spread across six cities and on three
different continents. However, in the end, we were all one company, and I
could divide our resources, as the projects required, across the basic
engineering tasks: research, development, testing and QA, and support. In the
TeraGrid, each site is mostly autonomous, and many of the participants
(developers) are juggling a handful of multi-year projects in addition to the
TeraGrid. This is simply how academic and government research laboratory based
research is pursued, and makes creating a sustainable, focused multi- site
development team quite challenging. Of course dealing with venture
capitalists, customers who want your product for free, and porting software to
legacy operating systems with a fondness for the color blue is also challenge,
but that's a different story altogether....
Pete Beckman is the Director of Engineering, TeraGrid, Argonne National
Laboratory. Beckman is a featured speaker at ClusterWorld Conference & Expo at
the San Jose Convention Center, June 23- 26. At ClusterWorld, Beckman will
talk about the latest developments with TeraGrid on June 24 at 10:30am. The
title of his talk will be: Building the TeraGrid: The World's Largest Grid,
Fastest Linux Cluster, and Fastest Optical Network Dedicated to Open Science
|