Special Features:
WEB SERVICES GRID SUPERCOMPUTING
COMING OF AGE By Neil Alger, Special Correspondent, HPCwire
An interview with Dr. Regan Moore.
HPCwire: What is Storage Resource Broker?
Dr. Reagan Moore: The Storage Resource Broker (SRB) is DataGrid technology
that is used to manage access to data in a grid distributed environment. When
you think about that problem, there are a lot of implications you have to
worry about in the statement: What do you do if the remote site is running a
different type of storage system than you are? What do you do with the fact
that the remote site is running a different administration domain and a
different authentication environment?
I don't necessarily have my own personal user id installed at that remote
site, which means traditionally I wouldn't be able to get to the data. What
do you do about the fact that you have to discover the name of the file at the
remote site somehow? The traditional method is to call someone on the
telephone. Or you can hope that they have put their information up on the Web
somehow, made it accesible so that you can, through Google maybe, discover the
file magically.
Normally I would have to do that through a Web page. We would like to be
able to automate all of those interactions so that they can be done from an
application, so that you can discover data without knowing its name, you can
acces data without knowing where its located, and retrieve data without
knowing what protocol is required by their storage system.
These are all called transparencies. You want discovery transparency,access
transparency, and protocol transparency. The technology we've built into SRB
tracks down those concepts we needed to be able to make this work. At the same
time, we had to worry about consistency within this grid distributed
environment.
We wanted a system where we wouldn't be at risk if someone at the remote
site decided to change a file, change its name, move it somewhere else. So we
looked for ways to be able to do that, to guarantee that we could manage a
consistent environement where the data is distributed over multiple sites.
We decided the approach that we would take would be to build a collection
and have the data owned by the collection, which means that when the
collection stores data at a remote site, it stores it under a collection ID,
rather than a user or researcher ID.
That means that when the user tries to get to the data, UNIX will lock them
out, because you can only get in if you're the collection ID. We installed a
server in front of the storage system so that we could manage access to that
data, and that server runs under the collection ID.
And then, from the perspective of the user, we provided the mechanisms
needed for the user to build a gobal identifier, a persistent identifier, for
each digital entity within the collection.
Those global and persistent identifiers are organized as a logical name
space. These are naming conventions which allow us to label digital entities
independently of the storage system they reside on. It also means we can
choose to add attributes to this logical namespace to manage the rest of the
functionality we want to have with the GRID.
That type of functionality could be access control lists, or permissions,
or authorization to get to the data. It can be versioning control so I can
maintain multiple versions of a digital entity. It can be audit trails, so I
can track all the operations that are done, all the people who have accessed a
piece of data, and keep track of when and by whom a digital entity has been
changed. There are a large number of such capabilities that you can implement,
and they all require associated metadata, and we associate that metadata with
the logical name.
That meant that we had to worry about how we were going to organize this
logical namespace. Traditionally you organize the namespace as a directory-
subdirectory hierarchy.
In our case we had to manage these metadata attributes also, so we
organized it as a collection-subcollection hierarchy where every subcollection
could have a different set of attributes. Because we have a logical namespace,
we can impose mappings on that logical namespace. We can map from the logical
name to multiple physical names and support replication, each physical name
coresponding to a copy of that data at a different site.
You can apply the same mapping but do multiple logical names linked to a
single physical name, and support the UNIX concept of soft links. And you can
do structural mappings, where you map a file into a container, such that when
the user requests that digital entity, the sytstem will know which container
to go to and retrieve and put on disk and allow you to read your individual
object.
This system then has been used to support data sharing, which basically is
federation across multiple resources. It's been used to support digital
libraries, which means discovery based upon metadata attributes, manipulation
of the metadata, and different services you want to provide for the collection
itself.
And its been used to build persistent archives, where we have to worry
about managing technology evolution.
How do we build an envioronment where you can choose to change your
particular storage system and yet still preserve the collection, all of the
digital entities that were sitting on that storage system, Independently of
your choice of storage repository?
HPCwire: What types of projects are currently utilizing SRB?
RM: The system is being used by a wide number of projects. They range from
a project with the Stanford Linear Accelerator Center to one with the Joint
Center for Structural Genomics (JCSG).
At the JCSG, they take images, crystal graphic refractions of of objects
put in their beam line, and then take that resulting image and store it in an
archive in San diego, register it in this logical namespace so people can do a
search to discover relevant crystal graphic images.
Then they put acces controls on it so that just the person who took the
image can access it for some period of time and then open it up.
There are projects with the Alliance for Cellular Signaling (AfCS), which
also wants to organize image data, which in this case is microarray data. SRB
is used in the Biomedical Informatics Research Network, BIRN, to federate MRI
facilities so that medical practicioners at Duke, Harvard, San Diego and UCLA
can share images and then request imaging devices.
There's a project with the National Archives to use this technology to
build a persistent archive distributed between the University of Maryland and
the National Archives II in San Diego. There is a project with the Library of
Congress to demonstrate the use of DataGrid technology to manage data.
There are also projects with the NASA Information PowerGrid, which is now
being installed at multiple NASA sites to manage very large data collections.
There are projects with the Department of Energy Particle Physics DataGrid.
Also, the Stanford High-Energy Physics Group is using this technology to
create a logical namespace to register digital entities from their archive.
So everywhere you look, there are groups that need the ability to organize
data, manage it, support discovery, and support acces and retrieval in a grid
distributed environment through the particular API that that particular
community wants to use.
HPCwire: On of the projects that you have demoed here at SC02 is the Oasis
Project. How does SRB play into this specific project?
RM: We have an application for the National Virtual Observatory, in which
they are providing acces to all sky surveys. An all sky survey might have 5
million images comprising 10 terabytes of data, and it should be accesible to
all astronomers, all highschool students, anybody who wants to look at a
particular image of the sky at a particular wavelength.
In a collaboration with IPAC at Caltech, that is funded both by NPCAI's
Digital Sky Project (which wants to support correlation of stars accross
multiple catalogs) and the National Virtual Observatory (which is an NSF
funded project that wants to support the integration of multiple surveys and
support services on those surveys), we are providing them access to a 2-micron
all-sky survey.
These are images taken at the 2-micron wavelength, and the images are
stored in an archive in San Dieg, then replicated into an archive at Caltech.
There's a Web interface at IPAC/Caltech through which you can access this
collection and retrieve an image for that particular are of the sky you are
interested in.
As part of the National Virtual Observatory, multiple services are being
created that the astronomers can apply to these collections. They range from
cutout services where I can go to a catalog and pull out information about all
the stars in a given region, to a mosaicing service where I can ask the system
to build a mosaic from multiple images into a larger view of the sky. These
efforts are being supported by the National Virtual Observatory across a large
number of sky surveys such that you can use the same service, the same access
methods across multiple servers, and then be able to integrate the
results.
HPCwire: How large is the complete dataset for the Oasis project?
RM: The 2-micron survey is 10 terabytes: each image is 2 MB, and there are 5
million images. There is a digital Palomar Observatory sky survey, which is 3
terabytes of data, but each image is a gigabyte, so it is made up of 3000
images. There is also a slow digital sky survey that is being managed by Johns
Hopkins, and it will be on the order of 15 TB of data.
Next generation sky surveys will be factors of 10 larger, so the amount of
data that astronomers are trying to manage is growing.
HPCwire: Would you like to talk about SRB and its relation to the GRID as a
whole?
RM: SRB is one example of DataGrid technology.
Every major scientific discipline that has distributed data is right now
building its own DataGrid.
In the high-end physics community there are already five DataGrids, and
each experiment has built the infrasturcture to manage the data that they
have. And what they're interested in doing is understanding where there's
commonality accross all of these DataGrid implementations such that people can
start migrating towards a common infrastructure.
The Global Grid Forum is addressing the issue of how we can build common
infrastructures that will support the remote access, remote manipulation, and
remote analysis of data that could be used by any discipline for any
experiment.
The GGF has multiple working groups; some are looking at issues of
security, some are looking at issues of GRID services, some are looking at the
issue of how you do accounting, others look at how you manage database
interactions, how you manage persistent archives, etc. There are over 80
working groups now of people looking at different aspects of the problem.
The Storage Resource Broker is an implementation of this technology which
started in 1996. It is used for storing more than 40TB over 6.5 million files
in this environment, across 25 different projects.
So it's an example of the use of GRID technology to federate access to
multple storage systems. Beacuse many of the projects we are dealing with are
concerned about digital libraries, collections, or preservation, we have had
to add a very large number of features to this environment to be able to
supprt these other capabilities. One of the challenges for the grid community
right now is to decide what are the minimal set of features you really need to
implement a digital library, that you really need to federate access to data,
or you really need to manage a persistent archive.
We have done a survey accross DataGrids, and at the time the survey was
done, we identified some 152 capabilities. When we looked across all of the
high energy physics DataGrids, and the other DataGrids, we could identify 50
capabilities that were in common across at least five of the seven DataGrids.
And those [50 capabilities] provided the core infrastructure that everybody
realized they should use, including the logical namespace, support for access
to an archive, including a variety of API's, C++, Java browsers ineracting
with the environment.
It included management of metadata about the entities; some of the metadata
management is automated by file size and creation topic. There is a list of
those fifty cpabilities that all of the GRIDS are tending to support. For the
SRB we supported 90% of those 152 capabilities, because many of them were
being driven by the digital library community.
HPCwire: Anything else that you would like to add?
RM: The challenge is that these three communities, the digital library
community, the DataGrid community, and the persistent archive community are
all working on data management.
The solution people want crosses all of these communities. They all want
discovery, they all want federation, and they all want persistence. We have to
understand how to get better collaboration between these three groups on
development problems.
|