GRIDtoday Logo Altair

DAILY NEWS AND INFORMATION FOR THE GLOBAL GRID COMMUNITY / DECEMBER 16, 2002: VOL. 1 NO. 27

( Previous Article )   ( Table of Contents )   ( Next Article )

Special Features:

WEB SERVICES GRID SUPERCOMPUTING COMING OF AGE
By Neil Alger, Special Correspondent, HPCwire

An interview with Dr. Regan Moore.

HPCwire: What is Storage Resource Broker?

Dr. Reagan Moore: The Storage Resource Broker (SRB) is DataGrid technology that is used to manage access to data in a grid distributed environment. When you think about that problem, there are a lot of implications you have to worry about in the statement: What do you do if the remote site is running a different type of storage system than you are? What do you do with the fact that the remote site is running a different administration domain and a different authentication environment?

I don't necessarily have my own personal user id installed at that remote site, which means traditionally I wouldn't be able to get to the data.

What do you do about the fact that you have to discover the name of the file at the remote site somehow? The traditional method is to call someone on the telephone. Or you can hope that they have put their information up on the Web somehow, made it accesible so that you can, through Google maybe, discover the file magically.

Normally I would have to do that through a Web page. We would like to be able to automate all of those interactions so that they can be done from an application, so that you can discover data without knowing its name, you can acces data without knowing where its located, and retrieve data without knowing what protocol is required by their storage system.

These are all called transparencies. You want discovery transparency,access transparency, and protocol transparency. The technology we've built into SRB tracks down those concepts we needed to be able to make this work. At the same time, we had to worry about consistency within this grid distributed environment.

We wanted a system where we wouldn't be at risk if someone at the remote site decided to change a file, change its name, move it somewhere else. So we looked for ways to be able to do that, to guarantee that we could manage a consistent environement where the data is distributed over multiple sites.

We decided the approach that we would take would be to build a collection and have the data owned by the collection, which means that when the collection stores data at a remote site, it stores it under a collection ID, rather than a user or researcher ID.

That means that when the user tries to get to the data, UNIX will lock them out, because you can only get in if you're the collection ID. We installed a server in front of the storage system so that we could manage access to that data, and that server runs under the collection ID.

And then, from the perspective of the user, we provided the mechanisms needed for the user to build a gobal identifier, a persistent identifier, for each digital entity within the collection.

Those global and persistent identifiers are organized as a logical name space. These are naming conventions which allow us to label digital entities independently of the storage system they reside on. It also means we can choose to add attributes to this logical namespace to manage the rest of the functionality we want to have with the GRID.

That type of functionality could be access control lists, or permissions, or authorization to get to the data. It can be versioning control so I can maintain multiple versions of a digital entity. It can be audit trails, so I can track all the operations that are done, all the people who have accessed a piece of data, and keep track of when and by whom a digital entity has been changed. There are a large number of such capabilities that you can implement, and they all require associated metadata, and we associate that metadata with the logical name.

That meant that we had to worry about how we were going to organize this logical namespace. Traditionally you organize the namespace as a directory- subdirectory hierarchy.

In our case we had to manage these metadata attributes also, so we organized it as a collection-subcollection hierarchy where every subcollection could have a different set of attributes. Because we have a logical namespace, we can impose mappings on that logical namespace. We can map from the logical name to multiple physical names and support replication, each physical name coresponding to a copy of that data at a different site.

You can apply the same mapping but do multiple logical names linked to a single physical name, and support the UNIX concept of soft links. And you can do structural mappings, where you map a file into a container, such that when the user requests that digital entity, the sytstem will know which container to go to and retrieve and put on disk and allow you to read your individual object.

This system then has been used to support data sharing, which basically is federation across multiple resources. It's been used to support digital libraries, which means discovery based upon metadata attributes, manipulation of the metadata, and different services you want to provide for the collection itself.

And its been used to build persistent archives, where we have to worry about managing technology evolution.

How do we build an envioronment where you can choose to change your particular storage system and yet still preserve the collection, all of the digital entities that were sitting on that storage system, Independently of your choice of storage repository?

HPCwire: What types of projects are currently utilizing SRB?

RM: The system is being used by a wide number of projects. They range from a project with the Stanford Linear Accelerator Center to one with the Joint Center for Structural Genomics (JCSG).

At the JCSG, they take images, crystal graphic refractions of of objects put in their beam line, and then take that resulting image and store it in an archive in San diego, register it in this logical namespace so people can do a search to discover relevant crystal graphic images.

Then they put acces controls on it so that just the person who took the image can access it for some period of time and then open it up.

There are projects with the Alliance for Cellular Signaling (AfCS), which also wants to organize image data, which in this case is microarray data. SRB is used in the Biomedical Informatics Research Network, BIRN, to federate MRI facilities so that medical practicioners at Duke, Harvard, San Diego and UCLA can share images and then request imaging devices.

There's a project with the National Archives to use this technology to build a persistent archive distributed between the University of Maryland and the National Archives II in San Diego. There is a project with the Library of Congress to demonstrate the use of DataGrid technology to manage data.

There are also projects with the NASA Information PowerGrid, which is now being installed at multiple NASA sites to manage very large data collections. There are projects with the Department of Energy Particle Physics DataGrid. Also, the Stanford High-Energy Physics Group is using this technology to create a logical namespace to register digital entities from their archive.

So everywhere you look, there are groups that need the ability to organize data, manage it, support discovery, and support acces and retrieval in a grid distributed environment through the particular API that that particular community wants to use.

HPCwire: On of the projects that you have demoed here at SC02 is the Oasis Project. How does SRB play into this specific project?

RM: We have an application for the National Virtual Observatory, in which they are providing acces to all sky surveys. An all sky survey might have 5 million images comprising 10 terabytes of data, and it should be accesible to all astronomers, all highschool students, anybody who wants to look at a particular image of the sky at a particular wavelength.

In a collaboration with IPAC at Caltech, that is funded both by NPCAI's Digital Sky Project (which wants to support correlation of stars accross multiple catalogs) and the National Virtual Observatory (which is an NSF funded project that wants to support the integration of multiple surveys and support services on those surveys), we are providing them access to a 2-micron all-sky survey.

These are images taken at the 2-micron wavelength, and the images are stored in an archive in San Dieg, then replicated into an archive at Caltech. There's a Web interface at IPAC/Caltech through which you can access this collection and retrieve an image for that particular are of the sky you are interested in.

As part of the National Virtual Observatory, multiple services are being created that the astronomers can apply to these collections. They range from cutout services where I can go to a catalog and pull out information about all the stars in a given region, to a mosaicing service where I can ask the system to build a mosaic from multiple images into a larger view of the sky. These efforts are being supported by the National Virtual Observatory across a large number of sky surveys such that you can use the same service, the same access methods across multiple servers, and then be able to integrate the results.

HPCwire: How large is the complete dataset for the Oasis project?

RM: The 2-micron survey is 10 terabytes: each image is 2 MB, and there are 5 million images. There is a digital Palomar Observatory sky survey, which is 3 terabytes of data, but each image is a gigabyte, so it is made up of 3000 images. There is also a slow digital sky survey that is being managed by Johns Hopkins, and it will be on the order of 15 TB of data.

Next generation sky surveys will be factors of 10 larger, so the amount of data that astronomers are trying to manage is growing.

HPCwire: Would you like to talk about SRB and its relation to the GRID as a whole?

RM: SRB is one example of DataGrid technology.

Every major scientific discipline that has distributed data is right now building its own DataGrid.

In the high-end physics community there are already five DataGrids, and each experiment has built the infrasturcture to manage the data that they have. And what they're interested in doing is understanding where there's commonality accross all of these DataGrid implementations such that people can start migrating towards a common infrastructure.

The Global Grid Forum is addressing the issue of how we can build common infrastructures that will support the remote access, remote manipulation, and remote analysis of data that could be used by any discipline for any experiment.

The GGF has multiple working groups; some are looking at issues of security, some are looking at issues of GRID services, some are looking at the issue of how you do accounting, others look at how you manage database interactions, how you manage persistent archives, etc. There are over 80 working groups now of people looking at different aspects of the problem.

The Storage Resource Broker is an implementation of this technology which started in 1996. It is used for storing more than 40TB over 6.5 million files in this environment, across 25 different projects.

So it's an example of the use of GRID technology to federate access to multple storage systems. Beacuse many of the projects we are dealing with are concerned about digital libraries, collections, or preservation, we have had to add a very large number of features to this environment to be able to supprt these other capabilities. One of the challenges for the grid community right now is to decide what are the minimal set of features you really need to implement a digital library, that you really need to federate access to data, or you really need to manage a persistent archive.

We have done a survey accross DataGrids, and at the time the survey was done, we identified some 152 capabilities. When we looked across all of the high energy physics DataGrids, and the other DataGrids, we could identify 50 capabilities that were in common across at least five of the seven DataGrids. And those [50 capabilities] provided the core infrastructure that everybody realized they should use, including the logical namespace, support for access to an archive, including a variety of API's, C++, Java browsers ineracting with the environment.

It included management of metadata about the entities; some of the metadata management is automated by file size and creation topic. There is a list of those fifty cpabilities that all of the GRIDS are tending to support. For the SRB we supported 90% of those 152 capabilities, because many of them were being driven by the digital library community.

HPCwire: Anything else that you would like to add?

RM: The challenge is that these three communities, the digital library community, the DataGrid community, and the persistent archive community are all working on data management.

The solution people want crosses all of these communities. They all want discovery, they all want federation, and they all want persistence. We have to understand how to get better collaboration between these three groups on development problems.

( Top of Page )

( Previous Article )   ( Table of Contents )   ( Next Article )