GRIDtoday Logo Intel

DAILY NEWS AND INFORMATION FOR THE GLOBAL GRID COMMUNITY / SEPTEMBER 22, 2003: VOL. 2 NO. 38

   ( Table of Contents )   

Special Features:

BLACKOUT LESSONS AND GRID COMPUTING
By Dmitri Tcherevik, VP, Computer Associates Intl

The idea of Grid computing is rapidly gaining popularity in the IT industry. It promises to make computing power, storage resources, network capacity and business application components available on-demand over the Internet. Companies no longer have to provision their IT capacity at levels that are multiples of the capacity required by peak workloads. When additional capacity is needed, it can be dynamically allocated from the worldwide computing Grid. Similarly, enterprises no longer need to locally develop, install and maintain all of the application components used by their business processes. Some of these components can be purchased on subscription basis from third parties, such as www.salesforce.com or Mappoint, and used remotely over the Internet. A business process then becomes an assembly of locally installed components and components installed elsewhere in the computing Grid.

The mix of business application components and raw computing resources in one paragraph may sound unusual. Yet, there is nothing unusual about it. Both types of resources are encapsulated as XML based Web services. Web services form an abstraction layer on top of the traditional IT infrastructure. This layer makes sharing of IT resources across enterprise and data center boundaries possible.

Grid computing is very appealing at many levels. It promises to reduce the cost of computing and increase productivity of the IT personnel. In addition, it allows companies to dynamically allocate enormous computing resources from a shared IT pool and use these resources to solve challenging computing problems. There is little doubt left in many people's minds that Grid computing is the primary evolutionary direction for all of the computing industry. The main argument seems to be that as the computing industry matures, it becomes similar to other "high-tech" industries of the past such as electricity, railroad transportation and telephony. Therefore, the computing Grid is an equivalent of other grids we see naturally occurring in the economy such as the electricity grid, the transportation grid and the telephony grid.

This analogy worked really well when one had to convince a pessimist, but recently it backfired in an eerie way. I am speaking, of course, about the recent spectacular problems with the transportation and the electricity grids. First, we experienced the biggest electricity blackout in the United States', and probably the world's, history. Second, parts of the rail-road transportation network became incapacitated by a computer worm causing tremendous disruptions in train schedules. In both cases, it was the grid infrastructure that failed miserably. In one case, the grid infrastructure was used to share electricity, and in the other case, the grid infrastructure was used to share the railroad network capacity.

What lessons, aside from the obvious one that we must be careful with analogies, can an IT professional draw from these events ? Is Grid computing overrated? Will it ever work or be suitable for mission critical applications? The idea of a global computing blackout caused by a disruption in a secondary data center can send a chill down anyone's spine. Again, if the history is any indication, there will always be a group of risk takers and reckless souls for whom the promise of sure benefits will outweigh the threat of hypothetical disasters. As soon as they start reaping the benefits, the rest of the industry will follow. Grid computing is bound to happen. The question then is: What can we do as an industry to prevent fiascos of the type experienced by the electricity and transportation grids?

In the search for an answer, it is useful to study the analysis produced by a group of industry officials and experts in the aftermath of the electricity blackout. We may never get to know the real reasons for the blackout, but elements of this analysis are now appearing in the press. These are some of reasons summarized in a recent New York Times article:

  • Midwest I.S.O. is a regional electricity grid manager responsible for a part of the grid that caused the blackout. It is a new and still-forming agency that operates with limited real-time information about what is happening on the transmission lines it oversees. At the time of the disaster, its operators did not have an up-to-date picture of what was happening in the grid. As a consequence, they did not have the information that would allow them to make the decisions that would prevent the disaster from happening.
  • On the day of the blackout, Midwest I.S.O. faced a computer malfunction at a big utility it monitors, which exacerbated the lack of visibility problem. In addition, it had to deal with its own computer problems. A state estimator, the program that helps operators monitor grid conditions and predict consequences of the various events and changes in the grid, was not working. The operators lacked real-time information to support their decisions and lacked a mechanism for estimating effects of these decisions on the state of the grid.
  • Midwest I.S.O. shares responsibility for the Midwestern part of the electricity grid with a number of other monitoring entities. No single entity has a comprehensive view of the state of the electricity grid in the region. As a consequence, no single entity was in a position to prevent the disaster from happening. Apparently, no mechanism exists that would allow the different entities to cooperate when ad-dressing a problem that spans their respective areas of control.
  • Even if Midwest I.S.O had been aware of all that was happening and had all the tools to make all the right decisions, its efforts to avert the collapse might have been hindered by its lack of authority: it cannot order power authorities to act in an emergency. Its voice is purely advisory. These findings have very important implications for the design and implementation of the emerging Grid computing infrastructure.

First, it is very likely that we will see, in the near future, emergence of monitoring and overseeing agencies responsible for controlling portions of the computing Grid. These agencies will be similar in function to Midwest I.S.O. Providers of IT infrastructure outsourcing and application services may serve as an equivalent of today's electrical power utilities. A single business process in an enterprise may be a client of several such computing "utilities". It is reasonable to assume that enterprises will welcome emergence of a new class of service intermediaries that will help them manage their relationship with the computing utilities and manage distribution of the shared computing resources across the Internet. Service intermediaries that perform some of these functions already exist, e.g. in the form of value added networks (VAN) for Web services.

Second, the service intermediaries in order to be effective will need direct and real-time access to management information describing computing resources available from the various service providers. This information will have to include general description of the resource capabilities, its current configuration, its current state, its current workload, events it can generate and other types of information. The service intermediary will also need access to management operations exposed by the resource, e.g. the ones that can be used to increase or decrease its capacity.

Since the service intermediary will rarely have direct access to the IT infrastructure used by computing utilities, the management information will have to be available at the level of the computing resources and not at the level of the IT infrastructure supporting these resources. This means that Web services representing the computing resources in the Grid will have to be directly manageable. In addition to traditional "business" interfaces each service will have to expose a collection of management interfaces.

Finally, in order for the management information to be available to service intermediaries in real-time, it must be proactively pushed by the utilities to the service intermediary as opposed to the intermediary polling the utilities at regular intervals to retrieve updates. This underscores importance of the efficient event management mechanism. It becomes a critical element of the Grid computing infrastructure. Events will have to be transmitted in XML reliably over the Internet.

Third, due to complex, dynamic and distributed nature of the computing Grid, no deterministic state transition model may exist for the overall computing environment. A management action performed by a service intermediary may percolate through the Grid in unexpected ways and cause undesirable consequences. Our industry will likely have to develop and rely on non-deterministic models reflecting behavior of the computing Grid. Just as a failure in the electrical grid state estimating program caused trouble during the recent blackout, failures in the computing grid modeling software may lead to unforeseen and large scale problems in the Grid computing environment. These problems may be-come especially pronounced when principles of autonomic computing and self-healing become widely adopted.

Fourth, due to the expected large scale of the computing Grid, it is highly unlikely that we will end up with a single service intermediary. The electricity grid now has a large number of overseeing entities responsible for different geographic regions. Similarly, the computing Grid infrastructure will have a number of service intermediaries, each responsible for its own "region" of computing. The "regions" can be formed in accordance with the type of computing resources distributed in the Grid (e.g. storage versus network capacity, the type of the industry vertical, e.g. the financial services Grid as opposed to the health care Grid, in accordance with administrative or political boundaries, and other principles).

The recent electricity blackout was caused to an extent by a problem that crossed multiple spheres of control, and by the lack of cooperation among the various monitoring entities required to address such a problem. In order to avoid similar disasters in the computing Grid, service intermediaries will have to perform cooperative and federated management of the grid environment. Federated management will require exchange of real-time management information and management policies. It may also require introduction of higher level service intermediaries, which may lead to a hierarchical management structure.

Finally, in order to make federated management of the computing Grid possible, computing utilities will have to cede some control over its resources to service intermediaries. Just as Midwest I.S.O. was hindered in its attempts to avert the electricity blackout by its lack of authority over power utilities, a service intermediary will be ill-positioned to prevent a computing blackout if it does not have authority to not only monitor the computing resources, but to also control its state and behavior.

This raises an interesting security and access control problem. If the behavior and state of computing resources are easily controllable over the Internet by service intermediaries, a security breach in the Grid computing infrastructure can easily be used by a hacker to inflict a "man-made" computing blackout. Since Web services are used to both access and control computing resources in the Grid, the issues of Web services security and access control become increasingly important.

In summary, issues such as Web services manageability and control; real time event management over the Internet; stochastic modeling of computing Grids; federated management of Web services; and Web services security and access control, will play critical role in the emergence of a reliable Grid computing environment suitable for supporting critical business functions and processes. Until we see these areas sufficiently developed, applications relying on Grid computing principles will be subject to periodic IT disruptions and computing blackouts.

( Top of Page )

   ( Table of Contents )