 |
|
DAILY NEWS AND INFORMATION
FOR THE GLOBAL GRID COMMUNITY / SEPTEMBER 22, 2003: VOL. 2 NO. 38
|
Special Features:
BLACKOUT LESSONS AND GRID
COMPUTING By Dmitri Tcherevik, VP, Computer Associates Intl
The idea of Grid computing is rapidly gaining popularity in the IT
industry.
It promises to make computing power, storage resources, network capacity and
business application components available on-demand over the Internet.
Companies no longer have to provision their IT capacity at levels that are
multiples of the capacity required by peak workloads. When additional capacity
is needed, it can be dynamically allocated from the worldwide computing Grid.
Similarly, enterprises no longer need to locally develop, install and maintain
all of the application components used by their business processes. Some of
these components can be purchased on subscription basis from third parties,
such as www.salesforce.com or
Mappoint, and used remotely
over the
Internet. A business process then becomes an assembly of locally installed
components and components installed elsewhere in the computing Grid.
The mix of business application components and raw computing resources in
one
paragraph may sound unusual. Yet, there is nothing unusual about it. Both
types of resources are encapsulated as XML based Web services. Web services
form an abstraction layer on top of the traditional IT infrastructure. This
layer makes sharing of IT resources across enterprise and data center
boundaries possible.
Grid computing is very appealing at many levels. It promises to reduce the
cost of computing and increase productivity of the IT personnel. In addition,
it allows companies to dynamically allocate enormous computing resources from
a shared IT pool and use these resources to solve challenging computing
problems. There is little doubt left in many people's minds that Grid
computing is the primary evolutionary direction for all of the computing
industry. The main argument seems to be that as the computing industry
matures, it becomes similar to other "high-tech" industries of the past such
as electricity, railroad transportation and telephony. Therefore, the
computing Grid is an equivalent of other grids we see naturally occurring in
the economy such as the electricity grid, the transportation grid and the
telephony grid.
This analogy worked really well when one had to convince a pessimist, but
recently it backfired in an eerie way. I am speaking, of course, about the
recent spectacular problems with the transportation and the electricity grids.
First, we experienced the biggest electricity blackout in the United States',
and probably the world's, history. Second, parts of the rail-road
transportation network became incapacitated by a computer worm causing
tremendous disruptions in train schedules. In both cases, it was the grid
infrastructure that failed miserably. In one case, the grid infrastructure was
used to share electricity, and in the other case, the grid infrastructure was
used to share the railroad network capacity.
What lessons, aside from the obvious one that we must be careful with
analogies, can an IT professional draw from these events ? Is Grid computing
overrated? Will it ever work or be suitable for mission critical applications?
The idea of a global computing blackout caused by a disruption in a secondary
data center can send a chill down anyone's spine. Again, if the history is any
indication, there will always be a group of risk takers and reckless souls for
whom the promise of sure benefits will outweigh the threat of hypothetical
disasters. As soon as they start reaping the benefits, the rest of the
industry will follow. Grid computing is bound to happen. The question then is:
What can we do as an industry to prevent fiascos of the type experienced by
the electricity and transportation grids?
In the search for an answer, it is useful to study the analysis produced by
a
group of industry officials and experts in the aftermath of the electricity
blackout. We may never get to know the real reasons for the blackout, but
elements of this analysis are now appearing in the press. These are some of
reasons summarized in a recent New York Times article:
- Midwest I.S.O. is a regional electricity grid manager responsible for a
part of the grid that caused the blackout. It is a new and still-forming
agency that operates with limited real-time information about what is
happening on the transmission lines it oversees. At the time of the disaster,
its operators did not have an up-to-date picture of what was happening in the
grid. As a consequence, they did not have the information that would allow
them to make the decisions that would prevent the disaster from
happening.
- On the day of the blackout, Midwest I.S.O. faced a computer malfunction at
a big utility it monitors, which exacerbated the lack of visibility problem.
In addition, it had to deal with its own computer problems. A state estimator,
the program that helps operators monitor grid conditions and predict
consequences of the various events and changes in the grid, was not working.
The operators lacked real-time information to support their decisions and
lacked a mechanism for estimating effects of these decisions on the state of
the grid.
- Midwest I.S.O. shares responsibility for the Midwestern part of the
electricity grid with a number of other monitoring entities. No single entity
has a comprehensive view of the state of the electricity grid in the region.
As a consequence, no single entity was in a position to prevent the disaster
from happening. Apparently, no mechanism exists that would allow the different
entities to cooperate when ad-dressing a problem that spans their respective
areas of control.
- Even if Midwest I.S.O had been aware of all that was happening and had all
the tools to make all the right decisions, its efforts to avert the collapse
might have been hindered by its lack of authority: it cannot order power
authorities to act in an emergency. Its voice is purely advisory. These
findings have very important implications for the design and implementation of
the emerging Grid computing infrastructure.
First, it is very likely that we will see, in the near future, emergence of
monitoring and overseeing agencies responsible for controlling portions of the
computing Grid. These agencies will be similar in function to Midwest I.S.O.
Providers of IT infrastructure outsourcing and application services may serve
as an equivalent of today's electrical power utilities. A single business
process in an enterprise may be a client of several such computing
"utilities". It is reasonable to assume that enterprises will welcome
emergence of a new class of service intermediaries that will help them manage
their relationship with the computing utilities and manage distribution of the
shared computing resources across the Internet. Service intermediaries that
perform some of these functions already exist, e.g. in the form of value added
networks (VAN) for Web services.
Second, the service intermediaries in order to be effective will need
direct
and real-time access to management information describing computing resources
available from the various service providers. This information will have to
include general description of the resource capabilities, its current
configuration, its current state, its current workload, events it can generate
and other types of information. The service intermediary will also need access
to management operations exposed by the resource, e.g. the ones that can be
used to increase or decrease its capacity.
Since the service intermediary will rarely have direct access to the IT
infrastructure used by computing utilities, the management information will
have to be available at the level of the computing resources and not at the
level of the IT infrastructure supporting these resources. This means that Web
services representing the computing resources in the Grid will have to be
directly manageable. In addition to traditional "business" interfaces each
service will have to expose a collection of management interfaces.
Finally, in order for the management information to be available to service
intermediaries in real-time, it must be proactively pushed by the utilities to
the service intermediary as opposed to the intermediary polling the utilities
at regular intervals to retrieve updates. This underscores importance of the
efficient event management mechanism. It becomes a critical element of the
Grid computing infrastructure. Events will have to be transmitted in XML
reliably over the Internet.
Third, due to complex, dynamic and distributed nature of the computing
Grid,
no deterministic state transition model may exist for the overall computing
environment. A management action performed by a service intermediary may
percolate through the Grid in unexpected ways and cause undesirable
consequences. Our industry will likely have to develop and rely on
non-deterministic models reflecting behavior of the computing Grid. Just as a
failure in the electrical grid state estimating program caused trouble during
the recent blackout, failures in the computing grid modeling software may lead
to unforeseen and large scale problems in the Grid computing environment.
These problems may be-come especially pronounced when principles of autonomic
computing and self-healing become widely adopted.
Fourth, due to the expected large scale of the computing Grid, it is highly
unlikely that we will end up with a single service intermediary. The
electricity grid now has a large number of overseeing entities responsible for
different geographic regions. Similarly, the computing Grid infrastructure
will have a number of service intermediaries, each responsible for its own
"region" of computing. The "regions" can be formed in accordance with the type
of computing resources distributed in the Grid (e.g. storage versus network
capacity, the type of the industry vertical, e.g. the financial services Grid
as opposed to the health care Grid, in accordance with administrative or
political boundaries, and other principles).
The recent electricity blackout was caused to an extent by a problem that
crossed multiple spheres of control, and by the lack of cooperation among the
various monitoring entities required to address such a problem. In order to
avoid similar disasters in the computing Grid, service intermediaries will
have to perform cooperative and federated management of the grid environment.
Federated management will require exchange of real-time management information
and management policies. It may also require introduction of higher level
service intermediaries, which may lead to a hierarchical management
structure.
Finally, in order to make federated management of the computing Grid
possible,
computing utilities will have to cede some control over its resources to
service intermediaries. Just as Midwest I.S.O. was hindered in its attempts to
avert the electricity blackout by its lack of authority over power utilities,
a service intermediary will be ill-positioned to prevent a computing blackout
if it does not have authority to not only monitor the computing resources, but
to also control its state and behavior.
This raises an interesting security and access control problem. If the
behavior and state of computing resources are easily controllable over the
Internet by service intermediaries, a security breach in the Grid computing
infrastructure can easily be used by a hacker to inflict a "man-made"
computing blackout. Since Web services are used to both access and control
computing resources in the Grid, the issues of Web services security and
access control become increasingly important.
In summary, issues such as Web services manageability and control; real
time
event management over the Internet; stochastic modeling of computing Grids;
federated management of Web services; and Web services security and access
control, will play critical role in the emergence of a reliable Grid computing
environment suitable for supporting critical business functions and processes.
Until we see these areas sufficiently developed, applications relying on Grid
computing principles will be subject to periodic IT disruptions and computing
blackouts.
|