Special Features:
HOW DOES ONE REALLY CHARACTERIZE
GRID COMPUTING? By Richard F. Freund, CTO, GridIQ
I wish to concentrate primarily on the issue of central control of grids and
the relationship to commercial uses of grids. Foster starts off his article
referring to the "recent explosion of commercial and scientific interest" in
Grid computing. Indeed that is so, and the variety of potential forms and uses
of Grid computing is a testimony to the appeal of the basic concepts involved.
Then Foster gives his most recent definition of a Grid that states, inter
alia, that a Grid is a system "not subject to centralized control". Foster's
concern about including policy issues as part of the definition for Grids
makes me concerned about getting into something like the old Supercomputing /
commercial use quagmire. This quagmire started (and continues) with an
obsession for "peak" FLOPS, rather than throughput.
Often this involves finding some exotic architecture and correspondingly
exotic problem dovetailing with the exotic architecture so as to reach
unparalleled levels of FLOPS. That's fine for research and perhaps publicity.
The problem arises when we try to extend these architectural solutions to
commercial computing. Typically supercomputers achieve only a miniscule
percentage of peak performance in production mode, especially with high-
variety task queues. Clusters comprised of pedestrian hardware are often far
more performance-and cost-effective.
The Top 500 Supercomputers web page is dominated by academic and research
sites, not commercial sites. Is that because academicians have greater
computational demands than commercial users, or is it because the commercial
users know that supercomputers are performance-and cost-ineffective for their
production task queues? I would hate to see an analogous split develop in Grid
Computing between the aims of academic and commercial users. Rather, we should
aspire to a modus operandi in which the academic research initially supports
and eventually blends naturally into commercial grid computing.
How does one characterize commercial grid computing? First of all, many
commercial enterprises have research arms that mimic the interests and
concerns of academics. But the heart of much commercial computing deals with
production computing. Certainly this is true in such areas as Electronic
Design, Bioinformatics, Mechanical Design, Energy, and Financial Analyses, to
mention a few. If Grid Computing is to be truly relevant to these domains, it
must enhance production computing.
I see at least three special features of production computing, that are not
generally characteristic of academic computing, nor found in Foster's
definition:
- Urgency - Of course academicians often talk about the need to compute
faster to meet deadlines, but the commercial need to bring products, such as
pharmaceuticals or new chip designs, to market ASAP is generally at a higher
level of urgency. Even more stringent can be the urgency of operational
military needs, such as command and control or intelligence functions.
- Central Responsibility and Authority - A commercial production
environment generally features an IT Manager who is held 100% responsible for
the success of failure of the operations under her control.
- Emphasis on Task Queue Throughput - A production environment generally
features a queue of tasks that all need to be computed. Academicians most
commonly look to optimization of individual tasks or projects. The IT Manager
may have 937 tasks to compute over the weekend. Individual optimization is
usually not an option for this number of tasks or for off-the-shelf binaries
that cannot be altered. On the other hand there are some special features of
the task queue that don't usually occur in academic computing. Viewed as a
large "meta-task", this queue may exhibit thousands-fold concurrency and is
already decomposed into its component, schedulable pieces.
Furthermore, since the overall task queue completion is paramount, it may make
sense to have some tasks run on their 17th best host. Consider an example of a
grid that deals with a planning study we performed for NASA some years ago,
for the Earth Observing System (EOS). Even before EOS "flew", NASA was
attempting to determine whether their collective distributed set of computers
("grid") was sufficient for their daily data analyses. We quantitatively
showed them that if they did not have intelligent, central control, it was
very unlikely that they would meet their daily computational requirements,
unless they persuaded Congress to add $20M worth of computers.
On the other hand, if they used "smart", centralized scheduling, they could
satisfy their daily demand with the computers that they already had in hand.
Would the NASA grid suddenly become an un-grid because they wanted to use it
in a smarter way? It seems to me that the issue of using central control or
not, is a "policy" question that relates to how you use your grid resources
and what your objectives are, not whether you meet the inherent definition of
a grid.
Suppose a company sets up a research grid for their IT research department
(meeting all of Foster's criteria and using Globus). Suppose further that this
research grid is established with the initial aim of learning how various
tasks perform with various implementation strategies. Once this has been done,
the same grid could be transitioned, under central control, to a production
computing environment that makes educated scheduling decisions based on past
task/machine experience.
Does it suddenly cease to be a grid because its use and social policy have
changed? If the answer is "yes," then Grids really have very limited
commercial value and grid researchers seeking support from commercial vendors
such as Sun, IBM, etc, should point that out up front. The need for central
control of commercial grid resources should not only exist, it should be
extended to improve the performance of commercial off-the-shelf software
applications whose binary codes cannot be modified for optimization or for use
with toolkits such as Globus.
This requires a change in thinking regarding:
Load-balancing as a throughput implementation mechanism b.Makespan (the
time the last task in a queue finishes) as a measure of throughput. Let me
illustrate these points with a notional example that is very similar to ones
that we find in the production technical computing world. Suppose we are
charged to complete 1250 tasks on a multi-site grid of 90 machines. Suppose
further that we first try scheduler A (load-balancing) in which all tasks are
finished in 100 time units, i.e., the makespan = 100.
In addition, assume that A's schedule was so well balanced that no machine was
finished with its portion of the schedule until 97 time units had passed. Now
suppose that we try scheduler B ("smart" scheduling) in which all work was
also finished in 100 time units, i.e., it had the same makespan as scheduler
A. However all but 2 tasks and their two corresponding machines finished at or
before 62 time units. Which is the better scheduler?
For those for whom "balance" is the pre-eminent measure, A is clearly the
better scheduler. But for production computing, B is clearly the better
scheduler since almost all tasks were finished much sooner, subsequently
making their supporting hosts available for the new "waves" of work that often
enter production systems throughout the day.
While B is superior for throughput (and also nimble to respond to sudden new
task loads), it obviously is inherently unbalanced. Part of what is required
here is a new throughput definition to replace makespan [Freund and Braun,
"Production Throughput as a High Performance Computing Meta-Task", PDPTA "02".
At GridIQ we have repeatedly demonstrated that the "unbalanced" schedules
associated with near-optimal task scheduling quickly provide significantly
greater throughput than those provided by load balancing schedulers and/or
months or years of code optimization efforts.
However, to achieve this greater level of throughput, some form of central
control is necessary. High Performance Computing (HPC) suffered greatly
because its development didn't prepare it to satisfy the requirements,
urgencies, and motivations of production technical computing. Excluding
centrally-controlled sites from participating in the evolving scope and
definition of grid computing will deter many production computing companies
from adopting, funding, and expanding grid computing, thus greatly limiting
grid computing's usefulness outside academia.
Doing so will stunt grid computing's growth, and limit its application in the
same way that HPC has historically been limited. Grid computing is an exciting
field, the potential of which we we're only beginning to grasp. Its definition
should allow for multiple policy implementations, including, centralized, de-
centralized, and in-between control. Such a broad definition will ensure that
both academic and production users continue to advance grid computing, for the
ultimate benefit of all types of users.
|