GRIDtoday Logo IBM

DAILY NEWS AND INFORMATION FOR THE GLOBAL GRID COMMUNITY / AUGUST 19, 2002: VOL. 1 NO. 10

( Previous Article )   ( Table of Contents )   ( Next Article )

Special Features:

HOW DOES ONE REALLY CHARACTERIZE GRID COMPUTING?
By Richard F. Freund, CTO, GridIQ

I wish to concentrate primarily on the issue of central control of grids and the relationship to commercial uses of grids. Foster starts off his article referring to the "recent explosion of commercial and scientific interest" in Grid computing. Indeed that is so, and the variety of potential forms and uses of Grid computing is a testimony to the appeal of the basic concepts involved. Then Foster gives his most recent definition of a Grid that states, inter alia, that a Grid is a system "not subject to centralized control". Foster's concern about including policy issues as part of the definition for Grids makes me concerned about getting into something like the old Supercomputing / commercial use quagmire. This quagmire started (and continues) with an obsession for "peak" FLOPS, rather than throughput.

Often this involves finding some exotic architecture and correspondingly exotic problem dovetailing with the exotic architecture so as to reach unparalleled levels of FLOPS. That's fine for research and perhaps publicity. The problem arises when we try to extend these architectural solutions to commercial computing. Typically supercomputers achieve only a miniscule percentage of peak performance in production mode, especially with high- variety task queues. Clusters comprised of pedestrian hardware are often far more performance-and cost-effective.

The Top 500 Supercomputers web page is dominated by academic and research sites, not commercial sites. Is that because academicians have greater computational demands than commercial users, or is it because the commercial users know that supercomputers are performance-and cost-ineffective for their production task queues? I would hate to see an analogous split develop in Grid Computing between the aims of academic and commercial users. Rather, we should aspire to a modus operandi in which the academic research initially supports and eventually blends naturally into commercial grid computing.

How does one characterize commercial grid computing? First of all, many commercial enterprises have research arms that mimic the interests and concerns of academics. But the heart of much commercial computing deals with production computing. Certainly this is true in such areas as Electronic Design, Bioinformatics, Mechanical Design, Energy, and Financial Analyses, to mention a few. If Grid Computing is to be truly relevant to these domains, it must enhance production computing.

I see at least three special features of production computing, that are not generally characteristic of academic computing, nor found in Foster's definition:

  • Urgency - Of course academicians often talk about the need to compute faster to meet deadlines, but the commercial need to bring products, such as pharmaceuticals or new chip designs, to market ASAP is generally at a higher level of urgency. Even more stringent can be the urgency of operational military needs, such as command and control or intelligence functions.

  • Central Responsibility and Authority - A commercial production environment generally features an IT Manager who is held 100% responsible for the success of failure of the operations under her control.

  • Emphasis on Task Queue Throughput - A production environment generally features a queue of tasks that all need to be computed. Academicians most commonly look to optimization of individual tasks or projects. The IT Manager may have 937 tasks to compute over the weekend. Individual optimization is usually not an option for this number of tasks or for off-the-shelf binaries that cannot be altered. On the other hand there are some special features of the task queue that don't usually occur in academic computing. Viewed as a large "meta-task", this queue may exhibit thousands-fold concurrency and is already decomposed into its component, schedulable pieces.

Furthermore, since the overall task queue completion is paramount, it may make sense to have some tasks run on their 17th best host. Consider an example of a grid that deals with a planning study we performed for NASA some years ago, for the Earth Observing System (EOS). Even before EOS "flew", NASA was attempting to determine whether their collective distributed set of computers ("grid") was sufficient for their daily data analyses. We quantitatively showed them that if they did not have intelligent, central control, it was very unlikely that they would meet their daily computational requirements, unless they persuaded Congress to add $20M worth of computers.

On the other hand, if they used "smart", centralized scheduling, they could satisfy their daily demand with the computers that they already had in hand. Would the NASA grid suddenly become an un-grid because they wanted to use it in a smarter way? It seems to me that the issue of using central control or not, is a "policy" question that relates to how you use your grid resources and what your objectives are, not whether you meet the inherent definition of a grid.

Suppose a company sets up a research grid for their IT research department (meeting all of Foster's criteria and using Globus). Suppose further that this research grid is established with the initial aim of learning how various tasks perform with various implementation strategies. Once this has been done, the same grid could be transitioned, under central control, to a production computing environment that makes educated scheduling decisions based on past task/machine experience.

Does it suddenly cease to be a grid because its use and social policy have changed? If the answer is "yes," then Grids really have very limited commercial value and grid researchers seeking support from commercial vendors such as Sun, IBM, etc, should point that out up front. The need for central control of commercial grid resources should not only exist, it should be extended to improve the performance of commercial off-the-shelf software applications whose binary codes cannot be modified for optimization or for use with toolkits such as Globus.

This requires a change in thinking regarding:

Load-balancing as a throughput implementation mechanism b.Makespan (the time the last task in a queue finishes) as a measure of throughput. Let me illustrate these points with a notional example that is very similar to ones that we find in the production technical computing world. Suppose we are charged to complete 1250 tasks on a multi-site grid of 90 machines. Suppose further that we first try scheduler A (load-balancing) in which all tasks are finished in 100 time units, i.e., the makespan = 100.

In addition, assume that A's schedule was so well balanced that no machine was finished with its portion of the schedule until 97 time units had passed. Now suppose that we try scheduler B ("smart" scheduling) in which all work was also finished in 100 time units, i.e., it had the same makespan as scheduler A. However all but 2 tasks and their two corresponding machines finished at or before 62 time units. Which is the better scheduler?

For those for whom "balance" is the pre-eminent measure, A is clearly the better scheduler. But for production computing, B is clearly the better scheduler since almost all tasks were finished much sooner, subsequently making their supporting hosts available for the new "waves" of work that often enter production systems throughout the day.

While B is superior for throughput (and also nimble to respond to sudden new task loads), it obviously is inherently unbalanced. Part of what is required here is a new throughput definition to replace makespan [Freund and Braun, "Production Throughput as a High Performance Computing Meta-Task", PDPTA "02". At GridIQ we have repeatedly demonstrated that the "unbalanced" schedules associated with near-optimal task scheduling quickly provide significantly greater throughput than those provided by load balancing schedulers and/or months or years of code optimization efforts.

However, to achieve this greater level of throughput, some form of central control is necessary. High Performance Computing (HPC) suffered greatly because its development didn't prepare it to satisfy the requirements, urgencies, and motivations of production technical computing. Excluding centrally-controlled sites from participating in the evolving scope and definition of grid computing will deter many production computing companies from adopting, funding, and expanding grid computing, thus greatly limiting grid computing's usefulness outside academia.

Doing so will stunt grid computing's growth, and limit its application in the same way that HPC has historically been limited. Grid computing is an exciting field, the potential of which we we're only beginning to grasp. Its definition should allow for multiple policy implementations, including, centralized, de- centralized, and in-between control. Such a broad definition will ensure that both academic and production users continue to advance grid computing, for the ultimate benefit of all types of users.

( Top of Page )

( Previous Article )   ( Table of Contents )   ( Next Article )