The dream of building computers by simply aggregating processors has been around since the earliest days of computing. Progress in building and using effective and efficient parallel software problems as well as by a long process of evolving architecture processors, however, has been slow. This rate of progress has been limited by difficulty of multiprocessors to enhance usability and improve efficiency.
Imbedded computers which consist of a processor,a small memory, and I/O devices are inserted into mobile phones, TV-sets, players, microwave ovens, etc. Game computerscontain a processor, a limited volume of memory, and a display as the only I/O device. Personal computer (PC) is made in several variants: pocketed, handheld (notebook), and desktop computers. The most of PCs use ?86 architecture. The most power PCs are used as servers and work stations. Large computers of general destinations, such as IBM System/390, constitute the Mainframe class. There are many mainframe applications ? such as databases, file servers, Web servers, simulations, and multiprogramming/batch processing ? amenable to running on more loosely coupled machines than the cache-coherent NUMA machines. These applications often need to be highly available, requiring some form of fault tolerance and repairability. Such applications ? plus the similarity of the multiprocessor nodes to desktop computers and the emergence of high-bandwidth, switch-based local area networks ? are why largescale processing uses clusters of off-the-shelf, whole computers. Supercomputer is a class of computers with the highest performance and cost; they are configured as servers and typically cost millions of dollars.
Because of the programming difficulty, most parallel processing success stories are a result of software wizards developing a parallel subsystem that presents a sequential interface. However, why is this so? The first reason is that you mustget good performance and efficiency from the parallel program on a multiprocessor, otherwise, you would use a uniprocessor, as programming is easier. In fact, uniprocessor design techniques such as superscalar and out-of-order execution take advantage of instruction-level parallelism, normally without involvement of the programmer. Such innovation reduces the demand for rewriting programs for multiprocessors. Another reason why it is difficult to write parallel processing programs is that the programmer must know a good deal about the hardware. On a uniprocessor, the high-level language programmer writes the program largely ignoring the underlying machine organization ? that?s the job of the compiler. Alas, it?s not that simple for multiprocessors. Although this second obstacle is beginning to lessen, our discussion reveals a third obstacle: Amdahl?s law. It reminds us that even small parts of a program must be parallelized to reach their full potential; thus coming close to linear speedup involves discovering new algorithms that are inherently parallel.
Clusters.One drawback of clusters has been that the cost of administering a cluster of N machines is about the same as the cost of administering N independent machines, while the cost of administering a shared address space multiprocessor with N processors is about the same as administering a single machine. Another drawback is that clusters are usually connected using the I/O bus of the computer, whereas multiprocessors are usually connected on the memory bus of the computer. The memory bus has higher bandwidth, allowing multiprocessors to drive the network link at higher speed and to have fewer conflicts with I/O traffic on I/O-intensive applications.
A final weakness is the division of memory: a cluster of N machines has N independent memories and N copies of the operating system, but a shared address multiprocessor allows a single program to use almost all the memory in the computer. Thus, a sequential program in a cluster has 1/Nth the memory available compared to a sequential program in an SMP.
The major distinction between the two is the purchase price for equivalent computing power for large-scale machines. Since large-scale multiprocessors have small volumes, the extra development costs of large machines must be amortized over few systems, resulting in higher cost to the customer. Since the same switches sold in high volume for small systems can be composed to construct large networks for large clusters, local area network switches have the same economy of scale advantages as small computers.
The weakness of separate memories for program size turns out to be strength in system availability. Since a cluster consists of independent computers connected through a local area network, it is much easier to replace a machine without bringing down the system in a cluster than in an SMP. Fundamentally, the shared address means that it is difficult to isolate a processor and replace a processor without work by the operating system. Since the cluster software is a layer that runs on top of local operating systems running on each computer, it is much easier to disconnect and replace a broken machine.
Clusters are constructed from whole computers and an independent, scalable network, this isolation also makes it easier to expand the system without bringing down the application that runs on top of the cluster. High availability and rapid, incremental expandability make clusters attractive to service providers for the World Wide Web.
As is often the case with two competing solutions, each side tries to borrow ideas from the other to become more attractive. On one side of the battle, to combat the high-availability weakness of multiprocessors, hardware designers and operating system developers are trying to offer the ability to run multiple operating systems on portions of the full machine, so that a node can fail or be upgraded without bringing down the whole machine. On the other side of the battle, since both system administration and memory size limits are approximately linear in the number of independent machines, some are reducing the cluster problems by constructing clusters from small-scale SMPs. For example, a cluster of 32 processors might be constructed from eight four-way SMPs or four eight-way SMPs. Such ?hybrid? clusters ? sometimes called constellationsor clustered, shared memory ? are proving popular with applications that care about cost/performance, availability, and expandability.
It is now widely held that the most effective way to build a computer that offers more performance than that achieved with a single-chip microprocessor is by building a multiprocessor or a cluster that leverages the significant price-performance advantages of mass-produced microprocessors.
On-chip multiprocessing appears to be growing in importance for two reasons. First, in the embedded market where natural parallelism often exists, such approaches are an obvious alternative to faster and possibly less efficient, processors. Second, diminishing returns in high-end microprocessor design will encourage designers to pursue on-chip multiprocessing as a potentially more cost-effective direction.
The Future of MPP Architecture.Small-scale multiprocessors built using snooping bus schemes are extremely cost-effective. Microprocessors traditionally have even included much of the logic for cache coherence in the processor chip, and several allow the buses of two or more processors to be directly connected ? implementing a coherent bus with no additional logic. With modern integration levels, multiple processors can be placed within a single die, resulting in a highly cost-effective multiprocessor. Recent microprocessors have been including support for NUMA approaches, making it possible to connect small to moderate numbers of processors with little overhead.
What is unclear at present is how the very largest parallel processors will be constructed. The difficulties that designers face include the relatively small market for very large multiprocessors and the need for multiprocessors that scale to larger processor counts to be extremely cost-effective at the lower processor counts, where most of the multiprocessors will be sold. There appear to be four slightly different alternatives for large-scale multiprocessors:
1. Designing a cluster using all off-the-shelf components, this offers the lowest cost. The leverage in this approach lies in the use of commodity technology everywhere: in the processors (PC or workstation nodes), in the interconnect (high-speed local area network technology, such as Gigabit Ethernet), and in the software (standard operating systems and programming languages). Of course, such multiprocessors will use message passing, and communication is likely to have higher latency and lower bandwidth than in the alternative designs. For applications that do not need high bandwidth or low-latency communication, this approach can be extremely cost-effective.
2. Designing clustered computers that use off-the-shelf processor nodes and a custom interconnect. The advantage of such a design is the cost effectiveness of the standard processor node, which is often a repackaged desktop computer; the disadvantage is that the programming model will probably need to be message passing even at very small node counts. The cost of the custom interconnect can be significant and thus make the multiprocessor costly, especially at small node counts. An example is the IBM SP.
3. Large-scale multiprocessors constructed from clusters of midrange multiprocessors with combinations of proprietary and standard technologies to interconnect such multiprocessors. This cluster approach gets its cost-effectiveness using cost-optimized building blocks. Many companies offer a high-end version of such a machine, including HP, IBM, and Sun. Due to the two-level nature of the design, the programming model sometimes must be changed from shared memory to message passing or to a different variation on shared memory, among clusters. This class of machines has made important inroads, especially in commercial applications.
4. Large-scale multiprocessors that simply scale up naturally, using proprietary interconnect and communications controller technology. There are two primary difficulties with such designs. First, the multiprocessors are not cost-effective at small scales, where the cost of scalability is not valued. Second, these multiprocessors have programming models that are incompatible, in varying degrees, with the mainstream of smaller and midrange multiprocessors. The SGI Origin is one example. (SGI - Silicon Graphics Inc.)
Each of these approaches has advantages and disadvantages, and the importance of the shortcomings of any one approach is dependent on the application class. It is unclear which will win out for larger-scale multiprocessors, although the growth of the market for Web servers has made ?racks of PCs? the dominant form at least by number of systems.
The Future of Microprocessor Architecture.Architects are using ever more complex techniques to try to exploit more instruction-level parallelism. The prospects for finding ever-increasing amounts of instruction-level parallelism in a manner that is efficient to exploit are somewhat limited. There are increasingly difficult problems to be overcome in building memory hierarchies for high-performance processors. Of course, continued technology improvements will allow us to continue to advance clock rate. However, the use of technology improvements that allow a faster gate speed alone is not sufficient to maintain the incredible growth of performance that the industry has experienced for over 20 years. Moreover, as power increases over 100 watts per chip, it is unclear how much higher it can go in air-cooled systems. Hence, power may prove to be another limit to performance.
Unfortunately, for more than a decade, increases in performance have come at the cost of ever-increasing inefficiencies in the use of silicon area, external connections, and power. This diminishing-returns phenomenon has only recently appeared to have slowed the rate of performance growth. What is clear is that we cannot sustain the rapid rate of performance improvements without significant innovations in computer architecture. The long-term direction will be to use increased silicon to build multiple processors on a single chip. Such a direction is appealing from the architecture viewpoint?it offers a way to scale performance without increasing hardware complexity. It also offers an approach to easing some of the challenges in memory system design, since a distributed memory can be used to scale bandwidth while maintaining low latency for local accesses. Finally, redundant processors can help with dependability. The challenge lies in software and in what architecture innovations may be used to make the software easier.
If the number of processors per chip grows with Moore?s law, dozens of processors are plausible in the near future. The challenge for such ?micro-multiprocessors? is the software base that can exploit them, which may lead to opportunities for innovation in program representation and optimization.
Evolution versus Revolution and the Challenges to Paradigm Shifts in the Computer Industry.Fig. 4.21. shows what we mean by the evolution-revolution spectrum of computer architecture innovation. To the left are ideas that are invisible to the user (presumably excepting better cost, better performance, or both) and are at the evolutionary end of the spectrum. At the other end are revolutionary architecture ideas. These are the ideas that require new applications from programmers who must learn new programming languages and models of computation, and must invent new data structures and algorithms. Revolutionary ideas are easier to get excited about than evolutionary ideas, but to be adopted they must have a much higher payoff. Caches are an example of an evolutionary improvement. Within five years after the first publication about caches, almost every computer company was designing a computer with a cache. The RISC ideas were nearer to the middle of the spectrum, for it took more than eight years for most companies to have a RISC product. Most multiprocessors have tended to the revolutionary end of the spectrum, with the largest-scale multiprocessors (MPPs) being more revolutionary than others.
The first four columns (Fig. 4.21) are distinguished from the last column in that applications and operating systems may be ported from other computers rather than written from scratch. For example, RISC is listed in the middle of the spectrum because user compatibility is only at the level of high-level languages (HLLs), while microprogramming allows binary compatibility, and parallel processing multiprocessors require changes to algorithms and extending HLLs. You see several flavors of multiprocessors on this figure. ?Timeshared multiprocessor? means multiprocessors justified by running many independent programs at once. Cache coherent UMA and cache coherent NUMA (a nonuniform memory access multiprocessor that maintains coherence for all caches) multiprocessors running parallel subsystems clusters such as databases or file servers. Moreover, the same applications are intended for ?message passing.? ?Parallel processing multiprocessor? means a multiprocessor of some flavor sold to accelerate individual programs developed by users [7], [14], [17], [26], [27], [33], [36].
The challenge for both hardware and software designers who would propose that multiprocessors and parallel processing become the norm, rather than the exception, is the disruption to the established base of programs. There are two possible ways this paradigm shift could be facilitated: if parallel processing offers the only alternative to enhance performance, and if advances in hardware and software technology can construct a gentle ramp that allows the movement to parallel processing, at least with small numbers of processors, to be more evolutionary. Perhaps cost/performance will be replaced with new goals of dependability, security, and/or reduced cost of ownership as the primary justification of such a change. When contemplating the future ? and when inventing your own contributions to the field ? remember the hardware/software interface. Acceptance of hardware ideas requires acceptance by software people; therefore, hardware people must learn more about software. In addition, if software people want good machines, they must learn more about hardware to be able to communicate with and thereby influence hardware designers.
The mainframe or supercomputer business is an example of necessary change manifesting itself in new systern architectures. The old school architecture of supercomputers used a proprietary-core CPU and systems with the sole goal of maximizing MHz or computation cycles. Proprietary operating systems, applications software, design tools, and the lack of a large knowledge base were secondary to the performance-at-all-costs objective. Designers of today's systems use off-the-shelf x86 or PowerPC processors. Brute force has given way to flexibility and time-to-market needs. The original architectures couldn't continue to meet the market demands. Costs and technology-to-market times increased past the acceptance point. Necessity dictated the paradigm shift to distributed processing over multiple smaller, easily producible CPUs.
Performance flexibility and application flexibility are a result of the new architecture. Each machine can be tailored to a certain performance level by removing or adding processing nodes. Infrastructure leverage is another result. These architectures now embrace the knowledge base of PC and workstation CPU architects, software writers, and manufacturers.
The PCs success has largely been a result of a standard, open platform. As the new Web-based economy grows, different platform requirements will evolve or be created. Personalization implies differentiation of platforms and technologies supporting those platforms. As biometric, genetic, and chemical technologies evolve, they will begin to be incorporated with CPU designs. Personalization, flexibility, and quick time to market will dictate a quick-turn design methodology. This is in direct opposition to the design style and methods incorporated in today's CPUs. Time to market for new architectures is measured in three-year increments, while spins of new processors based on known designs occur every six months. This design method requires that the architecture be correct for up to six years in the market ? three years for development and three years in production. The market requirements will no longer accept the current long pipelined process design cycle. The Internet has broken down all communication and knowledge barriers. This breakdown has increased the worldwide productivity, manufacturing knowledge, and design base. Design cycles must decrease or become the bottleneck for future progress. A more robust and flexible architecture is required.
Human Interaction.Finally, CPU and system architectures will evolve to achieve more and better interaction with each other and with humans. At the local level, analog technological improvements will allow for CPUs to communicate directly with the nervous system to aid in medical, physical, and mental improvements. At the Internet mesh level, systems will take on the capabilities of service or informational agents. Interpret power will allow for humanlike features such as emotion, anticipation of events or results, adaptation to events, and ability to communicate. Users will interact with personal agents to aid with problems or provide advice for medical, travel, emotional, investments, and many other areas.
CPU, microcontroller, and system architectures will become more flexible for market and customer requirements. The Moore's law progression of technology flies against the historical model of personalization and market-specific requirements of consumer products. Distributed processing and system-on-chip design techniques will allow for a la carte personalization of products. Analog advancements will enhance the user interface experience of personal Internet appliance devices and allow for interaction at the neural-network level. The standardization of data, transcoding of applications, and explosion of computation power on the Internet will let users take advantage of interpute power for high-performance applications. Necessity dictates change. Divergence points dictate change. Both have brought architecture to the cusp of required invention.