The chip set supports up to 16 Itanium processors for optimum 16-way performance. In combination with IA-64 Itanium processors, provides a powerful platform solution for the backbone of the Internet and enterprise computing. The server can be partitioned into a maximum of four domains, each constituting an isolated and complete computer system. This feature aids consolidation of smaller servers into a fewer number of larger servers.
Fig. 8.1 shows the block diagram of the Itanium server 16-way configuration. The modular construction is composed of four 4-CPU cells interconnected via a data crossbar chip and address snoop network. The 16-way box can be hard-partitioned into a maximum of four domains by fully or partially disconnecting the crossbar and the address interconnect at natural boundaries [1], [10].
Each cell has one system bus that supports up to four Intel Itanium microprocessors with power pods, the Itanium server chip set's north bridge, main memory DIMMs, and four connections to peripheral component interconnect (PCI) adapters via proprietary Gigastream-Links (GSLs). Fig. 8.2 shows the interrelations among those components. Two of the four microprocessors and their associated power pods are located on each side of the cell.
Itanium server's distributed, shared-memory architecture provides each of the four cells with a portion of the main memory. Each cell has 32 DIMM sites, half of which are located on an optional memory daughterboard. The chip set supports up to 128 Gbytes of physical address space. As is the nature of a distributed, shared-memory machine, Itanium server's memory has a cache-coherent, nonuniform memory access (NUMA) model.
The four cells share the service processor and the base I/O including the legacy south bridge, also shown in Fig. 8.1. When the server is partitioned into two or more domains, additional PCI, add-on base I/O cards are inserted into designated PCI slots, except for the primary domain, which is serviced by the original base I/O attached to the service processor board. The shared service processor serves all domains simultaneously.
Each PCI adapter has two 64-bit PCI buses that are configurable as either two-slot 66-MHz buses or four-slot 33-MHz buses, as shown in Fig. 8.3. All of the PCI slots are hot pluggable. Each PCI adapter has two GSLports; both ports may be used concurrently for performance, or alternatively for redundancy. The maximum configuration of an Itanium server system is 128 PCI slots or 32 PCI buses, with a maximum of 16 PCI adapters. The resulting aggregate I/O bandwidth is approximately 8 Gbytes/s.
Chip set architecture.Fig. 8.2 shows the chip set components and their interconnections for each cell. The 16-way configuration has four sets of components plus the external data crossbar. The chip set design is optimized for 16-way or 4-cell configurations and employs a snoop-based coherency mechanism for lower snoop latencies. The chip set uses 0.25-micron process technology and operates at a multiple of the system bus clock frequency.
At the heart of the chip set is the system address controller, an ASIC that handles system bus, I/O, and intercell transactions; internal and external coherency control; address routing; and so on. Fig. 8.4 is a high-level block diagram of the system address controller. The system address controller controls the system data controller and transfers data to and from the system bus, main memory, I/O, and other cells. The I/O controller has signal connections to both the system address controller and system data controller. The I/O controller also has four GSLs to the I/Os as well as a Megastream Link to the legacy south bridge and service processor. I/O translation look-aside buffers are integrated in the I/O controller chip and convert a 32-bit address issued by a single-address-cycle PCI device into a full 64-bit address.
There are two memory chip sets with one located on the cell and one on the optional memory daughterboard. Each consists of an intelligent memory address controller and four interleaving memory data controllers. It supports a chip-kill feature as well as a memory scan engine that performs memory initialization and test at power-on and periodic memory patrol and scrubbing.
Cells are interconnected tightly and directly for addresses to form the address network, and via the data crossbar for data. In a two-cell configuration, the data crossbar chip component may be omitted by direct wiring between the two cells.
To effectively reduce the snoop traffic forwarded to the system bus, each cell has a snoop filter (tag SRAM) that keeps track of the cache contents in the four CPUs on the cell. When a coherent memory transaction is issued in one cell, its address is broadcast to all other cells for simultaneous snooping.
The snoop filter is checked for any chance that the inquired address is cached in the cell. If it is a possibility, the address is forwarded to the system bus for snooping, and the result is returned to the requester cell. Otherwise, a snoop miss response is returned instantly as a result of the tag lookup. In either case, the snoop filter is updated by replacing or purging the tag entry associated with the CPU cache line that was loaded with the memory data. On a memory read, the local or remote addressed memory line is always read speculatively, whether or not the line may be cached in a CPU.
The system address controller has numerous address range registers to configure, which present a single flat memory space to the operating system. Similarly, all the PCI buses can be configured ? either by the firmware at boot time or dynamically during online reconfiguration ? to make a single, system-wide PCI bus tree. For compatibility reasons, these configuration registers are mapped to the chip set configuration space in a manner similar to Intel's 82460GX chip set. This makes Itanium server a natural 16-way extension of an 82460GX-based 4-way system.
From an operating system's viewpoint, our 16-way platform appears as a collection of 16 CPUs on a single virtual system bus, which is also connected to a single large memory and a large PCI bus tree rooted at the system bus. Although there are certain practical compromises such as limiting external task priority register (XTPR)-based interrupt rerouting within each physical system bus, Itanium server's simple system image and near-uniform memory access latency make it easy to achieve very good scalability without elaborate programming.
The chip set architecture supports compatibility with the Itanium processor and features aimed at reliability, availability, and serviceability. These features include cell hot-plug capability and memory mirroring, data paths protected by error-correcting codes, error containment and graceful error propagation for IA-64 machine check abort recovery, parity-protected control logic, hardware consistency checking, detailed log registers, and backdoor access to the chip set.
Partitioning and In-Box Clustering.When Itanium server is hard-partitioned at cell boundaries creating isolated domains (four maximum), each domain constitutes a complete computer system and runs a separate operating system instance (Fig. 8.5). Each domain can contain an arbitrary number of cells, and each cell may have any number of CPUs.
The integrated service processor configures the domains by programming the chip set registers. Repartitioning may take place either dynamically at user requests, at failures, or at boot time. Needless to say, each domain can be separately booted or brought down without affecting operations of other domains. Although, to alter domain configuration when an operating system is running requires operating system support. In addition to domain partitioning, the Itanium server chip set supports a true in-box clustering capability; that is, the partitioned domains can communicate with one another through the crossbar, eliminating the need for external interconnects. The internal interconnect is actually made up of partially shared physical memory, custom drivers, and optional user-mode libraries.
Each participating domain contributes a fraction of its physical memory at boot time to share among the nodes as a communication area. Fig. 8.6 shows the conceptual view of the inbox cluster memory in a 4×4 configuration. Users can program the amount of shared memory, as well as node configuretions in the field.
Custom drivers and user-mode libraries provide standard cluster application program interfaces to the upper layers and use the partially shared memory as the physical medium. This offers main memory latency and bandwidth for cluster node communications while ensuring application compatibility.