Most of the work done in the area of high-level power estimation has been focused at the register-transfer-level (RTL) description in the processor design flow. Only recently have we seen a surge of interest in estimating power at the microarchitecture definition stage, and specific work on power-efficient microarchitecture design has been reported.
Power-Performance Efficiency.The most common (and perhaps obvious) metric to characterize the power-performance efficiency of a microprocessor is a simple ratio, such as MIPS (million instructions per second)/watts (W). This attempts to quantify efficiency by projecting the performance achieved or gained (measured in MIPS) for every watt of power consumed. Clearly, the higher the number, the "better" the machine is [8], [38], [44], [60], [61], [62].
While this approach seems a reasonable choice for some purposes, there are strong arguments against it in many cases, especially when it comes to characterizing higher end processors. Performance has typically been the key driver of such server-class designs, and cost or efficiency issues have been of secondary importance. Specifically, a design team may well choose a higher frequency design point (which meets maximum power budget constraints) even if it operates at a much lower MIPS/W efficiency compared to one that operates at better efficiency but at a lower performance level. As such, (MIPS)2/W or even (MIPS)3/W may be the metric of choice at the high end.
On the other hand, at the lowest end, where battery life (or energy consumption) is the primary driver, designers may want to put an even greater weight on the power aspect than the simplest MIPS/W metric. That is, they may just be interested in minimizing the power for a given workload run, irrespective of the execution time performance, provided the latter doesn't exceed some specified upper limit. The MIPS metric for performance and the watts value for power may refer to average or peak values, derived from the chip specifications. For example, for a 1-gigahertz (109cycles/sec) processor that can complete up to 4 instructions per cycle, the theoretical peak performance is 4,000 MIPS. If the average completion rate for a given workload mix is p instructions/cycle, the average MIPS would equal 1,000 times p. However, when it comes to workload-driven evaluation and characterization of processors, metrics are often controversial. Apart from the problem of deciding on a representative set of benchmark applications, fundamental questions persist about ways to boil down performance into a single (average) rating that's meaningful in comparing a set of machines.
Since power consumption varies, depending on the program being executed, the benchmarking issue is also relevant in assigning an average power rating. In measuring power and performance together for a given program execution; we may use a fused metric such as the power-delay product (PDP) or energy-delay product (EDP). In general, the PDP-based formulations are more appropriate for low-power, portable systems in which battery life is the primary index of energy efficiency. The MIPS/W metric is an inverse PDP formulation, where delay refers to average execution time per instruction. PDP, being dimensionally equal to energy, is the natural metric for such systems.
For higher end systems (workstations) the EDP-based formulation is more appropriate, since the extra delay factor ensures a greater emphasis on performance. The (MIPS)2/W metric is an inverse EDP formulation. For the highest performance server-class machines, it may be appropriate to weight the delay part even more. This would point to the use of (MIPS)3/W, which is an inverse ED2P formulation. Alternatively, we may use (CPI)3×W (the cube of cycles per instruction times power) as a direct ED2P metric applicable on a per instruction basis.
The energy × (delay)2 metric, or performance3/power formula, is analogous to the cube-root rule, which follows from constant voltage-scaling arguments. Clearly, to formulate a voltage-invariant power-performance characterization metric, we need to think in terms of performance3/power.
When we are dealing with the System Performance Evaluation Cooperative/Corporation (SPEC) benchmarks, we may evaluate efficiency as (SPEC rating)/W, or (SPEC)x/W for short; where exponent value x (equaling 1, 2, or 3) may depend on the class of processors being compared.
Fig. 6.1,a, b shows the power-performance efficiency data for a range of commercial processors of approximately the same generation. In Figs 6.l,a, b the latest available processor is plotted on the left and the oldest on the right. We used SPEC/W, SPEC2/W, and SPEC3/W as alternative metrics, where SPEC stands for the processor's SPEC rating. For each category, the worst performer is normalized to 1, and other processor values are plotted as improvement factors over the worst performer.
The data validates our assertion that ? depending on the metric of choice and the target market (determined by workload class and/or the power/cost) ? the conclusion drawn about efficiency can be quite different.
For performance-optimized, high-end processors, the SPEC3/W metric seems fairest, with the very latest Intel PentiumIII and AMD Athlon offerings (at 1 GHz) at the top of the class for integer workloads. The older HP-PA 8600 (552 MHz) and IBM Power3 (450 MHz) still dominate in the floating-point class. For power-first processors targeted toward integer workloads (such as Intel's mobile Celeron at 333 MHz), SPEC/W seems the fairest. Note that we've relied on published performance and "max power" numbers; and, because of differences in the methodologies used in quoting the maximum power ratings, the derived rankings may not be completely accurate or fair. This points to the need for standardized methods in reporting maximum and average power ratings for future processors so customers can compare power-performance efficiencies across competing products in a given market segment.
Microarchitecture-Level Power Estimation.Fig. 6.2 shows a block diagram of the basic procedure used in the power-performance simulation infrastructure (PowerTimer) at IBM Research. The core of such models is a classical trace- or execution-driven, cycle-by-cycle performance simulator. In fact, the power-performance models are all built upon Burger and Austin's widely used, publicly available, parameterized SimpleScalar performance simulator.
A tool set around the existing, research- and production-level simulators used in the various stages of the definition and design of high-end PowerPC processors is being built. The nature and detail of the energy models used in conjunction with the workload-driven cycle simulator determine the key difference for power projections.
During every cycle of the simulated processor operation, the activated (or busy) micro-architecture-level units or blocks are known from the simulation state. Depending on the particular workload (and the execution snapshot within it), a fraction of the processor units and queue/buffer/bus resources are active at any given cycle.
The usefulness of microarchitectural power estimators hinges on the accuracy of the underlying energy models. We can formulate such models forgiven functional unit blocks (the integer or floating-point execution data path), storage structures (cache arrays, register files, orbuffer space), or communication bus structures (instruction dispatch bus or result bus) using either
? circuit-level or RTL simulation of the corresponding structures with circuit and technology parameters germane to the particular design, or
? analytical models or equations that formulate the energy characteristics in terms of the design parameters of a particular unit or block.
If detailed power and area measurements for a given chip are possible, we could build reasonably accurate energy models based on power density profiles across the various units. We can use such models to project expected power behavior for follow-on processor design points by using standard technology scaling factors. These power-density-based energy models have been used in conjunction with trace-driven simulators for power analysis and trade-off studies in some Intel processors.
Base Microarchitecture Model.Fig. 6.3 shows the high-level organization of the modeled processor. The model details make it equivalent in complexity to a modern, out-of-order, high-end microprocessor (for example, the Power4 processor). Here, we assume a generic, parameterized, out-of-order superscalar processor model adopted in a research simulator called Turandot.This research simulator was calibrated against a pre-RTL, detailed, latch-accurate processor model (referred to here as the R-model). The R-model served as a validation reference in a real processor development project. The R-model is a custom simulator written in C++ (with mixed VHDL "interconnect code"). There is a virtual one-to-one correspondence of signal names between the R-model and the actual VHDL (RTL) model. However, the R-model is about two orders of magnitude faster than the RTL model and considerably more flexible. Many microarchitecture parameters can be varied, albeit within restricted ranges. Turandot, on the other hand is a classical trace/execution-driven simulator, written in C, which is one to two orders of magnitude faster than the R-model. It supports a much greater number and range of parameter values [55].
That is, we decided to use the developed energy models first in conjunction with the R-model to ensure accurate measurement of the resource use statistics within the machine. To circumvent the simulator speed limitations, we used a parallel workstation cluster (farm). We also post-processed the performance simulation output and fed the average resource use statistics to the energy models to get the average power numbers. Looking up the energy models on every cycle during the actual simulation run would have slowed the R-model execution even further. While it would've been possible to get instantaneous, cycle-by-cycle energy consumption profiles through such a method, it wouldn't have changed the average power numbers for entire program runs.
Having used the detailed, latch-accurate reference model for the initial energy characterization, we could look at the unit- and queue-level power numbers in detail to understand, test, and refine the various energy models. Currently, we have reverted to using an energy-model-enabled Turandot model, for fast CPI (Clock/Cycles per Instruction) versus power trade-off studies with full benchmark traces. Turandot lets us experiment with a wider range and combination of machine parameters.
Our experimental results are based on the SPEC95 benchmark suite and a commercial TPC-C trace. All workload traces were collected on a PowerPC machine. We generated SPEC95 traces using the Aria tracing facility within the MET toolkit. We created the SPEC trace repository by using the full reference input set, however, we used sampling to reduce the total trace length to 100 million instructions per benchmark program. In finalizing the choice of exact sampling parameters, the performance team also compared the sampled traces against the full traces in a systematic validation study.
Data Cache Size and Effect of Scaling Techniques.Out-of-order superscalar processors of the class considered rely on queues and buffers to efficiently decouple instruction execution and increase performance. The pipeline depth and the resource size required to support decoupled execution combine to determine the machine performance. Because of this decoupled execution style, increasing the size of one resource without regard to other machine resources may quickly create a performance bottleneck. Fig. 6.4 shows the effects of varying all of the resource sizes within the processor core.
These include issue queues, rename registers, branch predictor tables, memory disambiguation hardware, and the completion table. For the buffers and queues, the number of entries in each resource is scaled by the values specified in the charts (0.6x, 0.8x, 1.2x, and 1.4x). For the instruction cache, data cache, and branch prediction tables, the sizes of the structures are doubled or halved at each data point.
From Fig. 6.4,a, we can see that performance increased by 5.5% for SPECfp, 9.6% for SPECint, and 11.2% for TPC-C as the size of the resources within the core is increased by 40% (except for the caches that are 4x larger). The configuration had a power dissipation of 52% to 55% higher than the baseline core. Fig. 6.4,b shows that the most power-efficient core microarchitecture is somewhere between the 1x and 1.2x cores.