Microarchitecture Level Power-Performance Fundamentals

At the elementary transistor gate level, we can formulate total power dissipation as the sum of three major components: switching loss, leakage, and short-circuit loss.

PW_device=(1/2)C V_DDV_swing af +I_leakageV_DD+I_sc V_DD(6.1)

Here, C is the output capacitance, V_DD is the supply voltage, f is the chip clock frequency, and a is the activity factor (0<a<1) that determines the device switching frequency. V_swing is the voltage swing across the output capacitor. I_leakage is the leakage current, and I_sc is the average short-circuits current.

The literature often approximates V_swing as equal to V_DD (or simply V for short), making the switching loss around (1/2)CV²af. Also for current ranges of V_DD (say, 1 volt to 3 volts) switching loss, (1/2)CV²af remains the dominant component. So as a first-order approximation for the whole chip we may formulate the power dissipation as

PW_chip = (6.2)

C_i, V_i, a_i, and f_i are unit- or block-specific average values. The summation is taken overall blocks or units i, at the microarchitecture level (instruction cache, data cache, integer unit; floating-point unit, load-store unit, register files, and buses). For the voltage range considered, the operating frequency is roughly proportional to the supply voltage; C remains roughly the same if we keep the same design, but scale the voltage. If a single voltage and clock frequency is used for the whole chip, the formula reduces to

(6.3)

where K's are unit- or block-specific constants. If we consider the worst case activity factor for each unit i ? that is, if a_i=1 for all i, then

PW_chip=K_vV³=K_ff ³(6.4)

where K_v and K_f are design-specific constants, where K's are unit- or block-specific constants.

That equation leads to the so-called cube-root rule.This point to the single most efficient method for reducing power dissipation for a processor designed to operate at high frequency: reduce the voltage (and hence the frequency). This is the primary mechanism of power control in Transmeta's Crusoe chip. There's a limit, however, on how much V_DD can be reduced (for a given technology), which has to do with manufacturability and circuit reliability issues. Thus, a combination of microarchitecture and circuit techniques to reduce power consumption ? without necessarily employing multiple or variable supply voltages ? is of special relevance.

Performance Basics.The most straightforward metric for measuring performance is the execution time of a representative workload mix on the target processor. We can write the execution time as

T =PL CPI CT =PL CPI (1/f ) (6.5)

Here, PL is the dynamic path length of the program mix, measured as the number of machine instructions executed. CPI is the average processor cycles per instruction incurred in executing the program mix, and CT is the processor cycle time (measured in seconds per cycle) whose inverse determines clock frequency f. Since performance increases with decreasing T, we may formulate performance PF as

PF_chip=K_pf f = K_pvV (6.6)

Here, the K's are constants for a given microarchitecture-compiler implementation. The K_pf value stands for the average number of machine instructions executed per cycle on the machine being measured. PF_chip in this case is measured in MIPS.

Adopting a noncontroversial weighted mix is not easy. Each ratio is calculated as the speedup with respect to execution time on a specified reference machine. This method has the advantage of allowing us to rank different machines unambiguously from a performance viewpoint. That is, we can show the ranking as independent of the reference machine used in such a formulation.

SMT/CMP Differences and Energy-Efficiency Issues.Consider the floating-point loop kernel shown in Table 6.1.

Table 6.1.

Date: 2016-06-12; view: 250

<== previous page	\|	next page ==>
Power Microprocessors	\|	Example loop test case

doclecture.net - lectures - 2014-2025 year. Copyright infringement or personal data (0.371 sec.)