1. What parameters and measuring units are usually used to estimate the throughput of a computers?
2. What purposes is the reference testing of computers carried out for?
3. What do the results of calculating the Power-Performance Efficiency allow us to judge about?
4. What models are used for microarchitecture-level power estimation?
5. What is the high-level organization of the modelled processor?
6. What is the parallel workstation cluster used for?
7. What parameters may be used for estimating the quality of a processor core?
8. What is the difference between the R-model and the actual VHDL-model?
9. What structure has been used to build the model for simulating the computing organization of the Power PC Turandot processor?
10. What are the peculiarities of the latch-accurate processor model?
Multithreading
Power-Efficient Microarchitecture Design Ideas and Trends.With the means to evaluate power efficiency during early-stage microarchitecture evaluations, we can propose and characterize processor organizations that are inherently energy efficient. Also, if we use the right efficiency metric in the right context, we can compare alternate points and rank them using a single power-performance efficiency number. Here, we first examine the power-performance efficiency trend exhibited by the current regime of superscalar processors. The "SMT/CMP differences and energy efficiency issues" box uses a simple loop-oriented example to illustrate the basic performance and power characteristics of the current superscalar regime. It also shows how the follow-on paradigms such as simultaneous multithreading (SMT) may help correct the decline in power-performance efficiency measures.
Single-Core, Wide-Issue, Superscalar Processor Chip Paradigm.One school of thought envisages a continued progression along the path of wider, aggressively speculative superscalar paradigms. Researchers continue to innovate in efforts to exploit single-thread instruction-level parallelism (ILP). Value prediction advances promise to break the limits imposed by data dependencies. Trace processors ease the fetch bandwidth bottleneck, which can otherwise impede scalability.
Nonetheless, increasing the superscalar width beyond a certain limit seems to yield diminishing gains in performance, while continuously reducing the performance-power efficiency metric (for example, SPEC3/W). Thus, studies show that for the superscalar paradigm going for wider issue, speculative hardware in a core processor ceases to be viable, beyond a certain complexity point. Such a point certainly seems to be upon us in the processor design industry. In addition to power issues, more complicated, bigger cores add to the verification and yield problems, which all add up to higher cost and delays in time-to-market goals.
Advanced methods in low-power circuit techniques and adaptive microarchitectures can help us extend the superscalar paradigm for a few more years. Other derivative designs, such as those for multicluster processors and SMT, can be more definitive paradigm shifts.
Multicluster Superscalar Processors.As we've implied, the desire to extract more and more ILP using the superscalar approach requires the growth of most of the centralized structures. Among them are instruction fetch logic, register rename logic, register file, instruction issue window with wakeup and selection logic, data-forwarding mechanisms, and resources for disambiguing memory references. In most cases, the energy dissipated per instruction grows in a superlinear fashion. None of the known circuit techniques solves this energy growth problem. Given that the IPC (Instructions per Cycle) performance grows sublinearly with issue width (with asymptotic saturation), it's clear why the classical superscalar path will lead to increasingly power-inefficient designs.
One way to address the energy growth problem at the microarchitectural level is to replace a classical superscalar CPU with a set of clusters, so that all key energy consumers are split among clusters. Then, instead of accessing centralized structures in the traditional superscalar design, instructions scheduled to an individual cluster would access local structures most of the time. The primary advantage of accessing a collection of local structures instead of a centralized one is that the number of ports and entries in each local structure is much smaller. This reduces access latency and lowers power.
Current high-performance processors certainly have elements of multi-clustering, especially in terms of duplicated register files and distributed issue queues. Simulation results showed that to compensate for the intercluster communication and to improve power-performance efficiency significantly, each cluster must be a powerful out-of-order superscalar machine by itself. This simulation-based study determined the optimal number of clusters and their configurations, for a specified efficiency metric (the EDP - Electronic Data Processing).
The multicluster organization yields IPC performance inferior to a classical superscalar with centralized resources (assuming equal net issue width and total resource sizes). The latency overhead of intercluster communication is the main reason behind the IPC shortfall. Another reason is that centralized resources are always better used than distributed ones. The multicluster organization is potentially more energy-efficient for wide-issue processors with an advantage that grows wide issue width.Given the same power budget, the multicluster organization allows configurations that can deliver higher performance than the best configurations with the centralized design.
Chip Multiprocessing.Server product groups such as IBM's PowerPC division have relied on chip multiprocessing as the future scalable paradigm. The Power4 design is the first example of this trend. Its promise is to build multiple processor cores on the same die or package to deliver scalable solutions at the system level.
Multiple processor cores on a single die can operate in various ways to yield scalable perfor?mance in a complexity- and power-efficient manner. In addition to shared-memory chip multiprocessing, we may consider building a multiscalar processor,which can spawn speculative tasks (derived from a sequential binary) to execute concurrently on multiple cores. Intertask communication can occur via register value forwarding or through shared memory. Dynamic memory disambiguation hardware is required to detect memory-ordering violations. On detecting such a violation, the offending task(s) must be squashed and restarted. Multiscalar-type paradigms promise scalable, power-efficient designs, provided the compiler-aided task-partitioning and instruction scheduling algorithms can be improved effectively.
Multithreading.Various flavors of multithreading have been proposed and implemented in the past as a means to go beyond single-thread ILP limits. One recent paradigm that promises to provide a significant boost in performance with a small increase in hardware complexity is simultaneous multithreading. In SMT, the processor core shares its execution-time resources among several simultaneously executing threads (programs). Each cycle instructions can be fetched from one or more independent threads and injected into the issue and execution slots. Because the issue and execution resources can be filled by instructions from multiple independent threads in the same cycle, the per-cycle uses of the processor resources can be significantly improved, leading to much greater processor throughput.
For occasions when only a single thread is executing on an SMT processor, the processor behaves almost like a traditional superscalar machine. Thus the same power reduction techniques are likely to be applicable. When an SMT processor is simultaneously executing multiple threads, however, the per-cycle use of the processor resources should noticeably increase, offering fewer opportunities for power reduction via such traditional techniques as clock gating. An SMT processor ? when designed specifically to execute multiple threads in a power-aware manner ? provides additional options for power-aware design.
The SMT processor generally provides a boost in overall throughput performance, and this alone will improve the power-performance ratio for a set of threads, especially in a context (such as server processors) of greater emphasis on the overall (throughput) performance than on low power. Furthermore, a processor specifically designed with SMT in mind can provide even greater power performance efficiency gains.
Because the SMT processor takes advantage of multiple execution threads, it could be designed to employ far less aggressive speculation in each thread. By relying on instructions from a different thread to provide increased resource use when speculation would have been used in a single-threaded architecture (and accepting higher throughput over the multiple threads rather than single-thread latency performance), the SMT processor can spend more of its effort on non-speculative instructions. This inherently implies a greater power efficiency of thread: that is, the power expended in the execution of useful instructions weighs better against misspeculated instructions on a per-thread basis. This also implies a somewhat simpler branch, unit design (for example, fewer resources devoted to branch speculation) that can further aid in the development by reducing design complexity and verification effort.
SMT implementations require an overhead in terms of additional hardware to maintain a multiple-thread state. This increase in hardware implies some increase in the processor's per-cycle power requirements, SMT designers must determine whether this increase in processor resources (and thus per-cycle power) can be well balanced by the reduction of other resources (less speculation) and the increase in performance attained across the multiple threads.
Compiler Support and Energy-Efficient Cache Architectures.Compilers can assist the microarchitecture in reducing the power consumption of programs. A number of compiler techniques developed for performance-oriented optimizations can be exploited (usually with minor modifications) to achieve power reduction. Reducing the number of memory accesses, reducing the amount of switching activity in the CPU, and increasing opportunities for clock gating will help here.
We've seen several proposals for power-efficient solutions to the cache hierarchy design. The simplest of these is the filter cache idea.
Deferred branch.The technology of the deferred branch allows minimizing overhead expenses of the instructions of conditional branch and depends on that, how frequently it is possible to reorder instructions successfully. The next instruction after the branch instruction is called the branch defer slot. The number of slots depends on that, how much time is necessary for completing the branch instruction. The instructions in the defer slots are always fetched from the memory, therefore it is possible that they be performed independently of that the branch to be or not to be. It is advisable to fill these slots with useful instructions, and if they are absent, then use NOP instructions. The program is performed logically so, as if the branch instruction follows the shift instruction. This means that the branch occurs one instruction later than in the sequential performing the program instructions. From this is the name ? a deferred branch.