Superscalar (an advanced pipelining technique that enables the processor to execute more than one instruction per clock cycle) realization of RISC is the systems able to process a few commands for a time. On other hand, the superscalar computer systems, as a rule, look over commands "ahead", that it is better to use resources and, possibly, to move operations, laboring for optimum productivity (Out-of-Order Execution). As all connections between operations is described by the names (by registers), basic complication is here represented by the so-called false dependences. Superscalar architecture is characterized yet by the so-called speculative character of calculations (Speculative Execution). The system comes to the conditional branch (a transition condition here not is yet known) and chooses one of branches. If drove - there is winning on speed. Did not drive - you are almost lose nothing.
The basic ideas of superscalar treatment consist in that at first a compiler builds dependences and carries out transpositions of operations. A successive instructions set parallelism is actually hidden in which turns out. Then "superscalar" begins the analysis and in the turn carries out optimization of execution. With passing ahead execution of calculation is effective only to the conditional branch. Therefore the use of the system of speculative calculations allows taking into account et al, less credible variants of paths. Basic difficulty on the path of realization of superscalar treatment is in absence of tools of description of parallelism at the level of language of machine.
The basic idea of conveyer of superscalar architecture consists in the use as slower link of horizontal conveyer of treatment of plenty (five and more) of functions boxes of the de bene esse named vertical data processing. At that, more fast links single functions boxes are used. Consequently, the sense consists in accelerating general treatment due to the ?subconveyorization? every the most weak link (which can be divided into additional stages). For this purpose on every stage one of slowly operating functions boxes is connected to implementation of the function of treatment, temporally ?leaving? from a general conveyer, and here it is temporally ?freed? a place in a general conveyer occupies other, the same on the functional purpose, or other block, to bring not in the delay of expectation (outage or stop) in the rhythm of work of general conveyer. I.e. in fact is conveyer more frequent with one changeable (to the variables) link which can be as the same (or other ALU), or to execute other functions quite (ALU with a floating point, block of load, block of maintenance). As a result of such timely substitution (frequently with the change of function of link) more fast-acting links of conveyer do not stand, and continue to work in that rhythm, saving the maximal fast-acting. Thus, superscalar architecture can be examined as approach to the conveyorization (and type of concordance, organization of conveyer work synchronization), allowing to expect step of work of conveyer at times implementations of operation by a fast-acting link, here neutralizing influencing of functions boxes working slower (does their presence and participation in-process general conveyer not critical and unnoticeable, i.e. not reducing general productivity of conveyer on the whole).
Although a term superscalar architecture was entered in 1987, such architecture was realized yet in the computers of CDC 6600 (renamed in Cyber 70 model 74), CDC 7600 (renamed in Cyber 70 model 76), and Elbrus-1 [63]. In a computer Elbrus-1 all commands with long variables was translated preliminary in the RISC-oriented code with the commands of the fixed length to take not apart their syntax consistently, by machine tools. And farther the real superscalar architecture began to work, as such code it was already possible to process parallel.
Functional parallelism in computer of Atlas was represented as a separate stand-alone 24-bit adder for the index calculations in addition to the basic arithmetic and logic device fixed-point and floating with the use of one main adder-accumulator [27], [60], [61], [62], [63].
The CDC 6600, which was introduced in the early 1960s, is sometimes given credit as being the first RISC architecture. Seymour Cray designed this machine with a single goal in mind ? to attain the maximum speed possible for scientific applications. The CDC 6600 was a large mainframe, costing several million dollars at the time. Although the 6600 were very different from the inexpensive processors, Cray was motivated by some of the same considerations that influence modern RISC design. In particular, because the pattern of instruction used by scientific programs is very stylized, Cray discovered that he could implement a very small instruction set and still build FORTRAN compilers that were able to generate efficient codes. Long before the term RISC had been coined, Cray discovered that by making simplifications in the instruction set, significant improvements in instruction throughput were possible. Among the RISC-like characteristics of the 6600 are the following:
? Only load and store instructions reference memory (these are rather peculiar ? an address register is set to the address of the memory location involved, and as a side effect the operand is loaded into, or stored, from the matching data register).
? Operations in registers are in three-address form, with two input registers and a separate output register.
? The instruction formats are very simple and uniform.
? There are multiple functional units, and instruction execution is pipelined, with a scoreboard used to mark busy registers to create the necessary interlocks.
Despite these similarities with current RISC processors, there are a number of respects in which the 6600 departs significantly from the general RISC model. In particular, there is a rather small number of registers, and they are not uniform. There is also no visible concurrency ? from the programmer's point of view, execution is strictly sequential. Most important, the hardware used to implement the 6600 is completely different, since the CDC 6600 predated the development of integrated circuits. The CDC 6600 was a marvel of hand-wired separate components. On opening up the main processor cabinet, one was confronted with a daunting tangle of wires. Hardware problems with the 6600 often had to be solved by slightly lengthening or shortening one of these wires to control signal propagation times.
CDC 6600 caused a command from memory each 100 ns and placed it in one of 10 functions boxes for concurrent execution (multiplications, divisions, additions, and additions of double layers, operations of shift, logic, transition and increment (decrement)). While commands were executed, central processing unit had caused a next command.
Superscalar microprocessors have one programs counter usually. Therefore commands analyzable on possibility of joint execution is ?tied? to the programs counter of processor by the window of execution [53, 54, and 65].
In AMD superscalar architecture in-use in microprocessors RISC86 a decoder breaks up long CISC are constructions on small RISC-like components, urgent RISC-operations (ROP). (ROP is similar to the commands of microcode of microprocessors of ?86.) For implementation of ROP two ALU blocks, block of operations with a floating-point, block of branch instructions execution, two blocks of load/maintenance, are intended.
The superscalar processor of SUPERSPARC had the separate conveyers of integer and material arithmetic, contained on a crystal separate cache-memory of the first level and provided to 3 instructions execution at one time.
Power 3 is superscalar microprocessor with extraordinary instructions execution. At one time it is able to execute up to 8 commands: 2 commands of load/maintenance, 2 commands with a floating-point, 2 short integer commands, 1 long integer command and 1 branch instruction. For this purpose Power 3 contains 7 executive devices.
Vectorial parallelism.A scalar computer (CDC 6600, IBM 360/91) is a machine, which has commands for manipulation by data containing only single number elements, while vectorial computer has commands for manipulation by data containing the well-organized set of number elements, i.e. vectors.
Seymur Cray supposed that a computer needs to be adapted not to the languages, but to the optimizing compilers. On other hand, the vectorial calculations in a great deal complicated positions of Cray supercomputers, as instead of paralleling execution of cycles, was them at first vectorized, on that considerable resources left.
Vector processor is an architecture and compiler model that was popularized by supercomputers in which high-level operations work on linear arrays of numbers. Vectorial computers can, as is generally known, to work in two different modes. The programs, which can be vectorized by a compiler, are executed in the vectorial mode with high-rate. The programs not containing vectorial parallelism or program, parallelism of which a compiler does not find out, are executed with low speed in the scalar mode.
Amdahl?s Law states that in the general productivity of such a high-speed computing system low-speed modes prevail only if the work performed in a scalar mode is not fully eliminated.
On scalar productivity which vectorial operations are not executed at, productivity of supervectorial computer Cray-1 in two times excels productivity of superscalar computer CDC 7600. On vectorial productivity for which vectorial operations are used only, and the channel of memory, productivity of supervectorial computer is a limiting factor Cray-1 in four times excels productivity of superscalar computer CDC 7600. Supervectorial productivity at which availability of functional units and vectorial registers is a limiting factor, as early as three times multiplies this break. In spite of the substantial winning in productivity, this technology did not get wide distribution, although it until now had some specific applications.
Thus, by the above all requirement of computer architecture at paralleling at the level of tasks there is presentation of the correctly balanced set of the duplicated resources which befit under general classification of functional parallelism. Thus very importantly, that the good management by the level of activity was carried out in all parts of the system, that it was possible to define bottlenecks and depending on the circumstances to extend (or to abbreviate) resources.