The Microprocessor Architecture for Java Computing (MAJC) was designed to exploit scalable parallelism not only at the instruction level, but the data, thread, and process levels as well. Thus, a flexible architecture that provides compatibility between implementations through simple binary translation and compatibility with existing instruction set architectures through dynamic binary translation(much like the dynamic translation of Java byte codes to the MAJC ISA through Sun's HotSpot virtualmachine) was designed [35], [61].
Architectural Features.After choosing the best features from the current crop of microprocessors, features beyond the state of the art to incorporate in MAJC were developed. MAJC is a scalable, very-long-instruction-word (VLIW) architecture with a hierarchical structure. The lowest level consists of instruction slices that provide the control, status, and architectural registers and the resources required to execute instructions in a VLIW packet. One to four instruction slices, as well as additional control and status registers including the program counter register and the processor unit status register, combine to form a microthread for executing a single VLIW packet stream. One or more microthreads and additional control and status registers, such as the processor unit control register, combine to form a processor unit. Finally, one or more processor units and additional control and status registers combine to form a processor cluster. The architecture supports fast communication between microthreads and processor units in a processor cluster through fast interrupts and virtual channels.
This structure supports parallelism at several levels:
? data-level parallelism through single-instruction, multiple-data (SIMD) instructions,
? instruction-level parallelism through VLIW packets containing from one to four instructions, and
? thread-level and process-level parallelism through vertical multithreadingand fast communication between chip multiprocessors (CMPs).
Even with recent research advances in instruction-level parallelism, high-frequency execution remains a dominant factor determining application performance in modern general-purposemicroprocessors. Although issuing multiple instructions in the same cycle is an important performance-enhancing technique for most traditional architectures, it typically requires complex logic to handle data and control dependencies and resource allocation. The complexity of this logic has a significant impact on frequencies and on the time required to implement and verify the logic.
MAJC supports issuing multiple instructions in a single cycle with the VLIW approach, in which multiple instructions execute as a single packet. The instructions in each VLIW packet must have no data or control interdependencies and no resource-use interdependencies. A compiler for a MAJC microprocessor is responsible for ensuring that VLIW packets meet these rules.
The compiler must also perform inter-packet dependency checking and resource allocation, except in the cases of variable-latency instructions such as memory instructions, and complicated extended-latency instructions such as division instructions. Thus, dependency checking and resource allocation move from hardware to software, simplifying MAJC implementations and improving time to market (generally, a one-month time-to-market delay requires a compensatory performance increase of approximately 4%). To schedule code efficiently for most modern microprocessors, compilers must have intimate knowledge of the microprocessors' pipelines, so dependency checking and resource allocation do not significantly increase the MAJC compiler's complexity.
Each MAJC VLIW packet contains from one to four 32-bit instructions. Traditional VLIW architectures require each VLIW packet to contain the maximum number of instructions even if there is insufficient instruction-level parallelism to actually issue that number of instructions per cycle. The unused instructions must be padded with no-operation instructions, which increase the executable's size, significantly affecting cache performance and memory bandwidth. In contrast, MAJC uses an encoding in each VLIW packet to indicate how many instructions the packet contains, thus reducing the executable footprint. Therefore, in MAJC, the term VLIW packet can also denote a variable-length-instruction-word packet.
A MAJC executable's size is very close to die of a RISC executable given support for similar 32-bit instructions. A traditional four-issue VLIW architecture (chosen for comparison because the maximum MAJC packet size is four instructions) requires 224 instruction words in the loop. However, with the variable-packet-size approach, MAJC requires only 62 instruction words in the loop, a reduction of more than 70% of the memory footprint, thereby reducing cache and memory bandwidth requirements.
Conceptually, each instruction in a VLIW packet uses an instruction slice for its execution. Except for the first slice, the resources in each instruction
slice are similar to each other.
The first instruction in a VLIW packet has a different instruction slice because it has only one quarter of the opcode space available to other instructions (Fig.8.7). Part of its opcode space is the VLIW packet header, which encodes the number of instructions available in the VLIW packet.
In many ways, each instruction slice is an instructions independent executi?on path. Each instruction slice has its own control and status registers, as well as private registers. Instructions execute completely within their own slices without requiring the resources in another slice, reducing communication requirements between instruction slices and, thus, the wires between slices. However, the global registers are shared so that instruction slices can communicate with each other. An implementation designer can choose to replicate these registers per slice and to broadcast updates from the other slices to maintain each slice's independence.
Instruction slices enable an implementer to use copy-and-paste techniques to design similar instruction slices, reducing design and verification efforts. The architecture lets the implementer choose whether an implementation supports a maximum of one, two, three, or four instructions in a VLIW packet to address a particular application requirement. The copy-and-paste technique supports this freedom by allowing additional instruction slices without significant additional effort. Binary translation is the planned method for maintaining binary compatibility between implementations.
The architecture supports a maximum of four 32-bit instructions in a VLIW packet because extracting instruction-level parallelism beyond four instructions in traditional applications is relatively difficult. However, each instruction can be SIMD, allowing up to four different SIMD instructions in a VLIW packet. This multiple-SIMD capability, coupled with the uniformity of resources in each instruction slice, provides application programmers the flexibility to fully use all the instruction slices. Thus, MAJC significantly speeds up broadband applications without supporting more instructions in a VLIW packet.
MAJC specifies that each instruction can access up to 128 registers, requiring 7 bits per register specifier to encode. An implementation can allow access to 32, 64, 96, or 128 registers per instruction (Fig. 8.8). These registers consist of global and private registers. All the instruction slices to permit interslice communication share global registers. Private registers are acces-sible only by instructions in the same slice.
The number of global and private registers per instruction slice is implementation-dependent as long as the implementation meets the following rules:
? there is a mini?mum of 32 global registers,
? the number of global registers is a multiple of 32,
? the number of private registers is a multiple of 32, and
? the sum of the number of global and private regis?ters per instruction slice does not exceed 128.
Thus, with a maximum of four instruction slices per VLIW packet, the MAJC register file definition allows from 32 to 416 registers but requires only 7 bits of opcode space per register specifier. For example, a VLIW packet supporting three instruction slices with 64 private registers per slice and 32 global registers has a total of 224 registers. Likewise, a VLIW packet supporting four instruction slices with 32 private registers per slice and 96 global registers also has a total of 224 registers.
The private registers accessed by an instruction depend on the instruction slice that the instruction in the VLIW packet occupies. MAJC's register configuration 1 flexibility permits implementation trade-offs between the area and cycle-time impact of using a large number of registers and the performance achieved by using a register configuration designed for a particular application.
The MAJC architectural registers as well as the instruction slices are "data-type agnostic"; that is, they can store and process information regardless of its data type. The instruction set provides a rich set of instructions to process a variety of data types ranging from 8- to 64-bit data types (Fig. 8.9).
An implementation can choose either 32- or 64-bit register widths. MAJC treats data types smaller than an implementation's register width as SIMD data types. Data types larger than an implementation's register width require a pair of evenly aligned registers to hold the data.
Unlike other architectures, which typically have separate integer, floating-point, and SIMD registers and execution resources, MAJC integrates the processing of various data types in an instruction slice. The compiler can then allocate registers for program variables without regard to data type and can efficiently allocate resources to meet applications requirements. In addition, because a VLIW packet need not dispatch instructions to separate execution resources based on the data type it is processing, instruction processing in to instruction slice is streamlined. An instruction slice can share the execution resources for processing various data types without replicating common resources such as multiplier arrays, leading to better resource use.
The architecture specifies a precise execution model in which each VLIW packet executes to completion or does not execute at all in sequential program order. All the instructions in a VLIW packet execute at the same time, so they must not have any read-after-write or write-after-write interdependencies. Write-after-read dependencies between instructions in the same VLIW packet are permitted. Instructions in a VLIW packet generate precise exceptions before the packet updates an architectural state. Interrupts occur asynchronously after a VLIW packet has completed updating any architectural state, but before the next sequential program order VLIW packet updates any architectural state.
MAJC supports three trap levels, thus allowing two nested traps. The privilege mode of software executing in a trap level is controlled by a field in the corresponding trap level's processor unit status register and is independent of the trap level. User programs and system software will execute at the first trap level?the former in nonprivileged mode and the latter in privileged mode. The second trap level is for interrupt and exception dispatch software, and the third trap level is for diagnostic software.
The memory consistency model assumed by MAJC is a variant of release consistency.In this model, the sole constraint is that memory operations on the same address from a single execution stream appear to be performed in sequential program order with respect to that stream.
However, VLIW packet fetching is not coherent with the memory operations of an execution stream. Memory operations on different addresses from an execution stream are independent and thus appear to be unordered. Additionally, there is no total order between memory operations from different execution streams. To order memory operations, MAJC requires the use of appropriate memory barrier instructions to provide the semantics of acquire and release operations as defined by release consistency. Release consistency provides the greatest flexibility for implementers to design memory subsystems for CMP and multiprocessor systems efficiently. Since Java programs already assume a relaxed memory model, the architecture does not additionally burden application programmers by using a relaxed memory model such as release consistency.