Architecturally-Structural Memory Organization Features
There are the following varieties of architecturally-structural memory organization:
- With general (Symmetrical MultiProcessing - SMP) or up-diffused memory (Massively Parallel Processor - ???);
- General (divided) conventional memory to a few processors incorporated by the unibus of Uniform Memory Architecture - UMA;
- With physically up-diffused, but logically by public storage (Non-Uniform Memory Architecture - NUMA);
- Cache-Only Memory Architecture - COMA;
- Technology integration of creation on the memory crystal as DRAM and technologies of creation thereon crystal of logical charts or processors up-diffused on all array of memory - Processor-In-Memory (PIM) etc.
Multiple-Module Memories and Interleaving.If main memory is structured as collection of physically separate modules, each with its own Address Buffer Register (ABR) and Data Buffer Register (DBR), it is possible for more than one module to be performing Read or Write operations at any given time. The average rate of transmission of words to and from the total RAM system can thus be increased. Extra controls will be required, but since the RAM speed is often the bottleneck in computation speeds, the expense involved is usually justified in large computers.
The way in which individual addresses are distributed over the modules is a critical factor in determining the average number of modules that can be kept busy as computations proceed. Two methods of address layout are indicated in Fig. 1.28. In the first case, the RAM address generated by the CPU is decoded as shown in Fig. 1.28,a. The high-order k bits of the address name one of n modules, and the low-order m bits name a particular word in that module. If the CPU issues Read requests to consecutive locations, as it does when fetching instructions of a straight-line program, then only one module is kept busy by the CPU. However, devices with direct memory access (DMA) ability may be operating in other memory modules.
The second and more effective way of addressing the modules is shown in Fig. 1.28, b. It is called memory interleaving. A module is selected by the low-order k bits of the RAM address, and the high-order m bits name a location within that module. Therefore, consecutive addresses are located in successive modules. Thus, any component of the system, for example, the CPU or DMA device, which generates requests for access to consecutive RAM locations, can keep a number of modules busy at any one time. This results in a higher average utilization of the memory system as a whole.
In the system of Fig. 1.28,b, there must be 2k modules; otherwise, there will be gaps of nonexistent locations in the RAM address space. This raises a practical issue. The first system described, Fig. 1.28,a, is more flexible than the second in that any number of modules up to 2k can be used. The modules are normally assigned consecutive Multiple-Module (MM) addresses from 0 up. Hence, an existing system can be expanded by simply adding one or more modules as required. The second system must always have the full set of 2k modules, and a failure in any module affects all areas of the address space. A failed module in the first system affects only a localized area of the address space. To take advantage of an interleaved RAM unit, the CPU should be able to issue the requests for memory words before these words are actually needed for execution.
Cache Memories.Analysis of a large number of typical programs has shown that most of their execution time is spent in a few main routines. When execution is localized within these routines, a number of instructions are executed repeatedly. This may be in the form of a simple loop, nested loops, or a few procedures that repeatedly call each other. The actual detailed pattern of instruction sequencing is not important. The main observation is that many instructions in each of a few localized areas of the program are repeatedly executed, while the remainder of the program is accessed relatively infrequently. This phenomenon is referred to as locality of reference.
Now, if it can be arranged to have the active segments of a program in a fast memory, then the total execution time can be significantly reduced. Such a memory is referred to as a cache (or buffer) memory. It is inserted between the CPU and the RAM, as shown in Fig. 1.29. To make this arrangement effective, the cache must be considerably faster than the RAM. Their relative access times usually differ by a factor of 5 to 10. This approach is more economical than the use of fast memory devices to implement the entire RAM.
Conceptually, operation of a cache memory is very simple. The memory control circuitry is designed to take advantage of the property of locality of reference. When a Read request is received from the CPU, the contents of a block memory words containing the location specified are transferred into the cache one word at a time. When any of the locations in this block is referenced by the program, its contents are read directly from the cache. Usually, the cache memory can store a number of such blocks at any given time. The correspondence between the RAM blocks and those in the cache is specified by means of a mapping function. When the cache is full and a memory word (instruction or data) is referenced that is not in the cache, a decision must be made as to which block should be removed to create space for the new block that contains the referenced word. The collection of rules for making this decision constitutes the replacement algorithm.
In each of the techniques that we will describe, there are some basic assumptions and operations that are independent on the particular mapping function and replacement algorithm used. It is best to describe them first. The CPU does not need to know explicitly about the existence of the cache. The CPU simply makes Read and Write requests as described previously. The addresses generated by the CPU always refer to locations in the RAM. The memory-access control circuitry shown in Fig. 1.29 determines whether or not the requested word currently exists in the cache. If it does, the Read or Write operation is performed on the appropriate cache location. When the operation is a Read, the main memory is not involved. However, if the operation is a Write, there are two ways that the system can precede. In the first case, the cache location and the RAM location are updated simultaneously. This is called the store-through method. The alternative is to update the cache location only and to mark it as such through the use of an associated flag bit. Later, when the block containing this marked word is to be removed from the cache to make way for a new block, the permanent RAM location of the word is updated. The store-through method is clearly simpler, but it results in unnecessary Write operations in the RAM when a given cache word is updated a number of times during its cache residency period.
Next, consider the case where the addressed word is not in the cache and the operation is a Read. If this happens, the block of words in the RAM that contains the requested word is brought into the cache, and then the particular word requested is forwarded to the CPU. There is an opportunity for some time saving here if the word is forwarded to the CPU as soon as it is available from the RAM instead of waiting for the whole block to be loaded into the cache. This is called load-through. During a Write operation, if the addressed word is not in the cache, the information is written directly into the RAM. In this case, there is a little advantage in transferring the block containing the addressed word to the cache. A Write operation normally refers to a location in one of the data areas of a program rather than to the memory area containing the program instructions. The property of locality of reference is not as pronounced in accessing data when Write operations are involved. Finally, we should recall that in the case of an interleaved memory, contiguous block transfers are very efficient. Thus, transferring data in blocks from the RAM to the cache enables an interleaved RAM unit to operate at its maximum possible speed.
Mapping Functions.In order to discuss possible methods for specifying where RAM blocks are placed in the cache, it is helpful to use a specific example. Consider a cache of 2048 (2K) words with a block size of 16 words. This means that the cache is organized as 128 blocks. Let the RAM have 64K words, addressable by a 16-bit address. For mapping purposes, the memory will be considered as composed of 4K blocks of 16 words each.
The simplest way of associating RAM blocks with cache blocks is the direct-mapping technique. In this technique, block k of the RAM maps onto block k modulo 128 of the cache. This is depicted in Fig. 1.30. Since more than one RAM block is mapped onto a given cache block position, contention may arise for that position. This situation occurs even when the cache is not full. Contention is resolved by allowing the new block to overwrite the currently resident block. Thus the replacement algorithm is trivial. The detailed operation of the direct-mapping technique is as follows. Let an RAM address consist of three fields, as shown in Fig. 1.30. When a new block is first brought into the cache, the high-order 5 bits of its RAM address are stored in five tag bits associated with its location in the cache. When the CPU generates a memory request, the 7-bit block address determines the corresponding cache block. The tag field of that block is compared to the tag field of the address. If they match, the desired word (specified by the low-order 4 bits of the address) is in that block of the cache.
If there is no match, the required word must be accessed in the RAM. The direct-mapping technique is easy to implement, but it is not very flexible. Fig. 1.31 shows a much more flexible mapping method, whereby an RAM block can potentially reside in any cache block position. This is called the associative-mapping technique. In this case, 12 tag bits are required to identify an RAM block when it is resident in the cache. The tag bits of an address received from the CPU must be compared to the tag bits of each block of the cache to see if the desired block is present. Since there is complete freedom in block positioning, a wide range of replacement algorithms is possible. However, it might not be practical to make full use of this freedom, because complex replacement algorithms may be too difficult to implement. The cost of implementation is also adversely affected by the requirement for a 128-way associative search of 12-bit patterns. The final mapping method to be discussed is the most practical. It is intermediate to the above two techniques. Blocks of the cache are grouped into sets, and the mapping allows a block of RAM to reside in any block of a specific set. Hence having a few choices for block placement eases the contention problem of the direct method. At the same time, decreasing the size of the associative search reduces the hardware cost.
An example of this block-set-associative-mapping technique is given in Fig. 1.32 for the case of two blocks per set. The 6-bit set field of the address determines which set of the cache might contain the desired block, as in the direct-mapping method. The tag field of the address must then be associatively compared to the tags of the two blocks of the set to see if a match occurs signifying block presence. This two-way associative search is not difficult to implement. It is clear that four blocks per set would be accommodated by a 5-bit set field, eight blocks per set by a 4-bit set field, etc. The extreme condition of 128 blocks per set requires no set bits and corresponds to the fully associative technique with 12 tag bits. The other extreme of one block per set is the direct-mapping method.
There has been a tacit assumption that both tags and data are in the cache memory. But it is quite reasonable to have the few tag bits in a separate, even faster memory, especially when associative searches are required. The result of accessing or searching this faster tag directory determines whether or not the desired block is in the cache. If it is, the block is in the cache position that directly corresponds to the directory tag position where the match is found.
Another technical detail that should be mentioned, that is in fact independent of the mapping function, is that it is usually necessary to have a Valid bit associated with each block. This bit indicates whether or not the block contains valid data. It does not serve the same function as the bit mentioned earlier that is needed to distinguish whether or not the cache contains an updated version of the RAM block. The Valid bits are all set to 0 when power is initially applied to the system or when the RAM is loaded with new programs and data from mass storage devices. These latter transfers normally bypass the cache and are achieved by an I/O channel or some simpler DMA mechanism. The Valid bit of a particular cache block is set to 1 the first time this block is loaded from the RAM. Once set, a Valid bit stays equal to 1 unless an RAM block is updated by a source, other than CPU, that bypasses the cache. In this case, a check is made if the block is currently in the cache. If it is, the Valid bit is set to 0. The introduction of the Valid bit also means that a slight modification should be made to our earlier discussion of cache accesses. As well as tag matching being a condition, accessing should only proceed if the Valid bit is equal to 1.
Flash Memory Device.Flash memory is a form of non-volatile memory that can be electrically erased and reprogrammed, or Electrically Erasable Programmable Read-Only Memory (EEPROM), which permits record cancellation by applying pulses. Besides, this device does not need a special apparatus for programming, as does its counterpart EPROM. They used for storage information while power is off. A short description of the memory different types is given in Table 1.1.