컴퓨터시스템구조론 Stallings 저 / 김종현 역 제9판 Lecture slides prepared for “Computer Organization and Architecture”, 9/e, by William Stallings, Chapter 4 “Cache Memory”. 컴퓨터시스템구조론 Stallings 저 / 김종현 역 제9판
Although seemingly simple in concept, computer memory exhibits perhaps the widest range of type, technology, organization, performance, and cost of any feature of a computer system. No single technology is optimal in satisfying the memory requirements for a computer system. As a consequence, the typical computer system is equipped with a hierarchy of memory subsystems, some internal to the system (directly accessible by the processor) and some external (accessible by the processor via an I/O module). This chapter and the next focus on internal memory elements, while Chapter 6 is devoted to external memory. To begin, the first section examines key characteristics of computer memories. The remainder of the chapter examines an essential element of all modern computer systems: cache memory. 제4장 캐시 메모리
컴퓨터 기억장치의 주요 특징들 The complex subject of computer memory is made more manageable if we classify memory systems according to their key characteristics. The most important of these are listed in Table 4.1. Table 4.1 Key Characteristics of Computer Memory Systems
기억장치 시스템의 특성 위치(location) 용량(Capacity) 전송의 단위(Unit of transfer) 기억장치가 컴퓨터의 내부에 있는지 혹은 외부에 있는지를 가리킴 내부 기억장치는 주기억장치(main memory)와 동일시되기도 함 프로세서는 자신의 지역 기억장치(local memory)인 레지스터들을 필요로 함 캐시는 다른 형태의 내부 기억장치 외부 기억장치는 디스크나 테이프와 같이, 프로세서가 입출력 제어기(I/O controller)를 통하여 액세스할 수 있는 주변 저장장치들로 구성됨 용량(Capacity) 기억장치에서 용량은 바이트(byte) 단위로 표현 전송의 단위(Unit of transfer) 내부 기억장치에서 전송의 단위는 기억장치 모듈로 들어오거나 나가는 전기적 선로의 수와 같다 The term location in Table 4.1 refers to whether memory is internal and external to the computer. Internal memory is often equated with main memory. But there are other forms of internal memory. The processor requires its own local memory, in the form of registers (e.g., see Figure 2.3). Further, as we shall see, the control unit portion of the processor may also require its own internal memory. We will defer discussion of these latter two types of internal memory to later chapters. Cache is another form of internal memory. External memory consists of peripheral storage devices, such as disk and tape, that are accessible to the processor via I/O controllers. An obvious characteristic of memory is its capacity. For internal memory, this is typically expressed in terms of bytes (1 byte = 8 bits) or words. Common word lengths are 8, 16, and 32 bits. External memory capacity is typically expressed in terms of bytes. A related concept is the unit of transfer. For internal memory, the unit of transfer is equal to the number of electrical lines into and out of the memory module. This may be equal to the word length, but is often larger, such as 64, 128, or 256 bits. To clarify this point, consider three related concepts for internal memory: • Word: The “natural” unit of organization of memory. The size of a word is typically equal to the number of bits used to represent an integer and to the instruction length. Unfortunately, there are many exceptions. For example, the CRAY C90 (an older model CRAY supercomputer) has a 64-bit word length but uses a 46-bit integer representation. The Intel x86 architecture has a wide variety of instruction lengths, expressed as multiples of bytes, and a word size of 32 bits. • Addressable units: In some systems, the addressable unit is the word. However, many systems allow addressing at the byte level. In any case, the relationship between the length in bits A of an address and the number N of addressable units is 2A = N. • Unit of transfer: For main memory, this is the number of bits read out of or written into memory at a time. The unit of transfer need not equal a word or an addressable unit. For external memory, data are often transferred in much larger units than a word, and these are referred to as blocks
데이터를 액세스하는 방식 순차적 액세스 직접 액세스 임의 액세스 연관 (associative) Memory is organized into units of data called records Access must be made in a specific linear sequence Access time is variable 직접 액세스 Involves a shared read-write mechanism Individual blocks or records have a unique address based on physical location 임의 액세스 Each addressable location in memory has a unique, physically wired-in addressing mechanism The time to access a given location is independent of the sequence of prior accesses and is constant Any location can be selected at random and directly addressed and accessed Main memory and some cache systems are random access 연관 (associative) A word is retrieved based on a portion of its contents rather than its address Each location has its own addressing mechanism and retrieval time is constant independent of location or prior access patterns Cache memories may employ associative access Another distinction among memory types is the method of accessing units of data. These include the following: • Sequential access: Memory is organized into units of data, called records. Access must be made in a specific linear sequence. Stored addressing information is used to separate records and assist in the retrieval process. A shared read–write mechanism is used, and this must be moved from its current location to the desired location, passing and rejecting each intermediate record. Thus, the time to access an arbitrary record is highly variable. Tape units, discussed in Chapter 6, are sequential access. Direct access: As with sequential access, direct access involves a shared read–write mechanism. However, individual blocks or records have a unique address based on physical location. Access is accomplished by direct access to reach a general vicinity plus sequential searching, counting, or waiting to reach the final location. Again, access time is variable. Disk units, discussed in Chapter 6, are direct access. • Random access: Each addressable location in memory has a unique, physically wired-in addressing mechanism. The time to access a given location is independent of the sequence of prior accesses and is constant. Thus, any location can be selected at random and directly addressed and accessed. Main memory and some cache systems are random access. • Associative: This is a random access type of memory that enables one to make a comparison of desired bit locations within a word for a specified match, and to do this for all words simultaneously. Thus, a word is retrieved based on a portion of its contents rather than its address. As with ordinary random-access memory, each location has its own addressing mechanism, and retrieval time is constant independent of location or prior access patterns. Cache memories may employ associative access.
용량과 성능: 기억장치의 가장 중요한 두 가지 특성들 세 가지 성능 파라미터들 사용: 기억장치 사이클 시간 액세스 시간 (지연) 임의 액세스 기억장치에서는 읽거나 쓰는 동작을 수행하는데 걸리는 시간, 즉 주소가 기억장치에 도착하는 순간부터 데이터가 저장되거나 읽혀지는 순간까지의 시간 비임의 액세스 기억장치(non-random-access memory)에서의 액세스 시간은 읽기-쓰기 메커니즘을 원하는 위치로 이동시키는데 걸리는 시간 기억장치 사이클 시간 액세스 시간과 다음 액세스를 시작하기 위하여 요구되는 동작에 걸리는 추가적인 시간을 합한 시간 추가적인 시간은 신호 선에서 과도 전류가 완전히 소멸되는 데 걸리는 시간, 또는 읽기 동작 후에 데이터가 파괴되는 저장장치인 경우에 데이터를 복원시키는 데 걸리는 시간. 기억장치 사이클 시간은 시스템 버스와 관계가 있음 전송률(Transfer rate) 전송률은 데이터가 기억장치로 전송되어 들어가거나 나오는 비율 임의 액세스 기억장치의 경우에 이것은 1/(사이클 시간)에 해당 From a user’s point of view, the two most important characteristics of memory are capacity and performance. Three performance parameters are used: • Access time (latency): For random-access memory, this is the time it takes to perform a read or write operation, that is, the time from the instant that an address is presented to the memory to the instant that data have been stored or made available for use. For non-random-access memory, access time is the time it takes to position the read–write mechanism at the desired location. • Memory cycle time: This concept is primarily applied to random-access memory and consists of the access time plus any additional time required before a second access can commence. This additional time may be required for transients to die out on signal lines or to regenerate data if they are read destructively. Note that memory cycle time is concerned with the system bus, not the processor. • Transfer rate: This is the rate at which data can be transferred into or out of a memory unit. For random-access memory, it is equal to 1/(cycle time).
기억장치 가장 보편적인 형태들: 반도체 기억장치 자기 표면 기억장치 광학적(optical) 자기-광학적(magneto-optical) 데이터 저장의 몇 가지 중요한 물리적 특성들: 휘발성 기억장치(volatile memory): 전력 공급이 중단되면 정보가 자연적으로 소멸되거나 사라짐 Information decays naturally or is lost when electrical power is switched off Nonvolatile memory Once recorded, information remains without deterioration until deliberately changed No electrical power is needed to retain information Magnetic-surface memories Are nonvolatile Semiconductor memory May be either volatile or nonvolatile Nonerasable memory Cannot be altered, except by destroying the storage unit Semiconductor memory of this type is known as read-only memory (ROM) For random-access memory the organization is a key design issue Organization refers to the physical arrangement of bits to form words A variety of physical types of memory have been employed. The most common today are semiconductor memory, magnetic surface memory, used for disk and tape, and optical and magneto-optical. Several physical characteristics of data storage are important. In a volatile memory, information decays naturally or is lost when electrical power is switched off. In a nonvolatile memory, information once recorded remains without deterioration until deliberately changed; no electrical power is needed to retain information. Magnetic-surface memories are nonvolatile. Semiconductor memory (memory on integrated circuits) may be either volatile or nonvolatile. Nonerasable memory cannot be altered, except by destroying the storage unit. Semiconductor memory of this type is known as read-only memory (ROM). Of necessity, a practical nonerasable memory must also be nonvolatile. For random-access memory, the organization is a key design issue. In this context, organization refers to the physical arrangement of bits to form words. The obvious arrangement is not always used, as is explained in Chapter 5.
기억장치 계층(Memory Hierarchy) 컴퓨터 기억장치 설계에 있어서 가장 중요한 세 가지 요건들: 용량은 얼마나 큰가? 얼마나 빠른가? 가격은 얼마나 비싼가? 용량, 가격 및 액세스 시간 사이에 적절한 상호조정(tradeoff) 필요 액세스 속도가 빨라질수록, 비트당 가격은 높아진다 용량이 커질수록, 비트당 가격은 낮아진다. 용량이 커질수록, 액세스 속도는 느려진다. 기억장치 딜레마로부터 벗어날 수 있는 방법은 어떤 한 가지 기억장치나 제조 기술에 의존하는 것이 아니라 기억장치 계층(memory hierarchy)을 이용하는 것 The design constraints on a computer’s memory can be summed up by three questions: How much? How fast? How expensive? The question of how much is somewhat open ended. If the capacity is there, applications will likely be developed to use it. The question of how fast is, in a sense, easier to answer. To achieve greatest performance, the memory must be able to keep up with the processor. That is, as the processor is executing instructions, we would not want it to have to pause waiting for instructions or operands. The final question must also be considered. For a practical system, the cost of memory must be reasonable in relationship to other components. As might be expected, there is a trade-off among the three key characteristics of memory: capacity, access time, and cost. A variety of technologies are used to implement memory systems, and across this spectrum of technologies, the following relationships hold: • Faster access time, greater cost per bit • Greater capacity, smaller cost per bit • Greater capacity, slower access time The dilemma facing the designer is clear. The designer would like to use memory technologies that provide for large-capacity memory, both because the capacity is needed and because the cost per bit is low. However, to meet performance requirements, the designer needs to use expensive, relatively lower-capacity memories with short access times. The way out of this dilemma is not to rely on a single memory component or technology, but to employ a memory hierarchy.
캐시와 주기억장치 Cache memory is designed to combine the memory access time of expensive, high-speed memory combined with the large memory size of less expensive, lower-speed memory. The concept is illustrated in Figure 4.3a. There is a relatively large and slow main memory together with a smaller, faster cache memory. The cache contains a copy of portions of main memory. When the processor attempts to read a word of memory, a check is made to determine if the word is in the cache. If so, the word is delivered to the processor. If not, a block of main memory, consisting of some fixed number of words, is read into the cache and then the word is delivered to the processor. Because of the phenomenon of locality of reference, when a block of data is fetched into the cache to satisfy a single memory reference, it is likely that there will be future references to that same memory location or to other words in the block. Figure 4.3b depicts the use of multiple levels of cache. The L2 cache is slower and typically larger than the L1 cache, and the L3 cache is slower and typically larger than the L2 cache.
캐시/주기억장치 조직 Figure 4.4 depicts the structure of a cache/main-memory system.
캐시 설계의 요소들 Table 4.2 캐시 설계의 요소들 This section provides an overview of cache design parameters and reports some typical results. We occasionally refer to the use of caches in high-performance computing (HPC). HPC deals with supercomputers and their software, especially for scientific applications that involve large amounts of data, vector and matrix computation, and the use of parallel algorithms. Cache design for HPC is quite different than for other hardware platforms and applications. Indeed, many researchers have found that HPC applications perform poorly on computer architectures that employ caches [BAIL93]. Other researchers have since shown that a cache hierarchy can be useful in improving performance if the application software is tuned to exploit the cache [WANG99, PRES01]. Although there are a large number of cache implementations, there are a few basic design elements that serve to classify and differentiate cache architectures. Table 4.2 lists key elements. Table 4.2 캐시 설계의 요소들
캐시 주소(Cache Addresses) 가상 기억장치 (Virtual Memory) 프로그램으로 하여금 물리적으로 이용 가능한 주기억장치 크기에 상관없이 논리적인 관점으로 기억장치의 주소를 지정할 수 있게 해주기 위한 설비 사용된다면, 기계 명령어의 주소 필드는 가상 주소를 포함 주기억장치로부터의 읽기와 쓰기를 위하여, 하드웨어 기억장치 관리 유니트(MMU)가 각 가상 주소를 주기억장치 내의 물리적 주소로 변환 Almost all non-embedded processors, and many embedded processors, support virtual memory, a concept discussed in Chapter 8. In essence, virtual memory is a facility that allows programs to address memory from a logical point of view, without regard to the amount of main memory physically available. When virtual memory is used, the address fields of machine instructions contain virtual addresses. For reads to and writes from main memory, a hardware memory management unit (MMU) translates each virtual address into a physical address in main memory.
논리적 및 물리적 캐시 When virtual addresses are used, the system designer may choose to place the cache between the processor and the MMU or between the MMU and main memory (Figure 4.7). A logical cache, also known as a virtual cache, stores data using virtual addresses. The processor accesses the cache directly, without going through the MMU. A physical cache stores data using main memory physical addresses. One obvious advantage of the logical cache is that cache access speed is faster than for a physical cache, because the cache can respond before the MMU performs an address translation. The disadvantage has to do with the fact that most virtual memory systems supply each application with the same virtual memory address space. That is, each application sees a virtual memory that starts at address 0. Thus, the same virtual address in two different applications refers to two different physical addresses. The cache memory must therefore be completely flushed with each application context switch, or extra bits must be added to each line of the cache to identify which virtual address space this address refers to.
사상 함수(Mapping Function) 캐시 라인(cache line)의 수는 주기억장치 블록의 수 보다 적기 때문에, 주기억장치 블록을 캐시 라인으로 사상(mapping) 해주는 알고리즘이 필요 세 가지 기술들이 사용됨: 직접(Direct) 가장 간단한 기술 주기억장치의 각 블록을 한 개의 캐시 라인으로만 사상 연관(Associative) 주기억장치 블록이 캐시의 어떤 라인으로도 적재될 수 있도록 허용 캐시 제어 논리가 주기억장치 주소를 단순히 태그와 단어 필드로 해석 원하는 블록이 캐시에 있는지 확인하기 위하여, 캐시 제어 논리는 태그가 일치하는지 모든 라인들과 동시에 비교 세트 연관(Set Associative) 직접 사상과 연관 사상의 장점만을 취하고 단점을 줄이기 위한 절충안 Because there are fewer cache lines than main memory blocks, an algorithm is needed for mapping main memory blocks into cache lines. Further, a means is needed for determining which main memory block currently occupies a cache line. The choice of the mapping function dictates how the cache is organized. Three techniques can be used: direct, associative, and set associative. Direct mapping: The simplest technique, known as direct mapping, maps each block of main memory into only one possible cache line. Associative mapping: Associative mapping overcomes the disadvantage of direct mapping by permitting each main memory block to be loaded into any line of the cache. Set-associative mapping: Set-associative mapping is a compromise that exhibits the strengths of both the direct and associative approaches while reducing their disadvantages.
직접 사상 (Direct Mapping) The mapping is expressed as i = j modulo m where i = cache line number j = main memory block number m = number of lines in the cache Figure 4.8a shows the mapping for the first m blocks of main memory. Each block of main memory maps into one unique line of the cache. The next m blocks of main memory map into the cache in the same fashion; that is, block Bm of main memory maps into line L0 of cache, block Bm+1 maps into line L1, and so on.
직접 사상 캐시의 조직 The mapping function is easily implemented using the main memory address. Figure 4.9 illustrates the general mechanism.
직접 사상의 예 Figure 4.10 Direct Mapping Example.
직접 사상 방식의 요약 주소 길이 = (s + w) 비트 주소지정 가능한 단위의 개수 = 2s+w 단어 혹은 바이트 주기억장치 내 블록들의 수 = 2s+ w/2w = 2s 캐시 내 라인들의 수 = m = 2r 태그의 크기 = (s – r) 비트 The direct mapping technique is simple and inexpensive to implement. Its main disadvantage is that there is a fixed cache location for any given block. Thus, if a program happens to reference words repeatedly from two different blocks that map into the same line, then the blocks will be continually swapped in the cache, and the hit ratio will be low (a phenomenon known as thrashing).
희생자 캐시(Victim Cache) 직접 사상 캐시에서, 만약 어느 프로그램이 같은 라인에 사상되는 두 개의 블록들로부터 단어들을 반복해서 읽어와야 한다면, 그 블록들은 캐시에서 반복적으로 교체(swap)될 것이고, 결과적으로 적중률(hit ratio)이 낮아지게 되는 스레싱(thrashing) 현상 발생 원래 빠른 액세스 시간에 영향을 주지 않고도 직접 사상 캐시의 충돌 미스를 줄이는 방법으로 제안 완전 연관 캐시(Fully associative cache) 전형적인 크기는 4~16 캐시 라인 직접 사상 L1 캐시와 다음 레벨 캐시 사이에 위치 One approach to lower the miss penalty is to remember what was discarded in case it is needed again. Since the discarded data has already been fetched, it can be used again at a small cost. Such recycling is possible using a victim cache. Victim cache was originally proposed as an approach to reduce the conflict misses of direct mapped caches without affecting its fast access time. Victim cache is a fully associative cache, whose size is typically 4 to 16 cache lines, residing between a direct mapped L1 cache and the next level of memory. This concept is explored in Appendix D.
완전 연관 캐시 조직 Associative mapping overcomes the disadvantage of direct mapping by permitting each main memory block to be loaded into any line of the cache (Figure 4.8b). In this case, the cache control logic interprets a memory address simply as a Tag and a Word field. The Tag field uniquely identifies a block of main memory. To determine whether a block is in the cache, the cache control logic must simultaneously examine every line’s tag for a match. Figure 4.11 illustrates the logic.
연관 사상의 예 Figure 4.12 Associative Mapping Example.
연관 사상 캐시의 요약 주소 길이 = (s + w) bits 주소지정 가능한 단위의 개수 = 2s+w 단어 혹은 바이트 주기억장치 내 블록들의 수 = 2s+ w/2w = 2s 캐시 내 라인들의 수 = 결정 불가 태그의 크기 = s 비트 With associative mapping, there is flexibility as to which block to replace when a new block is read into the cache. Replacement algorithms, discussed later in this section, are designed to maximize the hit ratio. The principal disadvantage of associative mapping is the complex circuitry required to examine the tags of all cache lines in parallel.
세트 연관 사상 직접 사상과 연관 사상의 장점만을 취하고 단점을 줄이기 위한 절충안 캐시는 여러 개의 세트들로 구성 직접 사상과 연관 사상의 장점만을 취하고 단점을 줄이기 위한 절충안 캐시는 여러 개의 세트들로 구성 각 세트는 여러 개의 라인들을 포함 주어진 블록은 주어진 세트 내 어느 라인에든 사상될 수 있음 예: 세트 당 2 라인 2-way 연관 사상 주어진 블록은 단 한 세트 내의 두 라인들 중의 하나에 들어갈 수 있다 Set-associative mapping is a compromise that exhibits the strengths of both the direct and associative approaches while reducing their disadvantages. In this case, the cache consists of a number sets, each of which consists of a number of lines. The relationships are m = v* k i = j modulo v where i = cache set number j = main memory block number m = number of lines in the cache v = number of sets k = number of lines in each set This is referred to as k-way set-associative mapping.
주기억장치로부터 캐시로의 사상 : k-Way 세트 연관 Figure 4.13a illustrates this mapping for the first v blocks of main memory. As with associative mapping, each word maps into multiple cache lines. For set-associative mapping, each word maps into all the cache lines in a specific set, so that main memory block B0 maps into set 0, and so on. Thus, the set-associative cache can be physically implemented as n associative caches. It is also possible to implement the set-associative cache as k direct mapping caches, as shown in Figure 4.13b. Each direct-mapped cache is referred to as a way, consisting of v lines. The first v lines of main memory are direct mapped into the v lines of each way; the next group of v lines of main memory are similarly mapped, and so on. The direct-mapped implementation is typically used for small degrees of associativity (small values of k) while the associative-mapped implementation is typically used for higher degrees of associativity [JACO08].
k-Way 세트 연관 캐시의 조직 For set-associative mapping, the cache control logic interprets a memory address as three fields: Tag, Set, and Word. The d set bits specify one of v = 2d sets. The s bits of the Tag and Set fields specify one of the 2s blocks of main memory. Figure 4.14 illustrates the cache control logic. With fully associative mapping, the tag in a memory address is quite large and must be compared to the tag of every line in the cache. With k-way set-associative mapping, the tag in a memory address is much smaller and is only compared to the k tags within a single set.
세트 연관 사상 캐시의 요약 주소 길이 = (s + w) bits 주기억장치 내 블록의 수 = 2s+w/2w=2s 세트 내 라인의 수 = k 세트의 수= v = 2d 캐시 내 라인들의 수 = m=kv = k * 2d 캐시의 크기 = k * 2d+w 단어 혹은 바이트 태그의 크기 = (s – d) bits Set Associative Mapping Summary.
Figure 4.15 shows an example using set-associative mapping with two lines in each set, referred to as two-way set-associative.
캐시 크기에 따른 다양한 연관성 Figure 4.16 shows the results of one simulation study of set-associative cache performance as a function of cache size [GENU04]. The difference in performance between direct and two-way set associative is significant up to at least a cache size of 64 kB. Note also that the difference between two-way and four-way at 4 kB is much less than the difference in going from for 4 kB to 8 kB in cache size. The complexity of the cache increases in proportion to the associativity, and in this case would not be justifiable against increasing cache size to 8 or even 16 Kbytes. A final point to note is that beyond about 32 kB, increase in cache size brings no significant increase in performance. The results of Figure 4.16 are based on simulating the execution of a GCC compiler. Different applications may yield different results. For example, [CANT01] reports on the results for cache performance using many of the CPU2000 SPEC benchmarks. The results of [CANT01] in comparing hit ratio to cache size follow the same pattern as Figure 4.16, but the specific values are somewhat different.
교체 알고리즘(replacement algorithm) 일단 캐시가 채워졌다면, 새로운 블록을 캐시로 가져올 때는 현재 캐시에 저장되어 있는 블록들 중의 하나는 반드시 교체되어야 함 직접 사상에서는 임의의 블록이 들어갈 수 있는 라인이 하나뿐이기 때문에 선택 불가 연관 및 세트-연관 기법에서는 교체 알고리즘이 필요 높은 속도를 얻기 위해서는 알고리즘이 하드웨어로 구현되어야 함 Once the cache has been filled, when a new block is brought into the cache, one of the existing blocks must be replaced. For direct mapping, there is only one possible line for any particular block, and no choice is possible. For the associative and set-associative techniques, a replacement algorithm is needed. To achieve high speed, such an algorithm must be implemented in hardware.
가장 널리 사용되고 있는 네 가지 알고리즘들: : 최소 최근 사용(least recently used : LRU) 가장 효과적 캐시 내에서 참조되지 않은 채로 가장 오래 있었던 블록을 교체 구현의 단순성 때문에, LRU는 가장 널리 사용되는 교체 알고리즘 First-in-first-out (FIFO) 캐시 내에 가장 오래 머물렀던 블록을 교체 라운드-로빈(round-robin)이나 원형 버퍼 기법으로 쉽게 구현 최소 사용 빈도(least frequently used: LFU) 가장 적게 참조되었던 블록을 교체 각 라인에 카운터(counter)를 두어 구현 A number of algorithms have been tried. We mention four of the most common. Probably the most effective is least recently used (LRU): Replace that block in the set that has been in the cache longest with no reference to it. For two-way set associative, this is easily implemented. Each line includes a USE bit. When a line is referenced, its USE bit is set to 1 and the USE bit of the other line in that set is set to 0. When a block is to be read into the set, the line whose USE bit is 0 is used. Because we are assuming that more recently used memory locations are more likely to be referenced, LRU should give the best hit ratio. LRU is also relatively easy to implement for a fully associative cache. The cache mechanism maintains a separate list of indexes to all the lines in the cache. When a line is referenced, it moves to the front of the list. For replacement, the line at the back of the list is used. Because of its simplicity of implementation, LRU is the most popular replacement algorithm. Another possibility is first-in-first-out (FIFO): Replace that block in the set that has been in the cache longest. FIFO is easily implemented as a round-robin or circular buffer technique. Still another possibility is least frequently used (LFU): Replace that block in the set that has experienced the fewest references. LFU could be implemented by associating a counter with each line. A technique not based on usage (i.e., not LRU, LFU, FIFO, or some variant) is to pick a line at random from among the candidate lines. Simulation studies have shown that random replacement provides only slightly inferior performance to an algorithm based on usage [SMIT82].
쓰기 정책(Write Policy) 캐시에 적재되어 있던 블록이 교체될 때 두 가지 경우를 고려: 만약 캐시에 있는 원래의 블록이 변경된 적이 없다면, 그 블록을 먼저 써내보내지(write out) 않고 새로운 블록으로 덧써도(overwrite) 된다 만약 적어도 한 번의 쓰기가 수행된 적이 있다면, 새로운 블록을 가져오기 전에 캐시의 그 라인을 기억장치의 블록에 씀으로써 주기억장치가 갱신되어야 한다 유의해야 할 두 가지 문제점: 한 개 이상의 장치들이 주기억장치를 액세스할 수도 있다는 점 다수의 프로세서들이 동일한 버스에 접속되어 있고 각 프로세서가 자신의 캐시를 가지고 있는 경우에는 더욱 복잡한 문제 발생 만약 캐시 내의 데이터가 변경되면, 다른 캐시 내에 있는 그 단어는 무효가 된다 When a block that is resident in the cache is to be replaced, there are two cases to consider. If the old block in the cache has not been altered, then it may be overwritten with a new block without first writing out the old block. If at least one write operation has been performed on a word in that line of the cache, then main memory must be updated by writing the line of cache out to the block of memory before bringing in the new block. A variety of write policies, with performance and economic trade-offs, is possible. There are two problems to contend with. First, more than one device may have access to main memory. For example, an I/O module may be able to read-write directly to memory. If a word has been altered only in the cache, then the corresponding memory word is invalid. Further, if the I/O device has altered main memory, then the cache word is invalid. A more complex problem occurs when multiple processors are attached to the same bus and each processor has its own local cache. Then, if a word is altered in one cache, it could conceivably invalidate a word in other caches.
Write Through 및 Write Back 가장 간단한 기법 모든 쓰기 동작들이 캐시로 뿐 아니라 주기억장치로도 동시에 이루이짐 주요 단점은 기억장치 사용량이 많아져서 병목 현상을 야기할 수 있다는 점 Write back 기억장치에 대한 쓰기 동작을 최소화 캐시 내의 데이터만 갱신 주기억장치의 일부분이 무효(invalid) 상태에 있게 됨. 따라서 I/O 모듈에 의한 액세스는 반드시 캐시를 통하여 이루어져야 함 회로가 복잡해지며, 병목이 발생할 가능성이 있음 The simplest technique is called write through. Using this technique, all write operations are made to main memory as well as to the cache, ensuring that main memory is always valid. Any other processor–cache module can monitor traffic to main memory to maintain consistency within its own cache. The main disadvantage of this technique is that it generates substantial memory traffic and may create a bottleneck. An alternative technique, known as write back, minimizes memory writes. With write back, updates are made only in the cache. When an update occurs, a dirty bit, or use bit, associated with the line is set. Then, when a block is replaced, it is written back to main memory if and only if the dirty bit is set. The problem with write back is that portions of main memory are invalid, and hence accesses by I/O modules can be allowed only through the cache. This makes for complex circuitry and a potential bottleneck. Experience has shown that the percentage of memory references that are writes is on the order of 15% [SMIT82]. However, for HPC applications, this number may approach 33% (vector-vector multiplication) and can go as high as 50% (matrix transposition). In a bus organization in which more than one device (typically a processor) has a cache and main memory is shared, a new problem is introduced. If data in one cache are altered, this invalidates not only the corresponding word in main memory, but also that same word in other caches (if any other cache happens to have that same word). Even if a write-through policy is used, the other caches may contain invalid data. A system that prevents this problem is said to maintain cache coherency. Possible approaches to cache coherency include the following: • Bus watching with write through: Each cache controller monitors the address lines to detect write operations to memory by other bus masters. If another master writes to a location in shared memory that also resides in the cache memory, the cache controller invalidates that cache entry. This strategy depends on the use of a write-through policy by all cache controllers. • Hardware transparency: Additional hardware is used to ensure that all updates to main memory via cache are reflected in all caches. Thus, if one processor modifies a word in its cache, this update is written to main memory. In addition, any matching words in other caches are similarly updated. • Non-cacheable memory: Only a portion of main memory is shared by more than one processor, and this is designated as non-cacheable. In such a system, all accesses to shared memory are cache misses, because the shared memory is never copied into the cache. The non-cacheable memory can be identified using chip-select logic or high-address bits.
라인 크기(Line Size) 한 개의 데이터 블록이 읽혀져서 캐시에 저장될 때, 원하는 단어뿐만 아니라 인접한 몇 개의 단어들이 같이 읽혀온다 블록 크기가 증가할수록 적중률은 지역성의 원리에 따라 처음에는 증가할 것이다 블록의 크기가 증가할수록 더 유용한 데이터가 캐시로 들어올 것이다 블록이 더 커지고, 새로이 인출된 정보가 사용될 확률이 교체되어야 하는 정보를 다시 사용하게 될 확률보다 더 낮아질수록 적중률은 감소하기 시작한다 두 가지 특수한 효과들이 나타난다: ● 블록이 커질수록 캐시에 들어올 수 있는 블록의 수가 줄어든다. 블록을 읽어올 때마다 원래의 캐시 내용은 없어지기 때문에, 블록의 수가 적으면 인출된 직후에 다른 블록과 다시 교체된다. ● 블록이 커질수록 원하는 단어로부터 멀리 떨어져 있는 단어들도 같이 읽혀오며, 그들은 가까운 미래에 사용될 가능성은 낮다. Another design element is the line size. When a block of data is retrieved and placed in the cache, not only the desired word but also some number of adjacent words are retrieved. As the block size increases from very small to larger sizes, the hit ratio will at first increase because of the principle of locality, which states that data in the vicinity of a referenced word are likely to be referenced in the near future. As the block size increases, more useful data are brought into the cache. The hit ratio will begin to decrease, however, as the block becomes even bigger and the probability of using the newly fetched information becomes less than the probability of reusing the information that has to be replaced. Two specific effects come into play: • Larger blocks reduce the number of blocks that fit into a cache. Because each block fetch overwrites older cache contents, a small number of blocks results in data being overwritten shortly after they are fetched. • As a block becomes larger, each additional word is farther from the requested word and therefore less likely to be needed in the near future. The relationship between block size and hit ratio is complex, depending on the locality characteristics of a particular program, and no definitive optimum value has been found. A size of from 8 to 64 bytes seems reasonably close to optimum [SMIT87, PRZY88, PRZY90, HAND98]. For HPC systems, 64- and 128-byte cache line sizes are most frequently used.
다단계 캐시(Multilevel caches) 회로의 밀도가 높아짐에 따라, 캐시를 프로세서와 같은 칩 위에 두는 것이 가능 온-칩 캐시는 프로세서의 외부 활동을 줄여주며, 따라서 실행 시간을 가속시키고 전체 시스템 성능을 높여준다 요구된 명령어나 데이터가 온-칩 캐시에서 발견되면 버스 액세스는 필요 없게 된다 프로세서 내부의 데이터 통로들은 버스에 비하여 짧기 때문에 온-칩 캐시 액세스는 0-대기 상태(zero-wait state) 버스 사이클보다도 더 빨리 완료될 수 있다. 이 주기 동안 버스는 다른 전송들을 지원할 수도 있다 2-단계 캐시(two-level cache): 단계 1 (L1) 로 표시되는 내부 캐시 단계 2 (L2) 로 표시되는 외부 캐시 L2 캐시를 사용함으로써 얻을 수 있는 잠재적 이득은 L1과 L2 캐시들에 대한 적중률에 달려있다 다단계 캐시를 사용하면 캐시와 관련된 설계 요소들인 크기, 교체 알고리즘 및 쓰기 정책 등이 더 복잡해지게 된다 As logic density has increased, it has become possible to have a cache on the same chip as the processor: the on-chip cache. Compared with a cache reachable via an external bus, the on-chip cache reduces the processor’s external bus activity and therefore speeds up execution times and increases overall system performance. When the requested instruction or data is found in the on-chip cache, the bus access is eliminated. Because of the short data paths internal to the processor, compared with bus lengths, on-chip cache accesses will complete appreciably faster than would even zero-wait state bus cycles. Furthermore, during this period the bus is free to support other transfers. The inclusion of an on-chip cache leaves open the question of whether an off-chip, or external, cache is still desirable. Typically, the answer is yes, and most contemporary designs include both on-chip and external caches. The simplest such organization is known as a two-level cache, with the internal cache designated as level 1 (L1) and the external cache designated as level 2 (L2). The reason for including an L2 cache is the following: If there is no L2 cache and the processor makes an access request for a memory location not in the L1 cache, then the processor must access DRAM or ROM memory across the bus. Due to the typically slow bus speed and slow memory access time, this results in poor performance. On the other hand, if an L2 SRAM (static RAM) cache is used, then frequently the missing information can be quickly retrieved. If the SRAM is fast enough to match the bus speed, then the data can be accessed using a zero-wait state transaction, the fastest type of bus transfer. Two features of contemporary cache design for multilevel caches are noteworthy. First, for an off-chip L2 cache, many designs do not use the system bus as the path for transfer between the L2 cache and the processor, but use a separate data path, so as to reduce the burden on the system bus. Second, with the continued shrinkage of processor components, a number of processors now incorporate the L2 cache on the processor chip, improving performance. The potential savings due to the use of an L2 cache depends on the hit rates in both the L1 and L2 caches. Several studies have shown that, in general, the use of a second-level cache does improve performance (e.g., see [AZIM92], [NOVI93], [HAND98]). However, the use of multilevel caches does complicate all of the design issues related to caches, including size, replacement algorithm, and write policy; see [HAND98] and [PEIR99] for discussions.
8Kbyte 및 16Kbyte L1 캐시에 대한 전체 적중률 (L1 & L2) Figure 4.17 shows the results of one simulation study of two-level cache performance as a function of cache size [GENU04]. The figure assumes that both caches have the same line size and shows the total hit ratio. That is, a hit is counted if the desired data appears in either the L1 or the L2 cache. The figure shows the impact of L2 on total hits with respect to L1 size. L2 has little effect on the total number of cache hits until it is at least double the L1 cache size. Note that the steepest part of the slope for an L1 cache of 8 Kbytes is for an L2 cache of 16 Kbytes. Again for an L1 cache of 16 Kbytes, the steepest part of the curve is for an L2 cache size of 32 Kbytes. Prior to that point, the L2 cache has little, if any, impact on total cache performance. The need for the L2 cache to be larger than the L1 cache to affect performance makes sense. If the L2 cache has the same line size and capacity as the L1 cache, its contents will more or less mirror those of the L1 cache. With the increasing availability of on-chip area available for cache, most contemporary microprocessors have moved the L2 cache onto the processor chip and added an L3 cache. Originally, the L3 cache was accessible over the external bus. More recently, most microprocessors have incorporated an on-chip L3 cache. In either case, there appears to be a performance advantage to adding the third level (e.g., see [GHAI98]). Further, large systems, such as the IBM mainframe zEnterprise systems, now incorporate 3 on-chip cache levels and a fourth level of cache shared across multiple chips [CURR11].
통합 캐시(unified cache)와 분리 캐시(split cache) 분리 캐시가 보편화되고 있음: 명령어 전용 캐시 데이터 전용 캐시 두 캐시들은 모두 같은 단계에 존재하며, 전형적으로 두 개의 L1 캐시로서 존재 통합 캐시의 장점 더 높은 적중률 명령어와 데이터간의 균형을 자동적으로 유지 단 한 개의 캐시만 설계하고 구현하면 됨 L1에서는 분리 캐시, 더 높은 레벨에서는 통합 캐시로 가는 경향 분리 캐시 설계의 장점 명령어 인출/해독 유니트와 실행 유니트 간에 캐시로 인한 경합을 없앨 수 있다는 점 파이프라이닝에서 중요 When the on-chip cache first made an appearance, many of the designs consisted of a single cache used to store references to both data and instructions. More recently, it has become common to split the cache into two: one dedicated to instructions and one dedicated to data. These two caches both exist at the same level, typically as two L1 caches. When the processor attempts to fetch an instruction from main memory, it first consults the instruction L1 cache, and when the processor attempts to fetch data from main memory, it first consults the data L1 cache. There are two potential advantages of a unified cache: • For a given cache size, a unified cache has a higher hit rate than split caches because it balances the load between instruction and data fetches automatically. That is, if an execution pattern involves many more instruction fetches than data fetches, then the cache will tend to fill up with instructions, and if an execution pattern involves relatively more data fetches, the opposite will occur. • Only one cache needs to be designed and implemented. The trend is toward split caches at the L1 and unified caches for higher levels, particularly for superscalar machines, which emphasize parallel instruction execution and the prefetching of predicted future instructions. The key advantage of the split cache design is that it eliminates contention for the cache between the instruction fetch/decode unit and the execution unit. This is important in any design that relies on the pipelining of instructions. Typically, the processor will fetch instructions ahead of time and fill a buffer, or pipeline, with instructions to be executed. Suppose now that we have a unified instruction/data cache. When the execution unit performs a memory access to load and store data, the request is submitted to the unified cache. If, at the same time, the instruction prefetcher issues a read request to the cache for an instruction, that request will be temporarily blocked so that the cache can service the execution unit first, enabling it to complete the currently executing instruction. This cache contention can degrade performance by interfering with efficient use of the instruction pipeline. The split cache structure overcomes this difficulty.
Pentium 4 캐시 Table 4.4 Intel Cache Evolution The evolution of cache organization is seen clearly in the evolution of Intel microprocessors (Table 4.4). The 80386 does not include an on-chip cache. The 80486 includes a single on-chip cache of 8 Kbytes, using a line size of 16 bytes and a four-way set-associative organization. All of the Pentium processors include two on-chip L1 caches, one for data and one for instructions. For the Pentium 4, the L1 data cache is 16 Kbytes, using a line size of 64 bytes and a four-way set-associative organization. The Pentium 4 instruction cache is described subsequently. The Pentium II also includes an L2 cache that feeds both of the L1 caches. The L2 cache is eight-way set associative with a size of 512 kB and a line size of 128 bytes. An L3 cache was added for the Pentium III and became on-chip with high-end versions of the Pentium 4. Table 4.4 Intel Cache Evolution
Pentium 4 블록 다이어그램 Figure 4.18 provides a simplified view of the Pentium 4 organization, highlighting the placement of the three caches. The processor core consists of four major components: • Fetch/decode unit: Fetches program instructions in order from the L2 cache, decodes these into a series of micro-operations, and stores the results in the L1 instruction cache. • Out-of-order execution logic: Schedules execution of the micro-operations subject to data dependencies and resource availability; thus, micro-operations may be scheduled for execution in a different order than they were fetched from the instruction stream. As time permits, this unit schedules speculative execution of micro-operations that may be required in the future. Execution units: These units executes micro-operations, fetching the required data from the L1 data cache and temporarily storing results in registers. • Memory subsystem: This unit includes the L2 and L3 caches and the system bus, which is used to access main memory when the L1 and L2 caches have a cache miss and to access the system I/O resources. Unlike the organization used in all previous Pentium models, and in most other processors, the Pentium 4 instruction cache sits between the instruction decode logic and the execution core. The reasoning behind this design decision is as follows: As discussed more fully in Chapter 16, the Pentium process decodes, or translates, Pentium machine instructions into simple RISC-like instructions called micro-operations. The use of simple, fixed-length micro-operations enables the use of superscalar pipelining and scheduling techniques that enhance performance. However, the Pentium machine instructions are cumbersome to decode; they have a variable number of bytes and many different options. It turns out that performance is enhanced if this decoding is done independently of the scheduling and pipelining logic. We return to this topic in Chapter 16. The data cache employs a write-back policy: Data are written to main memory only when they are removed from the cache and there has been an update. The Pentium 4 processor can be dynamically configured to support write-through caching.
Pentium 4 캐시 연산 모드들 표 4.5 Pentium 4 캐시 연산 모드들 The L1 data cache is controlled by two bits in one of the control registers, labeled the CD (cache disable) and NW (not write-through) bits (Table 4.5). There are also two Pentium 4 instructions that can be used to control the data cache: INVD invalidates (flushes) the internal cache memory and signals the external cache (if any) to invalidate. WBINVD writes back and invalidates internal cache and then writes back and invalidates external cache. Both the L2 and L3 caches are eight-way set-associative with a line size of 128 bytes. Note: CD = 0; NW = 1 is an invalid combination. 표 4.5 Pentium 4 캐시 연산 모드들
ARM 캐시의 특성들 The ARM cache organization has evolved with the overall architecture of the ARM family, reflecting the relentless pursuit of performance that is the driving force for all microprocessor designers. Table 4.6 shows this evolution. The ARM7 models used a unified L1 cache, while all subsequent models use a split instruction/data cache. All of the ARM designs use a set-associative cache, with the degree of associativity and the line size varying. ARM cached cores with an MMU use a logical cache for processor families ARM7 through ARM10, including the Intel StongARM and Intel Xscale processors. The ARM11 family uses a physical cache. The distinction between logical and physical cache is discussed earlier in this chapter (Figure 4.7). 표 4.6 ARM 캐시의 특성들
ARM 캐시 및 쓰기 버퍼의 조직 An interesting feature of the ARM architecture is the use of a small first-in-first- out (FIFO) write buffer to enhance memory write performance. The write buffer is interposed between the cache and main memory and consists of a set of addresses and a set of data words. The write buffer is small compared to the cache, and may hold up to four independent addresses. Typically, the write buffer is enabled for all of main memory, although it may be selectively disabled at the page level. Figure 4.19, taken from [SLOS04], shows the relationship among the write buffer, cache, and main memory. The write buffer operates as follows: When the processor performs a write to a bufferable area, the data are placed in the write buffer at processor clock speed and the processor continues execution. A write occurs when data in the cache are written back to main memory. Thus, the data to be written are transferred from the cache to the write buffer. The write buffer then performs the external write in parallel. If, however, the write buffer is full (either because there are already the maximum number of words of data in the buffer or because there is no slot for the new address) then the processor is stalled until there is sufficient space in the buffer. As non-write operations proceed, the write buffer continues to write to main memory until the buffer is completely empty. Data written to the write buffer are not available for reading back into the cache until the data have transferred from the write buffer to main memory. This is the principal reason that the write buffer is quite small. Even so, unless there is a high proportion of writes in an executing program, the write buffer improves performance.
Summary Chapter 4 Cache Memory Elements of cache design Cache addresses Cache size Mapping function Replacement algorithms Write policy Line size Number of caches Pentium 4 cache organization ARM cache organization Characteristics of Memory Systems Location Capacity Unit of transfer Memory Hierarchy How much? How fast? How expensive? Cache memory principles Chapter 4 summary.