1 00:00:08,610 --> 00:00:16,219 So, the last module we discussed instruction set architecture. 2 00:00:16,219 --> 00:00:23,550 And now next 2 weeks we are going to discuss memory hierarchy design mainly consist of 3 00:00:23,550 --> 00:00:27,790 cache memory design and the DRAM based main memory. 4 00:00:27,790 --> 00:00:31,640 Why do we require to study memory hierarchy design? 5 00:00:31,640 --> 00:00:40,230 We know that programs exhibit principle of locality, which states that the processor 6 00:00:40,230 --> 00:00:46,390 accesses some data and instructions now, there is a high chance that the same data will be 7 00:00:46,390 --> 00:00:50,400 required in future or the neighboring data may be required in future. 8 00:00:50,400 --> 00:00:59,420 So, for example, consider a scenario where the simple piece of code which is matrix multiplication. 9 00:00:59,420 --> 00:01:06,030 Here we are performing matrix multiplication on 2 matrices B and C and the result is stored 10 00:01:06,030 --> 00:01:07,880 in matrix A 11 00:01:07,880 --> 00:01:12,549 We assume in this particular example the data stored in row major order. 12 00:01:12,549 --> 00:01:19,290 So, all the elements of a row is stored first and then go the next row and so on. 13 00:01:19,290 --> 00:01:27,650 So, given this example we know that B[0][0] is accessed first and then B[0][1], B[0][2] 14 00:01:27,650 --> 00:01:36,430 and so on because the k value is changing based on the innermost for loop and i value 15 00:01:36,430 --> 00:01:44,670 is fixed for all the iterations of the two inner for loops. 16 00:01:44,670 --> 00:01:54,000 So which says that if I access element B[0][0] I am going to access B[0][1] in the next cycle 17 00:01:54,000 --> 00:01:58,210 and similarly B[0][2], B[0][3] and so on in the following cycles. 18 00:01:58,210 --> 00:02:07,939 And similarly, when I consider elements of array A, so A[0][0] is repeatedly used for 19 00:02:07,939 --> 00:02:11,440 all the iterations of the innermost for loop. 20 00:02:11,440 --> 00:02:20,670 That means for k equal to 0 to n, A[0][0] is used and now A[1][0] will be used only 21 00:02:20,670 --> 00:02:25,500 after the complete iterations of the innermost for loop is done. 22 00:02:25,500 --> 00:02:30,780 And also the complete iterations of the second for loop. 23 00:02:30,780 --> 00:02:37,720 So, in the second iteration, so when I equal to 1, again for j equal to 0 to n and k equal 24 00:02:37,720 --> 00:02:38,720 to 0 to n. 25 00:02:38,720 --> 00:02:47,080 So, we are going to use A[1][0] and after that again this will be repeated for A[1][1], 26 00:02:47,080 --> 00:02:48,800 A[1][2] and so on. 27 00:02:48,800 --> 00:02:57,550 So, this says that an element of array A when I am accessing now, I am going to access this 28 00:02:57,550 --> 00:03:00,099 in the near future. 29 00:03:00,099 --> 00:03:04,510 This is true with the other elements of the other arrays also. 30 00:03:04,510 --> 00:03:10,269 From this example it is clear that the elements which are accessed now there is a high chance 31 00:03:10,269 --> 00:03:16,300 that, the same elements may be repeatedly used in the near future or elements which 32 00:03:16,300 --> 00:03:22,900 are neighboring to the previously accessed elements will be required in the near future. 33 00:03:22,900 --> 00:03:30,739 So, this says that the principle of locality can be exploited either in time or in space. 34 00:03:30,739 --> 00:03:36,379 So, accesses to the same memory location that occur close together in time is called as 35 00:03:36,379 --> 00:03:38,040 temporal locality. 36 00:03:38,040 --> 00:03:45,020 Whereas, if it happens close in space that is nothing but accesses to the same memory 37 00:03:45,020 --> 00:03:50,780 location that occurs close together in space is called a spatial locality. 38 00:03:50,780 --> 00:03:56,600 And the thumb rule says that ninety percentage of the execution of programs spends in only 39 00:03:56,600 --> 00:03:58,200 ten percent of the code. 40 00:03:58,200 --> 00:04:04,400 So, to exploit this principle of locality available in most of the programs, we need 41 00:04:04,400 --> 00:04:07,060 to come up with a memory hierarchy. 42 00:04:07,060 --> 00:04:16,150 Rather than storing the entire code and accessing from the memory, it is always better to keep 43 00:04:16,150 --> 00:04:23,770 the repeatedly accessed data in the faster memory which is very close to the processor, 44 00:04:23,770 --> 00:04:27,119 so that the access time will be very short. 45 00:04:27,119 --> 00:04:33,360 So, effectively all the cache memory designs or the memory hierarchy design what we consider 46 00:04:33,360 --> 00:04:37,289 typically exploits the principle of locality that is available in the programs. 47 00:04:37,289 --> 00:04:43,259 So, organize your memory system into a hierarchy with the faster, but the smaller memory closer 48 00:04:43,259 --> 00:04:49,319 to the processor and large memory which has high access latency will be kept very far 49 00:04:49,319 --> 00:04:52,099 from the processor. 50 00:04:52,099 --> 00:04:54,449 Let us look at the memory hierarchy. 51 00:04:54,449 --> 00:05:00,520 It consists of multiple levels in the hierarchy and this particular example is typically used 52 00:05:00,520 --> 00:05:02,990 for server type of systems. 53 00:05:02,990 --> 00:05:10,020 At the top of the pyramid we have the processor registers or CPU registers, which typically 54 00:05:10,020 --> 00:05:19,089 take less space in the order of less than 1 kb, kilobytes of space, but the access time 55 00:05:19,089 --> 00:05:20,259 is very fast. 56 00:05:20,259 --> 00:05:25,969 It will take typically the orders of picoseconds. 57 00:05:25,969 --> 00:05:32,469 The next level of the memory in the memory hierarchy is the level 1 cache memory which 58 00:05:32,469 --> 00:05:40,309 again the size of the L1 cache is typically around 32 kb to 64kb and the access time it 59 00:05:40,309 --> 00:05:42,809 takes 1 nanosecond or something. 60 00:05:42,809 --> 00:05:50,629 As we move further down in the pyramid, the size of the memory is increasing. 61 00:05:50,629 --> 00:05:59,389 The access time also is increasing, but the cost per bit is reducing significantly. 62 00:05:59,389 --> 00:06:07,210 The size of L2 memory is 256 kb to 512kb, depending on the system we consider. 63 00:06:07,210 --> 00:06:14,120 And the access times are 3 to 10 nanoseconds and some systems can consider third level 64 00:06:14,120 --> 00:06:26,370 of cache which is in the orders of 2 to 4 megabytes of space and takes 10 to 20 nanoseconds. 65 00:06:26,370 --> 00:06:36,839 Remember the CPU registers, L1, L2, L3 caches all these are actually designed using SRAM 66 00:06:36,839 --> 00:06:46,110 technology, because of that the access times are very low compared to the main memory which 67 00:06:46,110 --> 00:06:50,819 is typically designed using DRAM based technology. 68 00:06:50,819 --> 00:06:58,189 In the case of main memory, as I mentioned earlier we consider DRAM based technology, 69 00:06:58,189 --> 00:07:05,819 where a single bit is stored in a DRAM cell which consists of a transistor and a capacitor. 70 00:07:05,819 --> 00:07:14,270 And the access times for the main memory is typically in the range of 50 to 100 nanoseconds, 71 00:07:14,270 --> 00:07:19,819 but the size of the DRAM based memory is 4 to 16 GB which is very huge. 72 00:07:19,819 --> 00:07:30,110 In the pyramid we have a disc storage, which is having a size in the range of 4 to 16 terabytes, 73 00:07:30,110 --> 00:07:38,929 but the access times are in the range of milliseconds mainly because of these mechanical components 74 00:07:38,929 --> 00:07:42,479 involved in hard disk. 75 00:07:42,479 --> 00:07:50,460 Of course, these days we are also having the solid state drives which are based on flash 76 00:07:50,460 --> 00:07:53,639 technology and provide non-volatility. 77 00:07:53,639 --> 00:07:58,779 We know that the caches and the memory are volatile memories. 78 00:07:58,779 --> 00:08:02,189 So, that means when the power is off, the data will be lost. 79 00:08:02,189 --> 00:08:09,419 We consider flash based memories as a replacement for the hard disc and that is the reason why 80 00:08:09,419 --> 00:08:17,089 the latest laptops and so on we are having the flash based solid state drives, but the 81 00:08:17,089 --> 00:08:26,300 size of flash based solid state drives is not in the range of the size of HDDs, but 82 00:08:26,300 --> 00:08:31,729 they are decent enough to provide good storage space. 83 00:08:31,729 --> 00:08:39,469 And the advantage with the solid state drives is their access times are much smaller compared 84 00:08:39,469 --> 00:08:42,800 to the HDDs. 85 00:08:42,800 --> 00:08:54,839 So, once you have your memory in the multi-levels of hierarchy with the faster memory faster 86 00:08:54,839 --> 00:09:00,339 and the smaller sized memory closer to the processor. 87 00:09:00,339 --> 00:09:06,339 So we are going to keep repeatedly accessed data in these faster memories, but, because 88 00:09:06,339 --> 00:09:13,860 the size is very small so, we have to be selective in keeping data in these faster memories. 89 00:09:13,860 --> 00:09:19,860 So, as part of our cache memory designs we are going to discuss different mechanisms 90 00:09:19,860 --> 00:09:23,199 of how to keep selective data. 91 00:09:23,199 --> 00:09:30,709 So, is it really required to have so many levels of memory hierarchy in our computer 92 00:09:30,709 --> 00:09:31,709 design? 93 00:09:31,709 --> 00:09:39,610 Yes it is required this is mainly because of the famous memory wall problem. 94 00:09:39,610 --> 00:09:47,610 This graph shows that the processor performance is significantly increasing over the years, 95 00:09:47,610 --> 00:09:54,290 but the performance of the memory is not increasing at that particular pace. 96 00:09:54,290 --> 00:10:01,149 So, as a result as time progresses, we have a significant gap between the processor performance 97 00:10:01,149 --> 00:10:04,029 and the DRAM performance. 98 00:10:04,029 --> 00:10:09,150 In this graph the x axis shows the time line and the y axis shows the performance. 99 00:10:09,150 --> 00:10:15,440 And the processor line is actually specifies the rate at which the average memory requests 100 00:10:15,440 --> 00:10:22,690 per second is increasing and the memory line shows the number of DRAM memory accesses per 101 00:10:22,690 --> 00:10:23,690 second. 102 00:10:23,690 --> 00:10:30,720 As we can clearly see that the gap is widening significantly and in order to deal with this 103 00:10:30,720 --> 00:10:36,170 wide gap in the performance between the processors and the memory we definitely need to have 104 00:10:36,170 --> 00:10:39,430 multi levels of cache hierarchy. 105 00:10:39,430 --> 00:10:44,870 And remember this graph is only for a single core processor performance, but these days 106 00:10:44,870 --> 00:10:51,670 we are having multi-core systems where two, four, eight, sixteen processors are there 107 00:10:51,670 --> 00:10:52,790 in a single chip. 108 00:10:52,790 --> 00:10:59,800 And all these cores when they are executing different applications require so much memory 109 00:10:59,800 --> 00:11:05,070 band width and to support that there is a significant pressure on the memory. 110 00:11:05,070 --> 00:11:11,689 And so as a result we need to definitely have large spaces of the intermediate memories 111 00:11:11,689 --> 00:11:12,740 in the memory hierarchy. 112 00:11:12,740 --> 00:11:19,490 So, this says that the aggregate band width requirement grows with the number of cores. 113 00:11:19,490 --> 00:11:26,209 So, as a result we need to have efficient memory system hierarchy design which takes 114 00:11:26,209 --> 00:11:34,450 care of the demands from the multiple cored of a chip multi-processor so that, the overall 115 00:11:34,450 --> 00:11:36,880 performance of the system can be improved. 116 00:11:36,880 --> 00:11:42,620 So, having discussed this memory hierarchy and before going to discuss the internals 117 00:11:42,620 --> 00:11:45,870 of the cache memory and DRAM memory and so on. 118 00:11:45,870 --> 00:11:52,180 Let us see what is the performance improvement we get if the cache memory is provided in 119 00:11:52,180 --> 00:11:54,540 the system? 120 00:11:54,540 --> 00:11:59,450 To do that, we need to have quantification for cache performance. 121 00:11:59,450 --> 00:12:07,730 We already discussed the performance equation for CPUs where CPU time can be expressed as 122 00:12:07,730 --> 00:12:12,640 CPU clock cycles multiplied by the clock cycle time. 123 00:12:12,640 --> 00:12:19,720 So, here this expression is considered with the assumption that our cache memory is perfect. 124 00:12:19,720 --> 00:12:26,839 When I say cache memory is perfect, whatever the memory request issued by the processor 125 00:12:26,839 --> 00:12:29,420 it will be satisfied by the cache memory. 126 00:12:29,420 --> 00:12:34,980 So, effectively the processor is not stalled for servicing any memory request. 127 00:12:34,980 --> 00:12:41,529 This is an ideal scenario, but in reality this is not the case, where processor may 128 00:12:41,529 --> 00:12:44,260 be stalled for servicing memory request. 129 00:12:44,260 --> 00:12:51,480 So, when you have memory stalls this equation can be written as, CPU time is equal to ‘the 130 00:12:51,480 --> 00:12:59,379 sum of CPU clock cycles’ plus ‘memory stall cycles’ and their sum is multiplied 131 00:12:59,379 --> 00:13:01,819 by ‘the clock cycle time’. 132 00:13:01,819 --> 00:13:09,110 So, CPU clock cycles are the cycles spent by the processor in performing ALU operations 133 00:13:09,110 --> 00:13:12,600 or all the requests which are hit in the cache memory. 134 00:13:12,600 --> 00:13:18,350 And the memory stall cycles are the cycles the processor sitting idle to get the data 135 00:13:18,350 --> 00:13:24,060 from either the 1 level of the cache or 2 levels of the cache or any level in the memory 136 00:13:24,060 --> 00:13:25,060 hierarchy. 137 00:13:25,060 --> 00:13:32,930 So, memory stall cycles is nothing but ‘the number of misses incurred by the processor’ 138 00:13:32,930 --> 00:13:41,399 times ‘the total time incurs for servicing 1 memory miss or 1 cache miss’. 139 00:13:41,399 --> 00:13:48,910 So, it is effectively the number of misses times the miss penalty. 140 00:13:48,910 --> 00:13:56,730 This can be further expressed as ‘the instruction count’ times ‘the number of misses per 141 00:13:56,730 --> 00:14:02,920 instruction’ times ‘miss penalty’ because previously we have given the expression for 142 00:14:02,920 --> 00:14:06,620 CPU time in terms of instruction count. 143 00:14:06,620 --> 00:14:12,260 So, effectively we can rewrite our memory stalls also in terms of instruction count. 144 00:14:12,260 --> 00:14:17,559 ‘The total number of instructions in the program’ times ‘the number of misses incurred 145 00:14:17,559 --> 00:14:23,230 per instruction’ times ‘a miss penalty’. 146 00:14:23,230 --> 00:14:30,949 This can be further rewritten as ‘instruction count’ times ‘memory accesses per instruction’ 147 00:14:30,949 --> 00:14:35,959 times ‘misses per memory accesses’ times ‘missed penalty’. 148 00:14:35,959 --> 00:14:44,019 So, the ratio of misses per memory accesses, the ratio of misses and memory accesses is 149 00:14:44,019 --> 00:14:45,899 called as a miss rate. 150 00:14:45,899 --> 00:14:52,490 So, our total memory stalls can be expressed as ‘instruction count’ times ‘memory 151 00:14:52,490 --> 00:14:57,319 accesses per instruction’ times ‘miss rate’ times ‘missed penalty’. 152 00:14:57,319 --> 00:15:06,329 So, we can substitute this memory stall equation in our CPU time equation to get the overall 153 00:15:06,329 --> 00:15:10,500 CPU time if we have a cache memory. 154 00:15:10,500 --> 00:15:17,990 And using this equation we can get the performance improvement using a cache memory. 155 00:15:17,990 --> 00:15:22,899 So, to illustrate that let us consider an example. 156 00:15:22,899 --> 00:15:31,920 Assume that the CPI of a computer is 1 when all memory accesses hit in the cache. 157 00:15:31,920 --> 00:15:38,999 This is effectively an ideal scenario where you are not waiting for any memory request 158 00:15:38,999 --> 00:15:39,999 to be serviced. 159 00:15:39,999 --> 00:15:47,029 So, effectively memory stalls are 0, but if thirty percent of the instructions are loads 160 00:15:47,029 --> 00:15:53,910 and stores and the miss penalty is hundred cycles and the miss rate is five percent. 161 00:15:53,910 --> 00:15:59,540 How much faster the computer be, if all instructions were cache hits? 162 00:15:59,540 --> 00:16:06,199 So, first start with the ideal scenario, when all the memory accesses are hit in the cache. 163 00:16:06,199 --> 00:16:09,069 So, our memory stall cycles equal to 0. 164 00:16:09,069 --> 00:16:16,970 So, CPU time is ‘CPU clock cycles’ times ‘the clock cycle time’, which is equal 165 00:16:16,970 --> 00:16:21,760 to ‘IC’ into ‘CPI’ into ‘clock cycle time’ where CPI is equal to 1. 166 00:16:21,760 --> 00:16:24,740 So, it is effectively IC into clock cycle time. 167 00:16:24,740 --> 00:16:29,209 Remember we have not given any clock frequency for the processor and we have also not given 168 00:16:29,209 --> 00:16:31,499 the number of instructions in the program. 169 00:16:31,499 --> 00:16:35,649 So, effectively we consider IC and clock cycle time as it is. 170 00:16:35,649 --> 00:16:44,089 Now, consider the scenario where we have imperfect cache which has some percentage of memory 171 00:16:44,089 --> 00:16:47,149 request will incur miss and there is a miss penalty. 172 00:16:47,149 --> 00:16:55,779 So, memory stall cycles due to this, the miss rate is given as ‘IC’ into ‘memory access 173 00:16:55,779 --> 00:17:04,130 per instructions’ into ‘miss rate’ into ‘miss penalty’ which is effectively 6.5 174 00:17:04,130 --> 00:17:06,660 times IC. 175 00:17:06,660 --> 00:17:13,929 And so overall CPU time because of this memory stalls is equal to ‘CPU time in the ideal 176 00:17:13,929 --> 00:17:19,309 scenario’ plus ‘the CPU time’ because of memory stalls, which is effectively 7.5 177 00:17:19,309 --> 00:17:22,370 times IC times ‘clock cycle time’. 178 00:17:22,370 --> 00:17:29,529 So, the speed up we achieve because of our perfect cache compared to an imperfect cache 179 00:17:29,529 --> 00:17:31,340 is equal to 7.5. 180 00:17:31,340 --> 00:17:40,480 So, this gives a motivation that we need to keep the useful data in the cache memory so 181 00:17:40,480 --> 00:17:43,270 that, the overall performance can be improved. 182 00:17:43,270 --> 00:17:48,000 And if you do not have a cache in this particular scenario for every request we have to go to 183 00:17:48,000 --> 00:17:51,410 the memory and which is going to take hundred cycle latency. 184 00:17:51,410 --> 00:17:55,720 And so as a result the performance penalty will be significant, if we do not consider 185 00:17:55,720 --> 00:18:01,830 a cache memory, but whereas, if we consider a cache memory with some miss rate the performance 186 00:18:01,830 --> 00:18:06,920 can be improved compared to the case where no cache memory is considered. 187 00:18:06,920 --> 00:18:11,120 But of we have a cache memory which is perfect we can improve the performance even with respect 188 00:18:11,120 --> 00:18:15,300 to a cache memory with some percentage of misses and so on. 189 00:18:15,300 --> 00:18:20,909 So, this motivates us to come up with the efficient cache memory between your processor 190 00:18:20,909 --> 00:18:26,230 and the main memory or we may have to have multiple levels of cache memory to improve 191 00:18:26,230 --> 00:18:27,899 the overall performance of the system. 192 00:18:27,899 --> 00:18:32,830 So, with this I am concluding this module and in the next module we are going to discuss 193 00:18:32,830 --> 00:18:35,679 the internals of the cache memory design. 194 00:18:35,679 --> 00:18:36,370 Thank you.