Kai Kaltenbach, Microsoft Premier Corporate Support, 9/12/95
When a microprocessor asks for information faster than system RAM can deliver it, the processor goes into a wait state. Essentially the processor is sitting around doing nothing until the system RAM is ready to deliver the information it asked for. This greatly slows down system operation. When a system is running without encountering wait states, it is said to be in zero wait-state operation, and runs much faster.
Memory speed is measured in nanoseconds (ns). The fastest affordable DRAM (Dynamic RAM) memory chips are 60-70ns. For a processor to operate in zero wait-states at a system-board speed of 33MHz (as with a 486DX/33, 486DX2/66, 486DX4/100) the system RAM would have to have a speed of 30ns - prohibitively expensive. For zero wait-states at a system board speed of 66MHz (as with a Pentium 66, 100 or 133) the memory would have to operate at 15ns! What's more, it would have to be more expensive SRAM (Static RAM), which is faster than DRAM because it doesn't require the system to refresh its contents periodically. At the time of this writing, 15ns SRAM is over ten times the cost of standard 70ns DRAM.
This is where memory caching comes in, making today's systems possible at a reasonable cost. You may be familiar with using a disk cache, such as Microsoft SmartDrive, which uses a small RAM buffer to speed up access to a large hard disk. Memory caching uses a small buffer of very fast RAM to speed up a large bank of slower RAM.
All Intel processors since the advent of the 486 are equipped with an integral cache of 8kb-16kb in size. When a RAM cache is built into a CPU, it's known as a Level 1 (L1) cache.
Most systems today use a second RAM cache built onto the system board, called a Level 2 (L2) cache.
The cache is managed by an 'intelligent' circuit called the cache controller. A system with both an L1 and an L2 cache has two cache controllers; one on the CPU chip itself, and one on the motherboard. The cache controller uses various prediction algorithms to enhance cache performance. For example, it attempts to predict what memory segments the processor will ask for next, and read those segments into the cache before the processor asks for them. This is known as read-ahead caching.
When the processor asks for some data from memory, and that data can be delivered directly from the cache RAM, that's a cache hit. When the system has to take the performance hit of going to the main bank of memory to retrieve the data, that's a cache miss. The percentage of cache hits versus cache misses determines the system's performance versus other systems with the identical CPU.
The cache hit/miss ratio, and therefore overall system performance, is determined by several factors (see below). One of the crucial factors is the ratio between the size of the cache and the size of system RAM. As previously noted, L1 caches are generally 8kb-16kb in size. This tiny cache is not sufficient to produce a large cache hit/miss ratio with any significant amount of system RAM. Therefore, performance suffers significantly without an L2 cache. It is not uncommon, for example, for a 486 system with an efficient L2 cache to far outperform a Pentium system without a cache. In a recent industry magazine test of notebook computers, a 486 machine (with L2 cache) outperformed a Pentium 90 machine (without L2 cache) by 30%.
L2 cache sizes range from 64K-1024K, with 256K being by far the most common size. More on L2 cache sizing later.
The following factors influence the performance of a cached system:
All caches are not created equal, even if they are of equal size. Given the trade press focus on cache size, most purchasers simply ask for a cache of a particular size, and don't focus specifically on performance measurements. Unfortunately, this has led some system vendors to develop very low-cost caching systems that allow them to advertise a 256K cache without regard to the performance of that cache. It's entirely possible, in fact common, for a smaller, well designed cache to outperform a larger, badly designed cache.
See above. All things being equal as far as cache architecture and controller design is concerned, a larger cache-to-system RAM ratio will provide better system performance, up to a point. You quickly reach a point of diminishing returns. The important thing to remember in general, is that if you wanted to maintain the same cache hit ratio when you double the amount of system RAM, you would have to double the amount of cache RAM as well (although there are other factors that do not make this a linear relationship).
Most system boards are designed for a particular speed of system RAM and cache RAM. There are some exceptions that allow you to tune the system's cache parameters to different speeds of memory. For example, it is getting more and more common for system boards to offer a 70ns/60ns switchable memory speed option. Without such an option, adding faster system RAM than the board is designed for won't provide any performance benefits. In Pentium systems, 20ns cache SRAM is generally used for 50-60MHz system boards (using the Pentium 75/90/100/120), and 15ns cache SRAM is normally utilized for 66MHz system boards (using the Pentium 100/133). Cache SRAM at speeds up to 8ns has recently become available, although rare and expensive.
Cache controllers are usually programmed with algorithms based on statistical analysis of memory access by popular operating systems. Many cache controllers are optimized for either 16-bit or 32-bit software systems. If your particular software accesses memory in a different pattern than the cache controller was optimized for, you can get significantly higher or lower than theoretical (benchmarked) efficiency. Upgrading an operating system from 16-bit to 32-bit can change system hardware performance dramatically in some cases. When evaluating systems for purchase, make sure to benchmark the systems under your operating system of choice, and if possible, the operating system you plan to implement next.
Software tools are available for measuring cache efficiency, such as those from Sofwin Laboratories (800-339-2579). Sofwin tools in particular have a feature to show whether a system cache is optimized for 16-bit or 32-bit operations. While such measurements can lend insight into system design, they are arguably less useful for purchasing decisions, because your real-world performance will depend on the software and operating system being used.
Recently, several new RAM and cache technologies were introduced. These include:
Enhanced Data Output (EDO) DRAM provides faster data throughput that partially obviates the need for an L2 cache. Systems using EDO DRAM and no L2 cache will be faster than similar systems using regular DRAM, but not as fast as systems with an L2 cache. EDO DRAM also provides a performance benefit when used with an L2 cache, but industry magazine test centers have reported that the performance difference in that case is less than 5%. Theoretically, EDO DRAM doesn't cost any more to manufacture than regular DRAM, so eventually EDO DRAM may replace regular DRAM. But at the time of this writing, EDO DRAM was significantly more than 5% more expensive than regular DRAM, and probably not worth the price/performance ratio on systems with an L2 cache.
You can think of Enhanced DRAM (EDRAM) as RAM that carries its own cache on each module. In an EDRAM-based system, essentially the entire system memory bank is the cache. This can provide dramatic performance improvements. However, at this time, EDRAM is scarce, very expensive and has not been adopted by many system vendors.
Burst cache technology brings a very large performance advantage to the Pentium playing field, made possible by Intel's recent introduction of the Triton chipset for Pentium systems, and also supported by other chipset vendors. Industry magazine tests show that burst cache equipped systems outperform their standard cache counterparts by 20% or more. In fact, the performance benefit is frequently more than the performance difference between Pentium chip classes, i.e. a Pentium 90 with burst cache has been shown to outperform a Pentium 100 with normal cache. Since the difference in price between normal cache and burst cache is usually less than the difference in price between Pentium chip classes, it only makes sense to standardize on burst cache systems. There are other considerations of course, because the Intel Triton chipset does not support some features that are required by corporate standards, such as multiprocessor operation, memory parity, and over 128MB of system RAM.
The following general guidelines will help you specify systems that will give you the best possible performance under Windows 95 and Windows NT. However, it's important to remember that the key measurement is how your software performs on a given system versus that system's cost, service and warranty, reliability and compatibility. And needless to say, the other components of a system; hard disk, video card, etc; can affect performance as much as anything else. The key factor is balance; that all the components of the system are equal in performance, and no significant bottlenecks exist. That's why a real-world benchmark of your particular operating system and applications is so important.
Industry publications clearly show the large performance advantage of an L2 cache. Since L2 caching is essentially an industry standard today, the only difficult choices you may have to make will be in the area of notebook computers, which have not yet embraced the L2 cache in significant numbers.
You will find a lot of varying opinions on the benefits of various L2 cache sizes. The concensus among industry insiders seems to be that you can get by with 128K of L2 cache up to 8MB DRAM, with 256K of L2 the standard from 16MB-32MB, and 512K optimal for 32MB and up. Again, these figures are rough estimates, and performance can vary widely due to the cache performance considerations discussed earlier.
Published benchmarks definitely point towards the burst cache superiority. And since its performance boost costs less than the equivalent investment in CPU power, it's said to be a smart choice for desktop machines. You may want to forego burst caching for servers, since the system board chipset that supports burst caching doesn't provide some mission-critical features at this time (see above).
Since alternative memory technologies (EDRAM, and EDO RAM in systems with L2 cache) have not yet been shown in the media to provide a demonstrable price/performance ratio increase over standard DRAM, that remains the standard today. If you want the fastest possible system, and you're buying from a hardware vendor that doesn't give you a price hit for EDO DRAM, then by all means use it.