Updated: March 13,1996 |
Presented by: Scott B. Suhy
Senior Consultant with Microsoft Consulting Services, responsible
for enterprise architecture, design, and optimization for Fortune
500 companies.
Baseline
Tuning for "Memory" Performance
Tuning for "Processor" Performance
Tuning for "Disk" Performance
Tuning for "Network" Performance
Capacity Planning
Summary
References
Would it not be nice if there were no traffic bottlenecks during your everyday task of going to work? No traffic lights, fender benders, car problems, detours, people pulling out in front of you, people in the left hand lane going less than the speed limit, four lane highways narrowing down to two lanes.... This is rather unrealistic, just like with a computer system it is unrealistic to expect at some point in time there will not be a limit to the amount of memory, CPU, or I/O being consumed by internal or external processes.
You might also say that it might be nice to know how long it was going to take you to get to work in the morning (with some expected normal variation). Users of a computer system have the same expectation. They expect their jobs to finish in an acceptable amount of time without bottlenecks in the system slowing them down.
If there were bottlenecks on your way to work each day, I suppose you could optimize or tune the trip (reduce bottlenecks) by possibly finding an alternate route, car pooling, taking advantage of a car pool lane, taking a bus, or even changing your working hours (possibly to the evening when there is no traffic and the only thing keeping you from getting to work any faster is the speed limit and possibly the size of your engine). Computer systems have the same optimizations (run jobs during off peak hours, etc.). As with transportation systems, there is also the same lack of environmental control with a computer system. For example, it is not realistic to think that there will always be the same amount of traffic (on the road or in your computer system), it's also not realistic to think that you have control over the traffic (on the road or in your computer system). Problems always occur (a rain storm causing increased slowdowns on the road or one user consuming a great deal of the bandwidth of the server's memory, CPU, or I/O). Managing, as well as expecting, the problems, and knowing what to do when they occur is the key.
Once you feel you have the trip optimized, you might also think about taking some statistics, daily, weekly, or monthly, such as the amount of time it takes you to arrive at the office, number of red lights you got rather than green, and so on. This type of information will allow you to make future decisions on such things as "If I stop to get gas in the morning, how much earlier will I have to leave the house?" Of course you would also have to know how much time it would take you at the gas station (another set of statistics). The same thing goes for your computer system. It's called Capacity Planning.
The following information provides you with tips on areas of the Microsoft® Windows NT operating system in which you should pay attention (What to Watch). It also gives you a few rules/guidelines to use to optimize the system (What You Can Do). Once you take each of these areas into consideration, your system should be optimized. Once you feel your system is optimized it is then time to gather data on current capacity. The data will allow you to do the following:
This information is rather technical in nature and assumes that you already know a great deal about Microsoft Windows NT Workstation and Microsoft Windows NT Server operating systems. However, it only touches the surface of optimization. Many books could be written on the subject. Consequently, this paper neglects to explain many details and assumes you know where to get information about the hardware and software concepts mentioned. If you stumble upon a concept that is not explained in detail, you may want to refer to the Microsoft Windows NT Resource Kit, Server Message Block specification (which can be obtained from Microsoft), Microsoft TechNet, or any book that details network architecture (such as the book Local Area Networks by James Martin or LAN Times Encyclopedia of Networking by Tom Sheldon).
Before diving into any Performance Tuning, it is necessary to go over some definitions and terms.
For the purpose of this paper, I refer to the word task as a series of computer instructions, the execution of which involves work to be performed by one or more computer components or resources (for example, CPU, memory, hard disk, and network adapters).
The amount of time it takes to complete a task can be divided up among the several resources that are involved in the task's execution-some resources will be responsible for small amounts of the total time, others will be responsible for larger amounts.
The single resource that consumes the most time during a task's execution is that task's bottleneck. Bottlenecks can occur because resources are not being used efficiently, resources are not being used fairly, or a resource is too slow or too small. Let me try to elaborate on this point with the following example.
Example. If a task takes 2.2 seconds to complete, with .2 seconds spent executing instructions in the CPU and 2 seconds retrieving data from the disk (assuming both are not overlapping in time), the disk is the bottleneck in the task. If the CPU were replaced with one twice as fast, task execution time would drop from 2.2 to 2.1 seconds. This would be approximately a 4.5% increase in productivity. However, if the disk controller were replaced with one twice as fast, it would drop the disk access time from 2 seconds to 1 second, dropping the total execution time from 2.2 to 1.2 seconds. This would be approximately a 45% increase in productivity.
It would be easy if the previous example were on a workstation running the Microsoft MSDOS® operating system, but we are dealing with a multitasking OS. One thing to always keep in mind, especially in a multitasking OS, is that resolving one bottleneck will always lead to the next one.
The goal in tuning Windows NT is to determine what hardware resource is experiencing the greatest demand (bottleneck), and then adjusting the operation to relieve that demand and maximize total throughput. A system should be structured so that its resources are used efficiently and distributed fairly among the users. This is not as difficult as it sounds, assuming you use a few good rules/guidelines and have a thorough understanding of the computing environment. For example, in a file and print server environment, most of the activity at the server is in support of file and print services. This tends to cause high disk utilization because of the large number of files being opened and closed. It also causes the network interface card(s) to endure a heavy load because of the large amount of data that is being transferred. Memory typically does not get a heavy load in this environment (memory usage however can be heavy due to the large amount of system memory that may be allocated to file system cache). Processor utilization is also typically low in this environment. In contrast, a server application environment (for example, other Microsoft BackOffice products such as Microsoft SQL Server database server for PC networks, Microsoft Mail electronic mail system [Mail 3.5 or Exchange], Microsoft Systems Management Server centralized management for distributed systems, and Microsoft SNA Server) is much more processor and memory bound than a typical file and print server environment because much more actual processing is taking place at the server. The disk and network tend to be less utilized, due to a smaller amount of data being sent over the wire and to the disk. Understanding these generalizations is not enough; the only way to get an idea of the utilization of the resources is to monitor them, and one of the most powerful tools that you can use is the Windows NT Performance Monitor.
Performance Monitor is a graphical tool for measuring the performance of your own Windows NT-based computer or other Windows NT-based computers on a network. It is located in the Administrative Tools group of both the Windows NT Workstation and Windows NT Server products. On each computer, you can view the behavior of objects such as processors, memory, cache, threads, and processes. Each of these objects has an associated set of counters that provide information on such things as device usage, queue lengths, and delays, as well as information used for throughput and internal congestion measurements. It provides charting, alerting, and reporting capabilities that reflect current activity along with ongoing logging. You can also open log files at a later time for browsing and charting as if they were reflecting current activity.
Before spending money to add more hardware or replace existing hardware with faster, it's best to use Performance Monitor to first tune the system to make the most efficient use of existing resources. Here are a couple of examples of where the tool may be useful:
Example. If we find that the CPU is 100% utilized, before replacing it with a faster CPU or adding another one, we should identify and analyze the process that is utilizing the bulk of the CPU time. We may find that the processor cycles are being consumed by a disk controller requiring PIO. In this case a DMA disk controller will then reduce processor utilization.
Example. If we determine the hard disk is full, before adding additional disk drives, identify how much of the page file is being utilized. You may find that the system page file size is initialized at 100 MB, but there is never more than 40 MB of it being used. Instead of purchasing another disk, we could adjust the size of the page file.
If you talk to our product support engineers or our consultants in the field and ask them about the tuning questions they most frequently hear, you may find the following:
1. How do I determine how well an application is performing?
2. How can I support my environment in a proactive manner?
3. How do I know what component of my system is the most limiting (the bottleneck)?
4. How can I ensure my system is performing the best it possibly can perform?
5. How do I determine what size system I need based on the following criteria?
6. How do I know when to upgrade?
All of these questions play some part in performance tuning. We are going to focus mostly on answering questions 2, 3, and 4, primarily by focusing our attention on exploring each of the primary components of a computer system-the memory, processor, and the I/O subsystem (e.g., disks and networks). From this standpoint, performance tuning means ensuring that every user gets a fair share of available resources of the entire system. Once you feel you have 2, 3, and 4 under control, you can start focusing on 5 and 6, which are more capacity planning issues. Once you have 5 and 6 under control, you will be able to answer number 1, and more important, do "What If" analysis.
Lack of memory is by far the most common cause of serious performance problems in computer systems. If you read no further in this document you could just answer by saying "Memory!", if anyone ever asks you how to improve the performance of a system.
Memory contention arises when the memory requirements of the active processes exceed the physical memory available on the system; at this point, the system is out of memory. To handle this lack of memory the system starts paging (moving portions of active processes to disk in order to reclaim physical memory). At this point, performance decreases dramatically. Consider the following example. If the average instruction in a computer takes approximately 100 nanoseconds to execute and disk access takes somewhere on the order of 10s of milliseconds, how many times slower would the machine run, if there were 1 paging operation per instruction? If you answered 100,000 you would be correct! Let's hope things don't get that bad....
To optimize overall performance, steps must be taken to ensure that main memory is used as efficiently as possible and thus paging is held to a minimum. As you will see in the next section, you can tell how loaded system memory is by watching how the system pages.
If "Memory Pages/sec" is increasing, yet "Memory Available bytes" is not decreasing then you may actually not have a memory bottleneck. In this case you would want to check for an application that is doing a great deal of disk IO (reads or writes) and the data is not in cache.
It details the total memory in your system, the current available memory ready for allocation to applications you may start, available space within your page file, and the Memory Load Index. The Memory Load Index specifies a number between 0 and 100 that gives a general idea of current memory utilization, in which 0 indicates no memory use and 100 indicates full memory use. This index is based on the amount of available memory pages are free in the system. If AvailablePages < 100 (~500K assuming a 4K page) then the index would be at 100. If the AvailablePages > 1100 (4MB assuming a 4K page) the index would be at 0. Any value in between would be calculated with the following formula:
(100-((Available Pages-100)/10)
This dialog is built with a call to the Microsoft Win32® application programming interface GlobalMemoryStatus() in the SDK.
If this occurs, it is generally related to a memory leak in another process. To determine the process at fault you can monitor each process's Page File bytes or Working Set.
2019: The server was unable to allocate from the system nonpaged pool because the pool was empty.
Nonpaged pool pages cannot be paged out to the paging file, but instead remain in main memory as long as they are allocated. NonPagedPoolSize is calculated using complex algorithms based on physical memory size. However, you can use the following formulas to 'approximate' these values for an X86-based computer.
MinimumNonPagedPoolSize = 256K
MinAdditionNonPagedPoolPerMb = 32K
DefaultMaximumNonPagedPool = 1 MB
MaxAdditionNonPagedPoolPerMb = 400K
PAGE_SIZE=4096
NonPagedPoolSize = MinimumNonPagedPoolSize +
((Physical MB - 4) * MinAdditionNonPagedPoolPerMB)
Example. On a 32 MB x86-based computer:
MinimumNonPagedPoolSize = 256K
NonPagedPoolSize = 256K + ((32 - 4) * 32K) = 1.2 MB
MaximumNonPagedPoolSize = DefaultMaximumNonPagedPool +
((Physical MB - 4) * MaxAdditionNonPagedPoolPerMB)
If MaximumNonPagedPoolSize < (NonPagedPoolSize + PAGE_SIZE
* 16),
then MaximumNonPagedPoolSize = (NonPagedPoolSize + PAGE_SIZE
*16)
Example On a 32 MB x86-based computer:
MaximumNonPagedPoolSize = 1 MB + ((32 - 4) * 400K) = 12.5 MB
You can monitor the system's nonpaged pool allocation with the "Memory Pool Non Paged Bytes" counter. If there is a shortage of nonpaged pool, you may also see the following error on a remote system or even the local system:
Not enough storage available to process this command.
If this occurs, start looking at each process's nonpaged pool allocation. This is generally caused by an application incorrectly making system calls and using up all allocated nonpaged pool.
"Paging File % Usage MAX" * Page file size = number of bytes used
Add together the bytes used for all page files. This is the amount of memory that would need to be added to allow all of the applications to perform their operations with minimum paging. For example, if your page file is 100 MB and the % Usage MAX is 20%, then you would need 20 MB additional RAM to have a system that does minimal paging. The reason this formula only gives you an idea ABOUT how much memory to add is that a) not all page file "in use" code is accessed all of the time; and b) the formula ignores the requirements for code and mapped files not backed by the paging file. Therefore this estimate is neither an upper bound, nor a lower bound-it is only an "indication." The truth is that there is no good way to know how much memory to add at this time. A more accurate way to measure the amount of memory an application would require is to run the application on a very large machine and measure the needs under some slight memory pressure. (There is a tool in the Windows NT Resource Kit volume 3 utilities called Response Probe that can aid in this area.)
Gotcha. Adding memory without upgrading the secondary cache size sometimes degrades processor performance. This is because the secondary cache now has to map the larger memory space, usually resulting in lowered hit rates in the cache. This slows down processor-bound programs because they are scattered more widely in memory after memory has been added. (Secondary cache refers to the physical cache memory chip(s) usually located on the motherboard, as opposed to within the processor itself. In the future, processors will be built with secondary cache on the same substrate as the processor chip, or even within the processor chip itself.)
w If you determine that a great deal of memory is being consumed by an application for which you have the source code, you may want to investigate tuning the application to be less memory intensive. Good tools to use to profile your applications' memory allocation are the Working Set Tuner and the VADUMP tools in the Win32 SDK.
w Spreading paging files across multiple disk drives and controllers generally improves performance as multiple disks can process I/O requests concurrently. After all, you can have up to 16 separate page files. Also, since Windows NT has several system files that are frequently accessed, you may want to experiment with locating the paging file on one disk and the Windows NT system files on another. You should also locate the page file(s) on separate disk(s) from application files to allow for page file I/O and application file I/O to occur concurrently. This will only work if the disk driver(s) and controller(s) used can accommodate asynchronous I/O requests. Keep in mind that most IBM-compatible "non-super servers" have an ATDISK as the default and the ATDISK driver can have only one I/O request pending at a time. If your system mixes high-speed disks and low-speed disks, use the fastest disks for all your paging.
w Use the Control Panel | System | Virtual Memory and set the page file size such that extension of it will rarely occur.
w Use the Control Panel | Services to turn off unnecessary Windows NT services, and Control Panel | Network to uninstall any unnecessary Windows NT device drivers. This can free up both CPU and memory. One example might be the Spooler service. If you don't have a printer connected to the workstation or server there may not be any reason to have it running. It can save you 560K of committed memory and about 691K of non paged pool--Check it out with the PMON.exe tool in the Windows NT Resource Kit. Look at the 'Commit Charge' column (this is the same as the "Process: Private Bytes" counter in Performance Monitor).
w User accounts are stored in a registry hive, which means each account consumes paged pool on a Primary Domain Controller or Backup Domain Controller. Therefore the limit on the number of user accounts depends on the amount of memory and swap file space in your PDC and BDCs. User accounts take about 1K each, so 10,000 is about 10 MB. You may want to consider a second domain (possibly a different domain model) if you have more than 15,000 user accounts. However, the only answer may be to add more memory.
w Some machines provide the ROM BIOS shadowing option. While this feature provides an advantage with MSDOS, it is NOT an advantage with Microsoft Windows NT. ROM BIOS shadowing is the process of copying the BIOS from ROM into RAM and using either hardware or 386 enhanced mode to remap the RAM into the normal address space of the BIOS. Because reading RAM is much faster than reading ROM, BIOS-intensive operations are substantially faster. For example, MSDOS uses the BIOS to write to the screen; therefore, with ROM BIOS shadowing, directory listings run more quickly. Windows NT does not use the BIOS (except during startup); therefore, no performance is gained by shadowing. If ROM BIOS shadowing is not used, more RAM is available. With Windows NT, there is an advantage to disabling the ROM BIOS shadowing option. This applies to other BIOS shadowing schemes as well. Typically the CMOS settings allow the system to shadow any BIOS. This includes the following: System BIOS, Video BIOS, Other adapters ROM BIOS (in a given select range).
A processor (running at a given clock speed) can execute a set number of instructions per second. Therefore, if a processor is switched among multiple threads that all have work to do, a given thread will take x (x being the number of simultaneously executing threads) times longer to complete a given task.
There are times when a thread has no work to do, such as when waiting for user input, or when waiting for another thread to finish a related operation. As long as the thread is in this waiting state, it will not be scheduled for execution and, thus, does not take up any CPU time. Since most Microsoft Windows®-type applications spend a considerable amount of time with their threads in this waiting state, there may be little performance degradation when running multiple Windows-based applications.
Some applications are considered CPU intensive. A CPU-intensive application almost always has work to do and spends very little, if any, time in the waiting state. For example, the following C program consumes 100% of the CPU. When additional applications are started, their performance, and that of the CPU-intensive application, will be less than optimal since all must share the processor's time. This is an example of how NOT to write an application; a better approach would be to create an event or wait on a semaphore.
main(){
while(1){}
}
The figure below shows the application's utilization of the CPU.
If the "Processor % Processor Time" counter consistently registers at or near 100%, the processor may be the bottleneck. ('System % Total processor time" can be viewed for multiprocessor systems.) If this occurs you need to determine WHO or WHAT is consuming the CPU. To determine which process is using up most of the CPU's time, monitor the "Process objects % Processor Time" for all of the process instances (as in the previous figure).
w You can tell if the CPU activity is due to applications or to servicing hardware interrupts by monitoring "Processor Interrupts/sec." This is the number of device interrupts the processor is experiencing. A value over 1000 should cause you to look at the efficiency of hardware I/O devices such as the disk controllers and network cards.
w You can also monitor "System System Calls/sec." Systems Calls/sec is the frequency of calls to Windows NT system service routines. These routines perform all of the basic scheduling and synchronization of activities on the computer and provide access to nongraphical devices, memory management, and name space management. If there are many more interrupts per second than system calls, it could indicate that a hardware device is generating an excessive number of interrupts.
w Monitor the "System Context Switches/sec" as well. Too frequent context switching can be caused if semaphores or critical sections (see the Windows NT SDK for more information) are placed at too low a level in order to attain high concurrency. The only way to solve this problem is to re-evaluate the priority place on the source code.
w Schedule CPU-intensive applications during off-peak hours. You can use the AT scheduler that ships with Windows NT.
w If you have control over the application source, you may want to investigate tuning the application to be less CPU intensive. There are a number of tools available with the Windows NT SDK that allow you to do this, such as WAP (Windows API Profiler), CAP (Call Attributed Profiler), FIOSAP (File I/O and Synchronization Win32 API Profiler), and Win32 API Logger.
w Distribute applications and processes across multiple machines.
w Upgrade the processor if possible. Keep in mind that Windows NT runs on MIPS and Digital Alpha AXP machines as well as the Intel (386, 486, and Pentium). Most servers are either file servers or application servers. Even though they use the same operating system each uses the machine's resources in a different way. A file server generally maximizes system bus utilization and under-utilizes the processor. A 486 clock doubler chip in this machine would not provide a big performance enhancement over a typical 486 chip. An application server (such as a database server running Microsoft SQL Server and Systems Management Server), however, utilizes the processor subsystem significantly more than the file