Investigating How CPU Stress Affects The Performance of Windows Computers

Introduction

The problems with the performance of Windows computers seem to have peaked around the time Windows Vista was released. Things have gotten better over the years; however, this is still a significant issue many people experience. An app or an entire computer may slow down for a number of reasons. This is often due to a bottleneck in system resources. A poorly written app locks up trying to access the internet when it is unavailable; video processing software takes up available memory so that everything else is swapped out; an antivirus with a poorly implemented file caching mechanism ends up scanning every single file every time it is being accessed. Occasionally, a runaway thread of a resource-intensive process consumes all the time of the available CPUs (also known as being CPU-bound) [1, p. 510]. How such a “greedy” process affects the entire system depends on many variables, such as the number of threads it runs, their priorities, processor affinity, and the number of processors available, to name a few. We set out to investigate how these parameters affect the real-life performance of a computer running one or more CPU-bound threads.

Processes and Threads

In Microsoft Windows terms, an application has one or more processes (or executing programs) associated with it. A process consists of one or more threads, and it is those threads that get CPU time. A thread executes a part of the application’s code and threads are allowed to run in parallel. There are dozens of threads running seemingly in parallel on any particular computer. This is achieved by the Windows kernel scheduler that allocates short time slices for each thread to run on an available CPU and then rapidly switches between these threads (Figure 1). Depending on the system, each thread is given a different quantum (timeline). It is typically 2 clock intervals for consumer Windows versions and 12 clock intervals for Windows servers. A clock interval, again, varies from 10 ms on x86 single core processors to 15 ms on most x86 and x64 multiprocessors. Switching between threads means the CPU needs to save the local data of the previous thread and load the data of the next thread, which takes time. The context switch overhead means quanta cannot be too short. At the same time, these time slices need to be short enough to ensure all threads execute reasonably fast to avoid UI being unresponsive, audio stuttering, and other undesired consequences.

Figure 1. CPU-bound threads are less of an issue in computers with multiple processors. Intel’s hyperthreading technology enables a single CPU core to act as two logical processors capable of executing threads in parallel.

How often a thread is scheduled to run depends on the number of threads in the system, the number of available CPUs, as well as the priority of the thread. Since the CPUs are a limited resource, the scheduler queues up threads to be executed on each available CPU. Threads with higher priority get in first. In fact, a thread that’s already running could be preempted (taken over) by another thread with a higher priority. A thread’s dynamic priority depends on the static priority of its process (priority class), the relative priority of the thread, and an additional priority boost that may be given by the scheduler to certain threads, for example, those that haven’t been run in a while, to avoid CPU starvation or to ensure smooth UI experience. On a multiprocessor system, a thread can run on one or more CPUs. This is determined by the processor affinity of the thread. A thread with an affinity for a single processor will only be allowed to run on that particular processor.

The Experiment

In order to test as many scenarios as possible, we wrote a program that would stress test the CPU by creating a process running a number of CPU-intensive threads. We ran this process with different priorities, processor affinities, and variable per-CPU load targets. In one of the experiments, we utilised a dual-core Intel system with hyperthreading (effectively exposing four logical processors) and ran a process with 1 to 8 threads, 6 priorities from IDLE_PRIORITY_CLASS to REALTIME_PRIORITY_CLASS, constraining the threads to a single logical processor or not constraining them at all, and a per-CPU load target varying from 0 to 100% in 10% increments. Overall, this resulted in 1,056 tests. The CPUs were stressed with simple arithmetic commands (Figure 2). A controlled load per logical CPU was achieved by putting the thread to sleep for an appropriate fraction of time.

Figure 2. The C# code used to task a CPU to a particular load.

As a gauge of real-time computer performance, we took the time from issuing a command that would launch the Notepad and until the app rendered its window. We allowed 10 seconds for the threads to kick in before launching the Notepad. We then launched the Notepad 10 times to ensure memory caching did not affect our results significantly. Finally, we launched the Notepad 20 more times and averaged the time over those 20 launches.

We ran the tests on virtual Windows computers as well as a few physical computers. In this article specifically,  we discuss the results obtained on the Windows 10 system running on Amazon Workspaces with the following configuration:

Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz, 2500 Mhz, 2 Core(s), 4 Logical Processor(s); 16 GB RAM.

Note that Amazon Workspaces use SSD drives; therefore, memory swapping will be relatively fast compared to a computer with a slower HDD [2, p. 73]. Our CPU-bound threads did not allocate memory, and we launched the Notepad a few times prior to measuring the results, so we can expect that the type of data storage would make no significant difference.

The Results

The 1,056 tests took over 10 hours to run on an Amazon Workspaces Windows 10 instance. The resulting CSV file can be downloaded from our server. Bear in mind that the “CPU load” values in the CSV are per-CPU load targets of each thread rather than the combined CPU load you’d see in the Task Manager. Therefore, a CPU load of 50% means that a particular thread tasked a single logical processor about 50% of the time.

Let’s take a bird’s-eye view on the results with the correlation graph obtained in R (Figure 3). You only need to look at the bottom row. As expected, we saw that the maximum total CPU load imposed by a single thread running at full throttle is 100% / # of logical processors [1]. As this was a system with four logical CPUs, our single “greedy” thread would only consume up to 25% of the overall CPU capacity of the system. Running such a thread at 100% throttle, we saw 25% utilisation in all four logical processors. Notice how the average time (to start Notepad.exe) rapidly goes up as the number of threads increases, starting with four threads.

Figure 3. The correlation matrix between the average time to launch Notepad.exe and various parameters such as per CPU load, the number of threads, the priority of the CPU-intensive process, and its processor affinity.

Overall the average launch time increased exponentially as the CPU load per thread approached 100%. The slowdown became noticeable roughly at 60% CPU load, though it depends a lot on other factors such as the priority of the process, the number of threads, and the processor affinity. Surprisingly, the system slows down even when our CPU-bound threads run in the context of a process with IDLE_PRIORITY_CLASS (Figure 4). Running threads at 90-100% CPU load doubled or tripled the average launch time, which is significant enough for most users to notice. There was a minor difference in the performance impact between a process with threads constrained to a single CPU (via an affinity mask) and an unconstrained one.

Figure 4. The average launch time of Notepad.exe on a dual-core Intel system with 4 logical CPUs. The process responsible for stressing the CPU was launched with different priority classes and had 1 to 8 CPU-bound threads running. The blue line denotes the results when our process was constrained to a single CPU with an affinity mask of 0x0001, whereas the red line corresponds to an affinity mask of 0xFFFF allowing its threads to run on any or all available CPUs.

Things look very different for processes running with priority classes above NORMAL_PRIORITY_CLASS (Figure 5). In such cases, threads unconstrained to a single CPU affect system performance, dramatically increasing the average start time from less than 0.1 up to 80 seconds. Notepad.exe was able to launch (although  it took 80 seconds) despite 8 threads running at 100% CPU intensity in a REALTIME_PRIORITY_CLASS process. While Windows schedules higher priority threads to run first (and preempts a lower priority thread if a higher priority thread becomes available), any thread can enter a waiting state voluntarily or, for example, due to the system resolving a paging I/O and thus yield CPU to another thread. Windows will also boost the priority of some threads in a foreground application as well as occasionally boost low-priority threads to avoid CPU starvation and priority inversion scenarios [3, pp. 411-447]. However, this temporary boost happens rarely and for a brief instance. In our case, the reason Notepad.exe was able to launch was due to our CPU-bound threads still calling the Sleep(0) function at 100% per-CPU load target (Figure 2). This allowed Windows to run other threads. When we removed Sleep altogether while running four or more threads with above-normal priority, the computer became unresponsive.

Figure 5. The same process with 1 to 8 CPU-bound threads with priorities ranging from above normal to real-time. Red lines correspond to threads with unrestricted processor affinity.

When one or all threads (say, running at 90%) are restricted to a single CPU on a system with 4 processors, we can expect that only that particular CPU should be occupied, leaving the other three available and therefore not having a significant impact on system performance. However, as seen in Figure 6, the performance hit is very significant and grows with the number of threads but only when more than 4 of our CPU-intensive threads are present. When the threads are unrestricted to a particular CPU, the performance hit is linear and less significant (at least for this particular per-CPU load target of 90%).

Figure 6. Average launch time on a system with 1 to 8 concurrent threads running at 90% per CPU usage.

Based on all of the above, it is recommended that software developers ensure that the priority of their processes and threads are not raised above normal unless this is a time-critical process, and even then, the priority should be dropped back down to normal once the time-critical part is dealt with. Threads that perform lengthy calculations and therefore do not have a chance to yield execution should be used sparingly. This is less of an issue now that most if not all computers feature multiple processors [4]. Applications that take advantage of multiple CPUs at the very least need to drop the priority of their resource-intensive threads below normal.

An effective strategy for computer optimization tools may lie in noticing CPU-bound threads with normal or above normal base priorities and subsequently restricting the processor affinity and/or reducing the priority of the process running these threads. This, of course, can affect time-critical applications. While this may not be an issue for a regular computer user (e.g. a music player working in the background may stutter, etc.), this can have devastating consequences in other time-critical equipment; for example, those used in health care. Note that changing the processor affinity or the priority of an application whose threads are not CPU-bound or occupy fewer logical CPUs than are available is not likely to appreciably improve system performance.

Further research is needed to investigate how CPU-intensive processes affect other aspects of system performance such as rendering the UI, responsiveness to user input such as mouse or keyboard, file operations, browsing experience, and so on. Stressing CPUs with various types of tasks, e.g. fast Fourier transform routines may be explored, as well as using machine language to ensure specific instructions (e.g., floating-point instructions, CLMUL, etc.) are being tested.

References

[1] Russinovich, M. E., & Margosis, A. (2016). Troubleshooting with the windows Sysinternals tools. Microsoft Press.
[2] Eilam, E. (2005). Reversing: Secrets of reverse engineering. John Wiley & Sons.
[3] Russinovich, M. E., Solomon, D. A., & Ionescu, A. (2012). Windows internals. Pearson Education. Pages 411-447
[4] Ghuman, S. S. (2016). Comparison of Single-Core and Multi-Core Processor. International Journal of Advanced Research in Computer Science and Software Engineering, 6(6).