All through the 90’s, the performance of CPUs grew at a breakneck pace. Each generation added more and more hardware features to make code run faster and faster. As transistor technology continued to advance, CPU manufacturers pushed exotic and complex tricks and logic into their CPUs. They added advanced logic to predict the next instruction, to dynamically reordering code instructions, to merge and break apart groups of instructions. This is an attempt to cram yet more work into each fraction of a second. However, in recent years CPU makers have found it increasingly harder to make meaningful performance gains in simple execution speed.
To make use of the extra transistors, CPU makers have glued together many CPU backends into a single CPU unit. Instead of building a bigger, more powerful engine in a car, we cram several engines into the car, and make them work together.
We call each of these engines “cores”. This has given us in the software world a bonanza of resources to work with. Unfortunately, taking use of those resources is a significant challenge.
What is a thread?
To take advantage of CPUs and CPU cores, we have a software concept called a “thread”. On a simple level, a thread is a container for a sequence of instructions, that an operating system can put on one core. We can have many threads running on one core, or many cores, or have only one thread running on one core with the other cores idle.
This is a lot to digest, so let’s look at some examples, using some real world analogies.
Several threads, one core
Let’s imagine a TA is grading student assignments. In this analogy, the TA is a core, and he treats each paper as a single thread. So he starts grading paper one, works on it for a minute, then stops. He starts grading paper two for a minute, then stops, and so on. When he has worked on all the papers for one minute, he goes back to the first one.
Several threads, several cores
So, as you can imagine, this poor TA is taking a really long time to finish. So the professor decides to speed it up by adding a second core, another TA.
Now, TA1 takes a paper, starts working on it for one minute. TA2 takes another paper, and starts working on it for one minute. Each then does the same pattern as before, grabbing a paper from the top of the pile, working on it, and putting it on the bottom.
Now, as you have probably noticed, this is not very efficient. It would probably be faster for each TA to start a paper, grade it all the way through, then start the next paper. It takes time to put a paper back, pick up a new paper, read the paper, figure out how far along the paper has been graded, and start working on it again. This is, however, the way computers work; this is a very real performance cost, called “context switching”. So why do we do work like this on computers? Because usually, the tasks each thread needs to run don’t all take the same amount of time. Some tasks take a really long time to complete (such as burning to a CD). Some tasks are very fast, like scrolling down a document or webpage. Sure, if we didn’t switch threads, burning a CD might complete a few seconds sooner; but would you really want your entire computer to lock up for several minutes while that task completed?
As you can see, using multiple threads can speed up completing a task, but at a cost of context switching. There are many ways we can use threads to speed up tasks, but they come with many other pitfalls, as well. Let’s look at ways to use multiple threads to speed up tasks.
For this example, we will look at cashiers ringing up groceries at a store. Let’s ignore the problem of context switching; lets pretend we have infinite cores.
Let’s take a case of one cashier (thread). There is only one line. The one cashier rings up the groceries, bags them, and assists taking the groceries to the car. Obviously, this does not scale well, as the line gets longer. So how do we use two cashiers?
First, we can use them in parallel. Each cashier gets a line, and assist shoppers at the same time. Another way to use them is to create a pipeline. One cashier rings up shoppers, while the second bags the groceries and assists shoppers to the car.
Note that these can be done together. If we have four cashiers, we can create two pipelines, and assist two lines of shoppers at the same time. Also note that while this can improve throughput, it also increases cost. However, it doesn’t always increase throughput; if there aren’t enough shoppers, most of the cashier threads are going to be idle, waiting for a job to do. Another issue is that, when using a pipeline, the work can be stalled if one task, such as taking groceries to a car, takes much longer than other tasks, such as ringing up, or bagging groceries.
How can we use threads more efficiently? Instead of tying each person to a specific task, we let each of them take whatever task is available. We have three tasks (ringing up, bagging, assisting to car), and a group of three cashiers. This group of cashiers are called a thread pool.
Each cashier then takes whatever task is available. If the cashier who assists to the car is not back when the cashier finishes bagging, that cashier then assists to the car, and so on. Of course, this can lead to issues, as well. If assisting to the car takes a very long time, it’s possible that all three cashiers are busy assisting, and no one is ringing up customers. So we need to prioritize the tasks. We assign ringing up as a high priority, bagging as a medium priority, and assisting to the car as a low priority. Now, a cashier will only take the highest priority task that is available. The cashier bagging groceries will only assist someone to the car if there are no more groceries to bag. This is called priority scheduling.
So, how do we use two cashiers in parallel? Well, let’s try having them both check out people on the same register. That doesn’t work at all, because one person’s grocery bill is getting mixed in with another, and leaving the bill in an entirely incoherent state. So we need to manage access to the register. In programming terms, the first cashier will “lock” the thread while ringing up groceries, then “unlock” it when finished. Then the next cashier can take control and start ringing up the next customer. Now, we have fixed the data integrity issue, but introduced a very serious performance bottleneck. This is called resource contention.
There is also another potential issue this has introduced. Imagine the cashier who has locked the register gets a call, to handle an emergency. The cashier then goes on break to handle the emergency, but forgets to unlock the register. Now the other cashier and all the customers have to wait. If the first cashier never comes back, the customers and other cashier will be waiting forever. This is called deadlocking.
Solving Resource Contention
The easiest way to solve this issue with resource contention is to not share resources. In our grocery store, we solve the issue by installing a 2nd cash register. Now, both clerks can process customers at the same time, without interfering with each other. But this came with some cost, the cost of buying a cash register; in computers this comes with cost too; more memory, and more time spent building the resources. Additionally, we now have two sets sales records that now have to be rectified.
Sometimes, we simply have to share resources. In this example, let’s add a manager. We want there to be only one manager, as only one person should have the accountability and responsibility for certain problems. Now, only one clerk can use the manager at a time. Imagine a clerk calls the manager over, and says, “hey, I might have a problem, so stay here just in case.” Now, that manager is “in use”, and any other clerk who needs manager assistance has to wait. This is another bottleneck. To alleviate this, we instruct our clerks to only call the manager when needed, and let the manager leave when the problem is resolved. Minimizing the time a shared resource is locked can significantly improve performance, and minimize the chance of deadlocking.
There are many more possible problems when dealing with multithreading: Live locking, false sharing, and race conditions to name a few. Even when we write correct multithreaded code, it can be hard to extract serious performance benefits. CPU makers continue to give us more and more raw power, but using that power to actually benefit the user is a significant challenge for programmers to overcome.