Kurt Seifried

Python Performance Part 2

Python Performance Part 1
Python Performance Part 3
Python Performance Part 4

In part one I covered a basic introduction to threads and the producer-consumer pattern. Now in part two I’m going to cover some (unfortunately common) pathological cases where threading in Python can make things much worse.

An example of when threading makes things worse

So last week we covered basic Python threading strategies and work queues, so if you have a program and want to speed it up simply thread it and it’ll run faster, right? Wrong. For certain types of workloads and computation using threading in Python will make things worse (in some cases much worse).

Take for example a simplified program (courtesy of David Beazley[David Beazley]) that does some number crunching (e.g. counting from a large number down to zero) several times. This eliminates a lot of variables (memory, disk IO, network IO, etc.) and should run very quickly on any modern CPU. If we want to count down from the large number twice it should take about as long to do it sequentially as in parallel since our problem is CPU bound right?

Let’s try it:

#!/usr/bin/env python
import time

def count(n):
 while n > 0:
 n -= 1

start_time = time.time()
stop_time = time.time()
elapsed_time = stop_time - start_time
print elapsed_time
# a very simple count down function# start time

# stop time

# elapsed time (roughly)

On my machine with three runs I got run times of 17.37, 17.10 and 17.34 seconds which is pretty consistent.

And the threaded version:

#!/usr/bin/env python
import time
from threading import Thread

def count(n):
 while n > 0:
 n -= 1

start_time = time.time()
t1 = Thread(target=count,args=(100000000,))
t2 = Thread(target=count,args=(100000000,))
t1.join(); t2.join() 
stop_time = time.time()
elapsed_time = stop_time - start_time
print elapsed_time

On my machine with three runs I got run times of 23.14, 23.11 and 23.51 seconds which is pretty consistent as well. It is also quite a bit slower than the sequential version of this program (roughly 33% slower than the sequential version!). In the case of David Beazley he reports the threaded version almost taking twice as long on his system. Much like David Beazley I also tried running the threaded version of the Python code on a single CPU, in the case of Linux you can use the “taskset” command to bind a program to a particular CPU. I also got the same results he did (roughly speaking), one a single CPU the threaded version of the counting program took only 19.15, 19.38 and 19.29 seconds.

So we have two rather interesting questions: Why does the sequential version of the program take so much less time, and why does the threaded version run faster on a single CPU then when it runs across multiple CPU cores?

Why threading doesn’t always work as expected – the GIL

This unexpected behavior is caused by the Global Interpreter Lock (GIL):

A Global Interpreter Lock (GIL) is a mutual exclusion lock held by a programming language interpreter thread to avoid sharing code that is not thread-safe with other threads. There is always one GIL for one interpreter process.

Usage of a Global Interpreter Lock in a language effectively limits concurrency of a single interpreter process with multiple threads — there is no or very little increase in speed when running the process on a multiprocessor machine. Due to signaling with a CPU-bound thread, it can cause a significant slowdown, even on single processors.

This is largely what makes threaded programming in Python easy (you have no real worries about concurrency since only one thread at a time runs), and also what can make it challenging for performance (since only one thread at a time runs). The last sentence also offers an explanation as to why out sequential code is running faster than the threaded code: “Due to signaling with a CPU-bound thread, it can cause a significant slowdown” or in English: the time needed to constantly swap between the threads (even on an efficient system like Linux) is noticeable (and only gets worse with larger numbers of threads). But why does Python swap CPU intensive tasks so much?

When running a thread the Python GIL is held until one of several things happens: the thread completes running, the thread goes into an IO operation or 100 ticks are reached. The first case is simple, a thread can call sys.exit() and kill itself off, this of course is the end of things and the GIL is yielded back. The second case is also simple, if a thread blocks for an IO bound operation (disk, network, etc.) and it is put to sleep (chances are the IO will take time anyways so no point waiting idly by), it yields the GIL and another thread is given a chance to run. The third case is a little more nuanced. For starters what exactly is a tick? A tick is basically an interpreter instruction. By limiting the number of ticks a thread can run a CPU bound task that doesn’t want to exit or isn’t going to do IO won’t end up hogging the CPU. The number of ticks a thread is allowed to use before being stopped can be controlled by the sys.setcheckinterval() parameter which can be set on the fly. The most important thing to remember is that ticks are NOT time constrained.

If all of the above seems rather complicated there is a good reason for it. Python internally doesn’t really handle the specific scheduling of threads; it leaves this up to the operating system. The rational for this is simple: the Python developers feel that reinventing the wheel (handling thread scheduling) is probably not the best way to go since modern operating systems have spent a significant amount of time and energy on getting threads to run efficiently and quickly (especially Linux). So internally Python essentially uses a form of cooperative multi tasking (the thread yields the CPU when it exits, blocks on IO or hits the tick limit).

The trouble with ticks

So if a tick isn’t time constrained what’s the worst that can happen? Some CPU instructions take very little time to run, and conversely some take a very long time to run. The following program is very simple:

#!/usr/bin/env python

def bigtick():
 nums = xrange(100000000)
 -1 in nums

Which when disassembled (more on this later) it turns out to consist of:

>>> dis.dis(bigtick)
 2           0 LOAD_GLOBAL              0 (xrange)
 3 LOAD_CONST               1 (100000000)
 6 CALL_FUNCTION            1
 9 STORE_FAST               0 (nums)

 3          12 LOAD_CONST               2 (-1)
 15 LOAD_FAST                0 (nums)
 18 COMPARE_OP               6 (in)
 21 POP_TOP             
 22 LOAD_CONST               0 (None)
 25 RETURN_VALUE        

As you can see it’s about 25 instructions, much less than the 100 limit imposed by default. So what happens when we run this? Well not much of anything, but it does take a while (about 4 seconds on my machine). Much to slow, let’s hit ctrl-c and get out of it.

As you may have noticed hitting ctrl-c doesn’t work, the program runs till it’s done, taking it’s own sweet time. Why is this? While the child thread is running it basically blocks on signals sent to it. Just imagine what happens if you have 10 children threads running code like this (now we’re deep into pathological behavior-land).

Working with the GIL

So how can we deal with some of this odd and downright bad behavior by the GIL? The first trick is to minimize the number of threads you are using. Because the operating system controls which thread is executed it may execute a thread that has nothing to do (i.e. is still blocking on IO), meaning you pay (computationally speaking) for a thread swap, the (short) execution time of a thread that doesn’t do anything and then the task swap to another thread (hopefully one that does some actual work). Or worse you have a thread holding a shared lock, but it is waiting on something (e.g. a database write to complete) before it yields the lock. You can end up with threads being run that can’t do anything, leaving your computer spinning until the thread holding the lock is run and eventually completes, freeing the lock. Having threads sitting idly by will generally reduce performance (but of course if you get surges or spikes in work having a spare pool of threads can be a very good thing.

As for a way of dealing with the signal handling problem the best advice I can find is to have the main thread handle signals and spawn of child threads that do the actual work (in other words have the main program run threads and then passively loop while the children threads work). This way the main thread won’t be working on something when a signal arrives resulting in the signal being ignored.

Like any performance advice I suggest you take it with a grain of salt and first measure your performance, identify any issues and then fix them (which is covered in the third article).


[David Beazley]