Posts Tagged ‘python’

Python Performance Part 3

May 31, 2010

Python Performance Part 1
Python Performance Part 2
Python Performance Part 4

In part one I covered a basic introduction to threads and the producer-consumer pattern. In part two I covered the Global Interpreter Lock (GIL) and some of the more pathological behavior it can cause. Now we’re going to go over how to profile code and how to instrument code so you can measure the performance of it so you can know with some certainty if the code actually needs to be fixed (and how it can be fixed). Finally we’ll go over some significant ways in which we can improve our handling of the producer-consumer pattern that will be more efficient and reliable.

Profiling vs. Performance measurement

Profiling and performance measurement are closely related but there is a difference. Profiling will generally give you every a lot of detail: function returns and exception events and since Python supports deterministic profiling it will do so with an extremely high degree of accuracy. The output of Python profiling is staggering, every single detail, the number of calls (ncalls), total time (tottime), per call time (tottime divided by ncalls), cumulative time (including sub calls/etc. for a function, cumtime) and the filename and line number and function called (filename:lineno(function)). In other words everything (this is one advantage of using an interpreted language). For most of us profiling should be used once the code has been written (premature optimization and all that) and used a bit (testing or otherwise). Performance measurement tends to be coarse grained, for example adding hooks that record how long a database operation takes to return or how long it takes to generate a web page in total. As well you can insert performance measuring code into your application and use it to continuously monitor system state/health (i.e. average DB query time over the last 5 minutes, etc.) and alert you if a problem occurs which is something that profiling is not really suited for.

Profiling Python code

Python provides two main profilers: “profile” and “cProfile” [PROFILE](a third one, “hotshot” has been deprecated and may be removed in the next version of Python according to the documentation). “profile” and “cProfile” are essentially identical with respect to usage, however the “cPython” module (written in C) is much faster and won’t slow your code down significantly. The “profile” module will slow your code down significantly, but as it is written in Python it can easily be modified or extended should you need to. Because Python is an interpreted language the interpreter is active during code execution, you don’t need to add instrumented code as it is already present.

A simple example of profiling
>>> def count():
...     n = 100000
...     while n > 0:
...             n -= 1
...
>>> import cProfile
>>> cProfile.run('count()', 'countprof')
>>> import pstats
>>> p = pstats.Stats('countprof')
>>> print p
<pstats.Stats instance at 0x254fb48>
>>> p.print_stats()
Mon Jan  4 00:19:11 2010    countprof

 3 function calls in 0.012 CPU seconds

 Random listing order was used

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
 1    0.000    0.000    0.012    0.012 <string>:1(<module>)
 1    0.012    0.012    0.012    0.012 <stdin>:1(count)

An excellent video covering this is available from Mike Fletcher [Mike C. Fletcher]

http://us.pycon.org/2009/conference/schedule/event/15/

What we can take away from this and other sources appears to be:

  1. You need to be able to properly measure performance before you profile code (luckily Python makes this trivial)
  2. Save your profiling results so you can compare things before and after (in other words did your optimization actually help)
  3. Check your algorithms; a stupid algorithm will always be slow
  4. Find function calls that are used often or are slow (and thus where you should start looking at first)
  5. Look for operations or results that can be pooled (i.e. using a single database query with a JOIN statement instead of multiple queries to get a collection of data)
  6. Look for operations that can be cached (keep your results around if you will use them again)
  7. Anytime you rely upon external code (especially opaque or closed source code) you are very much at the mercy of someone else
  8. Anytime you rely upon an external service (a database, reading or writing a file from the local system, getting a web page, etc.) you are very much at the mercy of someone else (in other words watch out for IO which may show up as lots of time spent sleeping)
  9. Only optimize one thing at a time unless you are incredibly brave
  10. Profiling threaded code can be problematic on multi-threaded or multiprocessor code

Of course I could be wrong. One last note, you can convert all this information into nice dot graphs using Gprof2Dot [GPROF2DOT].

Using cProfile

Using cProfile is easy; you import it as a module and then run code using it as a wrapper:

import cProfile cProfile.run('foo()', 'foo_profile')

 

This will execute the function foo() and output the data from the profile run to the file foo_profile. This is great if you just want to profile a specific class or function within your program. Alternatively if you want to profile the entire program you can use cProfile to execute the Python program in question and profile the entire thing:

# /usr/lib64/python2.6/cProfile.py -o base64-profile /usr/lib64/python2.6/base64.py /etc/resolv.conf
# python

>>> import pstats
>>> p = pstats.Stats('base64-profile')
>>> p.print_stats()
Tue Jan  5 02:37:04 2010    test-output.1

 282 function calls in 0.008 CPU seconds

 Random listing order was used

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 2    0.000    0.000    0.000    0.000 {method 'items' of 'dict' objects}
 1    0.000    0.000    0.000    0.000 /usr/lib64/python2.6/getopt.py:16(<module>)
 1    0.000    0.000    0.002    0.002 /usr/lib64/python2.6/base64.py:3(<module>)
 4    0.000    0.000    0.000    0.000 {method 'read' of 'file' objects}
 1    0.000    0.000    0.000    0.000 {open}
 3    0.000    0.000    0.000    0.000 {len}
 1    0.006    0.006    0.008    0.008 {execfile}
 256    0.000    0.000    0.000    0.000 {chr}
 1    0.000    0.000    0.000    0.000 /usr/lib64/python2.6/getopt.py:39(GetoptError)
 2    0.000    0.000    0.000    0.000 {binascii.b2a_base64}
 1    0.000    0.000    0.008    0.008 <string>:1(<module>)
 1    0.001    0.001    0.001    0.001 /usr/lib64/python2.6/base64.py:326(test)
 2    0.000    0.000    0.000    0.000 {method 'write' of 'file' objects}
 1    0.000    0.000    0.000    0.000 {method 'sort' of 'list' objects}
 1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
 1    0.000    0.000    0.000    0.000 /usr/lib64/python2.6/base64.py:285(encode)
 1    0.000    0.000    0.000    0.000 {method 'startswith' of 'str' objects}
 1    0.000    0.000    0.000    0.000 {range}
 1    0.000    0.000    0.000    0.000 /usr/lib64/python2.6/getopt.py:52(getopt)

<pstats.Stats instance at 0x7fad1e6c1ab8>

As you can see I ran the file /etc/resolv.conf through the Python base64 encode/decode utility. The first obvious thing to jump out is heavy use of the “chr()” function. This function returns a string one character at a time but appears to be extremely efficient (virtually no CPU time was used). The only thing that really appears to have taken any time is the execfile() function which parses a file. As you can imagine a Python application of any real size will generate a lot of output. Use of cPython is covered in detail in http://docs.python.org/library/profile.html

Measuring Python code performance

There’s a couple of ways to measure Python code execution, the most accurate for small amounts is the timeit.Timer() function. You basically wrap your function in this and it reports how long things took to run, it can also takes a repeat command, so for operations that are relatively fast you can run them a few thousand (or million times) to get an idea of how long they take. This module is generally best used when testing algorithms or other short pieces of code since it won’t give details of where things are being slow (just an overall picture).

But what about instrumenting code for long term monitoring, or measuring the speed of external software and services such as database queries? Some strategies here include simply using the time.time() method to get the current time at the start and end of the code you are interested in, subtract the two and you get the elapsed time. This allows you to wrap a call to a database; if the query takes more than say 5 seconds you can send a warning of some sort (log an alert, send an email, etc.). If you are processing database entries you can log the primary key and the start time when you start processing the entity, once done processing (possibly after shoving it through several work queues/etc.) you log the finish time, and ideally log these results.

A note on long term performance monitoring

A major advantage of architecting in performance monitoring is that you can generate long term stats regarding your code, application and environment. Things like how long database queries take, swap the disks for solid stats disks and see if or what a difference it makes, network latency/speed issues, etc. My advice would be to measure the performance of external libraries and services since profiling won’t really address this. I would also measure the overall performance of things (how long from page request to page generation, or page generation and writing of back end data, etc.). Especially if you are splitting work up using work queues profiling won’t help since you can’t track the processing of a specific message/request/etc. within that context easily.

Additionally if you do this by using function decorators you can simply create a timing function and wrap your database query class or whatever function it is you want to monitor without having to modify your code significantly.

Disassembling Python code

I thought I should add a quick note on disassembling Python code (so you can find out approximately how many ticks it is). In some cases where you suspect a computation or a task is causing a lot of context switching you may wish to look at modifying the sys.setcheckinterval() and increasing it. But of course simply twiddling the knobs until you get a better result is probably not the easiest way to accomplish this. Using the dis[DIS] module you can disassemble code, see “The trouble with ticks” in the second part of this series for an example. You may be lucky and find that setting the sys.setcheckinterval() helps (e.g. if you computation takes 120 ticks and it can now complete in one run rather than being interrupted) but I would urge extreme caution as it may have unintended side effects that are difficult to track down.

Shared resources, locking and performance

So what happens if you profile or time your code in testing and find that it is reasonably fast; but when you run it in a multi-threaded or multi-process environment you find your code crawls to a stand still or takes far more time than it should? Chances are you have some form of shared resource or locking, that when run in small trials doesn’t create contention, but when run in production does. The best manner to deal with this is of course to avoid having any resources that require locking or can’t easily be shared. Additionally I have found that the less shared resources, locking and other synchronization issues I have to deal with, the fewer bugs and I create (imagine that). The two best ways to deal with this are to profile and monitor your code, code that spends a lot of time sleeping may be waiting on a shared lock, and if something takes to long to run (such as a database query) you may have an external resource that doesn’t share well (i.e. a MySQL MyISAM table that forces table level locking for inserts can really kill performance when you have multiple INSERT or UPDATE statements happening at the same time).

Mapping critical code sections

If you must have shared resources you want to minimize their impact as much as possible, identify the minimal amount of code that needs to hold a lock on a shared resource and only wrap that code with the lock. One note: if your critical code relies on external code or services all bets are off (you hold a lock, then do a DB query and the DB query is very slow or fails or whatever you are out of luck and your application will grind to a halt). Look for code that establishes a lock using locking primitives such as acquire() and release() or threading.lock() and multiprocessing.Lock(). Additionally any databases you connect to may use table level locking, thus if one thread or process is running an INSERT or an UPDATE no other thread or process will be able to. You also want to ensure there are no concurrent deadlocks (i.e. thread #1 holds lock A while waiting to acquire lock B and thread #2 is holding lock B and waiting to acquire lock A, neither one gives up their lock and neither one can move forwards). If you must have multiple locks I would strongly suggest using locking hierarchies (e.g. always lock A and then lock B so no deadlock can occur); there are several excellent articles available on this topic. [LOCKING HIERARCHY]

A note on external C libraries

According to some sources Python is generally capable of running at around 10-20% of the speed of a compiled C program for the same algorithm (due to interpreter overhead, etc.). Generally speaking this won’t be a significant problem since many applications are IO bound (networks, disks, etc.) and modern CPU’s are quite fast. But what happens when you start pegging the CPU because of computation being done in Python? Again there is good news here, Guido van Rossum comes from a scientific computing (a.k.a. number crunching) background and this is one of the first things Python was designed to handle. You can (with relative ease) write C or C++ code and use it to extend Python code[PYTHON AND C]. This has already been done with a number of core modules, as mentioned above the cProfile and profile extensions accomplish the same work but as cPython is written in C it runs much faster. If you need to do heavy computation you will want to check out SciPy which is written largely in C and NumPy which is quite fast. [SCIPY]

An Extension to the Advanced Threading pattern – spinlocks

Normally we would use the thread.join() method but in this case we need a little more flexibility. The use of a spinlock (the thread simply spins, checking the lock repeatedly and once the lock releases it then continues doing whatever it is supposed to).

One of the main problems with the producer-consumer pattern is starvation. What if the producer can’t go fast enough to feed the workers and the workers have to wait for work? This in and of itself is not an insurmountable problem, however when combined with the fact that we want the Python script to run and exit in a timely manner (i.e. when all the work is done) and not before it is done how do we avoid a starvation situation triggering an early end of the program? The simplest method is to use a lock hierarchy, we have the producer establish a spinlock that it unlocks once it is done feeding data into the queue. The consumers also set a spinlock, if they notice that the queue is empty they can then check the producer spinlock, if this is not set then they know it is a simple starvation problem and that they should keep rechecking the queue until there is more work. However if they check the producer spinlock and it is set as done then they know that there is no more work to be done and they can also exit. One of the simpler ways this can be done is to create an array for each pool of threads (i.e. one for producers and one for consumers). Use the thread ID as the key and a value of 0 or 1 to indicate if the thread is done or not, then you can simply use something like “while (0 in spinlockProducerThread)” to see if any threads are still running. Ultimately this pattern can also be applied to the main thread, we simply have the main thread spawn off the producer and consumer threads and then spin until all the locks are released, indicating that all the work is done and the threads are finished. Additionally this allows us to have threads exit when they are done (i.e. once the producer thread has pushed all the work into the queue there is no reason to leave it running, because the spinlock is external we can exit, unlike if we were to use the thread.join() method). This pattern also works really well if you have multiple levels of producers and consumers, i.e. a chain of modules that downloads a web page, processes the content within it, then creates something based on the content and so on. This pattern also lends itself to distributed systems, by taking the locking mechanism and making it external to the running code (for example by using a database table as a shared blackboard to communicate) allows multiple systems to join, do work and then leave (allowing you to dynamically allocate resources).

web_get_producer_consumer_basic_pattern.py (outline of code)
class ProducerThread:
	def __init__(self, myID):
		self.myID = myID
		while 1:
			work_todo = GetWorkFromDB()
			for work_item in work_todo:
				workQueue.put(work_item)
			break
		spinlockProducerThread[myID] = 1
		sys.exit()

class ConsumerThread:
	def __init__(self, myID):
		self.myID = myID
		while 1:
			try:
				work_item = workQueue.get(block=False)
			except Queue.Empty:
				"""
				The queue is empty, but it may only be temporarily empty or it may
				be empty but other worker threads are still running
				"""
				if (0 not in spinlockProducerThread):
					spinlockConsumerThread[myID] = 1
					sys.exit()
			else:
				ProcessWork(work_item)

if __name__ == '__main__':

	spinlockProducerThread = [0] * number_of_producer_threads

	spinlockConsumerThread = [0] * number_of_consumer_threads

	workQueue = Queue.Queue()  

	for i in range(number_of_producer_threads):
		try:
			thread.start_new_thread(ProducerThread, (i,))
		except thread.error, e:
			myID = i
			spinlockProducerThread[myID] = 1
			print "ProducerThread thread creation failed"
			print e

	for i in range(number_of_consumer_threads):
		try:
			thread.start_new_thread(ThreadWorkerFreescan, (i,))
		except thread.error, e:
			myID = i
			spinlockConsumerThread[myID] = 1
			print "ConsumerThread thread creation failed"
			print e

	while (0 in spinlockProducerThread or 0 in spinlockConsumerThread):
		time.sleep(1)
		pass

[PROFILE]

http://docs.python.org/library/profile.html

[Mike C. Fletcher]

Mike C. Fletcher (http://www.vrplumber.com)

[GPROF2DOT]

http://code.google.com/p/jrfonseca/wiki/Gprof2Dot

[PROFILE]

http://docs.python.org/library/profile.html

[DIS]

http://docs.python.org/library/dis.html

[LOCKING HIERARCHY]

http://www.ddj.com/architect/204801163

[PYTHON AND C]

http://docs.python.org/extending/extending.html

[SCIPY]

http://www.scipy.org/

class ProducerThread:
def __init__(self, myID):
self.myID = myID
while 1:
work_todo = GetWorkFromDB()
for work_item in work_todo:
workQueue.put(work_item)
break
spinlockProducerThread[myID] = 1
sys.exit()class ConsumerThread:
def __init__(self, myID):
self.myID = myID
while 1:
try:
work_item = workQueue.get(block=False)
except Queue.Empty:
“””
The queue is empty, but it may only be temporarily empty or it may
be empty but other worker threads are still running
“””
if (0 not in spinlockProducerThread):
spinlockConsumerThread[myID] = 1
sys.exit()
else:
ProcessWork(work_item)

if __name__ == ‘__main__’:

spinlockProducerThread = [0] * number_of_producer_threads

spinlockConsumerThread = [0] * number_of_consumer_threads

workQueue = Queue.Queue()

for i in range(number_of_producer_threads):
try:
thread.start_new_thread(ProducerThread, (i,))
except thread.error, e:
myID = i
spinlockProducerThread[myID] = 1
print “ProducerThread thread creation failed”
print e

for i in range(number_of_consumer_threads):
try:
thread.start_new_thread(ThreadWorkerFreescan, (i,))
except thread.error, e:
myID = i
spinlockConsumerThread[myID] = 1
print “ConsumerThread thread creation failed”
print e

while (0 in spinlockProducerThread or 0 in spinlockConsumerThread):
time.sleep(1)
pass

Python Performance Part 2

May 31, 2010

Python Performance Part 1
Python Performance Part 3
Python Performance Part 4

In part one I covered a basic introduction to threads and the producer-consumer pattern. Now in part two I’m going to cover some (unfortunately common) pathological cases where threading in Python can make things much worse.

An example of when threading makes things worse

So last week we covered basic Python threading strategies and work queues, so if you have a program and want to speed it up simply thread it and it’ll run faster, right? Wrong. For certain types of workloads and computation using threading in Python will make things worse (in some cases much worse).

Take for example a simplified program (courtesy of David Beazley[David Beazley]) that does some number crunching (e.g. counting from a large number down to zero) several times. This eliminates a lot of variables (memory, disk IO, network IO, etc.) and should run very quickly on any modern CPU. If we want to count down from the large number twice it should take about as long to do it sequentially as in parallel since our problem is CPU bound right?

Let’s try it:

count_sequential.py
#!/usr/bin/env python
import time

def count(n):
 while n > 0:
 n -= 1

start_time = time.time()
count(100000000)
count(100000000)
stop_time = time.time()
elapsed_time = stop_time - start_time
print elapsed_time
# a very simple count down function# start time

# stop time

# elapsed time (roughly)

On my machine with three runs I got run times of 17.37, 17.10 and 17.34 seconds which is pretty consistent.

And the threaded version:

count_threaded.py
#!/usr/bin/env python
import time
from threading import Thread

def count(n):
 while n > 0:
 n -= 1

start_time = time.time()
t1 = Thread(target=count,args=(100000000,))
t1.start()
t2 = Thread(target=count,args=(100000000,))
t2.start()
t1.join(); t2.join() 
stop_time = time.time()
elapsed_time = stop_time - start_time
print elapsed_time

On my machine with three runs I got run times of 23.14, 23.11 and 23.51 seconds which is pretty consistent as well. It is also quite a bit slower than the sequential version of this program (roughly 33% slower than the sequential version!). In the case of David Beazley he reports the threaded version almost taking twice as long on his system. Much like David Beazley I also tried running the threaded version of the Python code on a single CPU, in the case of Linux you can use the “taskset” command to bind a program to a particular CPU. I also got the same results he did (roughly speaking), one a single CPU the threaded version of the counting program took only 19.15, 19.38 and 19.29 seconds.

So we have two rather interesting questions: Why does the sequential version of the program take so much less time, and why does the threaded version run faster on a single CPU then when it runs across multiple CPU cores?

Why threading doesn’t always work as expected – the GIL

This unexpected behavior is caused by the Global Interpreter Lock (GIL):

A Global Interpreter Lock (GIL) is a mutual exclusion lock held by a programming language interpreter thread to avoid sharing code that is not thread-safe with other threads. There is always one GIL for one interpreter process.

Usage of a Global Interpreter Lock in a language effectively limits concurrency of a single interpreter process with multiple threads — there is no or very little increase in speed when running the process on a multiprocessor machine. Due to signaling with a CPU-bound thread, it can cause a significant slowdown, even on single processors.

This is largely what makes threaded programming in Python easy (you have no real worries about concurrency since only one thread at a time runs), and also what can make it challenging for performance (since only one thread at a time runs). The last sentence also offers an explanation as to why out sequential code is running faster than the threaded code: “Due to signaling with a CPU-bound thread, it can cause a significant slowdown” or in English: the time needed to constantly swap between the threads (even on an efficient system like Linux) is noticeable (and only gets worse with larger numbers of threads). But why does Python swap CPU intensive tasks so much?

When running a thread the Python GIL is held until one of several things happens: the thread completes running, the thread goes into an IO operation or 100 ticks are reached. The first case is simple, a thread can call sys.exit() and kill itself off, this of course is the end of things and the GIL is yielded back. The second case is also simple, if a thread blocks for an IO bound operation (disk, network, etc.) and it is put to sleep (chances are the IO will take time anyways so no point waiting idly by), it yields the GIL and another thread is given a chance to run. The third case is a little more nuanced. For starters what exactly is a tick? A tick is basically an interpreter instruction. By limiting the number of ticks a thread can run a CPU bound task that doesn’t want to exit or isn’t going to do IO won’t end up hogging the CPU. The number of ticks a thread is allowed to use before being stopped can be controlled by the sys.setcheckinterval() parameter which can be set on the fly. The most important thing to remember is that ticks are NOT time constrained.

If all of the above seems rather complicated there is a good reason for it. Python internally doesn’t really handle the specific scheduling of threads; it leaves this up to the operating system. The rational for this is simple: the Python developers feel that reinventing the wheel (handling thread scheduling) is probably not the best way to go since modern operating systems have spent a significant amount of time and energy on getting threads to run efficiently and quickly (especially Linux). So internally Python essentially uses a form of cooperative multi tasking (the thread yields the CPU when it exits, blocks on IO or hits the tick limit).

The trouble with ticks

So if a tick isn’t time constrained what’s the worst that can happen? Some CPU instructions take very little time to run, and conversely some take a very long time to run. The following program is very simple:

#!/usr/bin/env python

def bigtick():
 nums = xrange(100000000)
 -1 in nums

Which when disassembled (more on this later) it turns out to consist of:

>>> dis.dis(bigtick)
 2           0 LOAD_GLOBAL              0 (xrange)
 3 LOAD_CONST               1 (100000000)
 6 CALL_FUNCTION            1
 9 STORE_FAST               0 (nums)

 3          12 LOAD_CONST               2 (-1)
 15 LOAD_FAST                0 (nums)
 18 COMPARE_OP               6 (in)
 21 POP_TOP             
 22 LOAD_CONST               0 (None)
 25 RETURN_VALUE        

As you can see it’s about 25 instructions, much less than the 100 limit imposed by default. So what happens when we run this? Well not much of anything, but it does take a while (about 4 seconds on my machine). Much to slow, let’s hit ctrl-c and get out of it.

As you may have noticed hitting ctrl-c doesn’t work, the program runs till it’s done, taking it’s own sweet time. Why is this? While the child thread is running it basically blocks on signals sent to it. Just imagine what happens if you have 10 children threads running code like this (now we’re deep into pathological behavior-land).

Working with the GIL

So how can we deal with some of this odd and downright bad behavior by the GIL? The first trick is to minimize the number of threads you are using. Because the operating system controls which thread is executed it may execute a thread that has nothing to do (i.e. is still blocking on IO), meaning you pay (computationally speaking) for a thread swap, the (short) execution time of a thread that doesn’t do anything and then the task swap to another thread (hopefully one that does some actual work). Or worse you have a thread holding a shared lock, but it is waiting on something (e.g. a database write to complete) before it yields the lock. You can end up with threads being run that can’t do anything, leaving your computer spinning until the thread holding the lock is run and eventually completes, freeing the lock. Having threads sitting idly by will generally reduce performance (but of course if you get surges or spikes in work having a spare pool of threads can be a very good thing.

As for a way of dealing with the signal handling problem the best advice I can find is to have the main thread handle signals and spawn of child threads that do the actual work (in other words have the main program run threads and then passively loop while the children threads work). This way the main thread won’t be working on something when a signal arrives resulting in the signal being ignored.

Like any performance advice I suggest you take it with a grain of salt and first measure your performance, identify any issues and then fix them (which is covered in the third article).

Appendix

[David Beazley]

http://www.dabeaz.com/python/GIL.pdf

Python Performance Part 1

May 31, 2010

Python Performance Part 2
Python Performance Part 3
Python Performance Part 4

If you’ve been programming in Python chances are you have run into a situation where your code takes a lot longer to run than it probably should, and if you haven’t used Python I’m going to show you how easy it ease to write high performance and scalable code that will run fast.

These articles do not cover why you should use Python (it’s cool, it ships standard on every Linux distribution, all the kids at Google use it, etc.). These articles will specifically cover how you can speed up Python performance (parallelization, code profiling, performance measuring, etc.) and include code examples (for brevity I will use pseudo code for some of the longer examples). I will also cover some Python internals so that you will understand why speeding up code doesn’t always work as expected (and what you can do about this). Please note that older versions of Python (e.g. 2.4.3 as shipped with Red Hat Enterprise and CentOS) do not support some of the things discussed in these articles. The reference platform used for this article is Fedora 12 (Python 2.6.4, Beanstalkd 1.4.2, memcached 1.4.4 and MySQL 5.1.41 basically).

A (super) quick introduction to Python threading

You have two basic modules that provide threading capabilities in Python. The first (and oldest) module is “thread” [THREAD] which provides low-level primitives such as creating a thread, exiting and some lock primitives so you can use shared resources safely. You don’t want to use this module for reasons I will shortly explain. The second module is “threading” [THREADING] and it builds upon the low level primitives (similar to SocketServer, HTTPServer, etc.) provided by “thread”. In Python the main program runs as a single thread, from this thread you can create additional child threads (these run within the same process space and thus have shared memory). The main trick is that you need the main thread to keep running while the children threads are doing their thing, if the main thread exits than all the child threads get killed off unceremoniously, potentially leaving a mess. This is why you don’t want to use “thread”, as you will need to create your own spinlock to hold the main thread open while the children threads run. With the “threading” module the system provides the join() method which “blocks the calling thread until the thread whose join() method is called is terminated.”

It should be noted that in many cases the main thread will continue running until threads started using the threading Thread.start() are done but there are no guarantees so I strongly suggest you use the join() method or create your own spinlock (especially if using older versions of Python).

Basic Threading Pattern

The basic threading pattern in Python (and most languages) is to take a bunch of work and split it up among different threads (insightful, yes?). The idea is that instead of doing things sequentially (one at a time) you do them in parallel (more than one at a time). For many tasks this works well, especially if the tasks are largely bound by IO (Input and Output) delays (especially for network related stuff).

web_get_parallel.py – (no join()) Comment

#!/usr/bin/env python
import urllib2
from threading import Thread
hosts = ["http://lwn.net/", "http://seifried.org/", "http://google.com/"]

def getURL(URL):
    urllib2.urlopen(URL)
    print "got URL: " + URL

for item in hosts:
    t = Thread(target=getURL, args=(item,))
    t.start()
# list of web pages to get# define a get URL function# loop through the list of URLs# define a thread

# start the thread

But is this really faster than doing the work sequentially?

web_get_sequential.py Comment

#!/usr/bin/env python
import urllib2
hosts = ["http://lwn.net/", "http://seifried.org/", "http://google.com/"]

def getURL(URL):
    urllib2.urlopen(URL)
    print "got URL: " + URL

for item in hosts:
    urllib2.urlopen(item)
    print "got URL: " + URL
# list of web pages to get# define a get URL function# loop through the list of URLs# get the URL

So based on this the parallel threaded example (not using join()) takes about a 0.5 seconds according to the UNIX “time” command, and the sequential version takes 6 seconds. This is obviously not right, assuming the sequential one takes about the same amount of time to get each URL (best case) then the parallel version should take 2 seconds (6/3 = 2).

So why is it only taking 0.5 seconds? Probably because the main thread exited before the children are done running (which means we didn’t finish our work). So let’s fix that:

web_get_parallel.py (with join()) Comment
#!/usr/bin/env python
import urllib2
from threading import Thread
thread_list = []
hosts = ["http://lwn.net/", "http://seifried.org/", "http://google.com/"]

def getURL(URL):
    urllib2.urlopen(URL)
    print "got URL: " + URL

for item in hosts:
    t = Thread(target=getURL, args=(item,))
    t.start()
    thread_list.append(t)

for thread in thread_list:
    thread.join()
# list of web pages to get# define a get URL function# loop through the list of URLs# define a thread

# start the thread

# loop through the list of threads

# join each thread

Great, the code now takes about 2.5 seconds, and we are actually getting the web pages correctly! As you can see this is a significant speed, taking only as long as the slowest web page (probably my site), and the sequential example taking the time it takes to get all the pages in total (we’ll cover how to measure performance properly later in this series).

Why Threading Helps Performance

The reason that we get such a significant speed up in this case is that the program is spending the majority of its time waiting for the remote web servers to respond with web pages that it has requested and about 1% (in other words not very much) of the time is spent creating and sending those requests. When a python thread does something IO related (reading or writing a file, sending a network request, waiting for a response, etc.) it essentially goes to sleep, at which point a different thread is given a chance to execute (so instead of simply waiting around the program can do other work as well). Additionally because threads share memory space the program won’t use much more memory (whereas if you split the work between different processes each one would have it’s own memory leading to much duplication). A final advantage of memory sharing is that variables and objects can be accessed by multiple threads; there is no need to engineer inter-process communications (although with the multiprocessing module this is trivial, more on this later in the fourth article of this series).

But what happens if we have a much larger workload (instead of 3 domains to get we have 3 million) and a much more complex workload (we have to get the web pages, extract data from them and process the data for example). If we try to start up 3 million threads to handle each domain individually I can pretty much guarantee you are not going to increase performance.

Advanced Threading pattern

So starting up one thread per task is probably not the most optimal way to go, especially if the workload is large or if it varies. But how can we efficiently divvy up the workload among a (possibly indeterminate) number of threads? Taking the total number of things to do and dividing by the number of threads and then assigning that many things to each thread is not optimal. What if one thread gets saddled with URL’s that all take much longer than average to download? We are then left with one thread still running while the rest are finished their work and waiting idly. Also the amount of work may not be known in advance, or we want to be able to add more work. We may be reading URL’s from a database or a file and not know in advance how many we have.

This is where Python work queues come in [WORKQUEUE]. A work queue provides a way to handle and share multiple objects, you put items into a queue and then retrieve them, the queue ensures that objects are entered correctly and removed correctly (so with multiple threads accessing a queue you don’t need to worry about locking and so on, you simply write to the queue or read from it and it works). Queues in Python can be FIFO (first in, first out, think of a pipe), LIFO (last in first out, think of a stack) and priority based (you can assign priorities to objects and then make sure that higher priority items are processed before lower priority items and so on).

Queue syntax is very simple: you create a queue, then you put objects into it and get objects out of it. This allows you to use the producer/consumer pattern [Producer-consumer problem]. The following example has a producer thread that reads the URL entries into a queue, and a group of worker threads that pull an item from the queue, process it and then repeat until the queue is empty at which time they exit. In order to ensure that the main thread runs for a long enough time to let the producer and consumer child threads finish their work we’ll simply use the thread.join() method to hold the main thread open until all the children have exited.

web_get_producer_consumer_basic_pattern.py
#!/usr/bin/env python

import urllib2
import sys
import Queue
from threading import Thread

host_list = ["http://lwn.net/", "http://seifried.org/", "http://google.com/", "http://yahoo.com/", "http://gmail.com/"]

thread_list = []
URL_work_queue = Queue.Queue()
number_of_consumer_threads = 3

def putURL(URL_list):
    for URL_item in URL_list:
        URL_work_queue.put(URL_item)

def getURL(thread_id):
    while 1:
        try:
            URL_toget = URL_work_queue.get(block=False)
        except Queue.Empty:
            print "thread exiting, id: " + str(thread_id)
            sys.exit()
        urllib2.urlopen(URL_toget)
        print "got URL: " + URL_toget

# fill the queue with work and block until we are done filling the queue

producer_thread = Thread(target=putURL, args=(host_list,))
producer_thread.start()
producer_thread.join()

# we can now start consumers

for i in range(number_of_consumer_threads):    
    t = Thread(target=getURL, args=(i,))
    t.start()
    thread_list.append(t)

for thread in thread_list:
    thread.join()

This pattern is efficient and can easily be extended to have multiple sets of threads and queues connecting them. Generally speaking I try to break the work tasks up into computationally intensive pieces (such as parsing a web page for content) and IO intensive tasks (such as requesting a web page or reading or writing a file). This allows tasks that do not require a lot of computation to quickly yield the GIL lock (more on this in the next article) allowing another thread to run (you want to do IO, especially network IO in parallel as much as possible since it is so slow). If we were to download the web page and process it within the same thread for example we would be limiting the number of simultaneous downloads since we’d be processing web pages when we could also be downloading them.

Now as for code quality the above code has a number of significant problems (mostly architecturally but one or two implementation-wise) that will be addressed in the later articles of this series.

Work Queues

As you can see work queues provide an ideal mechanism for linking pools of threads. Implementing queues from scratch in other languages such as C or Java is an incredibly complex task, what if two threads try to write to the queue at the same time, how do you sure that two separate objects are created properly, or if multiple threads are reading from the queue how do you ensure that objects are handed out properly? The good news is that in Python due to its providing low level primitives like threads and queues, and the Global Interpreter Lock (GIL) you really don’t have to care or spend much time making sure you get it right since it’s built in.

Appendix

[THREAD]

http://docs.python.org/library/thread.html

[THREADING]

http://docs.python.org/library/threading.html

[WORKQUEUE ]

http://docs.python.org/library/queue.html

[Producer-consumer problem]

http://en.wikipedia.org/wiki/Producer-consumer_problem

#!/usr/bin/env python

import urllib2

from threading import Thread

hosts = [“http://lwn.net/&#8221;, “http://seifried.org/&#8221;, “http://google.com/”%5D

def getURL(URL):

urllib2.urlopen(URL)

print “got URL: ” + URL

for item in hosts:

t = Thread(target=getURL, args=(item,))

t.start()