Python Performance Part 1

Python Performance Part 2
Python Performance Part 3
Python Performance Part 4

If you’ve been programming in Python chances are you have run into a situation where your code takes a lot longer to run than it probably should, and if you haven’t used Python I’m going to show you how easy it ease to write high performance and scalable code that will run fast.

These articles do not cover why you should use Python (it’s cool, it ships standard on every Linux distribution, all the kids at Google use it, etc.). These articles will specifically cover how you can speed up Python performance (parallelization, code profiling, performance measuring, etc.) and include code examples (for brevity I will use pseudo code for some of the longer examples). I will also cover some Python internals so that you will understand why speeding up code doesn’t always work as expected (and what you can do about this). Please note that older versions of Python (e.g. 2.4.3 as shipped with Red Hat Enterprise and CentOS) do not support some of the things discussed in these articles. The reference platform used for this article is Fedora 12 (Python 2.6.4, Beanstalkd 1.4.2, memcached 1.4.4 and MySQL 5.1.41 basically).

A (super) quick introduction to Python threading

You have two basic modules that provide threading capabilities in Python. The first (and oldest) module is “thread” [THREAD] which provides low-level primitives such as creating a thread, exiting and some lock primitives so you can use shared resources safely. You don’t want to use this module for reasons I will shortly explain. The second module is “threading” [THREADING] and it builds upon the low level primitives (similar to SocketServer, HTTPServer, etc.) provided by “thread”. In Python the main program runs as a single thread, from this thread you can create additional child threads (these run within the same process space and thus have shared memory). The main trick is that you need the main thread to keep running while the children threads are doing their thing, if the main thread exits than all the child threads get killed off unceremoniously, potentially leaving a mess. This is why you don’t want to use “thread”, as you will need to create your own spinlock to hold the main thread open while the children threads run. With the “threading” module the system provides the join() method which “blocks the calling thread until the thread whose join() method is called is terminated.”

It should be noted that in many cases the main thread will continue running until threads started using the threading Thread.start() are done but there are no guarantees so I strongly suggest you use the join() method or create your own spinlock (especially if using older versions of Python).

Basic Threading Pattern

The basic threading pattern in Python (and most languages) is to take a bunch of work and split it up among different threads (insightful, yes?). The idea is that instead of doing things sequentially (one at a time) you do them in parallel (more than one at a time). For many tasks this works well, especially if the tasks are largely bound by IO (Input and Output) delays (especially for network related stuff).

web_get_parallel.py – (no join()) Comment

#!/usr/bin/env python
import urllib2
from threading import Thread
hosts = ["http://lwn.net/", "http://seifried.org/", "http://google.com/"]

def getURL(URL):
    urllib2.urlopen(URL)
    print "got URL: " + URL

for item in hosts:
    t = Thread(target=getURL, args=(item,))
    t.start()
# list of web pages to get# define a get URL function# loop through the list of URLs# define a thread

# start the thread

But is this really faster than doing the work sequentially?

web_get_sequential.py Comment

#!/usr/bin/env python
import urllib2
hosts = ["http://lwn.net/", "http://seifried.org/", "http://google.com/"]

def getURL(URL):
    urllib2.urlopen(URL)
    print "got URL: " + URL

for item in hosts:
    urllib2.urlopen(item)
    print "got URL: " + URL
# list of web pages to get# define a get URL function# loop through the list of URLs# get the URL

So based on this the parallel threaded example (not using join()) takes about a 0.5 seconds according to the UNIX “time” command, and the sequential version takes 6 seconds. This is obviously not right, assuming the sequential one takes about the same amount of time to get each URL (best case) then the parallel version should take 2 seconds (6/3 = 2).

So why is it only taking 0.5 seconds? Probably because the main thread exited before the children are done running (which means we didn’t finish our work). So let’s fix that:

web_get_parallel.py (with join()) Comment
#!/usr/bin/env python
import urllib2
from threading import Thread
thread_list = []
hosts = ["http://lwn.net/", "http://seifried.org/", "http://google.com/"]

def getURL(URL):
    urllib2.urlopen(URL)
    print "got URL: " + URL

for item in hosts:
    t = Thread(target=getURL, args=(item,))
    t.start()
    thread_list.append(t)

for thread in thread_list:
    thread.join()
# list of web pages to get# define a get URL function# loop through the list of URLs# define a thread

# start the thread

# loop through the list of threads

# join each thread

Great, the code now takes about 2.5 seconds, and we are actually getting the web pages correctly! As you can see this is a significant speed, taking only as long as the slowest web page (probably my site), and the sequential example taking the time it takes to get all the pages in total (we’ll cover how to measure performance properly later in this series).

Why Threading Helps Performance

The reason that we get such a significant speed up in this case is that the program is spending the majority of its time waiting for the remote web servers to respond with web pages that it has requested and about 1% (in other words not very much) of the time is spent creating and sending those requests. When a python thread does something IO related (reading or writing a file, sending a network request, waiting for a response, etc.) it essentially goes to sleep, at which point a different thread is given a chance to execute (so instead of simply waiting around the program can do other work as well). Additionally because threads share memory space the program won’t use much more memory (whereas if you split the work between different processes each one would have it’s own memory leading to much duplication). A final advantage of memory sharing is that variables and objects can be accessed by multiple threads; there is no need to engineer inter-process communications (although with the multiprocessing module this is trivial, more on this later in the fourth article of this series).

But what happens if we have a much larger workload (instead of 3 domains to get we have 3 million) and a much more complex workload (we have to get the web pages, extract data from them and process the data for example). If we try to start up 3 million threads to handle each domain individually I can pretty much guarantee you are not going to increase performance.

Advanced Threading pattern

So starting up one thread per task is probably not the most optimal way to go, especially if the workload is large or if it varies. But how can we efficiently divvy up the workload among a (possibly indeterminate) number of threads? Taking the total number of things to do and dividing by the number of threads and then assigning that many things to each thread is not optimal. What if one thread gets saddled with URL’s that all take much longer than average to download? We are then left with one thread still running while the rest are finished their work and waiting idly. Also the amount of work may not be known in advance, or we want to be able to add more work. We may be reading URL’s from a database or a file and not know in advance how many we have.

This is where Python work queues come in [WORKQUEUE]. A work queue provides a way to handle and share multiple objects, you put items into a queue and then retrieve them, the queue ensures that objects are entered correctly and removed correctly (so with multiple threads accessing a queue you don’t need to worry about locking and so on, you simply write to the queue or read from it and it works). Queues in Python can be FIFO (first in, first out, think of a pipe), LIFO (last in first out, think of a stack) and priority based (you can assign priorities to objects and then make sure that higher priority items are processed before lower priority items and so on).

Queue syntax is very simple: you create a queue, then you put objects into it and get objects out of it. This allows you to use the producer/consumer pattern [Producer-consumer problem]. The following example has a producer thread that reads the URL entries into a queue, and a group of worker threads that pull an item from the queue, process it and then repeat until the queue is empty at which time they exit. In order to ensure that the main thread runs for a long enough time to let the producer and consumer child threads finish their work we’ll simply use the thread.join() method to hold the main thread open until all the children have exited.

web_get_producer_consumer_basic_pattern.py
#!/usr/bin/env python

import urllib2
import sys
import Queue
from threading import Thread

host_list = ["http://lwn.net/", "http://seifried.org/", "http://google.com/", "http://yahoo.com/", "http://gmail.com/"]

thread_list = []
URL_work_queue = Queue.Queue()
number_of_consumer_threads = 3

def putURL(URL_list):
    for URL_item in URL_list:
        URL_work_queue.put(URL_item)

def getURL(thread_id):
    while 1:
        try:
            URL_toget = URL_work_queue.get(block=False)
        except Queue.Empty:
            print "thread exiting, id: " + str(thread_id)
            sys.exit()
        urllib2.urlopen(URL_toget)
        print "got URL: " + URL_toget

# fill the queue with work and block until we are done filling the queue

producer_thread = Thread(target=putURL, args=(host_list,))
producer_thread.start()
producer_thread.join()

# we can now start consumers

for i in range(number_of_consumer_threads):    
    t = Thread(target=getURL, args=(i,))
    t.start()
    thread_list.append(t)

for thread in thread_list:
    thread.join()

This pattern is efficient and can easily be extended to have multiple sets of threads and queues connecting them. Generally speaking I try to break the work tasks up into computationally intensive pieces (such as parsing a web page for content) and IO intensive tasks (such as requesting a web page or reading or writing a file). This allows tasks that do not require a lot of computation to quickly yield the GIL lock (more on this in the next article) allowing another thread to run (you want to do IO, especially network IO in parallel as much as possible since it is so slow). If we were to download the web page and process it within the same thread for example we would be limiting the number of simultaneous downloads since we’d be processing web pages when we could also be downloading them.

Now as for code quality the above code has a number of significant problems (mostly architecturally but one or two implementation-wise) that will be addressed in the later articles of this series.

Work Queues

As you can see work queues provide an ideal mechanism for linking pools of threads. Implementing queues from scratch in other languages such as C or Java is an incredibly complex task, what if two threads try to write to the queue at the same time, how do you sure that two separate objects are created properly, or if multiple threads are reading from the queue how do you ensure that objects are handed out properly? The good news is that in Python due to its providing low level primitives like threads and queues, and the Global Interpreter Lock (GIL) you really don’t have to care or spend much time making sure you get it right since it’s built in.

Appendix

[THREAD]

http://docs.python.org/library/thread.html

[THREADING]

http://docs.python.org/library/threading.html

[WORKQUEUE ]

http://docs.python.org/library/queue.html

[Producer-consumer problem]

http://en.wikipedia.org/wiki/Producer-consumer_problem

#!/usr/bin/env python

import urllib2

from threading import Thread

hosts = [“http://lwn.net/”, “http://seifried.org/”, “http://google.com/”%5D

def getURL(URL):

urllib2.urlopen(URL)

print “got URL: ” + URL

for item in hosts:

t = Thread(target=getURL, args=(item,))

t.start()

Tags: ,

4 Responses to “Python Performance Part 1”

  1. s7v7nislands Says:

    I test the no join version using python 2.6.5 in gentoo, It just like join version.

    in python doc:
    The entire Python program exits when no alive non-daemon threads are left.

    so you need calll t.setDaemon(True) before t.start(), the no join version will exit like you say.

    sorry for poor english.

  2. kurtseifried Says:

    Yes, the point was to specifically show that you need to somehow keep the main thread running (if it exits, everything gets killed).

  3. Lens Hood Says:

    :*: I am really thankful to this topic because it really gives great information “-`

  4. Alexey Lavrenuke (@Direvius) Says:

    “Implementing queues from scratch in other languages such as C or Java is an incredibly complex task” — why implement a queue from scratch? BlockingDeque and BlockingQueue solve the problem. And they are in the standard Java library. And GIL is not an advantage at all.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

%d bloggers like this: