Python Performance Part 4

Python Performance Part 1
Python Performance Part 2
Python Performance Part 3

In part one I covered a basic introduction to threads and the producer-consumer pattern. In part two I covered some (unfortunately common) pathological cases where threading in Python can make things much worse. In part three I covered how to profile and measure performance within your code and some strategies to address them. Finally I covered a more advanced producer-consumer pattern that uses a spinlocks and a locking hierarchy. Now in part four we’re going to explore how you can escape a single core using Python and distribute your application across multiple CPU’s and multiple systems. I’m also going to cover a few of the more common strategies and tools for distributing applications and work loads across multiple systems. I have a lot of ground to cover so I will be unable to go in depth with detailed code examples/etc., however the Python documentation includes excellent examples which I will reference.

Escaping the core – multiprocessing

So as you have probably learned by now the GIL giveth (easy threading) and the GIL taketh (you can only run threaded code on one CPU core). Oddly enough the Python developers were not too happy about this either, so in Python 2.6 they introduced the “multiprocessing” module [MULTIPROCESSING]. The multiprocessing module provides a very similar set of primitives as the “thread” and “threading” modules. You get a process class which allows you to spawn multiple processes, a queue class which allows you to create queues that can be accessed from multiple processes, pipes that allow for bi-directional and uni-directional communications, locking, and shared memory in the form of values and arrays. You also get a manager which allows you to control a server process and pools which allow you to create and control multiple processes. Using this you can extend the consumer-producer pattern to multiple processes running across multiple CPU cores on multiple systems.

A quick note on subprocess[SUBPROCESS], this is primarily designed for executing external scripts and programs (e.g. ping, traceroute, whois, etc.) that are not written in Python. You can launch programs and scripts and talk to their stdin/stdout which is great for non python programs but extremely limited when compared to the multiprocessing module.

Inter-process communication

One major advantage of using the multiprocessing module is that inter-process communication mechanisms provided. You have several options; pipes (pipe()), locks (lock()), shared memory (Value() and Array()) and the server process (Manager()). The shared memory facilities provided by multiprocessing such as Value() and Array() are relatively fast since they are implemented as a shared memory structure, but of course by definition (shared memory space) they cannot be shared between processes running on different systems. If you want to share Python objects (such as variables, arrays, etc.) across processes running on separate servers you can use Manager() to create a server process that will hold and share the Python objects.

Facility Provides Pros Cons
Lock() locks Provides locking for operations that are not thread or multiple process safe Limited to single system
Queue() Work queues Fast, good for sharing work Limited to single system
Pipe() Bi-directional and uni-directional communication pipes Simple, send() and recv(), you can pass pretty much anything through it (variables, arrays, etc.) Not completely safe, each end of pipe must be accessed one at a time only
Shared memory Value() and Array() Very fast Limited to single system, limited to standard values and arrays (no other Python objects)
Manager() list, dict, Namespace, Lock, RLock, Semaphore, BoundedSemaphore, Condition, Event, Queue, Value and Array Very flexible, provides everything, works across multiple processes and systems Slower than shared memory

One of the more common patterns for designing communication capabilities into a program that uses multiple processes is to use a single control process which then has a bi-directional pipe to each worker processes allowing it to communicate with them. Now if you want to communicate across processes running on separate system you will need some external communications method such as xmlrpclib or asynchat (to name two possibilities). Like many open source projects one of the problems in Python is that you have numerous options that can lead to mental overload. My advice is that you list out what you need (i.e. do you need to pass the data through firewalls/etc, in which case something like xmlrpclib that can use HTTP as a transport may be a good idea) and go with the simplest option (which you may have to replace anyways).

Inter-process signals

Python provides a signal() module that gives you the ability to send signals, unfortunately there are some issues with handling signals in Python. You cannot intentionally block signals in Python, meaning that you can’t take a critical code section and tell it to ignore signals while it is running (ensuring that the code finishes executing) thus it is possible for critical code to only partially execute before yielding the GIL lock and another thread is run, which means you may need to think about transactions and methods ensure that partially completed tasks do not cause problems should another thread be executed. As well only the main thread can set a new signal handler, a common design pattern in Python is to have the main thread loop and handle signals while it spawns of child threads that do the actual work.

Pyro – Python Remote Objects

One disadvantage of using the Manager() module to provide communication between different systems is that it is somewhat limited. Although it provides a basic level of security (in the form of an authkey) it doesn’t appear to provide anything beyond that. Pyro[PYRO] supports SSL encrypted communications but most importantly it provides greater flexibility with objects and types and also includes some important capabilities like automatic reconnection in case a connection drops. Pyro also provides a name service so you can register objects and easily share them and an event service which is similar to queues but can be used for many to many communications, however it does not guarantee delivery of messages which can be a problem.

Beanstalkd message queue – sharing messages (and work)

Now one of my favorite pieces of software. One reason is that the beanstalkd [BEANSTALKD] server binary itself is around 59k. I like small and simple code since it generally has fewer bugs and other surprises. Beanstalkd provides a distributed work queue, you can put jobs in, take them out, bury them (they are essentially hidden until you pull them back out, a great way to put problematic jobs out of the way for a human to inspect later) and create multiple tubes, essentially unique queues (a single beanstalkd server can handle multiple related or unrelated queues). The major downside to beanstalkd is that it doesn’t provide any encryption of network traffic or authentication capabilities, so any client with access to a running instance of a beanstalkd server will have access to all the queues in that running instance. This can be gotten around by wrapping beanstalkd in SSL (using stunnel or a similar service) to secure communications and limiting access to beanstalkd queues based on either IP address or by requiring SSL client authentication. The beanstalkc [BEANSTALKC] Python client provides a very stable client for beanstalkd and also importantly has excellent documentation.

Formatting messages and work – pickle and JSON

So assuming you are not using the Python Value(), Array(), Pipe() or Manager() or something like Pyro (all of which support native Python objects) to communicate between processes you will need to format your data so that it can be passed around. If you need to pass an actual Python object such as a class (as opposed to a variable) then you will need to pickle() [PICKLE] it, which will serialize the object. Python pickle() supports several protocols for pickling objects, the first of which is text based (protocol 0) and is extremely useful if you need to pass an object over a text based protocol such as HTTP. The other two protocols are binary in nature.

If however you merely need to pass data and variables around you can use JSON instead of pickling. I really, really like JSON [JSON] because the specification is simple (it fits on one page) and it is human readable (so debugging it is simple, you can print and read the messages directly). Two Python libraries for JSON are available (simplejson and pyson), and as of Python 2.6 the JSON encoder (called “json”) is included.

Caching – increase speed and decrease workload

Of course if you can do work in advance, or do work once and keep the answer for later use again you will be able to significantly speed up your application. The real trick to caching an answer or a result is to know what to cache, whether or not caching it will improve things (to make it worthwhile) and finally how to expire answers (assuming this needs to be done). For example if you display photos online and want to display a thumbnail image you can create it when the image is uploaded, meaning that when someone asks to view it the thumbnail will be available almost immediately, the downside being that you potentially store thumbnails that are never requested by anyone. Alternatively you can generate the thumbnail as needed and keep the thumbnail for later use, you run don’t store thumbnails that are never asked for, but it will take longer for the system to display the thumbnail the first time it is requested (since it must be generated). Either way you also need to decide how long to keep the thumbnail image for; forever, or do you expire it at some point? The answer to these types of questions varies hugely of course depending on your data, usage and what is important to you (latency? speed? storage? processing time?). A general rule of thumb is to cache data that is as finished as possible, e.g. rather than caching a database query you can cache the compiled result (such as a piece of a generated web page) to reduce processing time and latency. As for cache expiry you can use tricks such as checking to see if the source data has changed since you last used it, saving you the processing time if it’s still valid (but not the time needed to get the data). Another strategy is to expire on update, when new data comes in have it trigger an expiry of any cached data based on the old data it has replaced. It should also be noted that some services and servers have built-in caching, e.g. MySQL’s query cache can be used to cache query results.

Caching with memcache

If you need to cache objects then memcache [MEMCACHED] is (usually) the cache for you. It’s basically an object store that uses a unique key (text string) to locate the object. It keeps objects in memory and does not write them to disk which makes it very fast. One trick is to use some calculable value for the key such as a URL, username, etc. to make it easy to query the cache for an object (and if the object doesn’t exist or is stale you can then generate a new one). For most sites you can get away with using spare amounts of memory on various servers, although some sites with especially heavy usage will dedicate entire servers to memcached. To access memcached from Python you want python-memached [PYTHON MEMCACHED]. One interesting project that recently came to light is Cachepy which is a caching solution based on Google App Engine. [CACHEPY]

A note on MySQL

I can’t put it any more simply then this: use InnoDB rather than MyISAM most of the time. MyISAM is sometimes faster for reads, but for writes (INSERT/UPDATE/etc.) to a table it can be muchh slower since the entire table must be locked for a write. InnoDB uses row level locking so as long as different rows of data are being updated the writes can occur concurrently rather than sequentially which is a huge advantage.

A note on PostgreSQL

The VACUUM FULL command will lock your database table and basically make it perform slower than mud while the VACUUM FULL is running. In other words avoid VACUUM FULL and use AUTOVACUUM, VACUUM, CLUSTER, or a SELECT INTO to copy data to a new table or a TRUNCATE TABLE if you want to clear it out. These options are covered in the PostgreSQL wiki. [VACUUM FULL]

A note on Google App Engine

If you want to start playing with large scale Python applications you should definitely check out Google App Engine. [APPENGINE]

Shared nothing/Sharding and other (potentially complicated) scaling strategies

My final performance tip is to use shared nothing or sharded data architectures. If you need to do user authentication and allow users to update their account info (in other words you can’t outsource authentication to something like OpenID) than you will need to store this data somewhere (most likely within a database). So you have a single central database for all user accounts, well that doesn’t scale to well (especially if you end up with more than one location hosting servers). So you replicate the database, creating multiple slaves with copies of the master so at least servers can get good read performance. But you’re still stuck with a single master for write operations. So you end up using something like MySQL Cluster which supports multi-master replication (you can write to any of the masters), but you still need to be careful about race conditions.

However there is a simpler method to spread out the read and write operations on the database. Using a fast hash function or even just a simple “algorithm” like all accounts starting with the letters “A” through “E” are on server 1, “F” through “J” are on server 2 and so on would allow you to split user accounts up into shards that can be handled by separate servers. Using a hash function is better, if done right you end up with accounts very evenly spread out across the servers (so if a whole bunch of people with accounts starting with the letter “A” join up it won’t mess things up). One such system that implemented this was the Cache Array Routing Protocol (CARP) [CARP], the idea being you could take a URL, hash it and determine which cache server would contain a copy of the data (and if the cache server didn’t have it then it would go fetch it). The advantage of this is that no communication with a central server (that tracks information location) or every single server (brute force method) is needed to determine which server hosts the information. One important note on sharding: few tools or services implicitly support it, and it can add significant complexity to your system, so use with caution (ideally only if absolutely necessary).

The future of Python performance – Unladen Swallow

So finally we come to the end, or rather the beginning of Python performance. As you may or may not know Google uses Python a lot internally (YouTube is mostly built using Python). Google appears to be in the situation where it is cheaper to re-architect and improve the language they are using rather than to rewrite all their existing code in some faster language. This is ultimately a VERY good thing for us Python users. So what is this “Unladen Swallow” [UNLADEN SWALLOW] and what does it do? From the FAQ:

We want to make Python faster, but we also want to make it easy for large, well-established applications to switch to Unladen Swallow.

1. Produce a version of Python at least 5x faster than CPython.

2. Python application performance should be stable.

3. Maintain source-level compatibility with CPython applications.

4. Maintain source-level compatibility with CPython extension modules.

5. We do not want to maintain a Python implementation forever; we view our work as a branch, not a fork.

The largest aspects of this project are moving Python to a new JIT compiler and using the LLVM [LLVM]. The advantage of moving to the Low Level Virtual Machine (LLVM) is that it creates very fast code and any improvements made to the LLVM will result in improvements to Python. Additionally Unladen Swallow is looking at improving startup time (which can be slow due to imports) and speeding up regular expressions to name a few other aspects. Finally one of the bigger goals was to remove the GIL, the idea being that a single Python process with multiple threads would then be able to use multiple CPU cores with no hassle, but as it turns out this is more complicated than expected so I guess we may have to wait a while for this.

Some more articles on performance: – the ten thousand client connection problem – Python memory profiler – Sharding Blog – Building scalable web applications with Google App Engine – High Scalability blog – Parallel Processing and Multiprocessing in Python

















Tags: , , , ,

2 Responses to “Python Performance Part 4”

  1. Katja Hofmann Says:

    Hi, thanks for the great article! I also found that with multiprocessing.sharedctypes, objects other than standard values and arrays can be created. I’m wondering whether this is still much faster than using Manager. Do you have any experience with this?



  2. “Security? That’s the OS’s/Networks Job!” « Steve's Personal Blog Says:

    […] didn’t end up trying beanstalk but did notice it had similar pitfalls. As Kurt Seifried points out in his blog: The major downside to beanstalkd is that it doesn’t provide any encryption of […]

Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: