Posts Tagged ‘performance’

Python Performance Part 4

May 31, 2010

Python Performance Part 1
Python Performance Part 2
Python Performance Part 3

In part one I covered a basic introduction to threads and the producer-consumer pattern. In part two I covered some (unfortunately common) pathological cases where threading in Python can make things much worse. In part three I covered how to profile and measure performance within your code and some strategies to address them. Finally I covered a more advanced producer-consumer pattern that uses a spinlocks and a locking hierarchy. Now in part four we’re going to explore how you can escape a single core using Python and distribute your application across multiple CPU’s and multiple systems. I’m also going to cover a few of the more common strategies and tools for distributing applications and work loads across multiple systems. I have a lot of ground to cover so I will be unable to go in depth with detailed code examples/etc., however the Python documentation includes excellent examples which I will reference.

Escaping the core – multiprocessing

So as you have probably learned by now the GIL giveth (easy threading) and the GIL taketh (you can only run threaded code on one CPU core). Oddly enough the Python developers were not too happy about this either, so in Python 2.6 they introduced the “multiprocessing” module [MULTIPROCESSING]. The multiprocessing module provides a very similar set of primitives as the “thread” and “threading” modules. You get a process class which allows you to spawn multiple processes, a queue class which allows you to create queues that can be accessed from multiple processes, pipes that allow for bi-directional and uni-directional communications, locking, and shared memory in the form of values and arrays. You also get a manager which allows you to control a server process and pools which allow you to create and control multiple processes. Using this you can extend the consumer-producer pattern to multiple processes running across multiple CPU cores on multiple systems.

A quick note on subprocess[SUBPROCESS], this is primarily designed for executing external scripts and programs (e.g. ping, traceroute, whois, etc.) that are not written in Python. You can launch programs and scripts and talk to their stdin/stdout which is great for non python programs but extremely limited when compared to the multiprocessing module.

Inter-process communication

One major advantage of using the multiprocessing module is that inter-process communication mechanisms provided. You have several options; pipes (pipe()), locks (lock()), shared memory (Value() and Array()) and the server process (Manager()). The shared memory facilities provided by multiprocessing such as Value() and Array() are relatively fast since they are implemented as a shared memory structure, but of course by definition (shared memory space) they cannot be shared between processes running on different systems. If you want to share Python objects (such as variables, arrays, etc.) across processes running on separate servers you can use Manager() to create a server process that will hold and share the Python objects.

Facility Provides Pros Cons
Lock() locks Provides locking for operations that are not thread or multiple process safe Limited to single system
Queue() Work queues Fast, good for sharing work Limited to single system
Pipe() Bi-directional and uni-directional communication pipes Simple, send() and recv(), you can pass pretty much anything through it (variables, arrays, etc.) Not completely safe, each end of pipe must be accessed one at a time only
Shared memory Value() and Array() Very fast Limited to single system, limited to standard values and arrays (no other Python objects)
Manager() list, dict, Namespace, Lock, RLock, Semaphore, BoundedSemaphore, Condition, Event, Queue, Value and Array Very flexible, provides everything, works across multiple processes and systems Slower than shared memory

One of the more common patterns for designing communication capabilities into a program that uses multiple processes is to use a single control process which then has a bi-directional pipe to each worker processes allowing it to communicate with them. Now if you want to communicate across processes running on separate system you will need some external communications method such as xmlrpclib or asynchat (to name two possibilities). Like many open source projects one of the problems in Python is that you have numerous options that can lead to mental overload. My advice is that you list out what you need (i.e. do you need to pass the data through firewalls/etc, in which case something like xmlrpclib that can use HTTP as a transport may be a good idea) and go with the simplest option (which you may have to replace anyways).

Inter-process signals

Python provides a signal() module that gives you the ability to send signals, unfortunately there are some issues with handling signals in Python. You cannot intentionally block signals in Python, meaning that you can’t take a critical code section and tell it to ignore signals while it is running (ensuring that the code finishes executing) thus it is possible for critical code to only partially execute before yielding the GIL lock and another thread is run, which means you may need to think about transactions and methods ensure that partially completed tasks do not cause problems should another thread be executed. As well only the main thread can set a new signal handler, a common design pattern in Python is to have the main thread loop and handle signals while it spawns of child threads that do the actual work.

Pyro – Python Remote Objects

One disadvantage of using the Manager() module to provide communication between different systems is that it is somewhat limited. Although it provides a basic level of security (in the form of an authkey) it doesn’t appear to provide anything beyond that. Pyro[PYRO] supports SSL encrypted communications but most importantly it provides greater flexibility with objects and types and also includes some important capabilities like automatic reconnection in case a connection drops. Pyro also provides a name service so you can register objects and easily share them and an event service which is similar to queues but can be used for many to many communications, however it does not guarantee delivery of messages which can be a problem.

Beanstalkd message queue – sharing messages (and work)

Now one of my favorite pieces of software. One reason is that the beanstalkd [BEANSTALKD] server binary itself is around 59k. I like small and simple code since it generally has fewer bugs and other surprises. Beanstalkd provides a distributed work queue, you can put jobs in, take them out, bury them (they are essentially hidden until you pull them back out, a great way to put problematic jobs out of the way for a human to inspect later) and create multiple tubes, essentially unique queues (a single beanstalkd server can handle multiple related or unrelated queues). The major downside to beanstalkd is that it doesn’t provide any encryption of network traffic or authentication capabilities, so any client with access to a running instance of a beanstalkd server will have access to all the queues in that running instance. This can be gotten around by wrapping beanstalkd in SSL (using stunnel or a similar service) to secure communications and limiting access to beanstalkd queues based on either IP address or by requiring SSL client authentication. The beanstalkc [BEANSTALKC] Python client provides a very stable client for beanstalkd and also importantly has excellent documentation.

Formatting messages and work – pickle and JSON

So assuming you are not using the Python Value(), Array(), Pipe() or Manager() or something like Pyro (all of which support native Python objects) to communicate between processes you will need to format your data so that it can be passed around. If you need to pass an actual Python object such as a class (as opposed to a variable) then you will need to pickle() [PICKLE] it, which will serialize the object. Python pickle() supports several protocols for pickling objects, the first of which is text based (protocol 0) and is extremely useful if you need to pass an object over a text based protocol such as HTTP. The other two protocols are binary in nature.

If however you merely need to pass data and variables around you can use JSON instead of pickling. I really, really like JSON [JSON] because the specification is simple (it fits on one page) and it is human readable (so debugging it is simple, you can print and read the messages directly). Two Python libraries for JSON are available (simplejson and pyson), and as of Python 2.6 the JSON encoder (called “json”) is included.

Caching – increase speed and decrease workload

Of course if you can do work in advance, or do work once and keep the answer for later use again you will be able to significantly speed up your application. The real trick to caching an answer or a result is to know what to cache, whether or not caching it will improve things (to make it worthwhile) and finally how to expire answers (assuming this needs to be done). For example if you display photos online and want to display a thumbnail image you can create it when the image is uploaded, meaning that when someone asks to view it the thumbnail will be available almost immediately, the downside being that you potentially store thumbnails that are never requested by anyone. Alternatively you can generate the thumbnail as needed and keep the thumbnail for later use, you run don’t store thumbnails that are never asked for, but it will take longer for the system to display the thumbnail the first time it is requested (since it must be generated). Either way you also need to decide how long to keep the thumbnail image for; forever, or do you expire it at some point? The answer to these types of questions varies hugely of course depending on your data, usage and what is important to you (latency? speed? storage? processing time?). A general rule of thumb is to cache data that is as finished as possible, e.g. rather than caching a database query you can cache the compiled result (such as a piece of a generated web page) to reduce processing time and latency. As for cache expiry you can use tricks such as checking to see if the source data has changed since you last used it, saving you the processing time if it’s still valid (but not the time needed to get the data). Another strategy is to expire on update, when new data comes in have it trigger an expiry of any cached data based on the old data it has replaced. It should also be noted that some services and servers have built-in caching, e.g. MySQL’s query cache can be used to cache query results.

Caching with memcache

If you need to cache objects then memcache [MEMCACHED] is (usually) the cache for you. It’s basically an object store that uses a unique key (text string) to locate the object. It keeps objects in memory and does not write them to disk which makes it very fast. One trick is to use some calculable value for the key such as a URL, username, etc. to make it easy to query the cache for an object (and if the object doesn’t exist or is stale you can then generate a new one). For most sites you can get away with using spare amounts of memory on various servers, although some sites with especially heavy usage will dedicate entire servers to memcached. To access memcached from Python you want python-memached [PYTHON MEMCACHED]. One interesting project that recently came to light is Cachepy which is a caching solution based on Google App Engine. [CACHEPY]

A note on MySQL

I can’t put it any more simply then this: use InnoDB rather than MyISAM most of the time. MyISAM is sometimes faster for reads, but for writes (INSERT/UPDATE/etc.) to a table it can be muchh slower since the entire table must be locked for a write. InnoDB uses row level locking so as long as different rows of data are being updated the writes can occur concurrently rather than sequentially which is a huge advantage.

A note on PostgreSQL

The VACUUM FULL command will lock your database table and basically make it perform slower than mud while the VACUUM FULL is running. In other words avoid VACUUM FULL and use AUTOVACUUM, VACUUM, CLUSTER, or a SELECT INTO to copy data to a new table or a TRUNCATE TABLE if you want to clear it out. These options are covered in the PostgreSQL wiki. [VACUUM FULL]

A note on Google App Engine

If you want to start playing with large scale Python applications you should definitely check out Google App Engine. [APPENGINE]

Shared nothing/Sharding and other (potentially complicated) scaling strategies

My final performance tip is to use shared nothing or sharded data architectures. If you need to do user authentication and allow users to update their account info (in other words you can’t outsource authentication to something like OpenID) than you will need to store this data somewhere (most likely within a database). So you have a single central database for all user accounts, well that doesn’t scale to well (especially if you end up with more than one location hosting servers). So you replicate the database, creating multiple slaves with copies of the master so at least servers can get good read performance. But you’re still stuck with a single master for write operations. So you end up using something like MySQL Cluster which supports multi-master replication (you can write to any of the masters), but you still need to be careful about race conditions.

However there is a simpler method to spread out the read and write operations on the database. Using a fast hash function or even just a simple “algorithm” like all accounts starting with the letters “A” through “E” are on server 1, “F” through “J” are on server 2 and so on would allow you to split user accounts up into shards that can be handled by separate servers. Using a hash function is better, if done right you end up with accounts very evenly spread out across the servers (so if a whole bunch of people with accounts starting with the letter “A” join up it won’t mess things up). One such system that implemented this was the Cache Array Routing Protocol (CARP) [CARP], the idea being you could take a URL, hash it and determine which cache server would contain a copy of the data (and if the cache server didn’t have it then it would go fetch it). The advantage of this is that no communication with a central server (that tracks information location) or every single server (brute force method) is needed to determine which server hosts the information. One important note on sharding: few tools or services implicitly support it, and it can add significant complexity to your system, so use with caution (ideally only if absolutely necessary).

The future of Python performance – Unladen Swallow

So finally we come to the end, or rather the beginning of Python performance. As you may or may not know Google uses Python a lot internally (YouTube is mostly built using Python). Google appears to be in the situation where it is cheaper to re-architect and improve the language they are using rather than to rewrite all their existing code in some faster language. This is ultimately a VERY good thing for us Python users. So what is this “Unladen Swallow” [UNLADEN SWALLOW] and what does it do? From the FAQ:

We want to make Python faster, but we also want to make it easy for large, well-established applications to switch to Unladen Swallow.

1. Produce a version of Python at least 5x faster than CPython.

2. Python application performance should be stable.

3. Maintain source-level compatibility with CPython applications.

4. Maintain source-level compatibility with CPython extension modules.

5. We do not want to maintain a Python implementation forever; we view our work as a branch, not a fork.

The largest aspects of this project are moving Python to a new JIT compiler and using the LLVM [LLVM]. The advantage of moving to the Low Level Virtual Machine (LLVM) is that it creates very fast code and any improvements made to the LLVM will result in improvements to Python. Additionally Unladen Swallow is looking at improving startup time (which can be slow due to imports) and speeding up regular expressions to name a few other aspects. Finally one of the bigger goals was to remove the GIL, the idea being that a single Python process with multiple threads would then be able to use multiple CPU cores with no hassle, but as it turns out this is more complicated than expected so I guess we may have to wait a while for this.

Some more articles on performance:

http://wiki.python.org/moin/PythonSpeed/PerformanceTips

http://www.kegel.com/c10k.html – the ten thousand client connection problem

http://ajbrown.org/blog/2008/11/05/micro-optimize-your-time-not-your-code/

http://stackoverflow.com/questions/110259/python-memory-profiler – Python memory profiler

http://dbshards.com/database-sharding-blog/ – Sharding Blog

http://sites.google.com/site/io/building-scalable-web-applications-with-google-app-engine – Building scalable web applications with Google App Engine

http://highscalability.com/ – High Scalability blog

http://wiki.python.org/moin/ParallelProcessing – Parallel Processing and Multiprocessing in Python

[MULTIPROCESSING]

http://docs.python.org/library/multiprocessing.html

[SUBPROCESS]

http://docs.python.org/library/subprocess.html

[PYRO]

http://pyro.sourceforge.net/

[BEANSTALKD]

http://kr.github.com/beanstalkd/

[BEANSTALKC]

http://github.com/earl/beanstalkc

[PICKLE]

http://docs.python.org/library/pickle.html

[JSON]

http://www.json.org/

[MEMCACHED]

http://memcached.org/

[PYTHON MEMCACHED]

http://www.tummy.com/Community/software/python-memcached/

[CACHEPY]

http://appengine-cookbook.appspot.com/recipe/cachepy-faster-than-memcache-and-unlimited-quota/

[VACUUM FULL]

http://wiki.postgresql.org/wiki/VACUUM_FULL

[APPENGINE]

http://code.google.com/appengine/

[CARP]

http://en.wikipedia.org/wiki/Cache_Array_Routing_Protocol

[PYSHARDS]

http://code.google.com/p/pyshards/wiki/Pyshards

[UNLADEN SWALLOW]

http://code.google.com/p/unladen-swallow/

[LLVM]

http://llvm.org/

Python Performance Part 3

May 31, 2010

Python Performance Part 1
Python Performance Part 2
Python Performance Part 4

In part one I covered a basic introduction to threads and the producer-consumer pattern. In part two I covered the Global Interpreter Lock (GIL) and some of the more pathological behavior it can cause. Now we’re going to go over how to profile code and how to instrument code so you can measure the performance of it so you can know with some certainty if the code actually needs to be fixed (and how it can be fixed). Finally we’ll go over some significant ways in which we can improve our handling of the producer-consumer pattern that will be more efficient and reliable.

Profiling vs. Performance measurement

Profiling and performance measurement are closely related but there is a difference. Profiling will generally give you every a lot of detail: function returns and exception events and since Python supports deterministic profiling it will do so with an extremely high degree of accuracy. The output of Python profiling is staggering, every single detail, the number of calls (ncalls), total time (tottime), per call time (tottime divided by ncalls), cumulative time (including sub calls/etc. for a function, cumtime) and the filename and line number and function called (filename:lineno(function)). In other words everything (this is one advantage of using an interpreted language). For most of us profiling should be used once the code has been written (premature optimization and all that) and used a bit (testing or otherwise). Performance measurement tends to be coarse grained, for example adding hooks that record how long a database operation takes to return or how long it takes to generate a web page in total. As well you can insert performance measuring code into your application and use it to continuously monitor system state/health (i.e. average DB query time over the last 5 minutes, etc.) and alert you if a problem occurs which is something that profiling is not really suited for.

Profiling Python code

Python provides two main profilers: “profile” and “cProfile” [PROFILE](a third one, “hotshot” has been deprecated and may be removed in the next version of Python according to the documentation). “profile” and “cProfile” are essentially identical with respect to usage, however the “cPython” module (written in C) is much faster and won’t slow your code down significantly. The “profile” module will slow your code down significantly, but as it is written in Python it can easily be modified or extended should you need to. Because Python is an interpreted language the interpreter is active during code execution, you don’t need to add instrumented code as it is already present.

A simple example of profiling
>>> def count():
...     n = 100000
...     while n > 0:
...             n -= 1
...
>>> import cProfile
>>> cProfile.run('count()', 'countprof')
>>> import pstats
>>> p = pstats.Stats('countprof')
>>> print p
<pstats.Stats instance at 0x254fb48>
>>> p.print_stats()
Mon Jan  4 00:19:11 2010    countprof

 3 function calls in 0.012 CPU seconds

 Random listing order was used

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
 1    0.000    0.000    0.012    0.012 <string>:1(<module>)
 1    0.012    0.012    0.012    0.012 <stdin>:1(count)

An excellent video covering this is available from Mike Fletcher [Mike C. Fletcher]

http://us.pycon.org/2009/conference/schedule/event/15/

What we can take away from this and other sources appears to be:

  1. You need to be able to properly measure performance before you profile code (luckily Python makes this trivial)
  2. Save your profiling results so you can compare things before and after (in other words did your optimization actually help)
  3. Check your algorithms; a stupid algorithm will always be slow
  4. Find function calls that are used often or are slow (and thus where you should start looking at first)
  5. Look for operations or results that can be pooled (i.e. using a single database query with a JOIN statement instead of multiple queries to get a collection of data)
  6. Look for operations that can be cached (keep your results around if you will use them again)
  7. Anytime you rely upon external code (especially opaque or closed source code) you are very much at the mercy of someone else
  8. Anytime you rely upon an external service (a database, reading or writing a file from the local system, getting a web page, etc.) you are very much at the mercy of someone else (in other words watch out for IO which may show up as lots of time spent sleeping)
  9. Only optimize one thing at a time unless you are incredibly brave
  10. Profiling threaded code can be problematic on multi-threaded or multiprocessor code

Of course I could be wrong. One last note, you can convert all this information into nice dot graphs using Gprof2Dot [GPROF2DOT].

Using cProfile

Using cProfile is easy; you import it as a module and then run code using it as a wrapper:

import cProfile cProfile.run('foo()', 'foo_profile')

 

This will execute the function foo() and output the data from the profile run to the file foo_profile. This is great if you just want to profile a specific class or function within your program. Alternatively if you want to profile the entire program you can use cProfile to execute the Python program in question and profile the entire thing:

# /usr/lib64/python2.6/cProfile.py -o base64-profile /usr/lib64/python2.6/base64.py /etc/resolv.conf
# python

>>> import pstats
>>> p = pstats.Stats('base64-profile')
>>> p.print_stats()
Tue Jan  5 02:37:04 2010    test-output.1

 282 function calls in 0.008 CPU seconds

 Random listing order was used

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 2    0.000    0.000    0.000    0.000 {method 'items' of 'dict' objects}
 1    0.000    0.000    0.000    0.000 /usr/lib64/python2.6/getopt.py:16(<module>)
 1    0.000    0.000    0.002    0.002 /usr/lib64/python2.6/base64.py:3(<module>)
 4    0.000    0.000    0.000    0.000 {method 'read' of 'file' objects}
 1    0.000    0.000    0.000    0.000 {open}
 3    0.000    0.000    0.000    0.000 {len}
 1    0.006    0.006    0.008    0.008 {execfile}
 256    0.000    0.000    0.000    0.000 {chr}
 1    0.000    0.000    0.000    0.000 /usr/lib64/python2.6/getopt.py:39(GetoptError)
 2    0.000    0.000    0.000    0.000 {binascii.b2a_base64}
 1    0.000    0.000    0.008    0.008 <string>:1(<module>)
 1    0.001    0.001    0.001    0.001 /usr/lib64/python2.6/base64.py:326(test)
 2    0.000    0.000    0.000    0.000 {method 'write' of 'file' objects}
 1    0.000    0.000    0.000    0.000 {method 'sort' of 'list' objects}
 1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
 1    0.000    0.000    0.000    0.000 /usr/lib64/python2.6/base64.py:285(encode)
 1    0.000    0.000    0.000    0.000 {method 'startswith' of 'str' objects}
 1    0.000    0.000    0.000    0.000 {range}
 1    0.000    0.000    0.000    0.000 /usr/lib64/python2.6/getopt.py:52(getopt)

<pstats.Stats instance at 0x7fad1e6c1ab8>

As you can see I ran the file /etc/resolv.conf through the Python base64 encode/decode utility. The first obvious thing to jump out is heavy use of the “chr()” function. This function returns a string one character at a time but appears to be extremely efficient (virtually no CPU time was used). The only thing that really appears to have taken any time is the execfile() function which parses a file. As you can imagine a Python application of any real size will generate a lot of output. Use of cPython is covered in detail in http://docs.python.org/library/profile.html

Measuring Python code performance

There’s a couple of ways to measure Python code execution, the most accurate for small amounts is the timeit.Timer() function. You basically wrap your function in this and it reports how long things took to run, it can also takes a repeat command, so for operations that are relatively fast you can run them a few thousand (or million times) to get an idea of how long they take. This module is generally best used when testing algorithms or other short pieces of code since it won’t give details of where things are being slow (just an overall picture).

But what about instrumenting code for long term monitoring, or measuring the speed of external software and services such as database queries? Some strategies here include simply using the time.time() method to get the current time at the start and end of the code you are interested in, subtract the two and you get the elapsed time. This allows you to wrap a call to a database; if the query takes more than say 5 seconds you can send a warning of some sort (log an alert, send an email, etc.). If you are processing database entries you can log the primary key and the start time when you start processing the entity, once done processing (possibly after shoving it through several work queues/etc.) you log the finish time, and ideally log these results.

A note on long term performance monitoring

A major advantage of architecting in performance monitoring is that you can generate long term stats regarding your code, application and environment. Things like how long database queries take, swap the disks for solid stats disks and see if or what a difference it makes, network latency/speed issues, etc. My advice would be to measure the performance of external libraries and services since profiling won’t really address this. I would also measure the overall performance of things (how long from page request to page generation, or page generation and writing of back end data, etc.). Especially if you are splitting work up using work queues profiling won’t help since you can’t track the processing of a specific message/request/etc. within that context easily.

Additionally if you do this by using function decorators you can simply create a timing function and wrap your database query class or whatever function it is you want to monitor without having to modify your code significantly.

Disassembling Python code

I thought I should add a quick note on disassembling Python code (so you can find out approximately how many ticks it is). In some cases where you suspect a computation or a task is causing a lot of context switching you may wish to look at modifying the sys.setcheckinterval() and increasing it. But of course simply twiddling the knobs until you get a better result is probably not the easiest way to accomplish this. Using the dis[DIS] module you can disassemble code, see “The trouble with ticks” in the second part of this series for an example. You may be lucky and find that setting the sys.setcheckinterval() helps (e.g. if you computation takes 120 ticks and it can now complete in one run rather than being interrupted) but I would urge extreme caution as it may have unintended side effects that are difficult to track down.

Shared resources, locking and performance

So what happens if you profile or time your code in testing and find that it is reasonably fast; but when you run it in a multi-threaded or multi-process environment you find your code crawls to a stand still or takes far more time than it should? Chances are you have some form of shared resource or locking, that when run in small trials doesn’t create contention, but when run in production does. The best manner to deal with this is of course to avoid having any resources that require locking or can’t easily be shared. Additionally I have found that the less shared resources, locking and other synchronization issues I have to deal with, the fewer bugs and I create (imagine that). The two best ways to deal with this are to profile and monitor your code, code that spends a lot of time sleeping may be waiting on a shared lock, and if something takes to long to run (such as a database query) you may have an external resource that doesn’t share well (i.e. a MySQL MyISAM table that forces table level locking for inserts can really kill performance when you have multiple INSERT or UPDATE statements happening at the same time).

Mapping critical code sections

If you must have shared resources you want to minimize their impact as much as possible, identify the minimal amount of code that needs to hold a lock on a shared resource and only wrap that code with the lock. One note: if your critical code relies on external code or services all bets are off (you hold a lock, then do a DB query and the DB query is very slow or fails or whatever you are out of luck and your application will grind to a halt). Look for code that establishes a lock using locking primitives such as acquire() and release() or threading.lock() and multiprocessing.Lock(). Additionally any databases you connect to may use table level locking, thus if one thread or process is running an INSERT or an UPDATE no other thread or process will be able to. You also want to ensure there are no concurrent deadlocks (i.e. thread #1 holds lock A while waiting to acquire lock B and thread #2 is holding lock B and waiting to acquire lock A, neither one gives up their lock and neither one can move forwards). If you must have multiple locks I would strongly suggest using locking hierarchies (e.g. always lock A and then lock B so no deadlock can occur); there are several excellent articles available on this topic. [LOCKING HIERARCHY]

A note on external C libraries

According to some sources Python is generally capable of running at around 10-20% of the speed of a compiled C program for the same algorithm (due to interpreter overhead, etc.). Generally speaking this won’t be a significant problem since many applications are IO bound (networks, disks, etc.) and modern CPU’s are quite fast. But what happens when you start pegging the CPU because of computation being done in Python? Again there is good news here, Guido van Rossum comes from a scientific computing (a.k.a. number crunching) background and this is one of the first things Python was designed to handle. You can (with relative ease) write C or C++ code and use it to extend Python code[PYTHON AND C]. This has already been done with a number of core modules, as mentioned above the cProfile and profile extensions accomplish the same work but as cPython is written in C it runs much faster. If you need to do heavy computation you will want to check out SciPy which is written largely in C and NumPy which is quite fast. [SCIPY]

An Extension to the Advanced Threading pattern – spinlocks

Normally we would use the thread.join() method but in this case we need a little more flexibility. The use of a spinlock (the thread simply spins, checking the lock repeatedly and once the lock releases it then continues doing whatever it is supposed to).

One of the main problems with the producer-consumer pattern is starvation. What if the producer can’t go fast enough to feed the workers and the workers have to wait for work? This in and of itself is not an insurmountable problem, however when combined with the fact that we want the Python script to run and exit in a timely manner (i.e. when all the work is done) and not before it is done how do we avoid a starvation situation triggering an early end of the program? The simplest method is to use a lock hierarchy, we have the producer establish a spinlock that it unlocks once it is done feeding data into the queue. The consumers also set a spinlock, if they notice that the queue is empty they can then check the producer spinlock, if this is not set then they know it is a simple starvation problem and that they should keep rechecking the queue until there is more work. However if they check the producer spinlock and it is set as done then they know that there is no more work to be done and they can also exit. One of the simpler ways this can be done is to create an array for each pool of threads (i.e. one for producers and one for consumers). Use the thread ID as the key and a value of 0 or 1 to indicate if the thread is done or not, then you can simply use something like “while (0 in spinlockProducerThread)” to see if any threads are still running. Ultimately this pattern can also be applied to the main thread, we simply have the main thread spawn off the producer and consumer threads and then spin until all the locks are released, indicating that all the work is done and the threads are finished. Additionally this allows us to have threads exit when they are done (i.e. once the producer thread has pushed all the work into the queue there is no reason to leave it running, because the spinlock is external we can exit, unlike if we were to use the thread.join() method). This pattern also works really well if you have multiple levels of producers and consumers, i.e. a chain of modules that downloads a web page, processes the content within it, then creates something based on the content and so on. This pattern also lends itself to distributed systems, by taking the locking mechanism and making it external to the running code (for example by using a database table as a shared blackboard to communicate) allows multiple systems to join, do work and then leave (allowing you to dynamically allocate resources).

web_get_producer_consumer_basic_pattern.py (outline of code)
class ProducerThread:
	def __init__(self, myID):
		self.myID = myID
		while 1:
			work_todo = GetWorkFromDB()
			for work_item in work_todo:
				workQueue.put(work_item)
			break
		spinlockProducerThread[myID] = 1
		sys.exit()

class ConsumerThread:
	def __init__(self, myID):
		self.myID = myID
		while 1:
			try:
				work_item = workQueue.get(block=False)
			except Queue.Empty:
				"""
				The queue is empty, but it may only be temporarily empty or it may
				be empty but other worker threads are still running
				"""
				if (0 not in spinlockProducerThread):
					spinlockConsumerThread[myID] = 1
					sys.exit()
			else:
				ProcessWork(work_item)

if __name__ == '__main__':

	spinlockProducerThread = [0] * number_of_producer_threads

	spinlockConsumerThread = [0] * number_of_consumer_threads

	workQueue = Queue.Queue()  

	for i in range(number_of_producer_threads):
		try:
			thread.start_new_thread(ProducerThread, (i,))
		except thread.error, e:
			myID = i
			spinlockProducerThread[myID] = 1
			print "ProducerThread thread creation failed"
			print e

	for i in range(number_of_consumer_threads):
		try:
			thread.start_new_thread(ThreadWorkerFreescan, (i,))
		except thread.error, e:
			myID = i
			spinlockConsumerThread[myID] = 1
			print "ConsumerThread thread creation failed"
			print e

	while (0 in spinlockProducerThread or 0 in spinlockConsumerThread):
		time.sleep(1)
		pass

[PROFILE]

http://docs.python.org/library/profile.html

[Mike C. Fletcher]

Mike C. Fletcher (http://www.vrplumber.com)

[GPROF2DOT]

http://code.google.com/p/jrfonseca/wiki/Gprof2Dot

[PROFILE]

http://docs.python.org/library/profile.html

[DIS]

http://docs.python.org/library/dis.html

[LOCKING HIERARCHY]

http://www.ddj.com/architect/204801163

[PYTHON AND C]

http://docs.python.org/extending/extending.html

[SCIPY]

http://www.scipy.org/

class ProducerThread:
def __init__(self, myID):
self.myID = myID
while 1:
work_todo = GetWorkFromDB()
for work_item in work_todo:
workQueue.put(work_item)
break
spinlockProducerThread[myID] = 1
sys.exit()class ConsumerThread:
def __init__(self, myID):
self.myID = myID
while 1:
try:
work_item = workQueue.get(block=False)
except Queue.Empty:
“””
The queue is empty, but it may only be temporarily empty or it may
be empty but other worker threads are still running
“””
if (0 not in spinlockProducerThread):
spinlockConsumerThread[myID] = 1
sys.exit()
else:
ProcessWork(work_item)

if __name__ == ‘__main__’:

spinlockProducerThread = [0] * number_of_producer_threads

spinlockConsumerThread = [0] * number_of_consumer_threads

workQueue = Queue.Queue()

for i in range(number_of_producer_threads):
try:
thread.start_new_thread(ProducerThread, (i,))
except thread.error, e:
myID = i
spinlockProducerThread[myID] = 1
print “ProducerThread thread creation failed”
print e

for i in range(number_of_consumer_threads):
try:
thread.start_new_thread(ThreadWorkerFreescan, (i,))
except thread.error, e:
myID = i
spinlockConsumerThread[myID] = 1
print “ConsumerThread thread creation failed”
print e

while (0 in spinlockProducerThread or 0 in spinlockConsumerThread):
time.sleep(1)
pass

Python Performance Part 2

May 31, 2010

Python Performance Part 1
Python Performance Part 3
Python Performance Part 4

In part one I covered a basic introduction to threads and the producer-consumer pattern. Now in part two I’m going to cover some (unfortunately common) pathological cases where threading in Python can make things much worse.

An example of when threading makes things worse

So last week we covered basic Python threading strategies and work queues, so if you have a program and want to speed it up simply thread it and it’ll run faster, right? Wrong. For certain types of workloads and computation using threading in Python will make things worse (in some cases much worse).

Take for example a simplified program (courtesy of David Beazley[David Beazley]) that does some number crunching (e.g. counting from a large number down to zero) several times. This eliminates a lot of variables (memory, disk IO, network IO, etc.) and should run very quickly on any modern CPU. If we want to count down from the large number twice it should take about as long to do it sequentially as in parallel since our problem is CPU bound right?

Let’s try it:

count_sequential.py
#!/usr/bin/env python
import time

def count(n):
 while n > 0:
 n -= 1

start_time = time.time()
count(100000000)
count(100000000)
stop_time = time.time()
elapsed_time = stop_time - start_time
print elapsed_time
# a very simple count down function# start time

# stop time

# elapsed time (roughly)

On my machine with three runs I got run times of 17.37, 17.10 and 17.34 seconds which is pretty consistent.

And the threaded version:

count_threaded.py
#!/usr/bin/env python
import time
from threading import Thread

def count(n):
 while n > 0:
 n -= 1

start_time = time.time()
t1 = Thread(target=count,args=(100000000,))
t1.start()
t2 = Thread(target=count,args=(100000000,))
t2.start()
t1.join(); t2.join() 
stop_time = time.time()
elapsed_time = stop_time - start_time
print elapsed_time

On my machine with three runs I got run times of 23.14, 23.11 and 23.51 seconds which is pretty consistent as well. It is also quite a bit slower than the sequential version of this program (roughly 33% slower than the sequential version!). In the case of David Beazley he reports the threaded version almost taking twice as long on his system. Much like David Beazley I also tried running the threaded version of the Python code on a single CPU, in the case of Linux you can use the “taskset” command to bind a program to a particular CPU. I also got the same results he did (roughly speaking), one a single CPU the threaded version of the counting program took only 19.15, 19.38 and 19.29 seconds.

So we have two rather interesting questions: Why does the sequential version of the program take so much less time, and why does the threaded version run faster on a single CPU then when it runs across multiple CPU cores?

Why threading doesn’t always work as expected – the GIL

This unexpected behavior is caused by the Global Interpreter Lock (GIL):

A Global Interpreter Lock (GIL) is a mutual exclusion lock held by a programming language interpreter thread to avoid sharing code that is not thread-safe with other threads. There is always one GIL for one interpreter process.

Usage of a Global Interpreter Lock in a language effectively limits concurrency of a single interpreter process with multiple threads — there is no or very little increase in speed when running the process on a multiprocessor machine. Due to signaling with a CPU-bound thread, it can cause a significant slowdown, even on single processors.

This is largely what makes threaded programming in Python easy (you have no real worries about concurrency since only one thread at a time runs), and also what can make it challenging for performance (since only one thread at a time runs). The last sentence also offers an explanation as to why out sequential code is running faster than the threaded code: “Due to signaling with a CPU-bound thread, it can cause a significant slowdown” or in English: the time needed to constantly swap between the threads (even on an efficient system like Linux) is noticeable (and only gets worse with larger numbers of threads). But why does Python swap CPU intensive tasks so much?

When running a thread the Python GIL is held until one of several things happens: the thread completes running, the thread goes into an IO operation or 100 ticks are reached. The first case is simple, a thread can call sys.exit() and kill itself off, this of course is the end of things and the GIL is yielded back. The second case is also simple, if a thread blocks for an IO bound operation (disk, network, etc.) and it is put to sleep (chances are the IO will take time anyways so no point waiting idly by), it yields the GIL and another thread is given a chance to run. The third case is a little more nuanced. For starters what exactly is a tick? A tick is basically an interpreter instruction. By limiting the number of ticks a thread can run a CPU bound task that doesn’t want to exit or isn’t going to do IO won’t end up hogging the CPU. The number of ticks a thread is allowed to use before being stopped can be controlled by the sys.setcheckinterval() parameter which can be set on the fly. The most important thing to remember is that ticks are NOT time constrained.

If all of the above seems rather complicated there is a good reason for it. Python internally doesn’t really handle the specific scheduling of threads; it leaves this up to the operating system. The rational for this is simple: the Python developers feel that reinventing the wheel (handling thread scheduling) is probably not the best way to go since modern operating systems have spent a significant amount of time and energy on getting threads to run efficiently and quickly (especially Linux). So internally Python essentially uses a form of cooperative multi tasking (the thread yields the CPU when it exits, blocks on IO or hits the tick limit).

The trouble with ticks

So if a tick isn’t time constrained what’s the worst that can happen? Some CPU instructions take very little time to run, and conversely some take a very long time to run. The following program is very simple:

#!/usr/bin/env python

def bigtick():
 nums = xrange(100000000)
 -1 in nums

Which when disassembled (more on this later) it turns out to consist of:

>>> dis.dis(bigtick)
 2           0 LOAD_GLOBAL              0 (xrange)
 3 LOAD_CONST               1 (100000000)
 6 CALL_FUNCTION            1
 9 STORE_FAST               0 (nums)

 3          12 LOAD_CONST               2 (-1)
 15 LOAD_FAST                0 (nums)
 18 COMPARE_OP               6 (in)
 21 POP_TOP             
 22 LOAD_CONST               0 (None)
 25 RETURN_VALUE        

As you can see it’s about 25 instructions, much less than the 100 limit imposed by default. So what happens when we run this? Well not much of anything, but it does take a while (about 4 seconds on my machine). Much to slow, let’s hit ctrl-c and get out of it.

As you may have noticed hitting ctrl-c doesn’t work, the program runs till it’s done, taking it’s own sweet time. Why is this? While the child thread is running it basically blocks on signals sent to it. Just imagine what happens if you have 10 children threads running code like this (now we’re deep into pathological behavior-land).

Working with the GIL

So how can we deal with some of this odd and downright bad behavior by the GIL? The first trick is to minimize the number of threads you are using. Because the operating system controls which thread is executed it may execute a thread that has nothing to do (i.e. is still blocking on IO), meaning you pay (computationally speaking) for a thread swap, the (short) execution time of a thread that doesn’t do anything and then the task swap to another thread (hopefully one that does some actual work). Or worse you have a thread holding a shared lock, but it is waiting on something (e.g. a database write to complete) before it yields the lock. You can end up with threads being run that can’t do anything, leaving your computer spinning until the thread holding the lock is run and eventually completes, freeing the lock. Having threads sitting idly by will generally reduce performance (but of course if you get surges or spikes in work having a spare pool of threads can be a very good thing.

As for a way of dealing with the signal handling problem the best advice I can find is to have the main thread handle signals and spawn of child threads that do the actual work (in other words have the main program run threads and then passively loop while the children threads work). This way the main thread won’t be working on something when a signal arrives resulting in the signal being ignored.

Like any performance advice I suggest you take it with a grain of salt and first measure your performance, identify any issues and then fix them (which is covered in the third article).

Appendix

[David Beazley]

http://www.dabeaz.com/python/GIL.pdf

Python Performance Part 1

May 31, 2010

Python Performance Part 2
Python Performance Part 3
Python Performance Part 4

If you’ve been programming in Python chances are you have run into a situation where your code takes a lot longer to run than it probably should, and if you haven’t used Python I’m going to show you how easy it ease to write high performance and scalable code that will run fast.

These articles do not cover why you should use Python (it’s cool, it ships standard on every Linux distribution, all the kids at Google use it, etc.). These articles will specifically cover how you can speed up Python performance (parallelization, code profiling, performance measuring, etc.) and include code examples (for brevity I will use pseudo code for some of the longer examples). I will also cover some Python internals so that you will understand why speeding up code doesn’t always work as expected (and what you can do about this). Please note that older versions of Python (e.g. 2.4.3 as shipped with Red Hat Enterprise and CentOS) do not support some of the things discussed in these articles. The reference platform used for this article is Fedora 12 (Python 2.6.4, Beanstalkd 1.4.2, memcached 1.4.4 and MySQL 5.1.41 basically).

A (super) quick introduction to Python threading

You have two basic modules that provide threading capabilities in Python. The first (and oldest) module is “thread” [THREAD] which provides low-level primitives such as creating a thread, exiting and some lock primitives so you can use shared resources safely. You don’t want to use this module for reasons I will shortly explain. The second module is “threading” [THREADING] and it builds upon the low level primitives (similar to SocketServer, HTTPServer, etc.) provided by “thread”. In Python the main program runs as a single thread, from this thread you can create additional child threads (these run within the same process space and thus have shared memory). The main trick is that you need the main thread to keep running while the children threads are doing their thing, if the main thread exits than all the child threads get killed off unceremoniously, potentially leaving a mess. This is why you don’t want to use “thread”, as you will need to create your own spinlock to hold the main thread open while the children threads run. With the “threading” module the system provides the join() method which “blocks the calling thread until the thread whose join() method is called is terminated.”

It should be noted that in many cases the main thread will continue running until threads started using the threading Thread.start() are done but there are no guarantees so I strongly suggest you use the join() method or create your own spinlock (especially if using older versions of Python).

Basic Threading Pattern

The basic threading pattern in Python (and most languages) is to take a bunch of work and split it up among different threads (insightful, yes?). The idea is that instead of doing things sequentially (one at a time) you do them in parallel (more than one at a time). For many tasks this works well, especially if the tasks are largely bound by IO (Input and Output) delays (especially for network related stuff).

web_get_parallel.py – (no join()) Comment

#!/usr/bin/env python
import urllib2
from threading import Thread
hosts = ["http://lwn.net/", "http://seifried.org/", "http://google.com/"]

def getURL(URL):
    urllib2.urlopen(URL)
    print "got URL: " + URL

for item in hosts:
    t = Thread(target=getURL, args=(item,))
    t.start()
# list of web pages to get# define a get URL function# loop through the list of URLs# define a thread

# start the thread

But is this really faster than doing the work sequentially?

web_get_sequential.py Comment

#!/usr/bin/env python
import urllib2
hosts = ["http://lwn.net/", "http://seifried.org/", "http://google.com/"]

def getURL(URL):
    urllib2.urlopen(URL)
    print "got URL: " + URL

for item in hosts:
    urllib2.urlopen(item)
    print "got URL: " + URL
# list of web pages to get# define a get URL function# loop through the list of URLs# get the URL

So based on this the parallel threaded example (not using join()) takes about a 0.5 seconds according to the UNIX “time” command, and the sequential version takes 6 seconds. This is obviously not right, assuming the sequential one takes about the same amount of time to get each URL (best case) then the parallel version should take 2 seconds (6/3 = 2).

So why is it only taking 0.5 seconds? Probably because the main thread exited before the children are done running (which means we didn’t finish our work). So let’s fix that:

web_get_parallel.py (with join()) Comment
#!/usr/bin/env python
import urllib2
from threading import Thread
thread_list = []
hosts = ["http://lwn.net/", "http://seifried.org/", "http://google.com/"]

def getURL(URL):
    urllib2.urlopen(URL)
    print "got URL: " + URL

for item in hosts:
    t = Thread(target=getURL, args=(item,))
    t.start()
    thread_list.append(t)

for thread in thread_list:
    thread.join()
# list of web pages to get# define a get URL function# loop through the list of URLs# define a thread

# start the thread

# loop through the list of threads

# join each thread

Great, the code now takes about 2.5 seconds, and we are actually getting the web pages correctly! As you can see this is a significant speed, taking only as long as the slowest web page (probably my site), and the sequential example taking the time it takes to get all the pages in total (we’ll cover how to measure performance properly later in this series).

Why Threading Helps Performance

The reason that we get such a significant speed up in this case is that the program is spending the majority of its time waiting for the remote web servers to respond with web pages that it has requested and about 1% (in other words not very much) of the time is spent creating and sending those requests. When a python thread does something IO related (reading or writing a file, sending a network request, waiting for a response, etc.) it essentially goes to sleep, at which point a different thread is given a chance to execute (so instead of simply waiting around the program can do other work as well). Additionally because threads share memory space the program won’t use much more memory (whereas if you split the work between different processes each one would have it’s own memory leading to much duplication). A final advantage of memory sharing is that variables and objects can be accessed by multiple threads; there is no need to engineer inter-process communications (although with the multiprocessing module this is trivial, more on this later in the fourth article of this series).

But what happens if we have a much larger workload (instead of 3 domains to get we have 3 million) and a much more complex workload (we have to get the web pages, extract data from them and process the data for example). If we try to start up 3 million threads to handle each domain individually I can pretty much guarantee you are not going to increase performance.

Advanced Threading pattern

So starting up one thread per task is probably not the most optimal way to go, especially if the workload is large or if it varies. But how can we efficiently divvy up the workload among a (possibly indeterminate) number of threads? Taking the total number of things to do and dividing by the number of threads and then assigning that many things to each thread is not optimal. What if one thread gets saddled with URL’s that all take much longer than average to download? We are then left with one thread still running while the rest are finished their work and waiting idly. Also the amount of work may not be known in advance, or we want to be able to add more work. We may be reading URL’s from a database or a file and not know in advance how many we have.

This is where Python work queues come in [WORKQUEUE]. A work queue provides a way to handle and share multiple objects, you put items into a queue and then retrieve them, the queue ensures that objects are entered correctly and removed correctly (so with multiple threads accessing a queue you don’t need to worry about locking and so on, you simply write to the queue or read from it and it works). Queues in Python can be FIFO (first in, first out, think of a pipe), LIFO (last in first out, think of a stack) and priority based (you can assign priorities to objects and then make sure that higher priority items are processed before lower priority items and so on).

Queue syntax is very simple: you create a queue, then you put objects into it and get objects out of it. This allows you to use the producer/consumer pattern [Producer-consumer problem]. The following example has a producer thread that reads the URL entries into a queue, and a group of worker threads that pull an item from the queue, process it and then repeat until the queue is empty at which time they exit. In order to ensure that the main thread runs for a long enough time to let the producer and consumer child threads finish their work we’ll simply use the thread.join() method to hold the main thread open until all the children have exited.

web_get_producer_consumer_basic_pattern.py
#!/usr/bin/env python

import urllib2
import sys
import Queue
from threading import Thread

host_list = ["http://lwn.net/", "http://seifried.org/", "http://google.com/", "http://yahoo.com/", "http://gmail.com/"]

thread_list = []
URL_work_queue = Queue.Queue()
number_of_consumer_threads = 3

def putURL(URL_list):
    for URL_item in URL_list:
        URL_work_queue.put(URL_item)

def getURL(thread_id):
    while 1:
        try:
            URL_toget = URL_work_queue.get(block=False)
        except Queue.Empty:
            print "thread exiting, id: " + str(thread_id)
            sys.exit()
        urllib2.urlopen(URL_toget)
        print "got URL: " + URL_toget

# fill the queue with work and block until we are done filling the queue

producer_thread = Thread(target=putURL, args=(host_list,))
producer_thread.start()
producer_thread.join()

# we can now start consumers

for i in range(number_of_consumer_threads):    
    t = Thread(target=getURL, args=(i,))
    t.start()
    thread_list.append(t)

for thread in thread_list:
    thread.join()

This pattern is efficient and can easily be extended to have multiple sets of threads and queues connecting them. Generally speaking I try to break the work tasks up into computationally intensive pieces (such as parsing a web page for content) and IO intensive tasks (such as requesting a web page or reading or writing a file). This allows tasks that do not require a lot of computation to quickly yield the GIL lock (more on this in the next article) allowing another thread to run (you want to do IO, especially network IO in parallel as much as possible since it is so slow). If we were to download the web page and process it within the same thread for example we would be limiting the number of simultaneous downloads since we’d be processing web pages when we could also be downloading them.

Now as for code quality the above code has a number of significant problems (mostly architecturally but one or two implementation-wise) that will be addressed in the later articles of this series.

Work Queues

As you can see work queues provide an ideal mechanism for linking pools of threads. Implementing queues from scratch in other languages such as C or Java is an incredibly complex task, what if two threads try to write to the queue at the same time, how do you sure that two separate objects are created properly, or if multiple threads are reading from the queue how do you ensure that objects are handed out properly? The good news is that in Python due to its providing low level primitives like threads and queues, and the Global Interpreter Lock (GIL) you really don’t have to care or spend much time making sure you get it right since it’s built in.

Appendix

[THREAD]

http://docs.python.org/library/thread.html

[THREADING]

http://docs.python.org/library/threading.html

[WORKQUEUE ]

http://docs.python.org/library/queue.html

[Producer-consumer problem]

http://en.wikipedia.org/wiki/Producer-consumer_problem

#!/usr/bin/env python

import urllib2

from threading import Thread

hosts = [“http://lwn.net/&#8221;, “http://seifried.org/&#8221;, “http://google.com/”%5D

def getURL(URL):

urllib2.urlopen(URL)

print “got URL: ” + URL

for item in hosts:

t = Thread(target=getURL, args=(item,))

t.start()

It feels slow – testing and verifying your network connection

May 1, 2010

So I have a backup network link (working from home means you need two network links) and it was feeling kind of slow. I had a Linksys BEsFx41 connected to it, which according to the specifications is an ok unit (does VPN, etc.) but in practice it felt really slow (web browsing was not fun). So let’s test this objectively I thought.

First obviously was to check the speed, am I getting what I paid for? a quick visit to www.speedtest.net showed that I was indeed getting the 4 megabits down and 1 megabit up (it’s a wireless link, so not super fast, but I don’t have to worry about backhoe fade) that I pay for. So if I’m getting good upload/download speeds why would it feel slow?

DNS

Luckily the DNSSEC has been in the news a lot recently and several DNS testing sites have come up in various blogs/conversations/etc. So I headed over to the ICSI Netalyzr which promises to “Debug your Internet.” It’s a java based test and takes a while, but I have to say the results are worth it. It checks for connection speed, filtering, DNS speed and filtering and a few other things. Turns out DNS lookups were horribly slow (on the order of several thousand milliseconds… aka seconds). No wonder web browsing felt slow!

Turns out the BEFSX41 intercepts DNS lookups and proxies them, good for filtering, terrible for performance.

So I tried out a Dlink EBR-2310, which had even worse DNS performance. To add insult to injury it doesn’t support routing properly. On the BEFSX41 I can specify static routes, i.e. a router on 192.168.1.1 can get to 10.1.2.0/255.255.255.0 through the machine at 192.168.1.2. The EBR-2310 simply doesn’t support any routing. It also does the DNS proxy intercept, worse than the BEFSX41 (about twice as slow, in other words completely unusable).

So off to the store I go for a Netgear RP614v4. I was hoping that because it was a relatively recent device it would have slightly better hardware and firmware. Luckily I was right. It’s a mildly retarded device; you can set it up as a DHCP server but you don’t really have many (well any) options as to what it serves out via DHCP (domain, DNS servers, default gateway, etc., it does these all with a brain dead default set). But it does DNS lookups in an average if 70-80ms (as opposed to 1-3 seconds).

On my main subnet Internet access is brokered through a pretty vanilla OpenBSD machine (apart from having IPv6 enabled it’s pretty bog standard) and DNS lookups/etc are much faster. If anything this experience has taught me that if you want performance go find a small cheap machine, load it up with OpenBSD and be happy. Time to buy a Soekris I suppose. Oh and if you want DNSSEC these hardware firewalls aren’t going to do the trick, they all pretty much only support short DNS replies, meaning that longer DNSSEC replies will be truncated (and thus broken). To test this you can use the OARC reply size test:

dig +short rs.dns-oarc.net txt

I also decided to test my network links for traffic shaping/etc., turns out my primary ISP does and my backups ISP doesn’t. To see if yours does/doesn’t check out the EFF page covering this.