Python Performance Part 3

Python Performance Part 1
Python Performance Part 2
Python Performance Part 4

In part one I covered a basic introduction to threads and the producer-consumer pattern. In part two I covered the Global Interpreter Lock (GIL) and some of the more pathological behavior it can cause. Now we’re going to go over how to profile code and how to instrument code so you can measure the performance of it so you can know with some certainty if the code actually needs to be fixed (and how it can be fixed). Finally we’ll go over some significant ways in which we can improve our handling of the producer-consumer pattern that will be more efficient and reliable.

Profiling vs. Performance measurement

Profiling and performance measurement are closely related but there is a difference. Profiling will generally give you every a lot of detail: function returns and exception events and since Python supports deterministic profiling it will do so with an extremely high degree of accuracy. The output of Python profiling is staggering, every single detail, the number of calls (ncalls), total time (tottime), per call time (tottime divided by ncalls), cumulative time (including sub calls/etc. for a function, cumtime) and the filename and line number and function called (filename:lineno(function)). In other words everything (this is one advantage of using an interpreted language). For most of us profiling should be used once the code has been written (premature optimization and all that) and used a bit (testing or otherwise). Performance measurement tends to be coarse grained, for example adding hooks that record how long a database operation takes to return or how long it takes to generate a web page in total. As well you can insert performance measuring code into your application and use it to continuously monitor system state/health (i.e. average DB query time over the last 5 minutes, etc.) and alert you if a problem occurs which is something that profiling is not really suited for.

Profiling Python code

Python provides two main profilers: “profile” and “cProfile” [PROFILE](a third one, “hotshot” has been deprecated and may be removed in the next version of Python according to the documentation). “profile” and “cProfile” are essentially identical with respect to usage, however the “cPython” module (written in C) is much faster and won’t slow your code down significantly. The “profile” module will slow your code down significantly, but as it is written in Python it can easily be modified or extended should you need to. Because Python is an interpreted language the interpreter is active during code execution, you don’t need to add instrumented code as it is already present.

A simple example of profiling
>>> def count():
...     n = 100000
...     while n > 0:
...             n -= 1
...
>>> import cProfile
>>> cProfile.run('count()', 'countprof')
>>> import pstats
>>> p = pstats.Stats('countprof')
>>> print p
<pstats.Stats instance at 0x254fb48>
>>> p.print_stats()
Mon Jan  4 00:19:11 2010    countprof

 3 function calls in 0.012 CPU seconds

 Random listing order was used

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
 1    0.000    0.000    0.012    0.012 <string>:1(<module>)
 1    0.012    0.012    0.012    0.012 <stdin>:1(count)

An excellent video covering this is available from Mike Fletcher [Mike C. Fletcher]

http://us.pycon.org/2009/conference/schedule/event/15/

What we can take away from this and other sources appears to be:

  1. You need to be able to properly measure performance before you profile code (luckily Python makes this trivial)
  2. Save your profiling results so you can compare things before and after (in other words did your optimization actually help)
  3. Check your algorithms; a stupid algorithm will always be slow
  4. Find function calls that are used often or are slow (and thus where you should start looking at first)
  5. Look for operations or results that can be pooled (i.e. using a single database query with a JOIN statement instead of multiple queries to get a collection of data)
  6. Look for operations that can be cached (keep your results around if you will use them again)
  7. Anytime you rely upon external code (especially opaque or closed source code) you are very much at the mercy of someone else
  8. Anytime you rely upon an external service (a database, reading or writing a file from the local system, getting a web page, etc.) you are very much at the mercy of someone else (in other words watch out for IO which may show up as lots of time spent sleeping)
  9. Only optimize one thing at a time unless you are incredibly brave
  10. Profiling threaded code can be problematic on multi-threaded or multiprocessor code

Of course I could be wrong. One last note, you can convert all this information into nice dot graphs using Gprof2Dot [GPROF2DOT].

Using cProfile

Using cProfile is easy; you import it as a module and then run code using it as a wrapper:

import cProfile cProfile.run('foo()', 'foo_profile')

 

This will execute the function foo() and output the data from the profile run to the file foo_profile. This is great if you just want to profile a specific class or function within your program. Alternatively if you want to profile the entire program you can use cProfile to execute the Python program in question and profile the entire thing:

# /usr/lib64/python2.6/cProfile.py -o base64-profile /usr/lib64/python2.6/base64.py /etc/resolv.conf
# python

>>> import pstats
>>> p = pstats.Stats('base64-profile')
>>> p.print_stats()
Tue Jan  5 02:37:04 2010    test-output.1

 282 function calls in 0.008 CPU seconds

 Random listing order was used

 ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 2    0.000    0.000    0.000    0.000 {method 'items' of 'dict' objects}
 1    0.000    0.000    0.000    0.000 /usr/lib64/python2.6/getopt.py:16(<module>)
 1    0.000    0.000    0.002    0.002 /usr/lib64/python2.6/base64.py:3(<module>)
 4    0.000    0.000    0.000    0.000 {method 'read' of 'file' objects}
 1    0.000    0.000    0.000    0.000 {open}
 3    0.000    0.000    0.000    0.000 {len}
 1    0.006    0.006    0.008    0.008 {execfile}
 256    0.000    0.000    0.000    0.000 {chr}
 1    0.000    0.000    0.000    0.000 /usr/lib64/python2.6/getopt.py:39(GetoptError)
 2    0.000    0.000    0.000    0.000 {binascii.b2a_base64}
 1    0.000    0.000    0.008    0.008 <string>:1(<module>)
 1    0.001    0.001    0.001    0.001 /usr/lib64/python2.6/base64.py:326(test)
 2    0.000    0.000    0.000    0.000 {method 'write' of 'file' objects}
 1    0.000    0.000    0.000    0.000 {method 'sort' of 'list' objects}
 1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
 1    0.000    0.000    0.000    0.000 /usr/lib64/python2.6/base64.py:285(encode)
 1    0.000    0.000    0.000    0.000 {method 'startswith' of 'str' objects}
 1    0.000    0.000    0.000    0.000 {range}
 1    0.000    0.000    0.000    0.000 /usr/lib64/python2.6/getopt.py:52(getopt)

<pstats.Stats instance at 0x7fad1e6c1ab8>

As you can see I ran the file /etc/resolv.conf through the Python base64 encode/decode utility. The first obvious thing to jump out is heavy use of the “chr()” function. This function returns a string one character at a time but appears to be extremely efficient (virtually no CPU time was used). The only thing that really appears to have taken any time is the execfile() function which parses a file. As you can imagine a Python application of any real size will generate a lot of output. Use of cPython is covered in detail in http://docs.python.org/library/profile.html

Measuring Python code performance

There’s a couple of ways to measure Python code execution, the most accurate for small amounts is the timeit.Timer() function. You basically wrap your function in this and it reports how long things took to run, it can also takes a repeat command, so for operations that are relatively fast you can run them a few thousand (or million times) to get an idea of how long they take. This module is generally best used when testing algorithms or other short pieces of code since it won’t give details of where things are being slow (just an overall picture).

But what about instrumenting code for long term monitoring, or measuring the speed of external software and services such as database queries? Some strategies here include simply using the time.time() method to get the current time at the start and end of the code you are interested in, subtract the two and you get the elapsed time. This allows you to wrap a call to a database; if the query takes more than say 5 seconds you can send a warning of some sort (log an alert, send an email, etc.). If you are processing database entries you can log the primary key and the start time when you start processing the entity, once done processing (possibly after shoving it through several work queues/etc.) you log the finish time, and ideally log these results.

A note on long term performance monitoring

A major advantage of architecting in performance monitoring is that you can generate long term stats regarding your code, application and environment. Things like how long database queries take, swap the disks for solid stats disks and see if or what a difference it makes, network latency/speed issues, etc. My advice would be to measure the performance of external libraries and services since profiling won’t really address this. I would also measure the overall performance of things (how long from page request to page generation, or page generation and writing of back end data, etc.). Especially if you are splitting work up using work queues profiling won’t help since you can’t track the processing of a specific message/request/etc. within that context easily.

Additionally if you do this by using function decorators you can simply create a timing function and wrap your database query class or whatever function it is you want to monitor without having to modify your code significantly.

Disassembling Python code

I thought I should add a quick note on disassembling Python code (so you can find out approximately how many ticks it is). In some cases where you suspect a computation or a task is causing a lot of context switching you may wish to look at modifying the sys.setcheckinterval() and increasing it. But of course simply twiddling the knobs until you get a better result is probably not the easiest way to accomplish this. Using the dis[DIS] module you can disassemble code, see “The trouble with ticks” in the second part of this series for an example. You may be lucky and find that setting the sys.setcheckinterval() helps (e.g. if you computation takes 120 ticks and it can now complete in one run rather than being interrupted) but I would urge extreme caution as it may have unintended side effects that are difficult to track down.

Shared resources, locking and performance

So what happens if you profile or time your code in testing and find that it is reasonably fast; but when you run it in a multi-threaded or multi-process environment you find your code crawls to a stand still or takes far more time than it should? Chances are you have some form of shared resource or locking, that when run in small trials doesn’t create contention, but when run in production does. The best manner to deal with this is of course to avoid having any resources that require locking or can’t easily be shared. Additionally I have found that the less shared resources, locking and other synchronization issues I have to deal with, the fewer bugs and I create (imagine that). The two best ways to deal with this are to profile and monitor your code, code that spends a lot of time sleeping may be waiting on a shared lock, and if something takes to long to run (such as a database query) you may have an external resource that doesn’t share well (i.e. a MySQL MyISAM table that forces table level locking for inserts can really kill performance when you have multiple INSERT or UPDATE statements happening at the same time).

Mapping critical code sections

If you must have shared resources you want to minimize their impact as much as possible, identify the minimal amount of code that needs to hold a lock on a shared resource and only wrap that code with the lock. One note: if your critical code relies on external code or services all bets are off (you hold a lock, then do a DB query and the DB query is very slow or fails or whatever you are out of luck and your application will grind to a halt). Look for code that establishes a lock using locking primitives such as acquire() and release() or threading.lock() and multiprocessing.Lock(). Additionally any databases you connect to may use table level locking, thus if one thread or process is running an INSERT or an UPDATE no other thread or process will be able to. You also want to ensure there are no concurrent deadlocks (i.e. thread #1 holds lock A while waiting to acquire lock B and thread #2 is holding lock B and waiting to acquire lock A, neither one gives up their lock and neither one can move forwards). If you must have multiple locks I would strongly suggest using locking hierarchies (e.g. always lock A and then lock B so no deadlock can occur); there are several excellent articles available on this topic. [LOCKING HIERARCHY]

A note on external C libraries

According to some sources Python is generally capable of running at around 10-20% of the speed of a compiled C program for the same algorithm (due to interpreter overhead, etc.). Generally speaking this won’t be a significant problem since many applications are IO bound (networks, disks, etc.) and modern CPU’s are quite fast. But what happens when you start pegging the CPU because of computation being done in Python? Again there is good news here, Guido van Rossum comes from a scientific computing (a.k.a. number crunching) background and this is one of the first things Python was designed to handle. You can (with relative ease) write C or C++ code and use it to extend Python code[PYTHON AND C]. This has already been done with a number of core modules, as mentioned above the cProfile and profile extensions accomplish the same work but as cPython is written in C it runs much faster. If you need to do heavy computation you will want to check out SciPy which is written largely in C and NumPy which is quite fast. [SCIPY]

An Extension to the Advanced Threading pattern – spinlocks

Normally we would use the thread.join() method but in this case we need a little more flexibility. The use of a spinlock (the thread simply spins, checking the lock repeatedly and once the lock releases it then continues doing whatever it is supposed to).

One of the main problems with the producer-consumer pattern is starvation. What if the producer can’t go fast enough to feed the workers and the workers have to wait for work? This in and of itself is not an insurmountable problem, however when combined with the fact that we want the Python script to run and exit in a timely manner (i.e. when all the work is done) and not before it is done how do we avoid a starvation situation triggering an early end of the program? The simplest method is to use a lock hierarchy, we have the producer establish a spinlock that it unlocks once it is done feeding data into the queue. The consumers also set a spinlock, if they notice that the queue is empty they can then check the producer spinlock, if this is not set then they know it is a simple starvation problem and that they should keep rechecking the queue until there is more work. However if they check the producer spinlock and it is set as done then they know that there is no more work to be done and they can also exit. One of the simpler ways this can be done is to create an array for each pool of threads (i.e. one for producers and one for consumers). Use the thread ID as the key and a value of 0 or 1 to indicate if the thread is done or not, then you can simply use something like “while (0 in spinlockProducerThread)” to see if any threads are still running. Ultimately this pattern can also be applied to the main thread, we simply have the main thread spawn off the producer and consumer threads and then spin until all the locks are released, indicating that all the work is done and the threads are finished. Additionally this allows us to have threads exit when they are done (i.e. once the producer thread has pushed all the work into the queue there is no reason to leave it running, because the spinlock is external we can exit, unlike if we were to use the thread.join() method). This pattern also works really well if you have multiple levels of producers and consumers, i.e. a chain of modules that downloads a web page, processes the content within it, then creates something based on the content and so on. This pattern also lends itself to distributed systems, by taking the locking mechanism and making it external to the running code (for example by using a database table as a shared blackboard to communicate) allows multiple systems to join, do work and then leave (allowing you to dynamically allocate resources).

web_get_producer_consumer_basic_pattern.py (outline of code)
class ProducerThread:
	def __init__(self, myID):
		self.myID = myID
		while 1:
			work_todo = GetWorkFromDB()
			for work_item in work_todo:
				workQueue.put(work_item)
			break
		spinlockProducerThread[myID] = 1
		sys.exit()

class ConsumerThread:
	def __init__(self, myID):
		self.myID = myID
		while 1:
			try:
				work_item = workQueue.get(block=False)
			except Queue.Empty:
				"""
				The queue is empty, but it may only be temporarily empty or it may
				be empty but other worker threads are still running
				"""
				if (0 not in spinlockProducerThread):
					spinlockConsumerThread[myID] = 1
					sys.exit()
			else:
				ProcessWork(work_item)

if __name__ == '__main__':

	spinlockProducerThread = [0] * number_of_producer_threads

	spinlockConsumerThread = [0] * number_of_consumer_threads

	workQueue = Queue.Queue()  

	for i in range(number_of_producer_threads):
		try:
			thread.start_new_thread(ProducerThread, (i,))
		except thread.error, e:
			myID = i
			spinlockProducerThread[myID] = 1
			print "ProducerThread thread creation failed"
			print e

	for i in range(number_of_consumer_threads):
		try:
			thread.start_new_thread(ThreadWorkerFreescan, (i,))
		except thread.error, e:
			myID = i
			spinlockConsumerThread[myID] = 1
			print "ConsumerThread thread creation failed"
			print e

	while (0 in spinlockProducerThread or 0 in spinlockConsumerThread):
		time.sleep(1)
		pass

[PROFILE]

http://docs.python.org/library/profile.html

[Mike C. Fletcher]

Mike C. Fletcher (http://www.vrplumber.com)

[GPROF2DOT]

http://code.google.com/p/jrfonseca/wiki/Gprof2Dot

[PROFILE]

http://docs.python.org/library/profile.html

[DIS]

http://docs.python.org/library/dis.html

[LOCKING HIERARCHY]

http://www.ddj.com/architect/204801163

[PYTHON AND C]

http://docs.python.org/extending/extending.html

[SCIPY]

http://www.scipy.org/

class ProducerThread:
def __init__(self, myID):
self.myID = myID
while 1:
work_todo = GetWorkFromDB()
for work_item in work_todo:
workQueue.put(work_item)
break
spinlockProducerThread[myID] = 1
sys.exit()class ConsumerThread:
def __init__(self, myID):
self.myID = myID
while 1:
try:
work_item = workQueue.get(block=False)
except Queue.Empty:
“””
The queue is empty, but it may only be temporarily empty or it may
be empty but other worker threads are still running
“””
if (0 not in spinlockProducerThread):
spinlockConsumerThread[myID] = 1
sys.exit()
else:
ProcessWork(work_item)

if __name__ == ‘__main__’:

spinlockProducerThread = [0] * number_of_producer_threads

spinlockConsumerThread = [0] * number_of_consumer_threads

workQueue = Queue.Queue()

for i in range(number_of_producer_threads):
try:
thread.start_new_thread(ProducerThread, (i,))
except thread.error, e:
myID = i
spinlockProducerThread[myID] = 1
print “ProducerThread thread creation failed”
print e

for i in range(number_of_consumer_threads):
try:
thread.start_new_thread(ThreadWorkerFreescan, (i,))
except thread.error, e:
myID = i
spinlockConsumerThread[myID] = 1
print “ConsumerThread thread creation failed”
print e

while (0 in spinlockProducerThread or 0 in spinlockConsumerThread):
time.sleep(1)
pass

Tags: ,

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: