Why is code using intermediate variables faster than code without?
I have encountered this weird behavior and failed to explain it. These are the benchmarks:
py -3 -m timeit "tuple(range(2000)) == tuple(range(2000))" 10000 loops, best of 3: 97.7 usec per loop py -3 -m timeit "a = tuple(range(2000)); b = tuple(range(2000)); a==b" 10000 loops, best of 3: 70.7 usec per loop
How come comparison with variable assignment is faster than using a one liner with temporary variables by more than 27%?
By the Python docs, garbage collection is disabled during timeit so it can't be that. Is it some sort of an optimization?
The results may also be reproduced in Python 2.x though to lesser extent.
Running Windows 7, CPython 3.5.1, Intel i7 3.40 GHz, 64 bit both OS and Python. Seems like a different machine I've tried running at Intel i7 3.60 GHz with Python 3.5.0 does not reproduce the results.
Running using the same Python process with
timeit.timeit() @ 10000 loops produced 0.703 and 0.804 respectively. Still shows although to lesser extent. (~12.5%)
My results were similar to yours: the code using intermediate variables was pretty consistently at least 10-20 % faster in the Python 3.4 that I tired. However when I used IPython on the very same Python 3.4 interpreter, I got these results:
In : %timeit -n10000 -r20 tuple(range(2000)) == tuple(range(2000)) 10000 loops, best of 20: 74.2 µs per loop In : %timeit -n10000 -r20 a = tuple(range(2000)); b = tuple(range(2000)); a==b 10000 loops, best of 20: 75.7 µs per loop
Notably, I never managed to get even close to the 74.2 µs for the former when I used
-mtimeit from the command line.
So this Heisenbug turned out to be something quite interesting. I decided to run the command with
strace and indeed there is something fishy going on:
% strace -o withoutvars python3 -m timeit "tuple(range(2000)) == tuple(range(2000))" 10000 loops, best of 3: 134 usec per loop % strace -o withvars python3 -mtimeit "a = tuple(range(2000)); b = tuple(range(2000)); a==b" 10000 loops, best of 3: 75.8 usec per loop % grep mmap withvars|wc -l 46 % grep mmap withoutvars|wc -l 41149
Now that is a good reason for the difference. The code that does not use variables causes the
mmap system call be called almost 1000x more than the one that uses intermediate variables.
withoutvars is full of
munmap for a 256k region; these same lines are repeated over and over again:
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f32e56de000 munmap(0x7f32e56de000, 262144) = 0 mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f32e56de000 munmap(0x7f32e56de000, 262144) = 0 mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f32e56de000 munmap(0x7f32e56de000, 262144) = 0
mmap call seems to be coming from the function
obmalloc.c also contains the macro
ARENA_SIZE, which is
#defined to be
(256 << 10) (that is
262144); similarly the
munmap matches the
obmalloc.c says that
Prior to Python 2.5, arenas were never
free()'ed. Starting with Python 2.5, we do try to
free()arenas, and use some mild heuristic strategies to increase the likelihood that arenas eventually can be freed.
Thus these heuristics and the fact that Python object allocator releases these free arenas as soon as they're emptied lead to
python3 -mtimeit 'tuple(range(2000)) == tuple(range(2000))' triggering pathological behaviour where one 256 kiB memory area is re-allocated and released repeatedly; and this allocation happens with
munmap, which is comparatively costly as they're system calls - furthermore,
MAP_ANONYMOUS requires that the newly mapped pages must be zeroed - even though Python wouldn't care.
The behaviour is not present in the code that uses intermediate variables, because it is using slightly more memory and no memory arena can be freed as some objects are still allocated in it. That is because
timeit will make it into a loop not unlike
for n in range(10000) a = tuple(range(2000)) b = tuple(range(2000)) a == b
Now the behaviour is that both
b will stay bound until they're *reassigned, so in the second iteration,
tuple(range(2000)) will allocate a 3rd tuple, and the assignment
a = tuple(...) will decrease the reference count of the old tuple, causing it to be released, and increase the reference count of the new tuple; then the same happens to
b. Therefore after the first iteration there are always at least 2 of these tuples, if not 3, so the thrashing doesn't occur.
Most notably it cannot be guaranteed that the code using intermediate variables is always faster - indeed in some setups it might be that using intermediate variables will result in extra
mmap calls, whereas the code that compares return values directly might be fine.
Someone asked that why this happens, when
timeit disables garbage collection. It is indeed true that
timeit does it:
timeit()temporarily turns off garbage collection during the timing. The advantage of this approach is that it makes independent timings more comparable. This disadvantage is that GC may be an important component of the performance of the function being measured. If so, GC can be re-enabled as the first statement in the setup string. For example:
However, the garbage collector of Python is only there to reclaim cyclic garbage , i.e. collections of objects whose references form cycles. It is not the case here; instead these objects are freed immediately when the reference count drops to zero.