Why is code using intermediate variables faster than code without?

I have encountered this weird behavior and failed to explain it. These are the benchmarks:

    py -3 -m timeit "tuple(range(2000)) == tuple(range(2000))"
    10000 loops, best of 3: 97.7 usec per loop
    py -3 -m timeit "a = tuple(range(2000));  b = tuple(range(2000)); a==b"
    10000 loops, best of 3: 70.7 usec per loop

How come comparison with variable assignment is faster than using a one liner with temporary variables by more than 27%?

By the Python docs, garbage collection is disabled during timeit so it can't be that. Is it some sort of an optimization?

The results may also be reproduced in Python 2.x though to lesser extent.

Running Windows 7, CPython 3.5.1, Intel i7 3.40 GHz, 64 bit both OS and Python. Seems like a different machine I've tried running at Intel i7 3.60 GHz with Python 3.5.0 does not reproduce the results.

Running using the same Python process with timeit.timeit() @ 10000 loops produced 0.703 and 0.804 respectively. Still shows although to lesser extent. (~12.5%)

My results were similar to yours: the code using intermediate variables was pretty consistently at least 10-20 % faster in the Python 3.4 that I tired. However when I used IPython on the very same Python 3.4 interpreter, I got these results:

    In [1]: %timeit -n10000 -r20 tuple(range(2000)) == tuple(range(2000))
    10000 loops, best of 20: 74.2 µs per loop

    In [2]: %timeit -n10000 -r20 a = tuple(range(2000));  b = tuple(range(2000)); a==b
    10000 loops, best of 20: 75.7 µs per loop

Notably, I never managed to get even close to the 74.2 µs for the former when I used -mtimeit from the command line.

So this Heisenbug turned out to be something quite interesting. I decided to run the command with strace and indeed there is something fishy going on:

    % strace -o withoutvars python3 -m timeit "tuple(range(2000)) == tuple(range(2000))"
    10000 loops, best of 3: 134 usec per loop
    % strace -o withvars python3 -mtimeit "a = tuple(range(2000));  b = tuple(range(2000)); a==b"
    10000 loops, best of 3: 75.8 usec per loop
    % grep mmap withvars|wc -l
    46
    % grep mmap withoutvars|wc -l
    41149

Now that is a good reason for the difference. The code that does not use variables causes the mmap system call be called almost 1000x more than the one that uses intermediate variables.

The withoutvars is full of mmap/munmap for a 256k region; these same lines are repeated over and over again:

    mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f32e56de000
    munmap(0x7f32e56de000, 262144)          = 0
    mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f32e56de000
    munmap(0x7f32e56de000, 262144)          = 0
    mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f32e56de000
    munmap(0x7f32e56de000, 262144)          = 0

The mmap call seems to be coming from the function _PyObject_ArenaMmap from Objects/obmalloc.c; the obmalloc.c also contains the macro ARENA_SIZE, which is #defined to be (256 << 10) (that is 262144); similarly the munmap matches the _PyObject_ArenaMunmap from obmalloc.c.

obmalloc.c says that

Prior to Python 2.5, arenas were never free()'ed. Starting with Python 2.5, we do try to free() arenas, and use some mild heuristic strategies to increase the likelihood that arenas eventually can be freed.

Thus these heuristics and the fact that Python object allocator releases these free arenas as soon as they're emptied lead to python3 -mtimeit 'tuple(range(2000)) == tuple(range(2000))' triggering pathological behaviour where one 256 kiB memory area is re-allocated and released repeatedly; and this allocation happens with mmap/munmap, which is comparatively costly as they're system calls - furthermore, mmap with MAP_ANONYMOUS requires that the newly mapped pages must be zeroed - even though Python wouldn't care.

The behaviour is not present in the code that uses intermediate variables, because it is using slightly more memory and no memory arena can be freed as some objects are still allocated in it. That is because timeit will make it into a loop not unlike

    for n in range(10000)
        a = tuple(range(2000))
        b = tuple(range(2000))
        a == b

Now the behaviour is that both a and b will stay bound until they're *reassigned, so in the second iteration, tuple(range(2000)) will allocate a 3rd tuple, and the assignment a = tuple(...) will decrease the reference count of the old tuple, causing it to be released, and increase the reference count of the new tuple; then the same happens to b. Therefore after the first iteration there are always at least 2 of these tuples, if not 3, so the thrashing doesn't occur.

Most notably it cannot be guaranteed that the code using intermediate variables is always faster - indeed in some setups it might be that using intermediate variables will result in extra mmap calls, whereas the code that compares return values directly might be fine.

Someone asked that why this happens, when timeit disables garbage collection. It is indeed true that timeit does it:

Note

By default, timeit() temporarily turns off garbage collection during the timing. The advantage of this approach is that it makes independent timings more comparable. This disadvantage is that GC may be an important component of the performance of the function being measured. If so, GC can be re-enabled as the first statement in the setup string. For example:

However, the garbage collector of Python is only there to reclaim cyclic garbage , i.e. collections of objects whose references form cycles. It is not the case here; instead these objects are freed immediately when the reference count drops to zero.

From: stackoverflow.com/q/36548518

Back to homepage or read more recommendations: