A snake which bites its tail: PyPy JITting itself

Originally published on the PyPy blog.

We have to admit: even if we have been writing for years about the fantastic speedups that the PyPy JIT gives, we, the PyPy developers, still don't use it for our daily routine. Until today :-).

Readers brave enough to run translate.py to translate PyPy by themselves surely know that the process takes quite a long time to complete, about a hour on super-fast hardware and even more on average computers. Unfortunately, it happened that translate.py was a bad match for our JIT and thus ran much slower on PyPy than on CPython.

One of the main reasons is that the PyPy translation toolchain makes heavy use of custom metaclasses, and until few weeks ago metaclasses disabled some of the central optimizations which make PyPy so fast. During the recent Düsseldorf sprint, Armin and Carl Friedrich fixed this problem and re-enabled all the optimizations even in presence of metaclasses.

So, today we decided that it was time to benchmark again PyPy against itself. First, we tried to translate PyPy using CPython as usual, with the following command line (on a machine with an "Intel(R) Xeon(R) CPU W3580 @ 3.33GHz" and 12 GB of RAM, running a 32-bit Ubuntu):

$ python ./translate.py -Ojit targetpypystandalone --no-allworkingmodules

... lots of output, fractals included ...

[Timer] Timings:
[Timer] annotate                       ---  252.0 s
[Timer] rtype_lltype                   ---  199.3 s
[Timer] pyjitpl_lltype                 ---  565.2 s
[Timer] backendopt_lltype              ---  217.4 s
[Timer] stackcheckinsertion_lltype     ---   26.8 s
[Timer] database_c                     ---  234.4 s
[Timer] source_c                       ---  480.7 s
[Timer] compile_c                      ---  258.4 s
[Timer] ===========================================
[Timer] Total:                         --- 2234.2 s

Then, we tried the same command line with PyPy (SVN revision 78903, x86-32 JIT backend, downloaded from the nightly build page):

$ pypy-c-78903 ./translate.py -Ojit targetpypystandalone --no-allworkingmodules

... lots of output, fractals included ...

[Timer] Timings:
[Timer] annotate                       ---  165.3 s
[Timer] rtype_lltype                   ---  121.9 s
[Timer] pyjitpl_lltype                 ---  224.0 s
[Timer] backendopt_lltype              ---   72.1 s
[Timer] stackcheckinsertion_lltype     ---    7.0 s
[Timer] database_c                     ---  104.4 s
[Timer] source_c                       ---  167.9 s
[Timer] compile_c                      ---  320.3 s
[Timer] ===========================================
[Timer] Total:                         --- 1182.8 s

Yes, it's not a typo: PyPy is almost two times faster than CPython! Moreover, we can see that PyPy is faster in each of the individual steps apart compile_c, which consists in just a call to make to invoke gcc. The slowdown comes from the fact that the Makefile also contains a lot of calls to the trackgcroot.py script, which happens to perform badly on PyPy but we did not investigate why yet.

However, there is also a drawback: on this specific benchmark, PyPy consumes much more memory than CPython. The reason why the command line above contains --no-allworkingmodules is that if we include all the modules the translation crashes when it's complete at 99% because it consumes all the 4GB of memory which is addressable by a 32-bit process.

A partial explanation if that so far the assembler generated by the PyPy JIT is immortal, and the memory allocated for it is never reclaimed. This is clearly bad for a program like translate.py which is divided into several independent steps, and for which most of the code generated in each step could be safely be thrown away when it's completed.

If we switch to 64-bit we can address the whole 12 GB of RAM that we have, and thus translating with all working modules is no longer an issue. This is the time taken with CPython (note that it does not make sense to compare with the 32-bit CPython translation above, because that one does not include all the modules):

$ python ./translate.py -Ojit

[Timer] Timings:
[Timer] annotate                       ---  782.7 s
[Timer] rtype_lltype                   ---  445.2 s
[Timer] pyjitpl_lltype                 ---  955.8 s
[Timer] backendopt_lltype              ---  457.0 s
[Timer] stackcheckinsertion_lltype     ---   63.0 s
[Timer] database_c                     ---  505.0 s
[Timer] source_c                       ---  939.4 s
[Timer] compile_c                      ---  465.1 s
[Timer] ===========================================
[Timer] Total:                         --- 4613.2 s

And this is for PyPy:

$ pypy-c-78924-64 ./translate.py -Ojit

[Timer] Timings:
[Timer] annotate                       ---  505.8 s
[Timer] rtype_lltype                   ---  279.4 s
[Timer] pyjitpl_lltype                 ---  338.2 s
[Timer] backendopt_lltype              ---  125.1 s
[Timer] stackcheckinsertion_lltype     ---   21.7 s
[Timer] database_c                     ---  187.9 s
[Timer] source_c                       ---  298.8 s
[Timer] compile_c                      ---  650.7 s
[Timer] ===========================================
[Timer] Total:                         --- 2407.6 s

The results are comparable with the 32-bit case: PyPy is still almost 2 times faster than CPython. And it also shows that our 64-bit JIT backend is as good as the 32-bit one. Again, the drawback is in the consumed memory: CPython used 2.3 GB while PyPy took 8.3 GB.

Overall, the results are impressive: we knew that PyPy can be good at optimizing small benchmarks and even middle-sized programs, but as far as we know this is the first example in which it heavily optimizes a huge, real world application. And, believe us, the PyPy translation toolchain is complex enough to contains all kinds of dirty tricks and black magic that make Python lovable and hard to optimize :-).

A snake which bites its tail: PyPy JITting itself

Comments