Tracing JITs in the real world @ CPython Core Dev Sprint¶
Last week I got to take part in the CPython Core Developer Sprint in Cambridge, hosted by ARM and brilliantly organized by Diego Russo -- about ~50 core devs and guests were there, and I was excited to join as one of the guests.
I had three main areas of focus:
-
C API: this was a follow up of what we discussed at the C API summit at EuroPython. The current C API is problematic, so we are exploring ideas for the development of PyNI (Python Native Interface), whose design will likely be heavily inspired by HPy. It's important to underline that this is just the beginning and the entire process will require multiple PEPs.
-
fancycompleter This is a small PR which I started months ago, to enable colorful tab completions within the Python REPL. I wrote the original version of fancycompleter 15 years ago, but colorful completions work only in combination with PyREPL. Now PyREPL is part of the standard library and enabled by default, so we can finally upstream it. I hope to see it merged soon.
-
"JIT stuff": I spent a considerable amount of time talking to the people who are working on the CPython JIT (in particular Mark, Brandt, Savannah, Ken Jin and Diego). Knowledge transfer worked in both ways: I learned a lot about the internal details of CPython's JIT, and conversely I shared with them some of the experience, pain points and gut feelings which I got by working many years on PyPy.
In particular, on the first day I presented a talk titled Tracing JIT and real world Python (slides and source code).
What follows is an annotated version of the slides.
CPython's new JIT and PyPy's JIT share fundamental similarities, as they're both tracing JITs.
I spent ~7 years of my career optimizing existing code for PyPy at a high-frequency trading firm, and I realized that I'm probably one of the few people in the world with actual experience in optimizing real world Python code for a tracing JIT.
I expect that some of the challenges which I faced will still be valid also for CPython, and I wanted to share my experience to make sure that CPython core devs are aware of them.
One lesson which I learned is that the set of benchmarks in pyperformance
are
a good starting point, but they are not entirely representative of what you
find in the wild.
The main goal of the talk is not to present solutions to these problems, but to raise awareness that they exist.
Until now CPython's performance has been particularly predictable, there are well established "performance tricks" to make code faster, and generally speaking you can mostly reason about the speed of a given piece of code "locally".
Adding a JIT completely changes how we reason about performance of a given program, for two reasons:
-
JITted code can be very fast if your code conforms to the heuristics applied by the JIT compiler, but unexpectedly slow(-ish) otherwise;
-
the speed of a given piece of code might depend heavily on what happens elsewhere in the program, making it much harder to reason about performance locally.
The end result is that modifying a line of code can significantly impact seemingly unrelated code. This effect becomes more pronounced as the JIT becomes more sophisticated.
The CPython JIT is still pretty new and doesn’t give huge speedups yet. I expect that as it gets faster, its performance will start looking more and more like PyPy’s.
I delivered this talk at the Core Dev Sprint: I expected my audience to be familiar with CPython's JIT, and wanted to draw parallels with PyPy's one.
Since the audience of this blog is different, let me briefly explain CPython's JIT first.
The explanations of both JITs are necessarily short, incomplete and highly simplified.
CPython JIT 101¶
Python source code is turned into bytecode. Bytecode is a sequence of
"opcodes" (LOAD_FAST
, BINARY_OP
, etc.), and the CPython VM is an
interpreter for those opcodes. Historically the VM was written by hand, and the
main loop consisted of a big switch
statement which executed the code
corresponding to each opcode.
Nowadays things are different: the opcodes are written in a special DSL and the main interpreter loop is generated from this DSL. Additionally, the DSL describes how each opcode can be decomposed into multiple "microops".
When the interpreter detects a "hot loop", it starts the JIT. The JIT retroactively looks at the opcodes which were executed in the last iteration of the loop, and creates a "linear trace" which contains the equivalent microops. This process is called trace projection and the result is an unoptimized trace of microops.
Then, the JIT can produce an optimized trace, by reordering and removing redundant microops. Finally, the optimized trace is turned into executable code using the "copy & patch" technique.
PyPy JIT 101¶
CPython's Python interpreter is written in C, and then compiled into an
executable by gcc
(or any other C compiler).
Similarly, PyPy's Python interpreter is written in RPython, and then compiled
into an executable by rpython
.
Under the hood, rpython
applies two separate transformations to the source
code:
-
it turns each function into C code, which is then fed to
gcc
to get the final executable; -
it turns each function into "jitcodes", which is a way to represent RPython's IR (internal representation). For each RPython function, the final
./pypy
executable contains its compiled representation (generated by GCC) and its jitcode representation (embedded as static data into the executable).
In a way, RPython's jitcodes are equivalent to CPython's microops, as they are a low-level representation of the logic of each opcode.
When the interpreter detects a hot loop, it enters trace recording mode, which is essentially an interpreter which executes the jitcodes: the result is a linear unoptimized trace of all the jitcodes which were actually executed.
Similarly to CPython, PyPy then produces an optimized trace, which is then sent to the JIT backend for actual native code generation.
Tracing JITs work by recording a trace of all microops which are executed. The optimizer can then reason about what happens in the trace and remove unneeded operations.
However, sometimes we encounter some operation which is a black box from the point of view of the tracer: we call them "trace blocker", because the tracing JIT cannot see through them. In the case of CPython, this happens for example, whenever we call any function implemented in C (because it doesn't have any correspondent "microop").
This is a simple function that computes pi
, generated by ChatGPT. Its
precise content is not important: what matters is that it's a nice purely
numerical loop that the PyPy JIT can optimize very well.
Same function as above, with a call to hic_sunt_leones()
. This is actually
an empty function which does absolutely nothing, but annotated in a
special way so that the PyPy JIT cannot "enter" it, so it effectively behaves
as trace blocker.
In this example we use the special pypyjit.residual_call
to simulate a trace
blocker, but in real life we get it whenever we have a call to any
non-traceable function, in particular C extensions.
The clean version runs 42x faster on PyPy than CPython - that's the JIT working perfectly. But with just one untraceable function call added to the loop, PyPy slows down to only 1.8x faster than CPython. That single line destroyed most of the JIT's effectiveness!
This happens because after the call the optimizer no longer knows whether its assumptions about the world are still true, and thus must be much more conservative.
I fear that for CPython, this will turn out to be a much bigger problem than for PyPy, for two reasons:
-
nowadays it's virtually impossible to run Python code without using any C extension, either directly or indirectly.
-
by construction, PyPy's JIT can see much more than CPython's JIT. Remember the slide about "jitcodes": any RPython function gets a "jitcodes" equivalent, which means that the JIT can automatially trace inside builtins and internals of the interpreter, whereas CPython can trace only inside pure python code.
For example, PyPy's JIT can trace through range()
, zip
, and enumerate()
automatically. CPython's JIT currently cannot because they are implemented in
C. CPython could add special cases for these common functions, but the
general approach doesn't scale.
The second big problem is what I call "data driven control flow". This example has been autogenerated by ChatGPT and it's completely silly, but it's a good representation of what happens in real life code.
In this example, fn
takes 9 variables, each of them can be None
or a
number. The function starts with a sequence of if <var> is None: ...
. The
function is then called repeatedly in a loop.
One of the assumption of tracing JITs is that control flow tends to stay on the "hot path", and that it's enough to optimize that to get good performance.
But in a case like this, each combination of None
ness selects a different
path, and if we assume the data is evenly distributed, we find out that
there is no hot path.
Let's see what happens when we execute on CPython and PyPy:
PyPy without JIT is "only" 2.3x slower than CPython, but when we enable the JIT, it becomes much worse. This happens because of an exponential explosion of code paths seen by the JIT.
In a normal compiler, an if
statement is compiled as a diamond, and the
control flow merges after each if
:
A tracing JIT by definition follows what's happening during a concrete execution, so it sees only a concrete path in the control flow, with "guards" to ensure correctness:
When guard(a is None)
fails enough times, we create a "bridge" and record
another linear trace, following again the concrete control flow that happens
now:
guard(a is None) ----> FAIL (side exit)
/ \
/ \
a = 0 pass
\ \
\ \
guard(b not None) guard(b not None)
/ /
/ /
b = 0 b = 0
\ \
\ \
... ...
Note how b = 0
is effectively duplicated now. By design, PyPy's JIT never
merges execution flow.
Looking inside PYPYLOG
confirms our theory: we get "exponential
tracing". The JIT has to compile separate optimized code for every unique
combination of which parameters are None and which aren't. With 9 parameters,
that could be up to 512 different combinations!
One possible mitigation is to rewrite conditional code to be "branchless" - using arithmetic tricks instead of if statements. But this makes code ugly and unreadable, and it's not always possible.
Despite years of working on this, I never found a really good solution. There were cases in which we had to continue running some piece of code on CPython because I never managed to make the PyPy version faster.
This pattern happens quite a lot, although often is more subtle: in this silly
example all the if
s are nicely grouped together at the start, but in a long
trace they can be scattered in multiple places, and any kind of control flow
contributes to the problem, not only if
s. In Python, this includes any kind
of dynamic dispatch, exceptions, etc.
One possible solution for CPython's JIT is to try to merge (some) traces to avoid or limit the exponential explosion. However, it is worth underlining that tracing JITs shine precisely when they can optimize a long linear trace: if you try to compile shorter traces, you might quickly end up in a situation which is equivalent to the "trace blocker" problem described earlier.
I suspect this might be a fundamental limitation of tracing JITs.
Compared to the other two problems, this is less serious, but it's worth
mentioning because of prevalence of async
(and thus implicitly generators)
in modern Python.
Here's another silly function that counts Pythagorean triples using nested loops. This is our baseline version using plain loops.
Here's the same algorithm refactored to use a generator function for the
nested iteration. The "state of iteration" is implicitly stored inside the
local variables of frame object associated to the range_product
generator.
Here's the same functionality implemented as a traditional iterator class. The
"state of iteration" is explicitly stored as attributes of RangeProductIter
.
On CPython, the generator version is ~29% slower than the explicit loops. The iterator class is much slower, as one would intuitively expect.
However, on PyPy we see different results: RangeProductIter
is basically
same speed as the baseline, while the generator version is slower. This
happens because in the case of RangeProductIter
the JIT is able to see the whole
lifetime of the object and optimize it away entirely: instance variables
become local variables, the call to __next__
is inlined and we get the
equivalent of explicit nested loops.
However, generators are required to create a frame object and represent a fundamental case in which the JIT cannot trace through them effectively. In more complex real-world scenarios, we saw much worse slowdowns than these examples show.
This is a collection of other miscellaneous problems that I had to deal with. Generally speaking, we lack good support for tooling and profilers. CPython needs to have a good story to explain people how to understand what's happening when the JIT is enabled.
Warmup is another big problem: in PyPy, very short programs tend to be slower than CPython because JITting costs. Moreover warmup is not an easily definable phase, as the linked paper shows. This is an area where currently CPython shines, as its JIT is very fast. I think that it will become slightly slower when it tries to optimize more aggressively, but hopefully warmup will overall be a lesser problem than on PyPy.
Moreover, it's very easy to accidentally make your code 2x, 5x or even 10x slower by changing seemingly innocent pieces of code. This is another reason why good tooling is essential.
Finally, the "long tail of JITting": every loop and every guard gets a counter, and we start JITting when it reaches a threshold. Given a sufficiently long running program, all counters reach the threshold eventually and we end up JITting much more than necessary, using too much memory and/or thrashing the cache. In many cases I found beneficial to just disable the JIT "after a while", with manually tuned heuristics.
These are slides which I didn't show during the live presentation, and show a case where a tracing JIT can shine: since the JIT sees a complete trace of an entire loop (including nested calls) it can easily removes a lot of temporary objects which usually penalize Python performance.
In many cases, we can get the famous "zero-cost abstractions".
Let's look at a concrete example. We need to compute the barycenter of triangles that are serialized in a binary format. Each triangle has three points, each point has x and y coordinates. This simulates real world protocols such as protobuf, capnproto, etc.
This is what we use a a baseline: a bare loop, using struct.unpack_from
to read 6 floats at a time.
Here's the "proper" object-oriented approach, similar to how modern
serialization libraries work. We create Triangle
and Point
classes that
provide a nice API for accessing the binary data. Each property access creates
new objects and calls struct.unpack_from. This is much more readable and
reusable, but creates many temporary objects.
Here's how you'd use the object-oriented API. The code is much cleaner and
more readable than the bare loop version. But notice how many object creations
are happening: one Triangle
object, six Point
objects, plus all the
intermediate tuples from struct.unpack_from
.
As expected, on CPython read_proto
is much slower than the bare one,
roughly 6x slower. However, PyPy can fully optimize away all the
abstraction overhead introduced by Triangle
and Point
.
In PyPy jargon we call this form of allocation removal "virtuals" (because we create "virtual objects" whose fields are represented as local variables) and it's probably the single most important optimization that PyPy does.
During my week in Cambridge I talked extensively with the CPython JIT devs about this and I hope I convinced them that this is what they should aim for 😊.
Note also that read_proto
is actually faster than read_loop
. This
happens because in read_loop
we do a single struct.unpack_from('dddddd', ...)
,
while in read_proto
we do a succession of six individual
struct.unpack_from('d', ...)
. It turns out that the JIT is able to trace
into the second form but not into the first, which means that in read_loop
we actually need to allocate a pseudo-tuple at each iteration.
The funny part is that I did not expect to get this result. I had to take
the time to analyze the JIT traces of both versions to understand why
read_loop
was slower. This is probably the best explanation of how
counterintuitive it is to reason about performance in a JITted world.
Acknowledgments¶
Thanks to Carl Friedrich Bolz-Tereick and Hood Chatham for feedback on the slides and the post.