In Python, the
dis module allows disassembly of Python code into the individual instructions executed by the Python interpreter (usually cPython) for each line. Passing a module, function or other piece of code to the
dis.dis function will return a human-readable representation of the underlying, disassembled bytecode. This is useful for analyzing and hand-tuning tight loops or perform other kinds of necessary, fine-grained optimizations.
The main function you will interact with when wanting to disassemble Python code is
dis.dis. It takes either a function, method, class, module, code string, generator or byte sequence of raw bytecode and prints the disassembly of that code object to
stdout (if no explicit
file argument is specified). In the case of a class, it will disassemble each method (also static and class methods). For a module, it disassembles all functions in that module.
Let’s see this in practice. Take the following code:
import dis class Foo(object): def __init__(self): pass def foo(self, x): return x + 1 def bar(): x = 5 y = 7 z = x + y return z def main(): dis.dis(bar) # disassembles `bar` dis.dis(Foo) # disassembles each method in `Foo`
This will print:
14 0 LOAD_CONST 1 (5) 3 STORE_FAST 0 (x) 15 6 LOAD_CONST 2 (7) 9 STORE_FAST 1 (y) 16 12 LOAD_FAST 0 (x) 15 LOAD_FAST 1 (y) 18 BINARY_ADD 19 STORE_FAST 2 (z) 17 22 LOAD_FAST 2 (z) 25 RETURN_VALUE Disassembly of __init__: 8 0 LOAD_CONST 0 (None) 3 RETURN_VALUE Disassembly of foo: 11 0 LOAD_FAST 1 (x) 3 LOAD_CONST 1 (1) 6 BINARY_ADD 7 RETURN_VALUE
Also, we can disassemble an entire module from the command line using
python -m dis module_file.py. Either way, at this point, we should probably discuss the format of the disassembly output. The columns returned are the following:
- The original line of code the disassembly is referencing.
- The address of the bytecode instruction.
- The name of the instruction.
- The index of the argument in the code block’s name and constant table.
- The human-friendly mapping from the argument index (4) to the actual value or name being referenced.
For (4), it is important to understand that all code objects in Python, that is, isolated code blocks like functions, have internal name and constant tables. These tables are simply lists, where the constant table would hold constants such as string literals, numbers or special values such as
None that appear at least once in the code block, while the name table will hold a list of variable names. These variable names are then, further, keys into a dictionary mapping such symbols to actual values. The reason why instruction arguments are indices into tables and not the values stored in those tables is so that arguments can have uniform length (always two bytes). As you can imagine, storing variable-length strings in the bytecode directly makes advancing a program counter a great deal more complex.
I just mentioned the concept of code objects. But what are these? Simply put, code objects are the compiled representation of a piece of code. This representation can then be understood by the cPython interpreter, which is the most common and popular implementation of the Python programming language. It is written in C and uses these code objects to execute Python code line-by-line. Next to cPython, there exist a number of other Python compilers and interpreters, including PyPy (which uses a JIT-compiler, often making it much faster than cPython) and Jython.
There are a few ways to generate code objects. The easiest is to access the
__code__ member that all functions and methods (remember that methods are just functions with an additional instance argument):
def foo(): pass In : foo.__code__ Out : <code object foo at 0x109d419c0, file "<string>", line 1>
Alternatively, you can just compile any piece of code yourself, using the
compile builtin. This function will return a code object. It takes the following parameters:
- A string or
- The name of the file in which the code is written, or a descriptive string if no such file exists (e.g.
- The compilation mode, which can be one of the following three strings:
'exec': Allows multiple expressions to be parsed.
'eval': Allows only a single expression to be parsed (no spaces).
'single': Interprets and evaluates an expression like an interactive shell. This means the code will be evaluated and any non-
Noneresult will be printed.
code_object = compile('x = 5; x += 1', '<string>', 'exec') # OK code_object = compile('x = 5; x += 1', '<string>', 'eval') # Error
These code objects provide all the information a Python interpreter such as cPython needs to execute the code represented by the object. For this, the code object has several interesting attributes:
co_code: The actual bytecode byte string.
co_consts: The constants available to the instructions (“the constants table”).
co_names: The global symbols available to the instructions.
co_nlocals: The number of local symbols (functions and variables).
co_varnames: The list of local symbols (functions and variables).
co_argcount: If the code object is that of a function, the number of arguments it takes.
For example, we can see that when we create the following function:
def foo(): x = 5 def bar(): pass return len([x])
The code object of
foo will have the constants (
5, the local variable names (
x and available symbols (
In : foo.__code__.co_consts Out : (None, 5) In : foo.__code__.co_varnames Out : ('bar', 'x') In : foo.__code__.co_names Out : ('len', )
Understanding Disassembled Code
Now that we know what code objects are and that they have name and constant tables, we can better understand the output of
dis.dis(). Let us take the disassembly of the following function:
y = 5 def function(): x = len([1, 2, 3]) z = x + y return z
In : dis.dis(function) Out : 3 0 LOAD_GLOBAL 0 (len) 3 LOAD_CONST 1 (1) 6 LOAD_CONST 2 (2) 9 LOAD_CONST 3 (3) 12 BUILD_LIST 3 15 CALL_FUNCTION 1 (1 positional, 0 keyword pair) 18 STORE_FAST 0 (x) 4 21 LOAD_FAST 0 (x) 24 LOAD_GLOBAL 1 (y) 27 BINARY_ADD 28 STORE_FAST 1 (z) 5 31 LOAD_FAST 1 (z) 34 RETURN_VALUE
For example, the very first line shows the
LOAD_GLOBAL instruction. It is the first bytecode instruction of the third line in the Python function, as indicated by the second and first column, respectively. As we can also see, it loads the name at index 0. Recall that this index is referencing the
co_names list associated with the disassembled code object. The last column
len, is simply a friendly hint by
dis.dis() as to what the value at
But what does
LOAD_GLOBAL do? Well, it loads a global. In this case the global function (symbol)
len. That was obvious, but what is more interesting is where it loads the global to. The answer to this question lies in how the cPython interpreter is structured: it is entirely stack-based. This means that any function or method symbols, constants or variable names are added to a stack. Operations such as binary addition (the
BINARY_ADD bytecode instruction you see) will then operate on this stack, popping two values and pushing the result back onto the stack. Function call instructions (
CALL_FUNCTION) will similarly pop the function symbol and arguments from the stack, execute the function (which pushes a new frame onto the stack) and ultimately push the return value of the called function back onto the stack.
A complete list of instructions understood by cPython can be found here. Some interesting ones include:
LOAD_FAST <index>: Pushes the local variable at the given index onto the stack.
STORE_GLOBAL <index>: Pops the top of the stack into the variable with the name specified at the given index.
BINARY_ADD: Pops the top two values from the stack, adds them and pushes the result back onto the stack.
BUILD_LIST <size>: Pops
sizemany elements off the stack and allocates a new list.
CALL_FUNCTION <#arguments>: Pops
#argumentsmany arguments off the stack. The lower byte of this number specifies the number of positional arguments, the high byte the number of keyword arguments. Then pops the function symbol (itself) off the stack, makes the call and pushes the return value of the function back onto the stack.
MAKE_FUNCTION <#arguments>: To create a function object, consumes:
- All default arguments in positional order,
- The code object of the function,
- The qualified name of the function, from the stack and pushes the created function object onto the stack.
POP_TOP: Simply pops (as in, “removes”) the value at the top of the stack.
Let’s now look at practical examples of how looking at the disassembly of our programs can help you optimize and better understand your code.
One simple case is investigating when your Python compiler performs certain optimizations and when not. For example, take the following code:
i = 1 + 2 f = 3.4 * 5.6 s = 'Hello,' + ' World!' I = i * 3 * 4 F = f / 2 / 3 S = s + ' Bye!'
Doing a disassembly shows us (comments mine):
### i 2 0 LOAD_CONST 10 (3) 3 STORE_FAST 0 (i) ### f 3 6 LOAD_CONST 11 (19.04) 9 STORE_FAST 1 (f) ### s 4 12 LOAD_CONST 12 ('Hello, World!') 15 STORE_FAST 2 (s) ### I 6 18 LOAD_FAST 0 (i) 21 LOAD_CONST 7 (3) 24 BINARY_MULTIPLY 25 LOAD_CONST 8 (4) 28 BINARY_MULTIPLY 29 STORE_FAST 3 (I) ### F 7 32 LOAD_FAST 1 (f) 35 LOAD_CONST 2 (2) 38 BINARY_TRUE_DIVIDE 39 LOAD_CONST 7 (3) 42 BINARY_TRUE_DIVIDE 43 STORE_FAST 4 (F) ### S 8 46 LOAD_FAST 2 (s) 49 LOAD_CONST 9 (' Bye!') 52 BINARY_ADD 53 STORE_FAST 5 (S) 56 LOAD_CONST 0 (None)
As you can see, the compiler will readily perform simple arithmetic operations at “compile time”, such as adding
1 + 2 or reducing the expression
3.4 * 5.6 to the constant
19.04. The compiler will even perform string concatenation immediately. However, as soon as a single variable (see
f) is involved, the compiler stops optimizing and loads the variable name as well as all constants individually and performs the binary operations one-by-one in the natural order of the operation.
When Rolling-Your-Own is Bad
Now, let’s look at a more realistic case where looking at the disassembly would have really given you insight. In Python, we usually use
range and iterators to loop over a sequence of numbers. However, we may think that calling
range and using the iterator may be too costly compared to a roll-your-own C-like loop. That is, we want to compare:
for i in range(x): pass
i = 0 while i < x: i += 1
Let’s do just that and profile them:
In : timeit.timeit('for i in range(x): pass', globals=dict(x=100)) Out: 1.4965313940192573 In : timeit.timeit('i = 0\nwhile i < x: i += 1', globals=dict(x=100)) Out: 7.19753862300422
As we can see, the built-in version is much faster. Let’s disassemble the code and try to understand why. First for the
1 0 SETUP_LOOP 20 (to 23) 3 LOAD_NAME 0 (range) 6 LOAD_NAME 1 (x) 9 CALL_FUNCTION 1 (1 positional, 0 keyword pair) 12 GET_ITER >> 13 FOR_ITER 6 (to 22) 16 STORE_NAME 2 (i) 19 JUMP_ABSOLUTE 13 >> 22 POP_BLOCK >> 23 LOAD_CONST 0 (None) 26 RETURN_VALUE
Then for the roll-your-own version:
1 0 LOAD_CONST 0 (0) 3 STORE_NAME 0 (i) 2 6 SETUP_LOOP 26 (to 35) >> 9 LOAD_NAME 0 (i) 12 LOAD_NAME 1 (x) 15 COMPARE_OP 0 (<) 18 POP_JUMP_IF_FALSE 34 21 LOAD_NAME 0 (i) 24 LOAD_CONST 1 (1) 27 INPLACE_ADD 28 STORE_NAME 0 (i) 31 JUMP_ABSOLUTE 9 >> 34 POP_BLOCK >> 35 LOAD_CONST 2 (None) 38 RETURN_VALUE
As we can see, the version using the built-in
range function has fewer instructions than the roll-your-own variant. Ultimately, what this boils down to is that the version using
range will make better use of internal C imrplementations of iterators, while our custom version requires repeated interpretation of the
POP_JUMP_IF_FALSE and other operations, which takes more time.
Lastly, inspecting the disassembly of code can give us quite interesting insight into how Python’s dynamic language properties affect its performance. Recall that a symbol in Python must be re-evaluated on each invocation at runtime. This means that even if we use the exact same global variable twice in the same line of code, it will have to be retrieved from the global namespace twice, individually, irrespective of previous or future loads, given that it may very well change value and/or type between evaluations. Ultimately, this means that caching “deeply qualified” names (with lots of dots) and storing them into local variables can improve the performance of our code quite a bit.
As a simple example, say we want to sum the sines of a range of values – lots of values. The naive way to do so would be the following:
def first(): total = 0 for x in range(1000): total += math.sin(x) return total
When we disassemble this using
dis.dis(first), we get the following:
2 0 LOAD_CONST 1 (0) 3 STORE_FAST 0 (total) 3 6 SETUP_LOOP 39 (to 48) 9 LOAD_GLOBAL 0 (range) 12 LOAD_CONST 2 (1000) 15 CALL_FUNCTION 1 (1 positional, 0 keyword pair) 18 GET_ITER >> 19 FOR_ITER 25 (to 47) 22 STORE_FAST 1 (x) 4 25 LOAD_FAST 0 (total) 28 LOAD_GLOBAL 1 (math) 31 LOAD_ATTR 2 (sin) 34 LOAD_FAST 1 (x) 37 CALL_FUNCTION 1 (1 positional, 0 keyword pair) 40 INPLACE_ADD 41 STORE_FAST 0 (total) 44 JUMP_ABSOLUTE 19 >> 47 POP_BLOCK 5 >> 48 LOAD_FAST 0 (total) 51 RETURN_VALUE
The important bit is line 4. As you can see, since we fully qualify
math.sin, we have to:
Where (2) ultimately equates to a dictionary (hash table) lookup. Understanding the dynamic nature of Python, we may hypothesize that caching and storing the
math.sin symbol in a local variable may greatly improve the performance of our code in the tight inner loop. Let’s see:
def second(): sin = math.sin total = 0 for x in range(1000): total += sin(x) return total
2 0 LOAD_GLOBAL 0 (math) 3 LOAD_ATTR 1 (sin) 6 STORE_FAST 0 (sin) 3 9 LOAD_CONST 1 (0) 12 STORE_FAST 1 (total) 4 15 SETUP_LOOP 36 (to 54) 18 LOAD_GLOBAL 2 (range) 21 LOAD_CONST 2 (1000) 24 CALL_FUNCTION 1 (1 positional, 0 keyword pair) 27 GET_ITER >> 28 FOR_ITER 22 (to 53) 31 STORE_FAST 2 (x) 5 34 LOAD_FAST 1 (total) 37 LOAD_FAST 0 (sin) 40 LOAD_FAST 2 (x) 43 CALL_FUNCTION 1 (1 positional, 0 keyword pair) 46 INPLACE_ADD 47 STORE_FAST 1 (total) 50 JUMP_ABSOLUTE 28 >> 53 POP_BLOCK 6 >> 54 LOAD_FAST 1 (total) 57 RETURN_VALUE
If we now look at the inner loop (line 5), we can see that we’ve replaced the
LOAD_ATTR sequence with
LOAD_FAST – a single local variable lookup. As the name implies, this will be a lot faster (supposedly). Let’s check:
In : timeit.timeit('f()', globals=dict(f=first), number=100000) Out: 27.191193769976962 In : timeit.timeit('f()', globals=dict(f=second), number=100000) Out: 15.568848820985295
Well, I don’t know about your definition of “a lot faster”, but a factor of (almost) two is quite a lot faster in my book.
Disassembled bytecode instructions are already quite low-level (a.k.a. cool). However, we can go even deeper and understand the byte code itself – i.e. the binary or hexadecimal representation of the instructions in compiled and assembled bytecode. For this, let’s define a function and mess a little more with its
def function(): x = 5 l = [1, 2] return len(l) + x
function.__code__ we can gain access to the code object associated with the function. Furthermore,
function.__code__.co_code returns the actual bytecode:
In : function.__code__.co_code Out: b'd\x01\x00}\x00\x00d\x02\x00d\x03\x00g\x02\x00}\x01\x00t\x00\x00|\x01\x00\x83\x01\x00|\x00\x00\x17S'
Yes! Bytes! Just what I like for breakfast. But what can we actually make of these delicious bites of bytecode? Well, we know that these bytes specify instructions, some taking arguments and some not. Each instruction will occupy a single byte and arguments (such as the indices into the name and constants table) will occupy further bytes. Furthermore, fortunately enough, the
dis module (as well as the
opcode module) provides an
opname table and an
opmap map. The former is a simple list, laid out such that indexing it with the opcode of an instruction will return the name (mnemonic) of that instruction. The latter,
dis.opmap, maps instruction mnemonics to their bytecode numbers:
In : dis.opname Out: 'GET_YIELD_FROM_ITER' In : dis.opmap['LOAD_CONST'] Out: 100
So, if we know the byte value describing a certain instruction, we now know how to get the instruction name. All that’s left is interpreting the arguments of these instructions. For this we need to know whether or not the instruction takes arguments in the first place. To get this information, we can make use of the
dis.hasjabs and others. Each of these are lists in the
dis module that contain the bytecodes either taking a a constant argument, a name argument, relative/absolute jump target or other kind of parameter. For example,
dis.hasnargs is also such a list, containing all opcodes related to function calls, such as
CALL_FUNCTION_VAR (for functions taking
CALL_FUNCTION_KW (for functions taking
**kwargs). It is noteworthy that if an instruction takes arguments at all, it can only take a single argument occupying exactly 16 bits (two bytes).
Writing Our Own Bytecode
For practice and fun, we could now actually write some bytecode ourselves. Say, for example, we wanted to write
x = 1 + 2 in bytecode. This equates to first loading the constants
2 onto the stack using
LOAD_CONST with opcode
0x64). Then, we perform a
0x17), which will pop those two values off the stack and push the result (3) back on top. Then we’ll do a
0x7d) to pop the result off the stack and store it in the only variable we’ll have, i.e. at index 0. This equates to:
0 LOAD_CONST 0 # 0x64 3 LOAD_CONST 1 # 0x64 6 BINARY_ADD # 0x17 7 STORE_NAME 0 # 0x7d consts = (1, 2) names = ('x',)
or, in (decimal) bytes:
100 0 0 # consts 100 1 0 # consts 23 # no arguments 125 0 0 # names
Note that the compiler would actually optimize
1 + 2 to the constant
3, as you’ll note if you run
dis.dis('x = 1 + 2'). Also, when you run
dis.dis, you’ll see that the code seems to be returning
None at the end. This is simply an implementation detail of cPython, related to it being written in C and always requiring a return value (even for code outside functions).
The disassembly also tells us that a single
LOAD_CONST instruction consumes three bytes. One for the opcode and two for the 16-bit parameter. Lastly, it is important to mention that arguments are stored in little endian format. That’s why
LOAD_CONST 1 is
100 1 0, sine
0b00000001 0b00000000 is
1 in little endian representation.
Writing a Disassembler
Given all this information, we’re now more than set to write our own little bytecode disassembler. We’ll simply take a bytestring, such as returned by a code object’s
co_code attribute, and loop through it’s bytes. We’ll have to read the bytes to decode instructions and pick out the right number of subsequent bytes for arguments. Also, we’ll want to make use of
co_firstlineno to cross-reference between the bytecode and the original, disassembled code. You can find a small implementation (~300 lines) following these ideas here.
The one thing many people love about Python is the speed of development you get from its high-level, high-abstraction programming interface. However, every high-level programming language has a great depth of low-level implementation mysteries that give you insight into the inner workings of the language and allow you to make more educated decisions about performance-critical optimizations. That said, don’t forget that there’s a plethora of other languages such as modern C++, Rust, Go and even Java that will be faster than even the most optimized Python code (usually) and provide reasonably intuitive interfaces. As such, see what you just learnt as useful knowledge for rare edge cases and fun trivia to talk about with you grandmother for Sunday tea, but not an invitation to disassemble your entire code base and optimize individual instructions.
Here some further resources and links I took inspiration and knowledge from: