Asked By – cs95
Note: This question is for informational purposes only. I am interested to see how deep into Python’s internals it is possible to go with this.
Not very long ago, a discussion began inside a certain question regarding whether the strings passed to print statements could be modified after/during the call to
def print_something(): print('This cat was scared.')
This dog was scared.
Notice the word “cat” has been replaced by the word “dog”. Something somewhere somehow was able to modify those internal buffers to change what was printed. Assume this is done without the original code author’s explicit permission (hence, hacking/hijacking).
This comment from the wise @abarnert, in particular, got me thinking:
There are a couple of ways to do that, but they’re all very ugly, and
should never be done. The least ugly way is to probably replace the
codeobject inside the function with one with a different
list. Next is probably reaching into the C API to access the str’s
internal buffer. […]
So, it looks like this is actually possible.
Here’s my naive way of approaching this problem:
>>> import inspect >>> exec(inspect.getsource(print_something).replace('cat', 'dog')) >>> print_something() This dog was scared.
exec is bad, but that doesn’t really answer the question, because it does not actually modify anything during when/after
How would it be done as @abarnert has explained it?
Now we will see solution for issue: Is it possible to “hack” Python’s print function?
First, there’s actually a much less hacky way. All we want to do is change what
_print = print def print(*args, **kw): args = (arg.replace('cat', 'dog') if isinstance(arg, str) else arg for arg in args) _print(*args, **kw)
Or, similarly, you can monkeypatch
sys.stdout instead of
Also, nothing wrong with the
exec … getsource … idea. Well, of course there’s plenty wrong with it, but less than what follows here…
But if you do want to modify the function object’s code constants, we can do that.
If you really want to play around with code objects for real, you should use a library like
bytecode (when it’s finished) or
byteplay (until then, or for older Python versions) instead of doing it manually. Even for something this trivial, the
CodeType initializer is a pain; if you actually need to do stuff like fixing up
lnotab, only a lunatic would do that manually.
Also, it goes without saying that not all Python implementations use CPython-style code objects. This code will work in CPython 3.7, and probably all versions back to at least 2.2 with a few minor changes (and not the code-hacking stuff, but things like generator expressions), but it won’t work with any version of IronPython.
import types def print_function(): print ("This cat was scared.") def main(): # A function object is a wrapper around a code object, with # a bit of extra stuff like default values and closure cells. # See inspect module docs for more details. co = print_function.__code__ # A code object is a wrapper around a string of bytecode, with a # whole bunch of extra stuff, including a list of constants used # by that bytecode. Again see inspect module docs. Anyway, inside # the bytecode for string (which you can read by typing # dis.dis(string) in your REPL), there's going to be an # instruction like LOAD_CONST 1 to load the string literal onto # the stack to pass to the print function, and that works by just # reading co.co_consts. So, that's what we want to change. consts = tuple(c.replace("cat", "dog") if isinstance(c, str) else c for c in co.co_consts) # Unfortunately, code objects are immutable, so we have to create # a new one, copying over everything except for co_consts, which # we'll replace. And the initializer has a zillion parameters. # Try help(types.CodeType) at the REPL to see the whole list. co = types.CodeType( co.co_argcount, co.co_kwonlyargcount, co.co_nlocals, co.co_stacksize, co.co_flags, co.co_code, consts, co.co_names, co.co_varnames, co.co_filename, co.co_name, co.co_firstlineno, co.co_lnotab, co.co_freevars, co.co_cellvars) print_function.__code__ = co print_function() main()
What could go wrong with hacking up code objects? Mostly just segfaults,
RuntimeErrors that eat up the whole stack, more normal
RuntimeErrors that can be handled, or garbage values that will probably just raise a
AttributeError when you try to use them. For examples, try creating a code object with just a
RETURN_VALUE with nothing on the stack (bytecode
b'S\0' for 3.6+,
b'S' before), or with an empty tuple for
co_consts when there’s a
LOAD_CONST 0 in the bytecode, or with
varnames decremented by 1 so the highest
LOAD_FAST actually loads a freevar/cellvar cell. For some real fun, if you get the
lnotab wrong enough, your code will only segfault when run in the debugger.
byteplay won’t protect you from all of those problems, but they do have some basic sanity checks, and nice helpers that let you do things like insert a chunk of code and let it worry about updating all offsets and labels so you can’t get it wrong, and so on. (Plus, they keep you from having to type in that ridiculous 6-line constructor, and having to debug the silly typos that come from doing so.)
Now on to #2.
I mentioned that code objects are immutable. And of course the consts are a tuple, so we can’t change that directly. And the thing in the const tuple is a string, which we also can’t change directly. That’s why I had to build a new string to build a new tuple to build a new code object.
But what if you could change a string directly?
Well, deep enough under the covers, everything is just a pointer to some C data, right? If you’re using CPython, there’s a C API to access the objects, and you can use
ctypes to access that API from within Python itself, which is such a terrible idea that they put a
pythonapi right there in the stdlib’s
ctypes module. 🙂 The most important trick you need to know is that
id(x) is the actual pointer to
x in memory (as an
Unfortunately, the C API for strings won’t let us safely get at the internal storage of an already-frozen string. So screw safely, let’s just read the header files and find that storage ourselves.
If you’re using CPython 3.4 – 3.7 (it’s different for older versions, and who knows for the future), a string literal from a module that’s made of pure ASCII is going to be stored using the compact ASCII format, which means the struct ends early and the buffer of ASCII bytes follows immediately in memory. This will break (as in probably segfault) if you put a non-ASCII character in the string, or certain kinds of non-literal strings, but you can read up on the other 4 ways to access the buffer for different kinds of strings.
To make things slightly easier, I’m using the
superhackyinternals project off my GitHub. (It’s intentionally not pip-installable because you really shouldn’t be using this except to experiment with your local build of the interpreter and the like.)
import ctypes import internals # https://github.com/abarnert/superhackyinternals/blob/master/internals.py def print_function(): print ("This cat was scared.") def main(): for c in print_function.__code__.co_consts: if isinstance(c, str): idx = c.find('cat') if idx != -1: # Too much to explain here; just guess and learn to # love the segfaults... p = internals.PyUnicodeObject.from_address(id(c)) assert p.compact and p.ascii addr = id(c) + internals.PyUnicodeObject.utf8_length.offset buf = (ctypes.c_int8 * 3).from_address(addr + idx) buf[:3] = b'dog' print_function() main()
If you want to play with this stuff,
int is a whole lot simpler under the covers than
str. And it’s a lot easier to guess what you can break by changing the value of
1, right? Actually, forget imagining, let’s just do it (using the types from
>>> n = 2 >>> pn = PyLongObject.from_address(id(n)) >>> pn.ob_digit 2 >>> pn.ob_digit = 1 >>> 2 1 >>> n * 3 3 >>> i = 10 >>> while i < 40: ... i *= 2 ... print(i) 10 10 10
… pretend that code box has an infinite-length scrollbar.
I tried the same thing in IPython, and the first time I tried to evaluate
2 at the prompt, it went into some kind of uninterruptable infinite loop. Presumably it’s using the number
2 for something in its REPL loop, while the stock interpreter isn’t?