Updated in 2026 with Python 3 examples and current links.
As mentioned in the first part of this series, some functions in Python can be dangerous if you’re not aware of their risks. In this installment, we’ll cover deserializing data with pickle and yaml and information leakage.
Pickle and friends
Why it’s useful
pickle enables you to store
state and Python objects to disk so that you can later restore them. Pickle can
be useful for storing something that doesn’t quite need a database or for data
that’s inherently temporary.
In the past, I’ve used pickle to support pause and resume functionality for large file transfers. I saved the progress to a pickle file and then, on resume, picked up where it left off and removed the pickle.
Why it’s dangerous
Pickle has the same weaknesses as exec and eval, which we covered in part 1.
It enables users to craft input that executes arbitrary code on
your machine. Sound familiar?
Other modules that rely on pickle inherit the same risks.
shelve, for example,
uses pickle under the hood for serialization.
Popular frameworks have learned this lesson over the years. Celery used pickle by default for worker communication before version 3.0.18, and Django used it for session storage before version 1.6. Both have since moved to safer defaults, but the underlying risk remains for any code that deserializes untrusted pickle data.
A dangerous example
I’m going to use an example from Lincoln Loop’s
Playing with Pickle Security
and expand upon it. In our example, we will serialize a command to call the
command-line utility ls and deserialize it with pickle.loads().
import os
import pickle
# Exploit that we want the target to unpickle
class Exploit:
def __reduce__(self):
# Note: this will only list files in your directory.
# It is a proof of concept.
return (os.system, ('ls',))
def serialize_exploit():
shellcode = pickle.dumps(Exploit())
return shellcode
def insecure_deserialize(exploit_code):
pickle.loads(exploit_code)
if __name__ == '__main__':
shellcode = serialize_exploit()
print('Yar, here be yer files.')
insecure_deserialize(shellcode)
In this case, we only wanted to list the files in the directory using the
ls command. We could have used almost any shell command.
What to use instead
You could use json to
serialize data or, if you must, yaml. If you use yaml, please read the
section below on why it has its own set of risks.
If you’re using Celery or Django, make sure you’re on a modern version that
does not use pickle for serialization by default.
If you must use it…
Be careful with your input! Never trust a pickle that has gone over the network or come from someone else. It’s too easy to exploit.
Additional references
Loading YAMLs
Why it’s useful
YAML files offer another option for serializing and deserializing data. They are useful for storing configuration or other immutable values. I have used YAMLs to store configuration values for web applications, where the configuration differs depending upon the environment we’re deploying to (production vs staging, for example).
PyYAML
does not live in the standard library but seems like the
most popular way to parse YAMLs in Python.
Why it’s dangerous
The simplest way to load a YAML file used to be yaml.load(). Unfortunately,
yaml.load() without a Loader argument is an unsafe operation that, you guessed it,
enables maliciously crafted files to execute arbitrary code on the host machine.
A dangerous example
As with pickle, we’ll setup an example where we read the files in a directory on the host machine.
In exploit.yml:
your_files: !!python/object/apply:subprocess.check_output ['ls']
In a Python script (after perhaps running pip install pyyaml):
import yaml
with open('exploit.yml') as exploit_file:
contents = yaml.load(exploit_file)
your_files = contents['your_files'].splitlines()
for your_file in your_files:
print(your_file)
Again, we can provide many different commands to subprocess, including those that we discussed in part 1.
What to use instead
The yaml module has a safe way to load yaml files: yaml.safe_load().
When I originally wrote this post, I wished the package had the safe method
as the default. As of PyYAML 6.0, calling yaml.load() without an explicit
Loader raises an error, which is a good step.
As Ned Batchelder said at the time:
Why do serialization implementers do this? If you must extend the format with dangerous features, provide them in the non-obvious method. Provide a .load() method and a .dangerous_load() method instead. At least that way people would have to decide to do the dangerous thing.
PyYAML eventually took that advice to heart.
If you must use it…
Use yaml.safe_load(). If you must use yaml.load() directly, pass
Loader=yaml.SafeLoader explicitly so your intent is clear.
Additional references
A few more dangers
A few more things to keep in mind.
SQL Injection
SQL Injection is basically untrusted input meets your database. All the same risks that we talked about with untrusted input above also apply here.
As a quick example, here’s how someone could exploit this:
import sqlite3
def get_user_by_name(name, cursor):
cursor.execute("SELECT * FROM users WHERE name = '%s'" % name) # unsafe!
if __name__ == '__main__':
conn = sqlite3.connect('example.db')
cursor = conn.cursor()
malicious_name = "Joe'; DROP TABLE users; --"
get_user_by_name(malicious_name, conn)
If you ran this example against a real database, the malicious name would drop the user’s table. Not great.
Python provides a database binding for
sqlite3 in the standard
library and there’s a section in the Python docs where they talk about how to
properly escape variables (which we do not do in the example).
Otherwise, I’d recommend using an ORM, such as the
one in Django or
sqlalchemy.
Information Leakage
The print function and logging module are useful but potentially risky.
Ideally, any log files that we write have their permissions configured to allow
only sufficiently privileged users to read them.
If anyone can read the log file, it’s easier for someone to access the logs when they shouldn’t be able to do so. If you must log sensitive information (must you?), be sure to protect it through access controls.
Thanks to @goodwillbits for recommending that I add this section.
In Conclusion
This series covered several ways Python functions can bite you. Python’s documentation flags the risks, but you have to know where to look.
If you take one thing from these posts: never trust untrusted input.
Discussion on Hacker News.