As mentioned in the first part of this series, some functions in Python can be dangerous if you’re not aware of their risks. In this installment, we’ll cover deserializing data with pickle and yaml and information leakage.
Pickle and friends
Why it’s useful
pickle
enables you to store
state and Python objects to disk so that you can later restore them. Pickle can
be useful for storing something that doesn’t quite need a database or for data
that’s inherently temporary.
In the past, I’ve used pickle to support pause and resume functionality for large file transfers. I saved the progress to a pickle file and then, on resume, picked up where it left off and removed the pickle.
Why it’s dangerous
Pickle has the same weaknesses as exec
and eval
, which we covered in part 1.
It enables users to craft input that executes arbitrary code on
your machine. Sound familiar?
Other libraries and modules rely upon Pickle to do their thing as well, which
makes them prone to the same risks.
One of those is shelve
,
which is another module related to serializing Python objects.
Celery, a popular package used for sending messages to queues, used pickle by default for communication with its workers before version 3.0.18. If you’re using an older version of Celery, make sure you’re following the recommended security guidelines or upgrade.
Django, a popular Python web framework,
used pickle
before version 1.6 to store session information. There’s a
scary warning in the Django docs
about how that can go wrong.
A dangerous example
I’m going to use an example from Lincoln Loop’s
Playing with Pickle Security
and expand upon it. In our example, we will serialize a command to call the
command-line utility ls
and deserialize it with pickle.loads()
.
import os
import cPickle
# Exploit that we want the target to unpickle
class Exploit(object):
def __reduce__(self):
# Note: this will only list files in your directory.
# It is a proof of concept.
return (os.system, ('ls',))
def serialize_exploit():
shellcode = cPickle.dumps(Exploit())
return shellcode
def insecure_deserialize(exploit_code):
cPickle.loads(exploit_code)
if __name__ == '__main__':
shellcode = serialize_exploit()
print('Yar, here be yer files.')
insecure_deserialize(shellcode)
In this case, we only wanted to list the files in the directory using the
ls
command. We could have used almost any shell command.
What to use instead
You could use json
to
serialize data or, if you must, yaml
. If you use yaml
, please read the
section below on why it has its own set of risks.
If you’re using Celery or Django, you should upgrade to a version that
does not use pickle
for serialization.
If you must use it…
Be careful with your input! Never trust a pickle that has gone over the network or come from someone else. It’s too easy to exploit.
Additional references
Loading YAMLs
Why it’s useful
YAML files offer another option for serializing and deserializing data. They are useful for storing configuration or other immutable values. I have used YAMLs to store configuration values for web applications, where the configuration differs depending upon the environment we’re deploying to (production vs staging, for example).
PyYAML
does not live in the standard library but seems like the
most popular way to parse YAMLs in Python.
Why it’s dangerous
The simplest way to load a YAML file is with yaml.load()
. Unfortunately,
yaml.load()
is an unsafe operation that, you guessed it, enables maliciously
crafted files to execute arbitrary code on the host machine.
A dangerous example
As with pickle, we’ll setup an example where we read the files in a directory on the host machine.
In exploit.yml
:
your_files: !!python/object/apply:subprocess.check_output ['ls']
In a Python script (after perhaps running pip install pyyaml
):
import yaml
with open('exploit.yml') as exploit_file:
contents = yaml.load(exploit_file)
your_files = contents['your_files'].splitlines()
for your_file in your_files:
print(your_file)
Again, we can provide many different commands to subprocess, including those that we discussed in part 1.
What to use instead
The yaml
module has a safe way to load yaml files: yaml.safe_load()
.
I wish the package had the safe method as the default, rather than the
dangerous one.
As Ned Batchelder says:
Why do serialization implementers do this? If you must extend the format with dangerous features, provide them in the non-obvious method. Provide a .load() method and a .dangerous_load() method instead. At least that way people would have to decide to do the dangerous thing.
If you must use it…
Use yaml.safe_load()
. If you must use yaml.load()
directly, then you
should be careful about which files you load and trust.
Additional references
A few more dangers
I wanted to briefly touch on a few other things to keep in mind while writing Python code.
SQL Injection
SQL Injection is basically untrusted input meets your database. All the same risks that we talked about with untrusted input above also apply here.
As a quick example, here’s how someone could exploit this:
import sqlite3
def get_user_by_name(name, cursor):
cursor.execute("SELECT * FROM users WHERE name = '%s'" % name) # unsafe!
if __name__ == '__main__':
conn = sqlite3.connect('example.db')
cursor = conn.cursor()
malicious_name = "Joe'; DROP TABLE users; --"
get_user_by_name(malicious_name, conn)
If you ran this example against a real database, the malicious name would drop the user’s table. Not great.
Python provides a database binding for
sqlite3
in the standard
library and there’s a section in the Python docs where they talk about how to
properly escape variables (which we do not do in the example).
Otherwise, I’d recommend using an ORM, such as the
one in Django or
sqlalchemy.
Information Leakage
The print
function and logging
module are useful but potentially risky.
Ideally, any log files that we write have their permissions configured to allow
only sufficiently privileged users to read them. If anyone can read the log
file, it’s easier for someone to access the logs when they shouldn’t be able to
do so. If you must log sensitive information (must you?), be sure to protect
it through access controls.
Thanks to @goodwillbits for recommending that I add this section.
In Conclusion
In this series, we’ve covered a few different ways in which Python functions can be dangerous. Python’s documentation is good about letting you know when it’s risky to use something but you have to know when and where to look. If you take nothing else away from this post, please remember to be careful when it comes to accepting untrusted input.
Discussion on Hacker News.