In Python, Rose == 'Red', Violet is not 'Blue'
I came across a bit of Python code that checked if a color was in a list like this:
It felt wrong. In Python, we shouldn’t be able to check if a string is
another
string, since that’s checking their references. I opened a Python interpreter:
That feels like it shouldn’t work! After research, I came across a StackOverflow question that talked about this and it got me interested. I tweeted:
#Python trivia
— Kevin London (@kevin_london) June 7, 2017
>>> 'red' is 'red'
True
>>> color = ''.join(['r', 'e', 'd'])
>>> color
'red'
>>> color is 'red'
False
>>> color == 'red'
True
Why does it work this way? What’s Python doing under the hood?
**[Editor’s Note: This blog post is originally from 2017, and I didn’t finish it until now. It’s Python 2-centric, though I still think it’s interesting!]**
What’s Python Doing?
The simplest answer to this is that the is
comparison checks the unique IDs of
the references to those pointers.
In other words,
is checking if
.
Here’s what the above scenario looks like in an interpreter:
The color
variable is the constructed
result of joining a list. Even though the values are the same, the reference
that the variable’s pointing to differs from the string literal of 'red'
constructed earlier.
That explains the scenario above. We’re comparing the IDs of each and, since the
reference to 'red'
is different than the reference to color
, we get a False
value returned.
String Interning and How it Works
String interning is a method of storing only one copy of each distinct string value, which must be immutable. Interning is done to optimize memory usage and speed up comparisons (see more here).
In Python 2, string interning is global. In Python 3, the intern
call lives
in the sys
module. Coming back to our example, using Python 3:
Adrien Guillo wrote an excellent explanation on how Python handles string interning. The short version is that strings of length 0 and 1 are all interned, and the rest are interned at compile time.
As such, the IDs of the two items are different since one was created (interned) at compile time and the other at runtime.
More Complications
Why does comparing sometimes yield a different result in a script as compared to an interpreter? Here’s a few interesting discussions:
- Why does comparing strings in Python using either ‘==’ or ‘is’ sometimes produce different results?
- Why does id() == id() and id() is id() produce different results in CPython?
Deeper Down the Rabbit Hole
Now here’s an example that really bends my brain.
If the IDs are the same, then why does the id comparison fail? What’s different
about this? This discrepancy arises because the id()
function returns a new
integer object each time it is called, even if the integer values are the same.
It all leads to dis
To truly understand what’s happening under the hood, we need to look at the
disassembled bytecode. The dis
module can show us how Python translates our code
into lower-level instructions.
Here’s a good read on this topic: Introduction to the Python Interpreter.
Finally, here’s an additional resource on string comparison in Python: String comparison in Python: ‘==’ vs ‘is’.
Conclusion
Understanding the nuances of string comparison and interning in Python can be
tricky, but it’s essential for writing efficient and bug-free code.
The is
operator should generally be avoided for string comparison because it checks for
object identity, not equality. Instead, use ==
to compare string values.
By learning how Python handles strings, you can write more efficient code and avoid subtle bugs.
Thanks for reading, and I hope this deep dive into Python string interning was enlightening!