I came across a bit of Python code that checked if a color was in a list like this:

if color is 'red' or color is 'blue':
    do_something()

It felt wrong. In Python, we shouldn’t be able to check if a string is another string, since that’s checking their references. I opened a Python interpreter:

> > > color = 'red'
> > > color is 'red'
> > > True

That feels like it shouldn’t work! After research, I came across a StackOverflow question that talked about this and it got me interested. I tweeted:

Why does it work this way? What’s Python doing under the hood?

**[Editor’s Note: This blog post is originally from 2017, and I didn’t finish it until now. It’s Python 2-centric, though I still think it’s interesting!]**

What’s Python Doing?

The simplest answer to this is that the is comparison checks the unique IDs of the references to those pointers. In other words,

color is 'red'

is checking if

 id(color) == id(red) 

.

Here’s what the above scenario looks like in an interpreter:

> > > id('red')
> > > 4397469120
> > > id(color)
> > > 4397468952
> > > id('red') == id('red')
> > > True
> > > id(color) == id('red')
> > > False

The color variable is the constructed result of joining a list. Even though the values are the same, the reference that the variable’s pointing to differs from the string literal of 'red' constructed earlier.

That explains the scenario above. We’re comparing the IDs of each and, since the reference to 'red' is different than the reference to color, we get a False value returned.

String Interning and How it Works

String interning is a method of storing only one copy of each distinct string value, which must be immutable. Interning is done to optimize memory usage and speed up comparisons (see more here).

In Python 2, string interning is global. In Python 3, the intern call lives in the sys module. Coming back to our example, using Python 3:

> > > color is 'red'
> > > False
> > > import sys
> > > interned_color = sys.intern(color)
> > > interned_color
> > > 'red'
> > > interned_color is 'red'
> > > True

Adrien Guillo wrote an excellent explanation on how Python handles string interning. The short version is that strings of length 0 and 1 are all interned, and the rest are interned at compile time.

As such, the IDs of the two items are different since one was created (interned) at compile time and the other at runtime.

More Complications

Why does comparing sometimes yield a different result in a script as compared to an interpreter? Here’s a few interesting discussions:

Deeper Down the Rabbit Hole

Now here’s an example that really bends my brain.

> > > id('red'), id('red')
> > > (4397469008, 4397469008)
> > > id(id('red')), id(id('red'))
> > > (4394823536, 4394823536)
> > > id('red') is id('red')
> > > False

If the IDs are the same, then why does the id comparison fail? What’s different about this? This discrepancy arises because the id() function returns a new integer object each time it is called, even if the integer values are the same.

It all leads to dis

To truly understand what’s happening under the hood, we need to look at the disassembled bytecode. The dis module can show us how Python translates our code into lower-level instructions.

Here’s a good read on this topic: Introduction to the Python Interpreter.

Finally, here’s an additional resource on string comparison in Python: String comparison in Python: ‘==’ vs ‘is’.

Conclusion

Understanding the nuances of string comparison and interning in Python can be tricky, but it’s essential for writing efficient and bug-free code. The is operator should generally be avoided for string comparison because it checks for object identity, not equality. Instead, use == to compare string values.

By learning how Python handles strings, you can write more efficient code and avoid subtle bugs.

Thanks for reading, and I hope this deep dive into Python string interning was enlightening!