Simple Counters in Python (with Benchmarks)
It's sometimes necessary to count the number of distinct occurrences in an collection. For example, counting how many times each letter occurs in a block of text. Or sorting a list by its most common member.
If I were to do this sort of counting with SQL, I would generally use something like this:
SELECT count(*)
FROM table
GROUP BY column
This could easily by combined this with an ORDER BY to get the most common items.
However, assuming you are working with some raw data, here are some strategies for counting distinct occurrences in Python. Skip to the end to see which method performs best.
dict and in
A plain dictionary works well as a counter. Though using it is verbose, it performs surprisingly well and works in any python version.
counter = dict()
foods = ['soy', 'dairy', 'gluten', 'soy']
for k in foods:
if not k in counter:
counter[k] = 1
else:
counter[k] += 1
..
>>> counter
{'soy': 2, 'cheese': 1, 'dairy': 1}
defaultdict
I've always loved the defaultdict. Used properly, it can cut out a lot of boilerplate from your code. It has many applications, one of which is a counter.
from collections import defaultdict
counter = defaultdict(int)
foods = ['soy', 'dairy', 'gluten', 'soy']
for k in foods:
counter[k] += 1
..
>>> counter
defaultdict(<type 'int'>, {'soy': 2, 'cheese': 1, 'dairy': 1})
By passing int to the class, all empty keys default to zero. This allows you to do += without setting the key first.
dict and setdefault
Dictionaries have a setdefault method that allows you to set the default value for a single key.
According to the python docs, running setdefault on every key is slower than using defaultdict. The benchmark below confirms this.
counter = dict()
foods = ['soy', 'dairy', 'gluten', 'soy']
for k in foods:
counter.setdefault(k, 0)
counter[k] += 1
...
>>> counter
{'soy': 2, 'cheese': 1, 'dairy': 1}
collections.Counter
Python 2.7 introduced collections.Counter which makes this trivial.
from collections import Counter
foods = ['soy', 'dairy', 'gluten', 'soy']
Counter(foods)
..
>>> counter
Counter({'soy': 2, 'gluten': 1, 'dairy': 1})
By passing a list to the Counter constructor, it does the grouping for us. It still behaves like a dictionary so we can still do stuff like
>>> counter['soy'] += 3
>>> counter['soy']
5
Benchmarks
Here are some quick and dirty benchmarks for these methods. I used this code to generate the data. I took some text by The Bard and counted the number of each letter and each word. There were a lot more unique words than letters which resulted in slower times to count them.
Keys | Counter | defaultdict | dict.setdefault | dict.in |
---|---|---|---|---|
6691 | 3.62 | 1.97 | 2.88 | 1.95 |
26727 | 13.13 | 4.31 | 9.58 | 7.17 |
These results show that while a plain dict and in checks performs best for a smaller number of keys, it's not significantly better than defaultdict. With a larger number of distinct members, defaultdict did substantially better than any other option.
Use defaultdict
The takeaway is to stick to the defaultdict when you need a counter. Not only is it performant, but it saves you from the boilerplate of operating on every key.
While Counter is shinny and convenient, it's slow. As an added bonus, defaultdict works in Python 2.5. If you are stuck with python 2.4 (upgrade!), running in on every key is your best option.
Edit: Updated in light of Philip's comment.