Wed 13 February 2013
In Programming .
tags: AWK data analysis
I learned about AWK when I first started using Linux. My exposure to the language
generally came in the form of one-liners that I would cut and paste from the web.
While it seemed like a powerful tool, I never saw it as a full-fledged programming
language and never took the time to learn to use it.
Why AWK
While I've seen some sophisticated applications of AWK in the wild, I mainly used it
for simple operations on log files. I wondered whether properly learning AWK
even made sense.
The book
Research on the topic lead me to this Stack Overflow answer by Brandon Craig Rhodes . Mr. Rhodes is an avid speaker in the Python community and
I respect his opinion. He recommends learning AWK not only to increase mastery
at the command line, but as an excuse to read The AWK Programming Language
by the original authors of the language.
Convinced, I acquired the book. While I'm still working my way though it, I've found
it succinct and comprehensive. It's a lot more than a manual for the language, it's
a discussion of many important programming concepts.
Staying power
What also struck me about AWK is its staying power. Though it's around 40 years old,
a search on usesthis.com
reveals a lot of smart people who explicitly mention AWK as an important part of
their toolset. Even though many of these people also mention a high-level language
like Python or Ruby, AWK stays relevant.
Chat transcripts
Since reading The Most Human Human ,
I've been fascinated by chat transcripts. Since I don't have anyone recording and
transcribing my face-to-face conversations, my Gchat logs are the closest thing
I have to a record of a real-time interaction with other people.
With that in mind, I wondered what interesting questions I could answer by analyzing
a transcript. Some of my ideas were:
Duration of interaction
Total words/chars for each participant (who does all the talking?)
Total time when no one was speaking (are we distracted?)
Number of exchanges (how often does the active speaker change?)
Starting with this small set of data points, chat logs could reveal some interesting
dimensions of a relationship. By comparing interactions between different people, or
with the same person over time, trends might start to emerge. Very brief and terse
interactions might suggest a casual acquaintance. Where as very long (in duration and
words exchanged) and engaging (as # of exchanges) might suggest close friendship.
To generate the source transcript for this post, I pulled up the chat log in Gmail,
pressed print and then simply cut and pasted it into a text file.
AWK: string processing made easy
AWK is a data extraction language. While it has a rich set of features, enabling
a variety of applications, it's manipulating text and freeing the data within where it
really shines. In a few short lines, it can manage tasks that would take more work in
other languages. In Python, to open a text file and run a regular expression on it,
we require some boilerplate code to get started.
import re
with open ( 'data.txt' , 'r' ) as f :
for line in f :
if re . match ( '[0-9]+' , line ):
print line
AWK allows us to do this from the command line and get to the real work much more quickly.
This is a contrived example but it's meant to show that AWK makes some tasks very
easy.
Flocks of AWKs
Since it's creation in the 1970s, AWK implementations have proliferated. They differ
in their licencing, speed and feature set. The original implementation, the one
described in the seminal volume on the language is known as nawk. This is the version
available by default in BSD operating systems and OSX. FreeBSD calls it
"one true awk".
The GNU project provides an alternative implementation called gawk . It adds features
not included in the original language including built-in date functions and true
multidimensional arrays. It's provided under the GPL which may matter to some. For
me, the additional features justify the extra installation on OSX (brew install
gawk did the trick). Gawk is required to run the code for these examples.
Parsing the transcript
Basic AWK programs are structured in blocks like
condition { action }
AWK reads a target file line by line and, if the condition holds, it performs the
action then moves onto the next condition. When parsing this chat transcript, we have
four types of lines. Some indicate a speaker:
me: hi!
Others indicate the time:
3:27 PM
Or when a certain amount of time has elapsed between messages:
5 minutes
Some have no distinguishing features at all and are just lines of text. These need to
be attributed to the active speaker as indicated by the last speaker identifier line.
My strategy for handling different cases is to look first for the time-related lines.
If we find one, we stop processing the line using next . If we find a line
identifying a speaker, we store the active speaker then remove the speaker
designation e.g. me: from the line, leaving only the raw chat content. Then, for
all remaining lines we simply count the words and characters and attribute them to
the active speaker.
Here we handle a line that indicates a speaker.
#Speaker change line
# e.g. me: I love cats
/^[A-za-z]+: / {
speaker = $ 1
if ( speaker !~ /^me/ ){
other_speaker = $ 1
}
changes ++
# Remove the speaker from the line
sub ( $ 1 FS , "" );
}
When parsing some of the lines, it simplifies the script to utilize the match
function provided by gawk. This makes it easier to capture segments of the string for
processing. For example, when calculating how much dead time elapsed between messages
we do.
# Dead time line
# e.g. 5 minutes
match ( $ 0 , /^([0-9]+) minutes/ , out ) {
dead_time += out [ 1 ]
# Don't count this as chat content
next
}
This makes it easy to capture the number of minutes and add it to our total.
Issues
Regexes
This strategy presents a problem. If a user types a message. Then a subsequent message
which reads:
10 minutes
This will get counted as dead air time. Getting around this would require parsing the
HTML version of the chat log. Since we want to use AWK for it's plain-text goodness,
we will ignore this issue.
Multiple speakers
The program only works for two-party conversations. It could be modified to allow for
chats involving any number of parties.
Output
Using the AWK's END directive, we print the our results:
me: 890 words (46%), 4415 characters (47%)
Jose: 1032 words (53%), 4896 characters (52%)
exchanges: 115
duration: 109 minutes
dead_time: 4 minutes
Impressions
AWK is a great tool and I think it's worth a programmer's time to learn it. That
said, it is not without it's problems.
Readability
AWK is good at what it does but I don't find the code I wrote very readable. Perhaps
this is my own inexperience. With a more complex project, this could lead to
maintenance issues. I'd be interested to know how more experienced AWKers deal with
this.
Data structures
Lack of data structures (e.g. lists), as well as a limited set of built-in
functionality (esp. outside of gawk) can make things harder.
Overall, I've enjoyed my foray into AWK. While I still wouldn't use it for anything
too complex, it's always good to learn new tools. Plus, I've already found myself
using it in cases where I would normally have to paste data into a spreadsheet.
Having these tasks in scripts saves time and adds flexibility.
The code I wrote for this post is available on GitHub .
There are comments .