Analyze Gchat transcripts in AWK

I learned about AWK when I first started using Linux. My exposure to the language generally came in the form of one-liners that I would cut and paste from the web. While it seemed like a powerful tool, I never saw it as a full-fledged programming language and never took the time to learn to use it.

Why AWK

While I've seen some sophisticated applications of AWK in the wild, I mainly used it for simple operations on log files. I wondered whether properly learning AWK even made sense.

The book

Research on the topic lead me to this Stack Overflow answer by Brandon Craig Rhodes. Mr. Rhodes is an avid speaker in the Python community and I respect his opinion. He recommends learning AWK not only to increase mastery at the command line, but as an excuse to read The AWK Programming Language by the original authors of the language.

Convinced, I acquired the book. While I'm still working my way though it, I've found it succinct and comprehensive. It's a lot more than a manual for the language, it's a discussion of many important programming concepts.

Staying power

What also struck me about AWK is its staying power. Though it's around 40 years old, a search on usesthis.com reveals a lot of smart people who explicitly mention AWK as an important part of their toolset. Even though many of these people also mention a high-level language like Python or Ruby, AWK stays relevant.

Chat transcripts

Since reading The Most Human Human, I've been fascinated by chat transcripts. Since I don't have anyone recording and transcribing my face-to-face conversations, my Gchat logs are the closest thing I have to a record of a real-time interaction with other people.

With that in mind, I wondered what interesting questions I could answer by analyzing a transcript. Some of my ideas were:

Duration of interaction

Total words/chars for each participant (who does all the talking?)

Total time when no one was speaking (are we distracted?)

Number of exchanges (how often does the active speaker change?)

Starting with this small set of data points, chat logs could reveal some interesting dimensions of a relationship. By comparing interactions between different people, or with the same person over time, trends might start to emerge. Very brief and terse interactions might suggest a casual acquaintance. Where as very long (in duration and words exchanged) and engaging (as # of exchanges) might suggest close friendship.

To generate the source transcript for this post, I pulled up the chat log in Gmail, pressed print and then simply cut and pasted it into a text file.

AWK: string processing made easy

AWK is a data extraction language. While it has a rich set of features, enabling a variety of applications, it's manipulating text and freeing the data within where it really shines. In a few short lines, it can manage tasks that would take more work in other languages. In Python, to open a text file and run a regular expression on it, we require some boilerplate code to get started.

import re

with open('data.txt','r') as f:
    for line in f:
        if re.match('[0-9]+', line):
            print line

AWK allows us to do this from the command line and get to the real work much more quickly.

awk '/[0-9]+/' data.txt

This is a contrived example but it's meant to show that AWK makes some tasks very easy.

Flocks of AWKs

Since it's creation in the 1970s, AWK implementations have proliferated. They differ in their licencing, speed and feature set. The original implementation, the one described in the seminal volume on the language is known as nawk. This is the version available by default in BSD operating systems and OSX. FreeBSD calls it "one true awk".

The GNU project provides an alternative implementation called gawk. It adds features not included in the original language including built-in date functions and true multidimensional arrays. It's provided under the GPL which may matter to some. For me, the additional features justify the extra installation on OSX (brew install gawk did the trick). Gawk is required to run the code for these examples.

Parsing the transcript

Basic AWK programs are structured in blocks like

condition { action }

AWK reads a target file line by line and, if the condition holds, it performs the action then moves onto the next condition. When parsing this chat transcript, we have four types of lines. Some indicate a speaker:

me: hi!

Others indicate the time:

3:27 PM

Or when a certain amount of time has elapsed between messages:

5 minutes

Some have no distinguishing features at all and are just lines of text. These need to be attributed to the active speaker as indicated by the last speaker identifier line.

My strategy for handling different cases is to look first for the time-related lines. If we find one, we stop processing the line using next. If we find a line identifying a speaker, we store the active speaker then remove the speaker designation e.g. me: from the line, leaving only the raw chat content. Then, for all remaining lines we simply count the words and characters and attribute them to the active speaker.

Here we handle a line that indicates a speaker.

#Speaker change line
# e.g. me: I love cats
/^[A-za-z]+: / {
    speaker = $1
    if (speaker !~ /^me/){
        other_speaker = $1
    }
    changes++

    # Remove the speaker from the line
    sub($1 FS, "");
}

When parsing some of the lines, it simplifies the script to utilize the match function provided by gawk. This makes it easier to capture segments of the string for processing. For example, when calculating how much dead time elapsed between messages we do.

# Dead time line
# e.g. 5 minutes
match($0, /^([0-9]+) minutes/, out) {
    dead_time += out[1]
    # Don't count this as chat content
    next
}

This makes it easy to capture the number of minutes and add it to our total.

Issues

Regexes

This strategy presents a problem. If a user types a message. Then a subsequent message which reads:

10 minutes

This will get counted as dead air time. Getting around this would require parsing the HTML version of the chat log. Since we want to use AWK for it's plain-text goodness, we will ignore this issue.

Multiple speakers

The program only works for two-party conversations. It could be modified to allow for chats involving any number of parties.

Output

Using the AWK's END directive, we print the our results:

me:  890 words  (46%),  4415 characters  (47%)
Jose: 1032 words  (53%),  4896 characters  (52%)
exchanges: 115
duration: 109 minutes
dead_time: 4 minutes

Impressions

AWK is a great tool and I think it's worth a programmer's time to learn it. That said, it is not without it's problems.

Readability

AWK is good at what it does but I don't find the code I wrote very readable. Perhaps this is my own inexperience. With a more complex project, this could lead to maintenance issues. I'd be interested to know how more experienced AWKers deal with this.

Data structures

Lack of data structures (e.g. lists), as well as a limited set of built-in functionality (esp. outside of gawk) can make things harder.

Overall, I've enjoyed my foray into AWK. While I still wouldn't use it for anything too complex, it's always good to learn new tools. Plus, I've already found myself using it in cases where I would normally have to paste data into a spreadsheet. Having these tasks in scripts saves time and adds flexibility.

The code I wrote for this post is available on GitHub.