Data mining local radio with Node.js

More harpsicord?!

Seattle is lucky to have KINGFM, a local radio station dedicated to 100% classical music. As one of the few existent classical music fans in his twenties, I listen often enough. Over the past few years, I've noticed that when I tune to the station, I always seem to hear the plinky sound of a harpsicord.

Before I sent KINGFM an email, admonishing them for playing so much of an instrument I dislike, I wanted to investigate whether my ears were deceiving me. Perhaps my own distaste for the harpsicord increased its impact in my memory.

This article outlines the details of this investigation and especially the process of collecting the data.

If it ain't baroque...

A harpsicord is in many ways similar to the piano. Pressing down and releasing one of its keys triggers an internal mechanism that plucks a string inside. Resultant vibration of the string produces the corresponding pitch. Because its strings are plucked, the instrument has no dynamic range. Each note sounds at roughly the same volume; however firmly or softly the player strikes the keys. Some harpsicords have several choirs of strings that allow the player limited control of the volume and timbre.

The harpsicord can sound tinny to modern ears. Thomas Beecham famously said, "The sound of a harpsichord - two skeletons copulating on a tin roof in a thunderstorm."

At the start of the 16th century, the newly invented fortepiano began to push both the harpsicord and its close relative, the clavicord out of favor. The new instrument worked more like the modern piano in that its strings were struck with padded hammers. Compared to the other keyboard instruments of its day, it had a more resonant sound and allowed the player to control the dynamics of each note simply by altering the force with which he struck the keys.

The period before the fortepiano, during which the harpsicord had its heyday is known as the Baroque Era. The history of classical music is often divided into several historical "eras" or "periods". The dates that separate them are somewhat arbitrary with substantial overlap, I'll follow Wikipedia in defining these boundaries because they have the most comprehensive composers data with the most permissive licence.

Wikipedia's dates differ little from those outlined by Naxos, a well respected music label with an extremely comprehensive catalog. Unfortunately, the Naxos ToS are extremely restrictive with respect to their composer data.

These eras are:

Medieval (476–1400)

Renaissance (1400–1600)

Baroque (1600–1760)

Classical era (1730–1820)

Romantic era (1815–1910)

20th century (1900–2000)

21st century (since 2000)

Since King seems to play very little music from before 1600, I ignored the Medieval and Renaissance era in my analysis.

Because of the dominance of the piano and its predecessors starting in the Classical Era, one is less likely to find the sound of the harpsicord in modern recordings of anything but Baroque music. Even then, music originally written for harpsicord is often transcribed to the piano and recorded that way. Glenn Gould, perhaps the most famous modern Bach interpreter, is well-known for recordings of such transcriptions.

One exception is opera. Harpsicord was used for accompanying recitative all the way into the late 18th century. For simplicity, we will blissfully ignore this fact.

Collecting the data

KINGFM's posts their playlist daily to their website. Scraping this data, I was able to build the dataset.

Scraping with Node

Web scraping is an normally a network constrained task. Most of the execution time is spent waiting for the server to respond. Node.js encourages an asynchronous style that is well-suited to such tasks. Using the request module, it's easy to send non-blocking HTTP requests and process each result in a callback as it's returned. For this reason, rate limiting yourself is important when scraping more than a few pages. Otherwise, the flood of requests you will unleash is likely to get you blocked or interfere with the target site.

Another advantage for Node for this usecase is that it brings existing client-side libraries to the server. Great scraping tools exist in other languages (e.g. scrapy). However, since many developers already have years of experience using jQuery client-side to access the DOM, they may prefer to use a familiar API instead of learning a new one.

Cheerio

Several npm packages are available to help us use jQuery in Node. jsdom is a popular option that implements the full DOM in JS; allowing us to use jQuery or most any other client-side library on the server.

However, cheerio better suits this simple task. The project provides a re-implementation of a the most important parts of jQuery core. It's simpler to use and the author claims its a much faster choice compared to jsdom. Since much of the official jQuery source provides unneeded functionality like AJAX and browser compatibility, a re-implementation that leaves this bloat behind is preferable.

An example scaper

Especially when used in concert with Coffeescript, Cheerio makes for readable scrapers. By leveraging the superpower that is the jQuery selector we can often get at our data with minimal code. As an example, let's use it to scrape the target URLs from the front page of reddit using the selector #siteTable a.title.

request = require('request')
cheerio = require('cheerio')

parse_page = (error, response, body) ->
  if(error or response.statusCode != 200)
    console.log(error)
  else
    # Load the page into cheerio
    $ = cheerio.load(body)

    # Iterate over the the links on the front page
    $('#siteTable a.title').each (k,v) =>

      console.log($(v).attr('href'))

request.get("http://www.reddit.com", parse_page)

Using this technique, I quickly pulled the last 30 days of playlist data from KINGFM and dumped it into a file.

The joys of data normalization

Then came the hard part: associating composer names in the playlist data with historical eras. This is more difficult than it seems because subtle differences in the datasets could result in mismatches. King is mostly consistent in defining its composers but it does so in a different format from Wikipedia. It lists "SCHUBERT" instead of "Franz Schubert". These cases are easy enough: simply convert to lowercase and lop off the first name.

There are several types of more difficult cases where the database contains multiple composers that share a last name. e.g. J.S. Bach and all of his sons. In these cases, we need first names or initials to differentiate. Unfortunately, the formats differ between the data sets. Wikipedia has "Carl Philipp Emanuel Bach" and KINGFM, "BACH, C.P.E". Other tough cases are those where composer has multiple last names e.g. "Vaughan Williams". Other annoying cases occur where diacritics did not match e.g."Dvořák" and "DVORÁK" (no accented r). Since my data set is fairly small (3210 playlist items), I was able to handle these unfortunate cases with regular expressions and frustration.

Handling overlap

In some cases, a composer belongs to multiple eras. For example, Beethoven's music is said to span both the Classical and the Romantic eras. One way to handle these cases would be to count every movement of Beethoven's as both Classical and Romantic. The downside is that this would result in double counting a lot of the most popular composers.

Instead, I chose to sort the database alphabetically by composer name rather than by era. In cases where there are two entries for the same name, the second one overwrites the first. This is not ideal either but should help to randomize the era into which era transitional composers are placed. I did some editorializing here for the most prominent composers. For instance, Beethoven was counted as Classical and Schubert as Romantic.

Results

Play count

2691 of the 3208 playlist items had matches in the database, leaving 472 unidentified tracks. The results were distributed like this

Era distribution

Top composers

Composer	Play count
Mozart	191
Bach	188
Haydn	114
Beethoven	92
Schubert	83
Chopin	78
Debussy	66
Mendelssohn	63
Brahms	51
Tchaikovsky	46

Air time

Analyzing the total play count for each era is useful. But the more interesting question for a listener is not how often tracks from each era are played. Rather, it's what proportion of the airtime each era consumes. This is an important distinction because some movements last less than a minute while others can last 30 minutes or longer.

Top Composers by airtime (including only the top 16 composers or 48% of total):

This data highlights the importance of using airtime over play count. While King plays J.S. Bach almost as often as Mozart, Mozart gets 42% more airtime, more than 20 hours more per month compared to Bach.

Async and ordering

When analyzing airtime, we have to make sure all of our data is properly sorted. Since we are scraping asynchronously and writing the results to disk as they are returned, it's likely that our data will come back in an order different from the order of the HTTP requests.

This can occur because some pages are larger and so take longer to transfer than others. A glance at the data shows that this did indeed happen:

Time	Composer
07/19/2012 11:51pm	PURCELL
07/19/2012 11:54pm	SCHUBERT
07/16/2012 12:02am	CHOPIN

Since we made non-blocking HTTP requests, the data from 7/19 arrived more quickly and so was written to disk before the data from the 16th.

Since we want to access this data as a JavaScript object anyway, we ought not rely on the default ordering of an object's properties. Field ordering is not not part of the ECMAScript spec. To remedy this, we will use moment.js to parse each date string and convert it to a UNIX timestamp.

These timestamps will be cast to two separate data structures, a list and hash. The list will be sorted and used for ordering. The hash maps timestamps to composer names. Iterating through the list, we look up the associated composer and use subtraction to work out the total seconds of airtime for each track.

require('fs')
require('moment')

composers_by_air_time = ->

  dates = []
  map = {}
  playlist = JSON.parse(fs.readFileSync('db/playlist.json'))

  for date, composer of playlist
    # Build an array of timestamps together with a mapping from
    # timestamp to composer
    unix_date = moment(date, 'MM/DD/YYYY hh:mma').unix()

    # Push it onto our list of timestamps
    dates.push(unix_date)

    # Map the composer whose work STARTED to this timestamp
    map[unix_date] = composer

  # Sort the dates (as ints) so we can subtract adjacent members
  dates.sort

  results = {}

  for idx, date of dates when idx > 0
    # Subtract each item from its predecessor, ignoring the first one
    prev_date = dates[idx - 1]
    difference = date - prev_date
    composer = map[prev_date]

    # Group by composer name
    if results[composer] > 0
      results[composer] += difference
    else
      results[composer] = difference

  return results

Airtime by era

We can combine the airtime data with the composer era data to get the total airtime by era

Conclusions

Blame the bias

The data shows that KINGFM is innocent of the charge of favoring Baroque music over other eras. Indeed, they play less Baroque than anything else: less than half as much as twentieth century music. Looks like my own bias against harpsicord has affected my statistical judgment. Good thing I actually checked before blaming the station.

Data mining in Node

Part of my motivation for this post was to get more familiar with using Node and Coffeescript. This pair makes a convenient programming environment for tasks like web scraping and networking applications.

That said, JavaScript on its own is a poor candidate for data analysis. It has a limited set of built-in data structures and no default support for parsing data from common file formats. Gauss looks like it may help to fill this void but it will likely be some time until the node world has something as full-featured as pandas.

For those interested, the simple scripts that I wrote in coffeescript for the scraping and analysis are on Github.