I just finished reading China Mieville's novel Kraken. It was really cool, it did take me a while to get into Mieville's voice, but once I got into the swing of it I really enjoyed it. Here is the blurb in case you are interested:
"In the Darwin Centre at London’s Natural History Museum, Billy Harrow, a cephalopod specialist, is conducting a tour whose climax is meant to be the Centre’s prize specimen of a rare Architeuthis dux--better known as the Giant Squid. But Billy’s tour takes an unexpected turn when the squid suddenly and impossibly vanishes into thin air."
One of the lines in the book struck me as surprisingly familiar. Here is the quote from the book:
"She flicked through a pad by her bed, where she made notes of various summonings. A spaceape, all writhing tentacles, to stimulate her audio nerve directly? Too much attitude."
After thinking about this for a moment, it clicked that this is a reference to a Burial song called Spaceape (from his self titled album). The line goes:
"Living spaceapes, creatures, covered, smothered in writhing tentacles
Stimulating the audio nerve directly"
I couldn't find any reference online to the inclusion of Burial lyrics in Mieville's novels. Okay I thought, that's a cool a easter egg, but it got my thinking, are there any other song lyrics buried in Mieville's books? And if so, is there any way we can scan them automatically?
The first step was to get a database of song lyrics which we can use to scan the novels for, Unfortunately there is no easy place to find a database of song lyrics, so I was forced to scrape them from a lyrics site.
I used the following free chrome extension web scraper which is very easy to use, and in my experience very reliable:
After about 10 minutes of setting it up, and about an hour of leaving it to run. I had managed to scrape most of Burial's lyrics in to csv files.
I also scraped lyrics by Kode9 and Spaceape so I could see if they were referenced anywhere. It's hard to know which artist I should look for, but both of these have been mentioned by Mieville in interviews.
The web scraping add-in has an easy to use GUI. Here is a screenshot of what it looks like to set it up:
Ebooks in text format
The next step was then to get his ebooks into a format that I could easily analyse. I assumed that I would need them in a csv format, but I actually got away with using .txt in the end. In order to get them into .txt. I used the built-in bulk converter in the following free ebook management program:
Here is a screenshot of Calibre. It is also very easy to use, and freely available online.
Analysing the text
This is now the hardest part of the problem. We have electronic copies of China Mieville's novels in .txt format, and we have a collection of lyrics in .txt format which we would like to compare them against, how can we programmatically analyse whether Mieville references other Burial lyrics in one of his books?
If we simply attempt to match entire strings, then we have the issue that we might miss out on some references due to small changes in word ordering or punctuation. For example, in the example above using Burial's Spaceape, the wording is slightly different and the tenses of some of the words have been changed, therefore looking for an exact match between lyrics and text will probably not work. If on the other hand we don't match complete strings, but just try to match words, then we will be overwhelmed by small words like 'the' and 'a' which will be used multiple times in both Burial's song lyrics, and in China Mieville's novels.
There are two main approaches I came up with.to solve this problem. My first thought was to match individual words, generating a huge list of matches, and then to count the number of uses of each word in Mieville's novels, and then sorting by the words that match but which are also the most uncommon. For example I would imagine that Spaceape is only ever used once in all of Mieville's novels, giving us information about how unusual this word is. Combined with the fact that this word is also used in a Burial lyric, gives us enough information to assume that there is a high probability of a match, at this point we could investigate manually.
I ultimately didn't go down this road. Instead, I had the idea to try to adapt plagiarism detection software to this problem. When you think about it, the two problems are actually quite similar. Plagiarism detection is about trying to automatically check two documents for similar phrases, without relying on complete matches.
Open Source Plagiarism Detection
I found the following free-to-use program created by Lou Bloomfield of the University of Virginia which is perfect for what I was trying to do.
It compares two sets of files and then creates a side by side hyperlinked comparison, which can be viewed in chrome, highlighting the parts of the documents where a possible match has been detected. There are various settings you can tweak to specify how close of a match you are interested in.
I have included a screenshot below of the section where the Spaceape line is detected. There were about 500 matches detected when I ran this, but it only took about a minute to scroll through and check up on the ones that looked significant.
Ultimately, this analysis felt like a bit of a failure. These were the only lyrics I could find in all of his novels and while there is always the chance that I need to expand the number of artists I'm looking at, or refine my detection methods I imagine this is all there is. I still thought the process was quite cool so I thought I'd write up what I had done anyway.
If you have any thoughts, let me know by leaving a comment.