I just finished reading China Mieville's novel Kraken. It was really cool, it did take me a while to get into Mieville's voice, but once I got into the swing of it I really enjoyed it. Here is the blurb in case you are interested:
"In the Darwin Centre at London’s Natural History Museum, Billy Harrow, a cephalopod specialist, is conducting a tour whose climax is meant to be the Centre’s prize specimen of a rare Architeuthis dux--better known as the Giant Squid. But Billy’s tour takes an unexpected turn when the squid suddenly and impossibly vanishes into thin air."
One of the lines in the book struck me as surprisingly familiar. Here is the quote from the book:
"She flicked through a pad by her bed, where she made notes of various summonings. A spaceape, all writhing tentacles, to stimulate her audio nerve directly? Too much attitude."
After thinking about this for a moment, it clicked that this is a reference to a Burial song called Spaceape (from his self titled album). The line goes:
"Living spaceapes, creatures, covered, smothered in writhing tentacles
Stimulating the audio nerve directly"
I couldn't find any reference online to the inclusion of Burial lyrics in Mieville's novels. Okay I thought, that's a cool a easter egg, but it got my thinking, are there any other song lyrics buried in Mieville's books? And if so, is there any way we can scan them automatically?
The first step was to get a database of song lyrics which we can use to scan the novels for, Unfortunately there is no easy place to find a database of song lyrics, so I was forced to scrape them from a lyrics site.
I used the following free chrome extension web scraper which is very easy to use, and in my experience very reliable:
After about 10 minutes of setting it up, and about an hour of leaving it to run. I had managed to scrape most of Burial's lyrics in to csv files.
I also scraped lyrics by Kode9 and Spaceape so I could see if they were referenced anywhere. It's hard to know which artist I should look for, but both of these have been mentioned by Mieville in interviews.
The web scraping add-in has an easy to use GUI. Here is a screenshot of what it looks like to set it up:
Ebooks in text format
The next step was then to get his ebooks into a format that I could easily analyse. I assumed that I would need them in a csv format, but I actually got away with using .txt in the end. In order to get them into .txt. I used the built-in bulk converter in the following free ebook management program:
Here is a screenshot of Calibre. It is also very easy to use, and freely available online.
Analysing the text
This is now the hardest part of the problem. We have electronic copies of China Mieville's novels in .txt format, and we have a collection of lyrics in .txt format which we would like to compare them against, how can we programmatically analyse whether Mieville references other Burial lyrics in one of his books?
If we simply attempt to match entire strings, then we have the issue that we might miss out on some references due to small changes in word ordering or punctuation. For example, in the example above using Burial's Spaceape, the wording is slightly different and the tenses of some of the words have been changed, therefore looking for an exact match between lyrics and text will probably not work. If on the other hand we don't match complete strings, but just try to match words, then we will be overwhelmed by small words like 'the' and 'a' which will be used multiple times in both Burial's song lyrics, and in China Mieville's novels.
There are two main approaches I came up with.to solve this problem. My first thought was to match individual words, generating a huge list of matches, and then to count the number of uses of each word in Mieville's novels, and then sorting by the words that match but which are also the most uncommon. For example I would imagine that Spaceape is only ever used once in all of Mieville's novels, giving us information about how unusual this word is. Combined with the fact that this word is also used in a Burial lyric, gives us enough information to assume that there is a high probability of a match, at this point we could investigate manually.
I ultimately didn't go down this road. Instead, I had the idea to try to adapt plagiarism detection software to this problem. When you think about it, the two problems are actually quite similar. Plagiarism detection is about trying to automatically check two documents for similar phrases, without relying on complete matches.
Open Source Plagiarism Detection
I found the following free-to-use program created by Lou Bloomfield of the University of Virginia which is perfect for what I was trying to do.
It compares two sets of files and then creates a side by side hyperlinked comparison, which can be viewed in chrome, highlighting the parts of the documents where a possible match has been detected. There are various settings you can tweak to specify how close of a match you are interested in.
I have included a screenshot below of the section where the Spaceape line is detected. There were about 500 matches detected when I ran this, but it only took about a minute to scroll through and check up on the ones that looked significant.
Ultimately, this analysis felt like a bit of a failure. These were the only lyrics I could find in all of his novels and while there is always the chance that I need to expand the number of artists I'm looking at, or refine my detection methods I imagine this is all there is. I still thought the process was quite cool so I thought I'd write up what I had done anyway.
If you have any thoughts, let me know by leaving a comment.
Who is the mysterious Satoshi Nakamoto?
Let me pitch you an idea for a movie - following the 2007 financial crisis, fed up with the corruption of the modern financial system, a lone genius creates a new virtual currency with which he aims to completely undermine the modern baking system. This new currency allows instantaneous online payments to be made with minimal transaction fees and with almost complete anonymity. Better yet, this system is completely decentralised, requiring no central bank or governing body. To further add to the mystique, our hero decides to eschew fame, remaining completely anonymous while netting himself a cool USD 1 billion in bitcoins. But our hero decides to walk away and leave the USD 1 billion in bitcoins untouched on a public ledger on the internet, proving to the world that he was never in it for the money.
All that he leaves behind is a name - ***Cue dramatic music*** - Satoshi Nakamoto,
Chuck in some bad guys and a love interest and we've got the making of a Hollywood blockbuster!
This is of course the true story of the origins of bitcoin.
Unsurprisingly, there have been many attempts to find the true identity of Satoshi Nakamoto, every six months or so a new candidate is found and the media jumps on the bandwagon, but none of the candidates so far have been really convincing.
I thought I'd do a bit of digging myself and see what we have to work with, and what we can know for certain., and what we can speculate about.
So what info do we have work with? Satoshi left behind the following:
The Forum Posts
Almost all the forum posts are highly technical, and there is very little to be gleaned about Satoshi's identity from the content of the posts. I did look through most of them just in case. But based on an idea in Satoshi's Wikipedia article, I have graphed the timestamps from the forum posts.
All the forum posts can be found on the following website, which I scraped using web scraper and then chucked into excel to extract the timestamps:
We can see that there is a clear trend for most posts to be made between 4pm and 11pm, with almost none being made between 5 am - 1 pm, suggesting that this is when Satoshi is asleep. Based on most people I know who don't have a 9-5 job, but are still involved in IT, this is a pretty reasonable sleeping pattern for someone living in a GMT time zone. If we assume that Satoshi has a conventional sleeping pattern though, then we would expect him to be living somewhere on the US East Coast. Both of these seem plausible to me, it does gets less plausible though to consider someone living much further east than Europe.
I then graphed the weekday of each forum post. Which shows a fairly stable pattern of posts through out the week. Nothing too surprising here.
It has also been noted that the blog posts from Satoshi use British spellings rather than US spelling. Let's also test that. I collected a list of words that are spelt differently in UK and US English, and cross referenced it against the blog posts we scraped earlier.
The following words were all used by Satoshi but with the UK spelling.
This strongly suggests that the author is most familiar with British English over American English.
It's also been noted (and I concur) that Satoshi's posts are written like a native English speaker. He uses common idioms well and his grammer and structuring all give compelling reasons to think that he is a native English Speaker .This has been put forward by some as definitive proof that he is British. I'm not so sure though, having met Europeans who through having a lot of exposure to English speakers growing up, now sound like native English speakers when writing or texting.
Leaving the blog posts for the time being, let's look at the emails in the mailing list.
Mailing List Emails
The mailing list was a Cryptography focused mailing list, established in 2000, and can be found through the following link:
The website gives the following introduction to the mailing list.
"Cryptography" is a low-noise moderated mailing list devoted to cryptographic technology and its political impact. Occasionally, the moderator allows the topic to veer more generally into security and privacy technology and its impact, but this is rare.
WHAT TOPICS ARE APPROPRIATE:
"On topic" discussion includes technical aspects of cryptosystems, social repercussions of cryptosystems, and the politics of cryptography such as export controls or laws restricting cryptography.
Satoshi began posting to the mailing list in November 2011, and his first post was an introduction to his new bitcoin system, it gave a brief overview and then linked to the paper he had written which contained the technical details. It therefore seems that the mailing list was a method of generating interest for his already fleshed out system rather than something he contributed to already.
I was initially slightly suspicious of how well written the emails are compared to the forum posts. It's been suggested by a few people that Satoshi might actually be the name chosen by a group of collaborators rather than one single person. On consideration though, the mailing list is said to be 'highly moderated' and therefore it should perhaps not be surprisingly that Satoshi has polished his grammar and writing when sending emails to the mailing list. Plus you'd expect quite a bit more care when replying to an email rather than making a forum post.
To be honest I struggled to gleam much more from the emails other than a couple of interesting quotes which I've included at the end of this post.
The Genesis Block
Satoshi created the first block of the first blockchain. Since there were no preceding transactions, Satoshi was able to insert a message into the block.
The message he selected was:
"The Times 03/Jan/2009 Chancellor on brink of second bailout for banks"
This tells us a few things. Firstly, it's evidence that no Bitcoins were mined prior to this date. Secondly, it could be seen as a comment on the financial bailout that was ongoing at the time and which may have cause Satoshi to develop Bitcoin in the first place. And finally, it's another link to the UK, given Satoshi has selected a British newspaper to timestamp his first block.
Other Random Thoughts:
Here are some additional thoughts on the Satoshi question which I have included in the hopes that someone else might find them useful.
Since Satoshi stopped working on Bitcoin in 2011, perhaps we should be looking for someone who has made interesting contributions to a different project since then?
Would a better programmer than me be able to spot idiosyncrasies in Satoshi's coding style which could be traced in other places? What if someone trawled Github and looked for these quirks?
Some people have attempted a Stylometric Analysis. I haven't looked into this at all, but it's something I might look into at another point.
Satoshi is the Japanese name of the main character (Ash Ketchum) in Pokemon and also the name of the creator of Pokemon, Satoshi Tajiri.
Are there any other famous Satoshis? Or Famous Nakamotos? I did a quick google, but I couldn't find anyone who stood out to me.
Satoshi was familiar with Mises' regression theorem, which is a pretty niche economic concept from Ludwig von Mises, an economist from the Austrian School. The Austrian School are famously associated with libertarian or right wing anarchist views.
Satochi seems pretty au fait with libertarian concepts generally
Prior to Bitcoin's rise, crytocurrencies were a very niche interest, perhaps it would be worthwhile to look at who was going to conferences, writing papers, working in the industry, etc. prior to 2007. It should be a relatively small group of people, and you would imagine that Satoshi would have a footprint in there somewhere.
Some interesting quotes from Satoshi:
Yes, but we can win a major battle in the arms race and gain a new territory of
freedom for several years.
Governments are good at cutting off the heads of a centrally controlled
networks like Napster, but pure P2P networks like Gnutella and Tor seem to be
holding their own.
I appreciate your questions. I actually did this kind of backwards. I had to
write all the code before I could convince myself that I could solve every
problem, then I wrote the paper. I think I will be able to release the code
sooner than I could write a detailed spec. You're already right about most of
your assumptions where you filled in the blanks.
It's very attractive to the libertarian viewpoint if we can explain it
properly. I'm better with code than with words though.
I believe I've worked through all those little details over the
last year and a half while coding it, and there were a lot of them.
The functional details are not covered in the paper, but the
sourcecode is coming soon. I sent you the main files.
Banks must be trusted to hold our money and transfer it electronically, but they lend it out in waves of credit bubbles with barely a fraction in reserve.
To draw a few tentative conclusions, we seem to be looking at:
A native English speaker.
Who picked a Japanese pseudonym.
Who favours British English over US English.
Who selected a British newspaper to timestamp his genesis block.
Who's background is primarily coding based.
Who seems to hold libertarian views and be motivated by libertarian beliefs
Who has an interest in Crytography and Crytocurrencies which stretches back to at least 2007.
And who appears to be operating either on the East Coast or on a Western European time zone.
Surely there can't be many people out there who meet all these criteria?
Which areas of actuarial work are the best paid?
Salary estimates for actuarial work are pretty hard to come by and in my experience not particularly accurate, Also, in practice there are also large differences in salary when accounting for experience, practice area, function, and location. So I thought I'd try to put together my own analysis which accounts for all these variables!
As a source of data I decided to work from online Job Adverts, mainly because I can collect them with a minimum of effort, but also because the only other option would be to send out my own survey (and to be honest why would anyone complete a salary survey from a random guy online?).
Step 1. Collect Some Data
I started by web scraping about 800 job adverts from an actuarial job site. This was all the job which were listed on the website at the time. Since this might be 'slightly' (i.e. completely) against the terms of service of the site, I won't say which site the adverts are from or publish any of the raw data. I don't see any issues with publishing aggregated data from the website though.
Once I had scraped the adverts, I extracted the salary figures and discarded adverts that did not include a salary, this left about 350 job advert to work with.
Step 2. Analysis
My first step was to segment the adverts by practice area. Rather than going through each advert one at a time, I split them by looking at key words in the adverts, for example, adverts containing the terms 'pensions', 'employee benefits', or 'defined contribution' were all assigned to the category 'Pensions'.
These figures were then used to produce the following box and whiskers chart:
As expected, General Insurance comes out as the highest paid practice area with Life and Investments in joint second place and Pensions lagging quite a bit behind.
In case you can't remember how a box and whiskers chart works, the black dot is the mean, the three lines in the box represent the 1st quartile, median, and 3rd quartile, and the whiskers go up to the maximum and down to the minimum value.
Are there any other correlations that are affecting this ranking though? Since a lot of GI jobs are based in London, does London's higher average salary accounts for the difference?
First let's check that actuarial jobs in London do actually have higher salaries than actuarial jobs outside London.
As expected, London based jobs are clearly better paid on average than non-London based jobs.
In order to see whether location is driving the differences between salaries across practice areas we need to include location as another variable in our chart:
This chart gets a bit more messy, but we can see that even when we account for the effect of location, we still have the same ordering of practice area:
General Insurance > Life > Pensions
Now that we've looked at practice areas, I'm going to show my bias towards insurance and look at the effect of the type of actuarial work within an insurance company on salary.
Surprisingly, all the functions are relatively equal. There seems to be a weak ordering of:
Capital Modelling > Pricing > Reserving
Though the difference does not seem to be significant.
Finally, let's look at the effect of qualifying on salary. As expected, jobs for qualified actuaries have a mean of £20,000 higher than part/nearly qualified positions.
All the more reason I should be studying right now rather than procrastinating on here!.
"I’m not a businessman
I’m a business, man!
Let me handle my business, damn"
Kanye West ft Jay.z - Diamonds from Sierra Leone
“It's easier to run
Replacing this pain with something numb
It's so much easier to go
Than face all this pain here all alone.”
Linkin Park - Easier to Run
Recently I've really been getting into hip hop, one thing that really struck me about hip hop, is that contary to popular perception. hip hop is on the whole surprisingly upbeat, especially when contrasted with rock and metal which I listened to a lot of growing up.
In the spirit of being scientific I thought I would try to quantify this difference. My plan was:
1) Get hold of a sample of lyrics from different artists of sufficient size.
2) Come up with a method for analysing how positive or negative the lyrics are.
3) Run the analysis on the lyrics and collate the results,
Step 1 - Collect data
So the first step was to obtain a sample of lyrics. This was actually harder than I thought it would be. There are plenty of websites which contain song lyrics, however I was looking for a large collection of songs and artists, and I didn't want to have to trawl through hundreds of webpages by hand.
The process of automating the collection of data from websites is called web scraping. I have played around with webs scraping before using Python, but I found it to be very fiddly to set up and not very reliable. This was a couple of years ago though, and it turns out that since then there have been a number of new tools which make the whole process easier.
I ended up using a tool called 'Web Scraper', website below:
Web Scraper is an add-in to Chrome with an easy to use graphical interface, that can export extracts directly to .csv files.
In total I managed to collect lyrics from about 4,000 songs across 12 artists. Whilst I did manage to automate the process, it was still quite slow going as many websites have anti-scraping technology that blocks you out if you take too much data too quickly. I might try to expand this sample at a later date, but we can already see some clear trends emerging just from these artists.
Step 2 - Develop a method for analysing the lyrics.
Trying to program a computer to understand the semantic meaning of human generated text is a big area of research in the field of Computer Science. It is called 'Natural Language Processing', or NLP for short, there are many advanced methods within NLP being developed at the moment, some of the approaches involve Machine Learning, statistical analysis or other complex methods. For a good introduction the Wikipedia page gives a helpful overview.
However, I was just looking for a very basic method that would allow for a broad brush analysis. I therefore decided to just use the relative frequency of words with positive or negative connotations within the lyrics as a proxy for how positive or negative the lyrics were as a whole.
The obvious weakness of this approach is that just because there are positive words within a sentence, does not necessarily imply the meaning of the sentence is positive as a whole. For example, take the sentence 'This is not fun', an analysis of this sentence just on the basis of the words contained in it would suggest that it is a positive sentence, given we have the word 'fun' in the sentence. The obvious way to counter this would be to start looking at phrases instead. So 'not fun' would be given a negative connotation. Trying to look at phrases rather than words, adds a large degree of additional complexity to the analysis though, and given that all the artist should be exposed to roughly the same degree of false positives and false negatives, and given that we still get interesting results just using this very basic heuristic I decided to stick with it in this case. I might come back to this at a later date though.
So given we are just interested in looking at words rather than phrases, I was looking for a dictionary of words with an indicator of whether the word is positive or negative. It turns out that people have already worked on this problem, and free dictionaries are available online. I found the website of Prof. Bill Mcdonald of the University of Notre Dame in Indiana, USA. He has developed word lists for use in automating the analysis of financial statements, all the lists can be found in the following link:
I took Prof. McDonald's list, added in a number of additional words that appear frequently in song lyrics but were not including in the word list. (For example ballin' came up a lot in Hip Hop, but unfortunately is under used in financial statements).
Step 3 - Run the analysis on the data and collate the results
Here are the results of my analysis:
As we can see, Jay Z and 50 Cent, our representative Hip Hop artists are clearly above the Metal artists in terms of the relative frequency of positive over negative words in their music. We can see that the Metal artists are clustered at the bottom of the table.
One interesting feature of the results, which perhaps should not have been a surprise, was that pop music, i.e. Justin Bieber and the Beatles, were actually the clear winners on these terms.
Grouping the artists together into genres further emphasises the clear positioning of the genres.
So there we have it, Hip Hop is more positive than Rock or Metal. And Justin Bieber is better than Metallica, but the Beatles are the most positive band of all!
I work as a pricing actuary at a reinsurer in London.