I’m a business, man!
Let me handle my business, damn"
Kanye West ft Jay.z - Diamonds from Sierra Leone
“It's easier to run
Replacing this pain with something numb
It's so much easier to go
Than face all this pain here all alone.”
Linkin Park - Easier to Run
Recently I've really been getting into hip hop, one thing that really struck me about hip hop, is that contary to popular perception. hip hop is on the whole surprisingly upbeat, especially when contrasted with rock and metal which I listened to a lot of growing up.
In the spirit of being scientific I thought I would try to quantify this difference. My plan was:
1) Get hold of a sample of lyrics from different artists of sufficient size.
2) Come up with a method for analysing how positive or negative the lyrics are.
3) Run the analysis on the lyrics and collate the results,
Step 1 - Collect data
So the first step was to obtain a sample of lyrics. This was actually harder than I thought it would be. There are plenty of websites which contain song lyrics, however I was looking for a large collection of songs and artists, and I didn't want to have to trawl through hundreds of webpages by hand.
The process of automating the collection of data from websites is called web scraping. I have played around with webs scraping before using Python, but I found it to be very fiddly to set up and not very reliable. This was a couple of years ago though, and it turns out that since then there have been a number of new tools which make the whole process easier.
I ended up using a tool called 'Web Scraper', website below:
Web Scraper is an add-in to Chrome with an easy to use graphical interface, that can export extracts directly to .csv files.
In total I managed to collect lyrics from about 4,000 songs across 12 artists. Whilst I did manage to automate the process, it was still quite slow going as many websites have anti-scraping technology that blocks you out if you take too much data too quickly. I might try to expand this sample at a later date, but we can already see some clear trends emerging just from these artists.
Step 2 - Develop a method for analysing the lyrics.
Trying to program a computer to understand the semantic meaning of human generated text is a big area of research in the field of Computer Science. It is called 'Natural Language Processing', or NLP for short, there are many advanced methods within NLP being developed at the moment, some of the approaches involve Machine Learning, statistical analysis or other complex methods. For a good introduction the Wikipedia page gives a helpful overview.
However, I was just looking for a very basic method that would allow for a broad brush analysis. I therefore decided to just use the relative frequency of words with positive or negative connotations within the lyrics as a proxy for how positive or negative the lyrics were as a whole.
The obvious weakness of this approach is that just because there are positive words within a sentence, does not necessarily imply the meaning of the sentence is positive as a whole. For example, take the sentence 'This is not fun', an analysis of this sentence just on the basis of the words contained in it would suggest that it is a positive sentence, given we have the word 'fun' in the sentence. The obvious way to counter this would be to start looking at phrases instead. So 'not fun' would be given a negative connotation. Trying to look at phrases rather than words, adds a large degree of additional complexity to the analysis though, and given that all the artist should be exposed to roughly the same degree of false positives and false negatives, and given that we still get interesting results just using this very basic heuristic I decided to stick with it in this case. I might come back to this at a later date though.
So given we are just interested in looking at words rather than phrases, I was looking for a dictionary of words with an indicator of whether the word is positive or negative. It turns out that people have already worked on this problem, and free dictionaries are available online. I found the website of Prof. Bill Mcdonald of the University of Notre Dame in Indiana, USA. He has developed word lists for use in automating the analysis of financial statements, all the lists can be found in the following link:
I took Prof. McDonald's list, added in a number of additional words that appear frequently in song lyrics but were not including in the word list. (For example ballin' came up a lot in Hip Hop, but unfortunately is under used in financial statements).
Step 3 - Run the analysis on the data and collate the results
Here are the results of my analysis:
As we can see, Jay Z and 50 Cent, our representative Hip Hop artists are clearly above the Metal artists in terms of the relative frequency of positive over negative words in their music. We can see that the Metal artists are clustered at the bottom of the table.
One interesting feature of the results, which perhaps should not have been a surprise, was that pop music, i.e. Justin Bieber and the Beatles, were actually the clear winners on these terms.
Grouping the artists together into genres further emphasises the clear positioning of the genres.
So there we have it, Hip Hop is more positive than Rock or Metal. And Justin Bieber is better than Metallica, but the Beatles are the most positive band of all!