I travel 4,000 miles across the Atlantic to get away from the humdrum shitshow that is recent British politics - and whaddayaknow: our first data science assignment is on the 2017 General Election.
This is the first in a series of posts exploring my coding assignments in my Computer Science with Applications year-long sequence, where I am learning most of my coding. The cool thing is that the assignments are almost always: “go and analyze some data we have got from the real world”. The bad thing is that the assignments are every week - so you don’t actually get much time to reflect on your results. The other bad thing is that U Chicago requests that we do not post the code we use online, because they want to recycle the assignments in future years (they are computer scientists - recycling is their MO). Therefore, in these posts, I will focus on pulling out cool results from the code that I built. You will just have to imagine what is going on under the hood. Believe me - it’s so smart.. so smart.. REALLY smart.
The assignment instructions are posted on our public course page here, so feel free to take a behind-the-scenes look at what a computer science course is like. You’ll notice that the infamous focus on design (Google: “focus on the user and all else will follow”) isn’t really there when they are communicating with their own…
But even he has an account now! As of course do each of the major parties. Our dataset consisted of each official party account Tweets (and ReTweets) during the period of Purdah from April 22- election day June 8th. Here’s how much the social media elves were twittering away in their respective camps:
Considering purdah consists of 49 days, the SNP were averaging a half-century of tweets every day. I’m sure it was great stuff. Labour are practically the Terence Malick of the tweeting world - poking their head out only once an hour on average.
We built a function called 'find_top_k_entities' to compute a list. Below is a call to the function with the following parameters: dataset = Conservatives, entities = hashtags, k = 5
As you can see from the Output, the Tories fave hashtag was #bbcqt(#BBCqt etc. included) - referring to the British Thursday night institution - which was used in 231/1761 tweets. I personally enjoy #chaos, a simple but effective shade-throw.
What about the other parties? Let’s play Guess Who!
[gallery ids=“310,309,308”]
I suspect the hashtag ‘votelibdem’ was used more times than the action was done. The other parties really don’t bother with Question Time!
Let’s take a look at the top 3-word phrases from each party. In addition, we created a function to different the tweets up into different month so we can see dynamics over time! Again, let’s play Guess Who!
[gallery ids=“311,312,313,314” type=“slideshow”]
‘Strong stable leadership’ went from Top of the Tory Twitter Table in April and May to nowhere! Here is a wonderful example of the data singing out a real-life story: the Tories abandoned that slogan on May 30th! Notice how the Tories use slogans much more frequently than Labour. This links to a long history of a well-oiled Tory PR machine. I enjoy ‘up scottish fishing’.
If some of the phrases don't seem natural, it is because we excluded the following list of 'particles': ["a", "an", "the", "this", "that", "of", "for", "or", "and", "on", "to", "be", "if", "we", "you", "in", "is", "at", "it", "rt", "mt", "with"]
What does a single tweet look like in our dataset? Take this tweet for example, the Conservatives second last tweet in our dataset made at 2145 BST (ignore the 2:45PM - US time!), 15 minutes before closing time.
https://twitter.com/Conservatives/status/872917427381825538
In our dataset, it is represented as a Python dictionary, with 24 keys. Here are the first three keys.
Notice how ‘created_at’ is 2045 GMT.
Some of the keys have values which are embedded dictionaries. For example, take a look at the entities, including hashtags:
Unsurprisingly there is one hashtag #VoteConservative which takes up indices 24-41 of the 140 allotted characters (now 280 these days apparently!).
In our code, we had two types of functions. 1). Iteratively extract information from these tweet dictionaries. 2). General purpose algorithms for finding the top most frequently occurring things. Putting these together, we are able to plug in our data and spit out the top phrases, hashtags etc.