KSRI Services Summer School - Social Computing Theory and Hackathon

September 24, 2013

I was invited by Simon Caton to come to the KSRI Services Summer School, held at KIT in Germany, to help him run a workshop session on Social Computing.  We decided to use the session as a crash course in retrieving and manipulating data from Social Media APIs - showing the students the basics, then running a mini ‘hackathon’ for the students to gain some practical experience.

I think the session went really well, the students seemed to enjoy it and the feedback was very positive. We spent about 90 minutes talking about APIs, JSON, Twitter, Facebook and Foursquare, then set the students off on forming teams and brainstorming ideas. Very quickly they managed to get set up grabbing Twitter data from the streaming API, and coming up with ways of analysing it for interesting facts and statistics.  A number of the students were not coders, and had never done anything like this before, so it was great to see them diving in, setting up servers and running php scripts to grab the data. It was also good to see the level of team work on display; everyone was communicating, dividing the work, and getting on well. Fuelled by a combination of pizza, beer, red bull and haribo they coded into the night, until we drew things to a close at about 10pm and retired to the nearest bar for a pint of debrief.

Hackathon Students

It was a really good experience, and I think everyone got something useful out of it. I’m looking forward to the presentations later on today to see what everyone came up with.

Our slides from the talk are available on slideshare. As usual they’re information light and picture heavy, so their usefulness is probably limited!

Post-Processed Dinosaurs

August 28, 2013

Finding myself with a free afternoon this week, I strolled down to the local Odeon to see Jurassic Park: IMAX 3D. (It should be noted that the 'IMAX" bit  doesn’t mean much -  the screen at the Odeon is nowhere near as big as a true ‘IMAX’ screen). I should say, I love this film a lot - hence my willingness to pay £12 (£12!!??!) to see it again on the big screen. I first saw it in the Shrewsbury Empire cinema when I was 10 years old, in one of my first (and possibly only) trips to the cinema with my Dad, and instantly loved it. This is not entirely unsurprising considering I was essentially the target audience at the time. Following that I wore through a pirate VHS copy obtained from a friend, then an actual legitimate VHS copy, followed by the inevitable much hardier DVD purchase. When we finally embraced streaming media a couple of years ago and sold off all the DVDs it was one of only a few that I was desperate to keep. I like the movie so much that I can even forgive Jurassic Park 2 and 3.

It’s sad then for me to see the movie in this format now. From minute 1, it’s clear that the 3D conversion is very poor quality. It’s basically like watching a moving pop up book, as flat characters and objects make their way across the screen at varying depths. At some points individual characters have been picked out of the background so poorly it actually looks like they’ve been filmed with early green-screen effects, so they’re totally divorced from the background. It just doesn’t add anything to the movie, and is actually often distracting. It’s a waste of the already impressive visuals of the movie, and so easy to see it for what it is: a cheap gimmick to try and cash in on a successful property. The problem is that it’s totally unnecessary - all that’s needed to get a bunch of new film goers interested in Jurassic Park (and become the ready made audience for the next ‘new’ JP movie) is to release the film again. I’m sure it would have done just as well as a 2D re-release, so this poor 3D affair is a waste of effort.

Of course the film itself is still amazing, and the sound quality (whether due to this new version or because of the IMAX standard speaker system) absolutely blew me away. I heard lines of dialogue that were previously just characters muttering under their breath, and the roar of the dinosaurs combined with that John Williams theme made me forgive the awful awful 3d conversion and fall in love with the movie all over again.

Also the raptors are still bloody terrifying.

not another bloody wordle?!?!

August 20, 2013

(UPDATE: an earlier version of this was totally wrong. It’s better now.)

Inspired by a Facebook post from a colleague, I decided to waste ten minutes this week knocking together a word cloud from the text of my thesis. The process was pretty straightforward.

First up - extracting the text from the thesis. Like all good scienticians, my thesis was written in LaTeX. I thought I could have used a couple of different tools to extract the plain text from the raw .tex input files, but actually none of the tools available from a quick googling seemed to work properly, so I went with extracting the text from the pdf file instead. Fortunately on Mac OS X this is pretty simple, as you can create a straightforward Automator application to extract the text from any pdf file, as documented in step 2 here.

Once I had the plain text contents of my thesis in a text file it was just a simple few lines of python (using the excellent NLTK) to get a frequency distribution of the words in my thesis:

from nltk.probability import FreqDist
from nltk.tokenize import word_tokenize, sent_tokenize

fdist = FreqDist()
with open("2012chorleymjphd.txt", "r") as inputfile:
for sentence in sent_tokenize(inputfile.read()):
for word in word_tokenize(sentence):

for word, count in fdist.iteritems():
if count > 10:
print "%s: %d" % (word, count)

Then it was just a matter of copying and pasting the word frequency distribution into wordle:

And there we have it. A not particularly informative but quite nice looking representation of my thesis. As you can guess from the cloud, it’s not the most exciting thesis in the world. Interestingly, the word error doesn’t seem to be there 😉.

SWN Festival 2013 plans – part 1: the data (2!)

August 18, 2013

In the previous post, I used python and BeautifulSoup to grab the list of artists appearing at SWN Festival 2013, and to scrape their associated soundcloud/twitter/facebook/youtube links (where available).

However, there are more places to find music online than just those listed on the festival site, and some of those extra sources include additional data that I want to collect, so now we need to search these other sources for the artists. Firstly, we need to load the artist data we previously extracted from the festival website, and iterate through the list of artists one by one:

artists = {}
with open("bands.json") as infile:
artists = json.load(infile)

for artist, artist_data in artists.iteritems():

The first thing I want to do for each artist it to search Spotify to see if they have any music available there. Spotify has a simple web API for searching which is pretty straightforward to use:

params = {
"q" : "artist:" + artist.encode("utf-8")

spotify_root_url = "http://ws.spotify.com/search/1/artist.json"
spotify_url = "%s?%s" % (spotify_root_url, urllib.urlencode(params))

data = retrieve_json_data(spotify_url)

if data.get("artists", None) is not None:
if len(data["artists"]) > 0:
artist_id = data["artists"][0]["href"].lstrip("spotify:artist:")
artist_data["spotify_id"] = data["artists"][0]["href"]
artist_data["spotify_url"] = "http://open.spotify.com/artist/" + artist_id

The ‘retrieve_json_data’ function is just a wrapper to call a URL and parse the returned JSON data:

def retrieve_json_data(url):

response = urllib2.urlopen(url)
except urllib2.HTTPError, e:
raise e
except urllib2.URLError, e:
raise e

raw_data = response.read()
data = json.loads(raw_data)

return data

Once I’ve searched Spotify, I then want to see if the artist has a page on Last.FM. If they do, I also want to extract and store their top-tags from the site. Again, the Last.FM API makes this straightforward. Firstly, searching for the artist page:

params = {
"artist": artist.encode("utf-8"),
"api_key": last_fm_api_key,
"method": "artist.getinfo",
"format": "json"

last_fm_url = "http://ws.audioscrobbler.com/2.0/?" + urllib.urlencode(params)

data = retrieve_json_data(last_fm_url)

if data.get("artist", None) is not None:
if data["artist"].get("url", None) is not None:
artist_data["last_fm_url"] = data["artist"]["url"]

Then, searching for the artist’s top tags:

params = {
"artist": artist.encode("utf-8"),
"api_key": last_fm_api_key,
"method": "artist.gettoptags",
"format": "json"

last_fm_url = "http://ws.audioscrobbler.com/2.0/?" + urllib.urlencode(params)

data = retrieve_json_data(last_fm_url)

if data.get("toptags", None) is not None:

artist_data["tags"] = {}

if data["toptags"].get("tag", None) is not None:
tags = data["toptags"]["tag"]
if type(tags) == type([]):
for tag in tags:
name = tag["name"].encode('utf-8')
count = 1 if int(tag["count"]) == 0 else int(tag["count"])
artist_data["tags"][name] = count
name = tags["name"].encode('utf-8')
count = 1 if int(tags["count"]) == 0 else int(tags["count"])
artist_data["tags"][name] = count

Again, once we’ve retrieved all the extra artist data, we can dump it to file:

with open("bands.json", "w") as outfile:
json.dump(artists, outfile)

So, I now have 2 scripts that I can run regularly to capture any updates to the festival website (including lineup additions) and to search for artist data on Spotify and Last.FM. Now I’ve got all this data captured and stored, it’s time to start doing something interesting with it…

EPSRC Doctoral Award Fellowship

August 16, 2013

I’m really very pleased to be able to say that I have been awarded a 2013 EPSRC doctoral award fellowship. This means I’ve been given an opportunity to spend 12 months from October this year working independently on a research project of my own choosing. I’ll be looking at the connection between places and personality, analysing the large dataset collected through the Foursquare Personality app to try and build towards a recommendation system for places that uses personality as one of its key input signals.

I think this is a really interesting research project, and I’m hoping for some good results. The basic question I’m asking is: if we know where someone has been (i.e. from their Foursquare history) then can we predict what their personality is? If we can do that, then maybe we can do the reverse, and from someone’s personality, infer where they might like to go. This could lead to a shift in the way that place recommendation systems are built, utilising not just the knowledge of where someone has been, but also why someone has been there.

This is a great opportunity -  while it has been really good to work on the last two EU projects I’ve been involved in, the overheads (especially the deliverables) have sometimes been a distraction and have sometimes gotten in the way of the research. With this project I’ll be able to plough on with the research without having to worry about those kinds of administrative overheads. It’s also a great stepping stone on my academic career path, and should give me the opportunity to generate some high quality outputs that will help with moving on to the next stage.

SWN Festival 2013 plans - part 1: the data

August 14, 2013

As I mentioned, I’m planning on doing a bit more development work this year connected to the SWN Festival. The first stage is to get hold of the data associated with the festival in an accessible and machine readable form so it can be used in other apps.

Unfortunately (but unsurprisingly), being a smallish local festival, there is no API for any of the data. So, getting a list of the bands and their info means we need to resort to web scraping. Fortunately, with a couple of lines of python and the BeautifulSoup library, getting the list of artists playing the festival is pretty straightforward:

import urllib2
import json

from bs4 import BeautifulSoup
root_page = "http://swnfest.com/"
lineup_page = root_page + "lineup/"

response = urllib2.urlopen(lineup_page)
except urllib2.HTTPError, e:
raise e
except urllib2.URLError, e:
raise e

raw_data = response.read()

soup = BeautifulSoup(raw_data)

links = soup.select(".artist-listing h5 a")

artists = {}

for link in links:
url = link.attrs["href"]
artist = link.contents[0]

artists[artist] = {}
artists[artist]["swn_url"] = url

All we’re doing here is loading the lineup page for the main festival website, using BeautifulSoup to find all the links to individual artist pages (which are in a div with a class of “artist-listing”, each one in a h5 tag), then parsing these links to extract the artist name, and the url of their page on the festival website.

Each artist page on the website includes handy links to soundcloud, twitter, youtube etc (where these exist), and since I’m going to want to include these kinds of things in the apps I’m working on, I’ll grab those too:

for artist, data in artists.iteritems():
response = urllib2.urlopen(data["swn_url"])
except urllib2.HTTPError, e:
raise e
except urllib2.URLError, e:
raise e

raw_data = response.read()

soup = BeautifulSoup(raw_data)

links = soup.select(".outlinks li")

for link in links:
source_name = link.attrs["class"][0]
source_url = link.findChild("a").attrs["href"]
data[source_name] = source_url

This code iterates through the list of artists we just extracted from the lineup page, retrieves the relevant artist page, and parses it for the outgoing links, stored in list items in an unordered list with a class of ‘outlinks’. Fortunately each link in this list has a class describing what type of link it is (facebook/twitter/soundcloud etc) so we can use the class as a key in our dictionary, with the link itself as an item. Later on once schedule information is included in the artist page we can add some code to parse stage-times and venues, but at the moment that data isn’t present on the pages, so we can’t extract it yet.

Finally we can just dump our artist data to json, and we have the information we need in an easily accessible format:

with open("bands.json", "w") as outfile:
json.dump(artists, outfile)

Now we have the basic data for each artist, we can go on to search for more information on other music sites. The nice thing about this script is that when the lineup gets updated, we can just re-run the code and capture all the new artists that have been added. I should also mention that all the code I’m using for this is available on github.

SWN Festival 2013 - plans

August 11, 2013

Last year I had a go at creating a couple of web apps based around the bands playing the SWN Festival here in Cardiff. I love SWN with all my heart, it’s a permanent fixture in my calendar and even if (when) I leave Cardiff it’ll be the one thing I come back for every year. It’s a great way to see and discover new bands, but sometimes the sheer volume of music on offer can be overwhelming. So I wanted to see if I could create some web apps that would help to navigate your way through all the bands, and find the ones that you should go and see.

The first was a simple app that gathered artist tags from Last.FM, allowing you to see which artists playing the festival had similar tags - so if you knew you liked one artist you could find other artists tagged with the same terms. The second (which technically wasn’t ever really finished) would allow you to login with a last.fm account and find the artists whose tags best matched the tags for your top artists in your last.fm profile.

I liked both these apps and found them both useful - but I don’t think they went far enough. I only started development late in the year, about a month before the festival, so didn’t have a lot of time to really get into it. This year I’m starting a lot earlier, so I’ve got time to do a lot more.

Firstly I’d like to repeat the apps from last year, but perhaps combining them in some way. I’d like to include more links to the actual music, making it easy to get from an artist to their songs by including embeds from soundcloud, spotify, youtube etc. I’d also like to try making a mobile app guide to the festival (probably as an android app as the official app is iOS only). I’m hopeful that given enough free time I should be able to get some genuinely useful stuff done, and I’ll be blogging about it here as I work on it.

Summer Project update

July 25, 2013

We are storming along with summer projects now, and starting to see some really good results.

Liam Turner (who is starting a PhD in the school in October) has been working hard to create a mobile version of the 4SQPersonality app. His work is coming along really well, with a great mobile HTML version now up and running, a native android wrapper working, and an iOS wrapper on its way. With any luck we’ll have mobile apps for both major platforms ready to be released before the summer is over.

Max Chandler, who is now a second year undergraduate, has done some great work looking at the Foursquare venues within various cities around the UK, analysing them for similarity and spatial distribution. He’s just over halfway through the project now and is beginning to work on visualising the data he’s collected and analysed. He’s creating some interesting interactive visualisations using D3, so as soon as he’s done I’ll link to the website here.

It’s been a really good summer for student projects so far, with some really pleasing results. I’ll post more description of the projects and share some of the results as they come to a close in the coming weeks.

Open Sauce Hackathon - Post Mortem

April 22, 2013

This weekend saw the second ‘Open Sauce Hackathon’ run by undergraduate students here in the school. Last years was pretty successful, and they improved upon it this year, pulling in many more sponsors and offering more prizes.

Unlike last year, when I turned up having already decided with Jon Quinn what we were doing, I went along this year with no real ideas. I had a desire to do something with a map, as I’m pretty sure building stuff connected to maps is going to play a big part in work over the next couple of months. Other than that though, I was at a bit of a loss. After playing around with some ideas and APIs I finally came up with my app: dionysus.

It’s a mobile friendly mapping app that shows you two important things: Where the pubs are (using venue data from Foursquare) and where the gigs are at (using event data from last.fm). If you sign in to either last.fm or Foursquare it will also pull in recommended bars and recommended gigs and highlight these for you.

The mapping is done using leaflet.js, which I found to be nicer and easier to use than Google Maps. The map tiles are based on OpenStreetMap data and come from CloudMade, while the (devastatingly beautiful) icons were rushed together by me over the weekend. The entire app is just client side Javascript and HTML, with HTML5 persistent localStorage used to maintain login authentication between sessions. It’s a simple app, but I’m pretty pleased with it. In the end I even won a prize for it (£50), so it can’t be too bad.

The app is hosted here, and the source code is available here. Obviously though the code is not very pretty and quite hacky, but it does the job!

Social Media Lecture Experiment

April 22, 2013

I’ve very recently had the opportunity to be involved in a different kind of lecture here at the School of Computer Science and Informatics. For his final year project one of our third year students (Samuel Boyes) is assessing the use of modern technology, social networking and video conferencing as part of the traditional lecture. As I had already been asked a few weeks ago to give a guest lecture in Matt Williams’ module Fundamentals of Computing with Java, on the topic of code maintainability/readability, we were presented with a nice opportunity to test out Sam’s theories in an actual lecture.

So, rather than giving a dull, fifty-minutes-of-me-waffling-on-about-things-at-the-front-of-a-lecture-theatre lecture, we changed things around. Instead, I spoke for twenty minutes on the general theory aspects of the topic, and following that we were joined in the lecture by an external speaker, Carey Hiles of Box UK, who joined via a Google+ Hangout to deliver some material about his experiences of the topic in the real world. The students were encouraged to get involved during the lecture, tweeting with a particular hashtag and leaving messages in a dedicated facebook group. We could then wrap the session up by going over the questions posted by the students, putting them to Carey through the video conferencing.

This was very interesting, as it allowed us to include the experiences and knowledge of someone out in the real world within the lecture, adding some extra value to the course and introducing students to ideas that are used in practice. It’s the kind of thing that we should be doing more of, and that I’d love to include in more lectures in the future. There are people out there with relevant real world experience and a desire to talk to students and improve the quality of education. Given the availability of tools that allow people to get involved in lectures remotely, it seems like a no-brainer that this kind of thing could and should happen more often.

It’s also directly relevant as there seems to be a strong desire to increase the added value of attending lectures and to improve the quality of teaching. The Higher Education sector in the UK is undergoing something of a transformation at the moment. Massive unjustified increases in tuition fees have fundamentally altered the relationship between universities and students, increasing the feeling that students are customers of the the institution. As such, students are now (rightly) demanding much better customer service and value for money from their institution (caution: PDF link!). (We’ll leave aside for now how the the universities are having to increase the value for money and customer service without actually having any more money, as that’s another argument.)

Seen in a global context, this customer-institution relationship becomes even more important. Massive online open course (MOOC) providers such as Coursera, EdX, Udacity and the UK based FutureLearn (in which my employer, Cardiff University, is involved) are providing large numbers of people with free educational material and courses. It’s still not clear how universities will end up continuing to make money from teaching while also giving away all their material for free. Some MOOC providers will charge other institutions licensing fees to use materials, while it seems fairly plausible that instead of charging to access material, universities will instead start to charge for providing credentials and certification once the courses have been completed.

If that really is a large part of the future of higher education, will the traditional lecture continue to exist? Will universities need lecture theatres full of hundreds of students, when they can put their material online and deliver teaching to thousands of students without the physical presence? Will the traditional university institution continue to exist? It’s entirely possible that many institutions could stop teaching and focus purely on research. It’s more than possible that some institutions will close entirely in the face of competition of ‘free’ teaching from higher ranked institutions. Is consolidation of teaching delivery between fewer larger institutions a good thing?

Until we discover the answers to these questions over the next few years (decades?), there’s still a need to ensure that students feel they are getting value for money. We need to ensure that the institution continues to provide a reason for students to pay to attend lectures where material is delivered that could instead be had online for free. Value added is more important than ever and maybe sessions like the one we had last week could play a part in that.