Extended Mind Crowdsourcing

Monday, Dec 15, 2014

Update 13/01/15: the paper containing the research described below is currently available from the HICSS website

This post is one I’m cross-posting both here and on the MobiSoc blog. Here, because it’s my personal translation of one of our latest research papers, and there because it’s a very good paper mostly written and driven by Roger Whitaker, so deserves an ‘official’ blog post!

A lot of use is made of Crowdsourcing in both business and academia. Business likes it because it allows simple tasks to be outsourced for a small cost. Researchers like it because it allows the gathering of large amounts of data from participants, again for minimal cost. (For an example of this, see our TweetCues work (paper here), where we paid Twitter users to take a simple survey and massively increased our sample size for a few dollars). As technology is developing, we can apply crowdsourcing to new problems; particularly those concerned with collective human behaviour and culture.

Crowdsourcing

The traditional definition of crowdsourcing involves several things:

  1. a clearly defined crowd

  2. a task with a clear goal

  3. clear recompense received by the crowd

  4. an identified owner of the task

  5. an online process

The combination of all these things allows us to complete a large set of simple tasks in a short time and often for a reduced cost. It also provides access to global labour markets for users who may not previously have been able to access these resources.

Participatory Computing

Participatory computing is a related concept to crowdsourcing, based around the idea that the resources and data of computing devices can be shared and used to complete tasks. As with crowdsourcing, these tasks are often large, complex and data-driven, but capable of being broken down into smaller chunks that can be distributed to separate computing devices in order to complete the larger task. BOINC is a clear example of this class of participatory computing.

participatory_img

Extended Mind Crowdsourcing

The extended mind hypothesis describes the way that humans extend their thinking beyond the internal mind, to use external objects. For instance, a person using a notebook to record a memory uses the ‘extended mind’ to record the memory; the internal mind simply recalls that the memory is located in the notebook, an object that is external to the individual.

Extended mind crowdsourcing takes crowdsourcing and participatory computing a step further by including the extended mind hypothesis, to allow us to describe systems that use the extended mind of participants, as represented by their devices and objects, in order to add implicit as well as explicit human computation for collective discovery.

emc_img

What this means is that we can crowdsource the collection of data and completion of tasks using both individual users, their devices, and the extended mind that the two items together represent. Thus by accessing the information stored within a smartphone or similar personal device, and the wider internet services that the device can connect to, we can access the extended mind of a participant and thus learn more about his or her behaviour and individual characteristics. In essence, extended mind crowdsourcing captures the way in which humans undertake and respond to daily activity. In this sense it supports observation of human life and our interpretation of and response to the environment. By including social networks and social media communication within the extended mind, it is clear that while an individual extended mind may represent a single individual human, it is also possible to represent a group, such as a network or a collective using extended mind crowdsourcing.

By combining the ideas of social computing, crowdsourcing, and the extended mind, we are able to access and aggregate the data that is created through our use of technology. This allows us to extend ideas of human cognition into the physical world, in a less formal and structured way than when using other forms of human computational systems. The reduced focus on task driven systems allows EMC to be directed at the solving of loosely defined problems, and those problems where we have no initial expectations of solutions or findings.

This is a new way of thinking about the systems we create in order to solve problems using computational systems focused on humans, but it has the potential to be a powerful tool in our research toolbox. We are presenting this new Extended Mind Crowdsourcing idea this week at HICSS.

Quick and Dirty Twitter API in Python

Wednesday, Nov 19, 2014

QUICK DISCLAIMER: this is a quick and dirty solution to a problem, so may not represent best coding practice, and has absolutely no error checking or handling. Use with caution…

A recent project has needed me to scrape some data from Twitter. I considered using Tweepy, but as it was a project for the MSc in Computational Journalism, I thought it would be more interesting to write our own simple Twitter API wrapper in Python.

The code presented here will allow you to make any API request to Twitter that uses a GET request, so is really only useful for getting data from Twitter, not sending it to Twitter. It is also only for using with the REST API, not the streaming API, so if you’re looking for realtime monitoring, this is not the API wrapper you’re looking for. This API wrapper also uses a single user’s authentication (yours), so is not setup to allow other users to use Twitter through your application.

The first step is to get some access credentials from Twitter. Head over to https://apps.twitter.com/ and register a new application. Once the application is created, you’ll be able to access its details. Under ‘Keys and Access Tokens’ are four values we’re going to need for the API - the  Consumer Key and Consumer Secret, and the Access Token and Access Token Secret. Copy all four values into a new python file, and save it as ‘_credentials.py’. Once we have the credentials, we can write some code to make some API requests!

First, we define a Twitter API object that will carry out our API requests. We need to store the API url, and some details to allow us to throttle our requests to Twitter to fit inside their rate limiting.

class Twitter_API:

 def __init__(self):

   # URL for accessing API
   scheme = "https://"
   api_url = "api.twitter.com"
   version = "1.1"

   self.api_base = scheme + api_url + "/" + version

   #
   # seconds between queries to each endpoint
   # queries in this project limited to 180
   # per 15 minutes
   query_interval = float(15 * 60)/(175)

   #
   # rate limiting timer
   self.__monitor = {'wait':query_interval,
     'earliest':None,
     'timer':None}

We add a rate limiting method that will make our API sleep if we are requesting things from Twitter too fast:

#
# rate_controller puts the thread to sleep
# if we're hitting the API too fast
def __rate_controller(self, monitor_dict):

 #
 # join the timer thread
 if monitor_dict['timer'] is not None:
 monitor_dict['timer'].join()

 # sleep if necessary
 while time.time() < monitor_dict['earliest']:
   time.sleep(monitor_dict['earliest'] - time.time())

 # work out then the next API call can be made
 earliest = time.time() + monitor_dict['wait']
 timer = threading.Timer( earliest-time.time(), lambda: None )
 monitor_dict['earliest'] = earliest
 monitor_dict['timer'] = timer
 monitor_dict['timer'].start()

The Twitter API requires us to supply authentication headers in the request. One of these headers is a signature, created by encoding details of the request. We can write a function that will take in all the details of the request (method, url, parameters) and create the signature:

#
# make the signature for the API request
def get_signature(self, method, url, params):

 # escape special characters in all parameter keys
 encoded_params = {}
 for k, v in params.items():
   encoded_k = urllib.parse.quote_plus(str(k))
   encoded_v = urllib.parse.quote_plus(str(v))
   encoded_params[encoded_k] = encoded_v

 # sort the parameters alphabetically by key
 sorted_keys = sorted(encoded_params.keys())

 # create a string from the parameters
 signing_string = ""

 count = 0
 for key in sorted_keys:
   signing_string += key
   signing_string += "="
   signing_string += encoded_params[key]
   count += 1
   if count < len(sorted_keys):
     signing_string += "&"

 # construct the base string
 base_string = method.upper()
 base_string += "&"
 base_string += urllib.parse.quote_plus(url)
 base_string += "&"
 base_string += urllib.parse.quote_plus(signing_string)

 # construct the key
 signing_key = urllib.parse.quote_plus(client_secret) + "&" + urllib.parse.quote_plus(access_secret)

 # encrypt the base string with the key, and base64 encode the result
 hashed = hmac.new(signing_key.encode(), base_string.encode(), sha1)
 signature = base64.b64encode(hashed.digest())
 return signature.decode("utf-8")

Finally, we can write a method to actually make the API request:

def query_get(self, endpoint, aspect, get_params={}):

 #
 # rate limiting
 self.__rate_controller(self.__monitor)

 # ensure we're dealing with strings as parameters
 str_param_data = {}
 for k, v in get_params.items():
   str_param_data[str(k)] = str(v)

 # construct the query url
 url = self.api_base + "/" + endpoint + "/" + aspect + ".json"

 # add the header parameters for authorisation
 header_parameters = {
   "oauth_consumer_key": client_id,
   "oauth_nonce": uuid.uuid4(),
   "oauth_signature_method": "HMAC-SHA1",
   "oauth_timestamp": time.time(),
   "oauth_token": access_token,
   "oauth_version": 1.0
 }

 # collect all the parameters together for creating the signature
 signing_parameters = {}
 for k, v in header_parameters.items():
   signing_parameters[k] = v
 for k, v in str_param_data.items():
   signing_parameters[k] = v

 # create the signature and add it to the header parameters
 header_parameters["oauth_signature"] = self.get_signature("GET", url, signing_parameters)

 # add the OAuth headers
 header_string = "OAuth "
 count = 0
 for k, v in header_parameters.items():
   header_string += urllib.parse.quote_plus(str(k))
   header_string += "=\""
   header_string += urllib.parse.quote_plus(str(v))
   header_string += "\""
   count += 1
   if count < 7:
     header_string += ", "

 headers = {
   "Authorization": header_string
 }

 # create the full url including parameters
 url = url + "?" + urllib.parse.urlencode(str_param_data)
 request = urllib.request.Request(url, headers=headers)

 # make the API request
 try:
   response = urllib.request.urlopen(request)
   except urllib.error.HTTPError as e:
   print(e)
 raise e
   except urllib.error.URLError as e:
   print(e)
   raise e

 # read the response and return the json
 raw_data = response.read().decode("utf-8")
 return json.loads(raw_data)

Putting this all together, we have a simple Python class that acts as an API wrapper for GET requests to the Twitter REST API, including the signing and authentication of those requests. Using it is as simple as:

 ta = Twitter_API()

 # retrieve tweets for a user
 params = {
    "screen_name": "martinjc",
 }

 user_tweets = ta.query_get("statuses", "user_timeline", params)

As always, the full code is online on Github, in both my personal account and the account for the MSc Computational Journalism.

How do people decide whether or not to read a tweet?

Tuesday, Nov 4, 2014

It turns out that an existing relationship with the author of the tweet is one of the main factors influencing how someone decides whether or not to read a tweet. At the same time,  a large number associated with a tweet can also make the tweet more attractive to readers.

Our latest Open Access research has discovered how much effect the information about a tweet has on whether people decide to read it or not.

By showing hundreds of Twitter users the information about two tweets but not the tweets themselves, and then asking the users which tweet they would like to read, we have been able to look at which information is more important when users are deciding to read a tweet.

We looked at two different types of information:

  1. Simple numbers that describe the tweet, such as the number of retweets it has, or numbers that describe the author, such as how many followers they have, or how many tweets they’ve written.

  2. Whether a relationship between the reader and the author is important, and whether that relationship was best shown through subtle hints, or direct information.

When readers can see only one piece of information, the case is clear: they’d rather read the tweet written by someone they are following. Readers can easily recognise the usernames, names, and profile images of people they already follow, and are likely to choose to read content written by someone they follow (instead of content written by a stranger) around 75% of the time. If all they can see is a piece of numerical information, they would rather read the tweet with the highest number, no matter what that number is. The effect is strongest with the number of retweets, followed by the number of followers, but even for the number of following and number of tweets written the effect is significant.

When readers can see two pieces of information, one about their relationship with the author, and one numerical, there are two cases to look at. When the author they follow also has a high numerical value, readers will choose that tweet in around 80% of the cases. When the author they already follow has a lower numerical value, it is still the existing relationship that is more of a draw. Readers would rather read a tweet from someone they know that has a low number of retweets, than one from a stranger with a high number of retweets.

This work offers an understanding of how the decision-making process works on Twitter when users are skimming their timelines for something to read, and has particular implications for the display and promotion of non-timeline content within content streams. For instance, readers may pay more attention to adverts and promoted content if the link between themselves and the author is highlighted.

Previous results  from an early experiment were published at SocialCom. The results in this new paper are from a modified and expanded version of this earlier experiment.

Beards, 'Taches and Testicles

Saturday, Nov 1, 2014

This is me:

mildly hungover morning selfie
mildly hungover morning selfie

Obviously the first thing you notice, after my devilishly handsome good looks, is that I have around the lower half of my face what has the potential to be described as,  if one is kind: a ‘beard’. It is patchy, it is more than often unkempt, and it is quite ginger, but it is somewhat beard like. I can no longer remember when I grew this beard, but I like it. I like it so much that I refused to shave it off when I graduated in 2013, and again when I got married earlier this year.

However, ominous things have happened. Recently, a mate and colleague done a tweet:

“Good on Pete” I thought. Good cause. I did Movember back in 2011, and it was hard, because quite frankly with a moustache I look like a complete tit. At the time I was doing it, I think Pete and I were sharing an office, so he knows how much of a tit you can look like during Movember, yet he’s chosen to do it anyway. Well done.

Of course, you won’t catch me doing it. I have a beard now, and I won’t shave that off. Also, as I mentioned, I look like a complete tit when I grow a moustache. It was fine in 2011, I was only an RA, so I could just hide in the office and work. The only person affected was my wife, who sadly had to be seen in public with me. I’m a lecturer now. I can’t just hide in my office. I have to teach. I have to stand up in front of students. I can’t do that looking like a person who belongs on some sort of list.

Then Vince Knight joined Pete’s team:

“Well done Vince” I thought. Good cause. At least Pete won’t look so daft walking around campus with a ‘tache now. There’ll be two of you at least. Not me of course. No way.

Then Pete done another tweet:

Oh.

Pete’s called me out. He wants me to join in. Maybe we’ll just all ignore him and it’ll go away.

Then I done a tweet:

WTF? What did I just do? Did I agree to do Movember again? Why? I have no idea. Perhaps I enjoy looking like a tit?

So. I joined. As did many others that Pete called out. And now we’re all going to grow moustaches and demand money from our friends, relatives and colleagues. It’s a good cause. You can donate to us, our team page is here.

First though, there’s business to take care of. The beard had to go. I had to locate my shaving equipment, which has not been used in many years, and attempt to remove the lovely facial hair to which I have become so attached, without slicing my face apart in the process:

WHAT HAVE I DONE?
WHAT HAVE I DONE?

So that’s it. The beard is off and I am clean-shaven for the first time in I don’t know how long. This, I think, is quite the sacrifice. But there is more to come. The ‘tache is on its way - slowly working its way out of my upper lip. I am going to look terrible. If you in any way feel inclined, please make it worth it. Donate to me or the team. Don’t let my beard have fallen in vain.  After all (I came up with this last night while very drunk and I LOVE IT):  beards grow back. Balls don’t.

GeoJSON and topoJSON for UK boundaries

Wednesday, Sep 17, 2014

I’ve just put an archive online containing GeoJSON and topoJSON for UK boundary data. It’s all stored on Github, with a viewer and download site hosted on Github pages.

Browser for the UK topoJSON stored in the Github repository
Browser for the UK topoJSON stored in the Github repository

The data is all created from shapefiles released by the Office of National Statistics, Ordnance Survey and National Records Scotland, all under the Open Government and OS OpenData licences.

In later posts I’ll detail how I created the files, and how to use them to create interactive choropleth maps.

CCGs and WPCs via the medium of OAs

Monday, Sep 15, 2014

As I was eating lunch this afternoon, I spotted a conversation between @JoeReddington and @MySociety whizz past in Tweetdeck. I traced the conversation back to the beginning and found this request for data:

I’ve been doing a lot of playing with geographic data recently while preparing to release a site making it easier to get GeoJSON boundaries of various areas in the UK. As a result, I’ve become pretty familiar with the Office of National Statistics Geography portal, and the data available there. I figured it must be pretty simple to hack something together to provide the data Joseph was looking for, so I took a few minutes out of lunch to see if I could help.

Checking the lookup tables at the ONS, it was clear that unfortunately there was no simple ‘NHS Trust to Parliamentary Constituency’ lookup table. However, there were two separate lookups involving Output Areas (OAs). One allows you to lookup which Parliamentary Constituency (WPC) an OA belongs to. The other allows you to lookup which NHS Clinical Commissioning Group (CCG) an OA belongs to. Clearly, all that’s required to link the two together is a bit of quick scripting to tie them both together via the Output Areas.

First, let’s create a dictionary with an entry for each CCG. For each CCG we’ll store it’s ID, name, and a set of OAs contained within. We’ll also add  an empty set for the WPCs contained within the CCG:

import csv
from collections import defaultdict

data = {}

# extract information about clinical commissioning groups
with open('OA11_CCG13_NHSAT_NHSCR_EN_LU.csv', 'r') as oa_to_cgc_file:
  reader = csv.DictReader(oa_to_cgc_file)
  for row in reader:
    if not data.get(row['CCG13CD']):
      data[row['CCG13CD']] = {'CCG13CD': row['CCG13CD'], 'CCG13NM': row['CCG13NM'], 'PCON11CD list': set(), 'PCON11NM list': set(), 'OA11CD list': set(),}
    data[row['CCG13CD']]['OA11CD list'].add(row['OA11CD'])

Next we create a lookup table that allows us to convert from OA to WPC:

# extract information for output area to constituency lookup
oas = {}
pcon_nm = {}

with open('OA11_PCON11_EER11_EW_LU.csv', 'r') as oa_to_pcon_file:
  reader = csv.DictReader(oa_to_pcon_file)
  for row in reader:
    oas[row['OA11CD']] = row['PCON11CD']
    pcon_nm[row['PCON11CD']] = row['PCON11NM']

As the almost last step we go through the CCGs, and for each one we go through the list of OAs it covers, and lookup the WPC each OA belongs to:

# go through all the ccgs and lookup pcons from oas
for ccg, d in data.iteritems():

 for oa in d['OA11CD list']:
   d['PCON11CD list'].add(oas[oa])
   d['PCON11NM list'].add(pcon_nm[oas[oa]])

del d['OA11CD list']

Finally we just need to output the data:

    for d in data.values():

     d['PCON11CD list'] = ';'.join(d['PCON11CD list'])
     d['PCON11NM list'] = ';'.join(d['PCON11NM list'])

    with open('output.csv', 'w') as out_file:
      writer = csv.DictWriter(out_file, ['CCG13CD', 'CCG13NM', 'PCON11CD list', 'PCON11NM list'])
      writer.writeheader()
      writer.writerows(data.values())

Run the script, and we get a nice CSV with one row for each CCG, each row containing a list of the WPC ids and names the CCG covers.

Of course, this data only covers England (as CCGs are a division in NHS England). Although there don’t seem to be lookups for OAs to Health Boards in Scotland, or from OAs to Local Health Boards in Wales, it should still be possible to do something similar for these countries using Parliamentary Wards as the intermediate geography, as lookups for Wards to Health Boards and Local Health Boards are available. It’s also not immediately clear how well the boundaries for CCGs and WPCs match up, that would require further investigation, depending on what the lookup is to be used for.

All the code, input and output for this task is available on my github page.

sitting on the dock of the bay

Monday, Sep 8, 2014

While we negotiate the transition from the old house which we’ve sold to the new house we’ve just bought we’ve been renting a lovely flat up on Penarth head. One of the main benefits of this flat is the glorious view over Cardiff Bay and to the city centre beyond. No matter what time it is, whenever I pass by the living room window I end up staring out across the city. During the day, there’s boats coming and going through the barrage locks, or into the docks proper. At night the city is lit up with a terrible orange urban glow that somehow looks both peaceful and exciting. I’ve spent a lot of time just stood on the balcony watching, and it’s been quite relaxing. Not only that, but I’ve had the opportunity to see some fairly interesting occurrences; especially when there’s been an unusual visitor to Cardiff docks, such as this tall ship we had visiting earlier in the year:

The Stavros S Niarchos leaving Cardiff Docks
The Stavros S Niarchos leaving Cardiff Docks

This was the case again this evening, when we were able to stand and watch the warships of various flags and types leaving Cardiff docks after the conclusion of the NATO summit in Newport. Leaving aside any particular feelings about militarisation, it is still genuinely interesting to see these things in your home city, even more so when you’ve got a good view.

Unfortunately despite still being up on the hill overlooking the city, the new house does not have such a commanding view of the docks, bay, or Cardiff. Losing that is one of the worst things about having to move. I guess I’ll just have to get used to putting my shoes on and leaving the house whenever I want to stare out over the bay…

View over Cardiff Bay from Northcliffe
View over Cardiff Bay from Northcliffe

The Graphical Web 2014

Thursday, Sep 4, 2014

photo of the author outside Winchester cathedral
photo of the author outside Winchester cathedral

Last week I had a lovely time down in Winchester with m’colleague, attending The Graphical Web 2014. This year the theme was ‘Visual Storytelling’, so I’d gone along to see what new things we could learn about visualisation to include in the MSc in Computational Journalism. We’d also already had a few conversations about the course with people who were going to be at the conference, so we were planning to take the opportunity to chat in person about their involvement.

There were many excellent informative and entertaining talks, ranging from the process behind the redesign of Google Maps, through how Twitter does data visualisation, and on to what happens when your data visualisation becomes immensely popular. I’d highly recommend anyone with an interest in any of this to take some time to look through the schedule and watch the videos of some of the talks - I’ll certainly be forcing the MScCompJ students to watch a few.

Scott Murray educates us on the best design process
Scott Murray educates us on the best design process

There were some interesting messages from people at the conference that I’ll be taking forward with my own work and trying to impart to the students. One that is key, I think, is to strike the right balance between detail and simplicity when presenting data. This was mentioned several times throughout the conference, but it really is important. Too much information in your visualisation and you can alienate the reader and confuse or hide your message. Not enough information and the context is lost, and the use of the design to the more advanced reader is reduced. It’s one of those balancing acts that we find so often when trying to mix both people and computers. Attempting to solve this problem and find this balance is challenging and interesting, and I look forward to seeing how the students next year cope with it.

Overall, it was a really good conference. I met a number of interesting people,  found a whole set of new people to follow on Twitter, and returned to Cardiff excited about the year ahead.

Foursquare icon downloading (yet again)

Friday, Aug 1, 2014

Previously I’ve written about a little script to download all the category icons from Foursquare and to create many different coloured versions of them. I’ve recently had to do this again for a project, and found my previous script did not work with a recent API version. I’ve updated the script to fix it and put it up on github.