How to access and download your Google Web History with wget

Sunday, Jun 5, 2011

Google Web History has now been recording all of the searches I made in Google since about 2005. Obviously 6 years of search queries and results is a phenomenal amount of data, and it would be nice to get hold of it all to see what I could make of it. Fortunately Google make the data available as an RSS feed, although it’s not particularly well documented.

(caution - many ‘ifs’ coming up next)

If you’re logged into your Google account the rss feed can be accessed at:

https://www.google.com/history/?q=&output=rss&num=NUM&start=START

If you’re using a *nix based operating system (Linux, Mac OS X etc) you can then use wget on the command line to get the data. The below example works for retrieving the 1000 most recent searches in your history:

wget --user=GOOGLE_USERNAME  \
--password=PASSWORD --no-check-certificate \
"https://www.google.com/history/?q=&output=rss&num=1000&start=0"

If you’ve enabled 2-factor authentication on your google account you’ll need to add an app-specific password for wget so it can access your account - the password in the example above should be this app-specific password, not your main account password. If you haven’t enabled 2 factor authentication then you might be able to use your normal account password, but I haven’t tested this.

A simple bash script will then allow you to download the entire search history:

for START in 0 1000 2000 3000 ... 50000  
do   
 wget --user=GOOGLE_USERNAME \
  --password=WGET_APP_SPECIFIC_PASSWORD --no-check-certificate \
  "https://www.google.com/history/?output=rss&num=1000&start=$START"
done
You may need to adjust the numbers in the first line - I had to go up to 50000 to get my entire search history back to 2005, you may need to make fewer calls if your history is shorter, or more if its longer.

Losing weight in 2011

Sunday, Jun 5, 2011

Since the beginning of the year I’ve been living under what we’ve been calling ‘the new regime’. This ‘new regime’ basically involves not living like a fat useless slob, so I’ve been getting fit, eating healthily and losing weight. So far I’ve lost over 10kg and can now run around the park a few times without collapsing to the floor clutching at my chest and screaming about ambulances, so I’d say its going pretty well. The basic concept behind the new regime is:

Eat less + do more = lose weight.

This will be followed once some weight has been lost by:

_Eat a normal amount + do more = stay the same. _

About a month ago, I came across someone somewhere on the internet recommending “The Hacker’s Diet” as a guide for weight loss. Not having read such a guide before I started on ‘the new regime’ I skimmed it a bit; the tl;dr version is:

Eat less + do more = lose weight.

This doesn’t exactly seem like rocket science to me, but lots of people seem to have a problem grasping this concept. The Hacker’s Diet does a pretty decent job of describing the human body as a simple system with inputs and outputs and manages to explain that if you limit your input and increase your output, you get a deficit and lose weight. So if you find anyone that says ‘Oh, I really struggle to lose weight’, slap them round the back of the head and point them in that direction.

The whole point of this post is that the last couple of chapters of the book contain a lot of information about tracking the calories you eat, the calories you burn, analysing trend from daily weight figures and so on. There’s a lot of detail on how to create spreadsheets to calculate weight trends, how to keep a daily log of calorie intake, and pages and pages of calorific information for food. The thing is, it’s 2011 now so none of those chapters are necessary, because as with anything that’s a pain in the rear end there are now loads of apps available to make life easier. As I’ve been using a number of them for 5 months I figure I’ll share the knowledge and review some of them over the next few days.

Summer project

Friday, Jun 3, 2011

Over the summer, Ian and I are going to be supervising a summer project. We’re going to get a 2nd year undergraduate student for an 8 week project. I’ve stuck up a page about it here, and will post whenever there’s something interesting to see.

Paper published

Thursday, Jun 2, 2011

Our latest paper (“Opportunistic social dissemination of micro-blogs”) on some of the last work we did for the Socialnets project has finally been published, and can be viewed online here, or in preprint form from my publications page.

Unexpected

Thursday, Jun 2, 2011

So I mentioned something about keeping busy? Yeah, well….

A while back I spent a couple of months playing around with an idea that we thought could make an interesting bit of research. It started with a simple modification to a protocol proposed by someone else (Gavidia et al, A gossip-based distributed news service for wireless mesh networks), where we added some elements of self-adaptation and cooperation. As things sometimes go, the results were quite good but not outstanding and we had more pressing things to look at, so we dropped it and moved onto something else.

It’s never nice to just drop work and not get anything from it though, so we wrote a technical report about the work we’d completed that we could stick in a deliverable somewhere. At the same time, we noticed a conference workshop where a paper on the work might fit, and decided it might be an idea to trim the report down for submission. Unfortunately, at that point things kicked off with a couple of journal papers that we’d been working on at the same time so we didn’t have time to do the submission.

Fast forward to the SocialNets/Recognition meeting in mid February and we learn that the deadline for the workshop was extended. On the spur of the moment we decided to have a bash at a paper for it. We cut the tech report down, gave it an edit, and submitted it. Fast forward again to this week and we get the notification through that the paper has been accepted. Previously we had a bit of work that would never see the light of day, buried at the back of an EU deliverable. Now for very little effort we have a published bit of work, I’ve got another publication to add to the list, and a trip to a conference as well. In Italy. In June. Sometimes life is just too cruel :-)

Keeping Busy

Thursday, Apr 7, 2011

I have found that the secret to forgetting that it’s my viva in 13 days is to keep busy. Extraordinarily busy. Luckily, work is conspiring right along with me, with coding on our first experiment in full go-mode, a workshop + social dinner next week to help organise, reading and planning for our main ‘project’ to do, a paper to edit and revise for a workshop and a side project looking at 4sq to move from prototype ‘buggy as hell, falls over if you look at it funny’ mode to ‘production, solid as a rock, leave it running and forget about it’ mode. That’s all before you remember that there’s also the final deliverables to work on as well! Luckily it’s just what I need to keep my mind off the 20th.

Just hoping I can get it all done before I do actually need to stop and remember what it was I did for my PhD of course….

Django + Tweepy + OAuth

Monday, Apr 4, 2011

There is a lot of information out there that talks about putting Django and Tweepy together to make a Twitter web app. I read a lot of it recently, and although a lot of it is helpful some of it can be quite complicated or out of date.

To make things simple, I thought I’d create an example Django project that just contains the basics needed to get up and running.

The ‘DjangoTweepy’ project, hosted on Github here contains all the code needed to get up and running with a Twitter web app using Django and Tweepy. Download it and follow the instructions to get up and running quickly, or with a bit of hacking add the ‘twitter_auth’ app to your existing projects.

Twitter Wordle

Monday, Apr 4, 2011

Next week we have the project partners coming over to Cardiff for a workshop. Stu and myself were discussing that we should have some way to display information throughout the day, something that makes it easy to pick out the main themes of talks etc. I’m a big fan of the Wordle as a way of displaying text, so we thought it would be nice if we could have a dynamic Wordle displayed that people could add to throughout the workshop.

I found a python library called pyTagCloud that will turn text into wordle-like tag clouds either as images or as html. As I mentioned previously I already have a django based project on the go which interfaces with twitter, so I already had code written that uses Tweepy to OAuth with twitter and do a search for keywords. Combining the two, I get an app that will continually search twitter for a given keyword, extract the text from all the tweets and display a wordle along with the latest tweets. People can contribute to the display by tweeting with a given hash tag, and as long as we search for that hash tag, their opinions and notes will be displayed.

For instance, here’s the display running with one of today’s trending topics: ‘Alan Titchmarsh’:

It’s not entirely perfect, I need to do some more filtering on the text to remove some words that aren’t removed by the stop word filtering that pyTagCloud does such as ‘rt’, ‘http’, ‘bit’, ‘ly’ and so on. It would also be good to remove things that aren’t words, like ‘xxxxxx’. I’ve been looking at the Natural Language Toolkit for a couple of days for the other project, so I’ll probably re-use some of that code here too.

The only problem now is that I’m probably going to be the only person at the workshop tweeting…

fun with django...

Monday, Mar 28, 2011

One of the latest ideas we are working on for the new project is looking at how providing different amounts of information affects our perception of content.  Even in pure implementation terms this is interestingly unlike any of the work I’ve done in the last few years, most of which was simulation based. “Code a simulation, run the simulation, analyse the results” has been a pattern of work since the beginning of my PhD and beyond. This recent idea however needs us to essentially survey a large number of people, so rather than just coding some command-line simulation in C/C++ (or god-forbid, Fortran) that only one or two people might use, I’m building a web based system that (hopefully) will have many hundreds of people interact with it.

From the fact that this website is just a basic wordpress install with very few changes to the default theme you can probably tell that I am not a web developer. In the recent past I’ve very rarely had any call to do anything web-based beyond just keeping a personal (static HTML + CSS) home page up to date here at the school. To suddenly move into building a dynamic website that does a lot of complicated database stuff and interfaces with other services via OAuth is quite a jump. Fortunately, it’s been a relatively easy jump thanks to django.

I’m not going to ramble on about what django is (basically a python web framework), if you haven’t heard of it go check it out. What I am going to do is recommend it highly. It is so simple, so quick and so powerful that implementing a web project with it is dreamy. We first discussed this research idea sometime at the beginning of the month, and I started coding it sometime around the 16th. Since then I’ve actually spent a week working on another project (and the ever present SocialNets deliverables for the EU), but I’m still massively into implementation to the point where I’m confident of having a site ready to beta test this week. Which I think is pretty good going for someone with as little web programming skill as I have!

So, there. Django. Try it out, it’s aces.

p.s - the project will hopefully be open to the public at some point in the future, once it is, I’ll post here.