Gatekrash 2 Beta
Automated UK event listings aggregator
- Case study by John
- February 21st, 2011
Gatekrash is a fully automated UK event listings aggregator. It crawls various websites for event information, processes it, and presents it in a clear, concise way. The best way to think about it is as the event listings version of your TV's Now and Next programme guide.
It's worth pointing out at this point that this version of Gatekrash (version 2), is an almost complete rewrite of Gatekrash version 1.
Aims of the project
The main goal of this project was to add a few more features to Gatekrash, speed up the response time of the site, increase the number of events it listed, and a new (more attractive) design. Additionally, the new site had to be much better in the way it presented information to visitors.
The vast majority of the site was rewritten, although some elements were still fit for purpose and so were integrated into the new site. The site itself resembles it's predecessor in the navigation and basic layout, but similarities end there.
The way Gatekrash collects event listings information is fairly simple. At a regular interval, Gatekrash starts a crawl of an event source. Pages (or feeds) are downloaded to permanent storage (i.e. non-volatile memory). This is based on a whitelist of URLs or URL formats, and a set of parameters governing the rate of requests, how many pages to download, and so on. Once all pages have been downloaded, the processing begins.
Several scrapers, designed for different sites and types of crawl, use regular expressions to extract information from the downloaded pages. Sometimes data is missing from these pages, so the scraper has to attempt to 'guess' that information. For example, if there is no end date on the page, but there is a start date, start time, and an end time, it would use the start and end time to attempt to determine the end date of the event - in this case if the end time is larger than the start time, it probably ends on the same day, but if it's smaller, chances are it ends the next day (there are obviously exceptions to this rule, but in general it seems to be true).
After the page has been 'scraped', and information has been extracted from it, it needs to be cleaned up. The 'distillery' is essentially that - it distills the data into the 'purest' (cleanest, most structured) form possible. It flags up missing required elements, converts unusual characters to their character reference counterparts, trims whitespace and so on. It also uses the Python Natural Language Toolkit to extract named entities from the description if there is one.
Finally, this new data has to be merged with the listings database. This is a multi-part process. The merge handler attempts to find the event's venue and place in the database. If it can't find the place in the database, the event will not be added. If it can't find a venue in the database, it will attempt to find alternative spellings of that venue. If it can't find that, then it will add a new venue to the database. After this, it will attempt to ascertain whether or not this event has been added before, by checking the source websites against the source URL of the crawled page. If there's a match, the updated information will replace the current information. Finally, event information is added to the database, and if the event already exists, a new source is added to the database for that event.
There is slightly more to it than that, but that is the general principle. If you want to know more, don't hesitate to ask.
The new frontend
I opted to use the HTML5 boilerplate as the basis for the new layout and aesthetic design of Gatekrash. It's very flexible, works cross browser, and looks a lot tidier.
The layout itself is not dissimilar to the old Gatekrash. The same sections still apply; snapshot, now, soon, popular and search. New (or heavily improved) pages include the place pages, which now list events instead of just a list of venues, and venue pages. Event pages have also been improved, with clearer information and highlighted performers.
The navigation header at the top of all pages has now been made persistent, with content scrolling underneath it (like new Twitter). This allows access to the core functionality of the site at all times, instead of having to scroll to the top of the page to get there. Gatekrash has a lot of very long pages, and this has been an issue.
Also across the site is the use of counters, to display the number of events in a certain place or venue. This gives the user a quick indication of the busyness of a location. Additionally, counters have been placed at the top of place and venue pages displaying the number of events happening there right now and the number of events happening within 7 days in the future, for the best at-a-glance information.
Another new feature are the busyness counters, on the frontpage, now page and soon pages. They give a top 10 list (when available) of the busiest places at that point in time, or in the next 24 hours.
The rest of the changes are purely aesthetic in nature - it's become less utilitarian and minimalist in it's design. Event listings also feature a list of 'Go if you like...' performers (when available), based on the event description.
Another development I made during the implementation of Gatekrash 2 was the inclusion of a new, documented API. Previously, I had not felt it necessary to include an API to access data, and I regretted that decision. The new API provides the same level of access to Gatekrash as I had when developing the UI.
In fact, I use the API to power the in page loading of more listings on most pages, and it works reliably. I've also integrated it into some of my other projects - Instant Playlist and Really Very - to provide relevant event listings to those sites.
If you want more detailed information on the API, take a look at the documentation - you can even use it if you want. (If you do use it, I'd find it really interesting to see what you're doing with it!)
I'll be continually adding new features to Gatekrash. The next major feature is a rewrite of Gatekrash Mobile, which no longer functions since the rewrite of the main site.
If you'd like more information on the new version of Gatekrash and how it works, don't hesitate to contact me!