Automated local event listings aggregator
- Case study by John
- October 13th, 2010
- Updated on March 21st, 2011
UPDATE: This version of Gatekrash has been discontinued. Why not read about the latest version of Gatekrash?
Gatekrash was born from the idea that event listings for the places around you should be more easily accessible. The main focus of the site is the presentation of 'now and next' information for events happening in your local area (in a similar way to the 'now and next' on television guides). Events currently in progress where the user is would be listed, and events starting next would be listed alongside this information in order to give the user a snapshot overview of what's going on.
A distilled version of the site's main functionality is as follows:
- Collect event information from various sources
- Aggregate this event information
- Attach meaningful data to this event information, including venue and location data
- Present these event listings in a now and next form to users based on where they physically are in the country
At the moment, the site collects event information from several websites, but in the future I'll be enhancing this so that it will be able to collect it from a larger number of sites. Additionally, in the future, it will accept data in other forms (including things like RSS feeds from venues, vcal format, and so on).
How it all works
Before the site can do anything, it needs event listings. To get this information, it retrieves it from several websites by 'crawling' them and extracting the information from them. Crawling is basically the process of visiting a site and downloading all the pages (or in this case the pages which contain event information). The site uses a crawler written in Python, which uses a whitelist of URL patterns in order to only retrieve relevant pages (instead of other pages, such as contact pages, about pages and so on). The crawler is also configurable in other ways, such as limiting the amount of data downloaded from a site, setting the rate of crawling and setting the maximum number of pages to crawl.
Once all these pages have been downloaded by the crawler, extracting the information from them is started. Currently the way this happens is a custom parser for each website which is crawled, but in the future I hope to have some form of 'intelligent' system in place to recognise certain pieces of information and extract them on it's own (the idea being that this will allow me to crawl from many more websites without having to write a custom parser for each). Each parser extracts certain key information from the pages, such as the event title, date, start and end times, venue, town/city, etc. This information is required before it is allowed to be added to the database - without any single part of this information, the listing is essentially useless. Only complete listings will be added to the database.
After collecting all the information from the pages, sanitising it and storing it in a temporary database, it is 'distilled'. This distillation process does several things to each event stored in the temporary database:
- It trims whitespace from the data where required
- It attempts to find spelling errors in placenames (using the UK Gazetteer database), and correct them
- It attempts to detect duplicate listings, and combines them
- It attaches location data to the listings (longitude and latitude)
Once this process has been completed, the process of merging this data with the 'live' database can begin. This is the process of moving listings from the temporary database into the database used to serve the website and mobile site. During this process, listings are compared to detect adding duplicates (if there is a duplicate being added, they are combined into one listing). After this process has been completed, the files from the crawl are deleted, and the temporary database is emptied ready for the next crawl. This process is automated and occurs periodically during the night (usually once or twice a week).
When you visit the site
When you visit the site, several things happen (depending on which page you land on). Firstly, the site attempts to detect where you are in the UK based on your IP address and the MaxMind GeoLiteCity database. It attempts a rough guess of where you are (if you're not in the UK, it defaults to London). After this, it depends on which page you are. For this example, we'll assume you're on the frontpage (a.k.a. the 'snapshot'). The site attempts to retrieve events from where it thinks the user is, both happening right now and starting soon. If it can't find any events happening where the user is right now or starting soon, it'll expand the parameters of the event listings to include nearby towns and cities (based on how close nearby towns and cities are to the user's location, using the UK Gazetteer database).
Another example based on the same method of guessing the user's location is the event pages. When a user visits an event page, and the event is not in the same town or city that the site thinks they're in, it'll estimate travel time between cities for the user based on the distance between them geographically.
The design, layout and aesthetics
The design is a very minimalist site. This is purely functional in nature. I have found that when people are looking for information, they want the site to be quick to load, easy to read, and have as few clicks between the page they're on and the page they want. Adding lots of flashy graphics and non-standard navigation choices would impede this. To this end, all text is of reasonable size (some may argue too large), and is black on white. Colour is used to highlight certain key areas of the page that may need the user's attention (such as prompts, or essential information).
Pages are mostly split into two sections; the left hand side of the page being the list of events, and the right hand side of the page being ancillary information. On the snapshot, now and soon pages, event listings are displayed on the left hand side of the page, with a row of buttons at the top of the page further narrowing or expanding the list of events based on location. The right hand side of the page is devoted to several elements designed to describe the purpose of the page, highlight events or venues, display advertisements, and promote the mobile site and the Facebook page. Search pages and popular pages, place pages and venue pages display similar information (albeit more specific to those pages' functions).
If you'd like to learn more about Gatekrash, feel free to contact me at any time!