Completely rebuilding a website
After a year of maintaining the current version of Gatekrash sporadically, I've decided that now's the time to completely revamp it.
Update 7/2/11: the main UI is done, finishing up the listings aggregator
I first designed and developed Gatekrash this time last year (well, actually a bit earlier). I was learning as I went along with the Python elements I used (the spider and page scraper, mostly). The end result is a less than desirable codebase which is becoming a pain to maintain and add new features to, and I'm fairly certain it wouldn't scale favourably.
So, with all that in mind, I've decided to rebuild the whole site from the ground up. Admittedly, there isn't much to rebuild - the site's user-facing element is very simple. All it has to do is retrieve a list of events for a given place (or places) which are happening in a certain timeframe.
There are a variety of other aspects to the site though, one of the foremost being the event retriever - the spider and page scrapers. I'll be improving the spider to make it more stable (and more properly obey robots.txt). At the moment, the spider and page scrapers do not work simultaneously, but this may change. In addition to making the spider better, I'll also be giving it the ability to properly read RSS feeds (in order to retrieve a lot more available event listings in a properly formatted and accurate way).
The page scrapers will also become more sophisticated over time. One of the new features I'm aiming to implement over the next few weeks and months (depending on my other workloads) is the detection of the type of event, and if the event features any notable performers (or similar types of things). Depending on the pages being scraped, this data may not be available solely using page scraping techniques, which would infer that I need to create some method of recognising and extracting named entities.
Another major improvement is to the way that the event data is stored. This is mainly a reorganisation of the database and it's contents, defining a better schema, and decoupling tables where I can (and adding more information for venues, places, and events). It'll also allow me to keep track of the last time an event listing was updated for accuracy, allowing more frequent (where appropriate) recrawls of events. I've also been using a custom data access class (i.e. I rolled my own) to retrieve data from the database in PHP for the user-facing element of the site. This is, without further development, becoming increasingly pointless to maintain, especially when there are dozens of excellent, full featured and maintained PHP classes which do the same job!
Finally, and to everybody who uses the site, most noticably, I'll be giving it a facelift. It is fairly dull at the moment. I'm in the process of redesigning several of my websites to give them a fresher, cleaner look. I'm redesigning Gatekrash in HTML5 (using the HTML5 boilerplate). I'm midway through redesigning, and it looks a lot prettier than the current design.
This reimagining of the site should not take as long to implement as the current version did (the current version took approximately 2 months, but I had to learn a lot of Python for that!). I'd estimate this one to be completed
before the new year at some point in January 2011.