Web Scraping Adventures

When I moved from Vancouver to Montreal I was disappointed to find out that there were no sites like livemusicvancouver.com – a great database-driven site for concert listings powered by user submissions. There was just a mildly inconvenient to access and very inconvenient to use montrealshows.com – which is functions wonderfully as a forum about local goings on but for a quick look at what’s playing tonight didn’t meet my requirements.

I set out to create an engine that would allow people to sign up and post their own shows, thinking that the Montreal Shows people and the Montreal community would embrace it - I offered the Montreal Shows folks use of the engine free of charge but critical mass (and other social factors are powerful things and I found it a bit of a hard sell, despite the fact that my site is free of charge and offers users the power to get their shows posted instantly with lots of great extra features like automatic googlemaps to the venue, automatic linking to the band website, etc etc.

After growing tired of posting shows by hand that I had picked up from various sources like the Mirror and Montreal Shows it occurred to me that I might be able to use web scraping (I wasn’t even familiar with the term at the time) to grab listings from other sites via a cron job (an automated process that could run say once every night to look for newly posted shows).

The problem was that the other sites didn’t offer any kind of standardized listings so making this happen would have been close to impossible though I did initially get it basically working. Glitches in the process proved too frustrating and I gave up, thinking maybe I would pick the idea up again if the other sites ever implemented any kind of standards in the way they handled their listings.

Finally, about a year after I began my scraping project, Montreal Shows finally implemented something of a database. I was initially optimistic because they introduced an RSS finally but it turned out that their RSS is still not particularly useful as it doesn’t list what venue a band is playing at. Scraping hopes were dashed again!

Then I noticed that you can actually list a somewhat standardized detail page for a show that contains a database primary key ID, venue, bands and date. I realized that with some regular expression work I could finally make my scraping idea work.

How I finally was able to accomplish this was to add a ‘source_id’ field to my shows database table, I then have a cron job that runs nightly that first looks up the last entered source_id in my table, uses that as a starting range, then loops through the external pages one by one – if it finds a show it begins the process of scraping the data, if not - in another words – we’ve reached the end of their database - the process dies.

If we do find a show and the process begins we start by doing regular expressions to target certain pieces of information on the page – the id (we already have this actually from our loop), the bands, the venue and the date. I found the rest of their information still non-standardized (even the fields we’re grabbing still are not perfectly standardized – date formats change randomly and must be a varchar field rather than a proper datetime field – but I can workaround this for the most part to produce a satisfactory result.

I then run the variables we’ve grabbed through regex through various php processing functions like trim and split to clean everything up as much as possible and then process whatever we’ve received as a date variable into something that hopefully will be recognizable in an automated string-to-date conversion.

Next we search the venue variable against our existing venues. If the venue does not exist in our table, we create a venue record, then do a lookup to get the venue id for use as a foreign key in our show record.

Now we check to make sure that the date is in the future, and that the bands variable is not ‘Error’ – convienently enough this is what we’re getting if we’ve reached the end of their database. If the band is called ‘Error’ we let the process die, otherwise we insert the show, then move on to the next page on their site - ie the next database record.

This has been working like a charm so far and I have some ideas to enhance my algorithms to grab a little bit more information to insert into our database, like venue addresses for new venues, and just generally reduce the possibility of errors.

It was nice to finally complete the project, even if it was a year later than I thought. Knowing that this kind of thing can be both possible and effective was satisfying and it makes me curious to explore further with the concept for other applications.

Jun 17, 2008

Web Scraping Adventures