Darren Mothersele

Software Developer

Warning: You are viewing old, legacy content. Kept for posterity. Information is out of date. Code samples probably don't work. My opinions have probably changed. Browse at your own risk.

Webbots - Saving Time, and more

Apr 11, 2008

web-dev

cURL is amazing. If you're not familiar with this particular technology then let me introduce it now: cURL (and more specifically the PHP interface for cURL) allows the programmer to communicate with various servers using various different protocols. I've been using it recently to build webbots to automate some tedious tasks, and do things that no normal human user of the internet would ever want to do. Here's an example use of this technology, and how I went about building it...

I've been working on an accounting interface for a client. I've built a front end that allows my client's clients to login and view their recent sales activity. They can view summaries of best selling lines broken down by artist, and geographic region. They can also plot sales for any specified time period onto a Google map. It's pretty cool, and got some great feedback from users. Behind the scenes there's a few tables to store the sales information, and manage access control to the data. The tricky bit was getting the sales information into the database. Here's where the webbots come in useful!

The sales information is downloaded from third party sites where the actual sales take place. This can be done regularly, but requires a human to actually log into a secure site, select which reports to download, unpack them from their archive, and then convert and upload them to the sales database.

I programmed a webbot to automate this process, but it took a bit of trial and error, as the remote site extremely well secured! If you've tried to automate download from secure sites (over https with HTML form login) then this might help:

  1. First step was to make the initial cURL connection and download the login page. I then pass this through a regex parser to extract the all important form ID.
  2. Then create your next cURL request to the server using the correct form variables, filling in your username and password as applicable. Pausing for a few seconds, and setting the correct referrer is good practice - you want your webbot to act as much like a real user of the site as possible. In building a webbot, you're just trying to make your life easier - you're not stress testing the remote server - so take it easy on the requests and make sure they all look like natural requests that the server would expect.
  3. The server I was connecting too required that I save cookie variables, so that it knew I was the same login as before. This is made easy using the built in cookie functionality of cURL.
  4. Now I get a series of forms returned from the server that I have to parse through regex to find the available reports. I've kept a record of previously downloaded reports, so I know which ones I'm after.
  5. Create cURL requests for each file to download. Extract the files out of the responses and remember to wait a reasonable amount of time between each request.
  6. When you're done do all the necessary tidy up. You should send a logout request, and delete any stored cookies. You may have temporary files to remove.

Now I've got a PHP program that uses cURL to login to the remote sites, and download any new sales reports. There's another program to unpack them and import the records to the database. I call both of these from the crontab and schedule downloads every night.

I'm now looking for more uses of webbots, as I'm sure they can automate lots more of the routine things that need doing.

If you're a PHP programmer start with the PHP documentation for cURL that I linked to above. Michael Schrenk's website has some useful wrapper functions that make cURL even easier to use.