Darren Mothersele

London based Drupal Web Developer
Drupal.org Google+ Facebook Twitter GitHub LinkedIn RSS Feed

Webbots - Saving Time, and more

cURL is amazing. If you're not familiar with this particular technology then let me introduce it now: cURL (and more specifically the PHP interface for cURL) allows the programmer to communicate with various servers using various different protocols. I've been using it recently to build webbots to automate some tedious tasks, and do things that no normal human user of the internet would ever want to do. Here's an example use of this technology, and how I went about building it...

I've been working on an accounting interface for a client. I've built a front end that allows my client's clients to login and view their recent sales activity. They can view summaries of best selling lines broken down by artist, and geographic region. They can also plot sales for any specified time period onto a Google map. It's pretty cool, and got some great feedback from users. Behind the scenes there's a few tables to store the sales information, and manage access control to the data. The tricky bit was getting the sales information into the database. Here's where the webbots come in useful!

The sales information is downloaded from third party sites where the actual sales take place. This can be done regularly, but requires a human to actually log into a secure site, select which reports to download, unpack them from their archive, and then convert and upload them to the sales database.

I programmed a webbot to automate this process, but it took a bit of trial and error, as the remote site extremely well secured! If you've tried to automate download from secure sites (over https with HTML form login) then this might help:

  1. First step was to make the initial cURL connection and download the login page. I then pass this through a regex parser to extract the all important form ID.
  2. Then create your next cURL request to the server using the correct form variables, filling in your username and password as applicable. Pausing for a few seconds, and setting the correct referrer is good practice - you want your webbot to act as much like a real user of the site as possible. In building a webbot, you're just trying to make your life easier - you're not stress testing the remote server - so take it easy on the requests and make sure they all look like natural requests that the server would expect.
  3. The server I was connecting too required that I save cookie variables, so that it knew I was the same login as before. This is made easy using the built in cookie functionality of cURL.
  4. Now I get a series of forms returned from the server that I have to parse through regex to find the available reports. I've kept a record of previously downloaded reports, so I know which ones I'm after.
  5. Create cURL requests for each file to download. Extract the files out of the responses and remember to wait a reasonable amount of time between each request.
  6. When you're done do all the necessary tidy up. You should send a logout request, and delete any stored cookies. You may have temporary files to remove.

Now I've got a PHP program that uses cURL to login to the remote sites, and download any new sales reports. There's another program to unpack them and import the records to the database. I call both of these from the crontab and schedule downloads every night.

I'm now looking for more uses of webbots, as I'm sure they can automate lots more of the routine things that need doing.

If you're a PHP programmer start with the PHP documentation for cURL that I linked to above. Michael Schrenk's website has some useful wrapper functions that make cURL even easier to use.

Thanks Darren - I've been looking for ways to implement cURL in a web bot, and I appreciate your insights. -hb

Hi Darren, I am trying to login to a drupal site, if you could give me some pointers it would be a big help.
http://groups.google.com/group/comp.lang.python/browse_thread/thread/08c...

thanks,
-david

I don't know Python too well, but it doesn't look like you are providing a form ID or a form build ID. These are hidden values in the login form. The form_id's value is user_login. I think you will need to provide this and a valid build ID. You may need to use curl to fetch the login page and parse the build ID from it. It's the hidden form element called "form_build_id".

Hi Darren,
Yes that was it :) This is working from the main page, no need to go to /user/login

#!/usr/bin/python
import urllib
import pycurl

user_agent = 'Mozilla/4.0 (compatible: MSIE 6.0)'
login = [('name', 'username'),
        ('pass', 'password'),
        ('form_id', 'user_login'),
        ('form_build_id', ''),
        ('op', 'Log in')]
login_data = urllib.urlencode(login)
crl = pycurl.Curl()
crl.setopt(pycurl.POSTFIELDS, login_data)
crl.setopt(pycurl.URL, 'http://yoursite.com')
crl.setopt(pycurl.HEADER, 1)
crl.setopt(pycurl.USERAGENT, user_agent)
crl.setopt(pycurl.FOLLOWLOCATION, 1)
crl.setopt(pycurl.COOKIEFILE, '/tmp/cookie.txt')
crl.setopt(pycurl.COOKIEJAR, '/tmp/cookie.txt')
crl.perform()
crl.setopt(pycurl.URL, 'http://yoursite.com/blog/username')
crl.setopt(pycurl.FOLLOWLOCATION, 1)
crl.perform()
crl.setopt(pycurl.URL, 'http://yoursite.com/node/add/blog')
crl.setopt(pycurl.FOLLOWLOCATION, 1)
crl.setopt(pycurl.VERBOSE, 1)
crl.perform()
crl.close()

As you can see I am now at the stage to add a blog entry. The idea is that on this Drupal Community site the goal is to add your configuration files and system information collected from a report, here is the report;

http://code.google.com/p/greport/source/browse/trunk/greport.py

And here are the files generated;
http://dwabbott.com/downloads/comprookie2000/

Users can compare systems, .conf files etc.

Ok now I can use a little help on this part;

http://dwabbott.com/downloads/add_blog.txt

Thanks,
-david

Either:

- parse the form_build_id and the form_token from the node/add/blog page and use that to construct the response, or,

- use the services module and send the data via XML - http://drupal.org/project/services

Ok moving along, I can post to a new node we called it gentoo report with 3 fields plus a title. The title works fine;
'title', 'Pycurl Test'),
but the fields only return "f" without the quotes;
('field_cbuild', 'x86_64-pc-linux-gnu'),
any ideas?
thanks again,
-david

Ok, I got it working, I will post it here, may help someone;
http://linuxcrazy.pastebin.com/f32587f10

I'm closing comments on this one post because it only attracts spam. If you want to comment on this please contact me and I will open them again, or post for you. Thanks!

Recent comments
Latest Posts