Darren Mothersele

London based Drupal Web Developer
Drupal.org Google+ Facebook Twitter GitHub LinkedIn RSS Feed

SimpleHTMLDOM Parser for Drupal

Import content from other webpages using Feeds and some HTML DOM magic. I've created this module, as importing HTML content is a task that comes up now and again, and I wanted a more generic way of doing it. This is useful for many things, including monitoring sites that don't support RSS, importing legacy content to a Drupal site, screen scraping, etc. To do this I've created a module called SimpleHTMLDOM Parser and published it on Drupal.org. Read on for documentation...

Installation

  • Download the module and extract into your sites module folder. You will also need the Feeds module, and it's dependencies.
  • Enable the module.

Example Configuration

  • Navigate to admin/structure/feeds and add a new importer.
  • Click "Change" next to Parser to select the SimpleHTMLDOM parser.

  • Edit the settings for the parser and enter the root node to extract. In this case we are importing a recipe list from a food website. On the page these items are HTML list items within a list with class "recipeList". We therefore use the selector pattern ".recipeList li" to select each of our root nodes.

  • We add an extracter for every item we wish to extract from each root node that is found. To do this expand the "Add new extractor" section of the form.

  • Each extraction requires a name which is used to refer to this item within the mapping interface. The following options are available for each extracted item:
    • Name: Used in the mapping interface of the processor
    • Pattern: The SimpleHTMLDOM pattern to match for this item. The syntax of these expressions is similar to jQuery selectors.
    • Attribute: The actual content to return from this item. You can use "innertext" to return the HTML within the item. "plaintext" returns just the text without any tags. Other html attriutes can be returned such as "src" for images, or "href" for links.
    • Multiple values: If multiple items are matched by the pattern you can return an array of values, rather than a single item. This can be mapped to a multivalue field, for example.
    • Offset: If not returning multiple values the first matched item is returned unless you specify an offset. For example, if you use a pattern of "p" and it matches three paragraphs, an offset of 0 (the default) returns the first item. An offset of "1" returns the second item, etc.
    • Strip whitespace: Some HTML documents contain excess whitespace within and around elements. Select this option to strip this extra space.
    • Rewrite result: If you wish to add a prefix, suffix or rewrite the value completely enter a string here. Any occurance of the token string [value] will be replaced with the extracted value.

Example extracting a title from each matched list item:

Example extracting an image from each matched item. Notice that in this case a prefix is added to make a full URL out of the returned value:

Only basic documentation now, but let me know if you have any questions or interesting usage ideas for this module. If you have any problems please raise them on the issue queue via the project page on drupal.org.

Thanx for article and module.
I think its more helpful then not native Dapper (page parser service).

If this works, you are the man. I'll check into it later.

thanks for the little documentation and the module.if am to extract from a website,where do i specify the url of the website to extract from.

Hi,

You specify the content that will be parsed in the Fetcher configuration. To use a webpage you will use the HTTPFetcher plugin for Feeds.

Thank you for the reply.

I have successfully fetched my html content and mapped it to my content type.but in my content type i have a taxonomy vocabulary with different terms.When i try to map to the taxonomy field i expected to see it list my terms so that i can map to that particular term but unfortunately i don't get that.Is there a way to do that.Will be very glad if you could help.

I there any way to access href link attribute?
tried a href, a[href] and so on - no luck

2. Is it possible to parse multiple pages automaticaly ? like here described http://net.tutsplus.com/tutorials/php/html-parsing-and-screen-scraping-w...

href solved, i am missing attribute href ....

Great article. I've used PHP Simple HTML DOM Parser outside of drupal with success. How hard would it be for me to port your work to 6.x? Do you have some advice for me on the steps I could take?

It's actually quite easy to migrate this to D6. There is some good documentation on writing Feeds plugins for the Drupal 6 version. Have a look at the Feeds developer documents. I have a very early version of the module that was working in D6. If I can find it I will post it up here.

Hi. Is possible send me link to download module for drupal 6? This is exactly what I need.

Thanks

Hi. Is possible send me link to download this module for drupal 6.22? Thank you.

I'm trying to get Title content from a webpage but am not getting anything created. The import log lists there are no new nodes, so I'm thinking maybe I don't have it configured correctly. I have the root node pattern as ".articleMain div" as all the data I want to extract is contained in a div with class articleMain. Then I created a new extractor Title with pattern ".articleTitle h1" as the title is in an h1 with class articleTitle.

I enter the desired URL in the URL Feed field and it looks like it is importing but I get no new nodes. Any ideas?

Hi, is possible send me link to download your module for drupal version 6.22? This module is exactly what I need. Thanks

Post new comment
The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • You may post code using <code>...</code> (generic) or <?php ... ?> (highlighted PHP) tags.

More information about formatting options

Recent comments
Latest Posts