Import content from other webpages using Feeds and some HTML DOM magic. I've created this module, as importing HTML content is a task that comes up now and again, and I wanted a more generic way of doing it. This is useful for many things, including monitoring sites that don't support RSS, importing legacy content to a Drupal site, screen scraping, etc. To do this I've created a module called SimpleHTMLDOM Parser and published it on Drupal.org. Read on for documentation...
- Download the module and extract into your sites module folder. You will also need the Feeds module, and it's dependencies.
- Enable the module.
- Navigate to admin/structure/feeds and add a new importer.
- Click "Change" next to Parser to select the SimpleHTMLDOM parser.
- Edit the settings for the parser and enter the root node to extract. In this case we are importing a recipe list from a food website. On the page these items are HTML list items within a list with class "recipeList". We therefore use the selector pattern ".recipeList li" to select each of our root nodes.
- We add an extracter for every item we wish to extract from each root node that is found. To do this expand the "Add new extractor" section of the form.
- Each extraction requires a name which is used to refer to this item within the mapping interface. The following options are available for each extracted item:
- Name: Used in the mapping interface of the processor
- Pattern: The SimpleHTMLDOM pattern to match for this item. The syntax of these expressions is similar to jQuery selectors.
- Attribute: The actual content to return from this item. You can use "innertext" to return the HTML within the item. "plaintext" returns just the text without any tags. Other html attriutes can be returned such as "src" for images, or "href" for links.
- Multiple values: If multiple items are matched by the pattern you can return an array of values, rather than a single item. This can be mapped to a multivalue field, for example.
- Offset: If not returning multiple values the first matched item is returned unless you specify an offset. For example, if you use a pattern of "p" and it matches three paragraphs, an offset of 0 (the default) returns the first item. An offset of "1" returns the second item, etc.
- Strip whitespace: Some HTML documents contain excess whitespace within and around elements. Select this option to strip this extra space.
- Rewrite result: If you wish to add a prefix, suffix or rewrite the value completely enter a string here. Any occurance of the token string [value] will be replaced with the extracted value.
Example extracting a title from each matched list item:
Example extracting an image from each matched item. Notice that in this case a prefix is added to make a full URL out of the returned value:
Only basic documentation now, but let me know if you have any questions or interesting usage ideas for this module. If you have any problems please raise them on the issue queue via the project page on drupal.org.