“import external HTML and transform it with XSLT?” – Forum Thread – Discuss

- Nils
- 22 Apr 08, 4:58 am
- Comment #1

For a current project I need to aggregate content from different webpages of the university I'm working at. Unfortunately there are no RSS feeds available that I can use as external source.

Is there a possibility to import external HTML pages, treat these as XML and transform them with XSLT?

Thanks for your help and ideas!
Nils

- philippo
- 22 Apr 08, 5:45 am
- Comment #2

Hi,

If the page is 100% XHTML, I think you can use the "Dynamic XML" source when you create a new Data Source. But, it's a risk. Most of the time pages are not 100% XHTML and you will end up with errors.

I think it's better to have a cron tab do a wget on the external page and then let a php script transform it into real XML or XHTML. Then you can safely use the Dynamic XML source.

- Nils
- 22 Apr 08, 12:33 pm
- Comment #3

Sad but true: my university does not care about web standards and I can be sure that none of the needed pages will validate.

Do you know a good php script that transforms HTML to XML?

- ibolmo
- 22 Apr 08, 2:25 pm
- Comment #4

Try:

http://symphony21.com/forum/discussions/117/1/

I guess: html -> markdown -> xml

- Nils
- 22 Apr 08, 3:00 pm
- Comment #5

Hi ibolmo,

I think this does not work: If I could transform my HTML with XSLT (as bauhouse did for his markdown solution) my problem would not exist. I need to grab the HTML dynamically and convert it into a structure that can be processed by XSL.

For example: I've got a web page containing a div with id="content". I need to find a way to grap this element and all its children and import it as dynamic XML into the system. Unfortunately I don't have a valid XHTML source but a HTML 4.01 Transitional document that does not necessarily validate.

Nils

- Lewis
- 23 Apr 08, 1:59 am
- Comment #6

I think your only option is to use PHP, grab the page, and parse it for the bits you want. Traditionally, there can be a lot of problems with this method since webpages change. You would do this using an event and output the bits to the XML.

- Nils
- 21 Oct 09, 10:15 pm
- Comment #7

I’d like to come back to this topic:

My university changed its content management system over this summer which now outputs valid XHTML transitional. I used some hackish php code to parse the old website (which was using invalid HTML).

Does anybody know an easy way to parse a valid XHTML document and fetch certain parts by using xPath?

- nickdunn
- 21 Oct 09, 10:24 pm
- Comment #8

Through Symphony or with a PHP script?

- nickdunn
- 21 Oct 09, 10:30 pm
- Comment #9

In PHP the DomDocument class has loadHTML() and loadHTMLFile() methods. So you could do one of these:

$dom = new DomDocument();
$dom->loadHTML(file_get_contents('http://your-url-here.com'));

Or you might be able to pass a URL directly:

$dom = new DomDocument();
$dom->loadHTMLFile('http://your-url-here.com');

DomDocument will load a broken HTML tree and attempt to fix it, so even if your XHTML isn’t perfectly well-formed you should be alright. By creating a DomXPath object you can apply XPath queries to the document. For example parse out all option elements in the document:

$xpath = new DomXPath($dom);
foreach($xpath->query("//*[name()='option']") as $option) {
    ...
}

- Nils
- 21 Oct 09, 10:32 pm
- Comment #10

Through Symphony or with a PHP script?

Ideally through Symphony - using a simple interface similar to the dynamic XML import.

- Nils
- 21 Oct 09, 10:34 pm
- Comment #11

DomDocument will load a broken HTML tree and attempt to fix it,

Ah, I didn’t know that. So this seems to be quite easy and straight forward. Now I just have to figure out how to hook into the data source manager to add this XHTML import feature.

- nickdunn
- 21 Oct 09, 11:19 pm
- Comment #12

Can a Dynamic XML Data Source pull in valid XHTML?

- nickdunn
- 21 Oct 09, 11:31 pm
- Comment #13

Can a Dynamic XML Data Source pull in valid XHTML?

And if it can not, you could write a simple PHP script which fetches the HTML and outputs it as valid XML:

<?php
header("Content-Type: text/xml");
$dom = new DomDocument();
$dom->loadHTML(file_get_contents('http://your-url-here.com'));
echo $dom->saveXML(); die;
?>

Then add the URL of this script for your Dynamic XML Data Source.

- Nils
- 21 Oct 09, 11:50 pm
- Comment #14

Can a Dynamic XML Data Source pull in valid XHTML?

No, this does not seem to work – but maybe this is a good feature request.

And if it can not, you could write a simple PHP script which fetches the HTML and outputs it as valid XML:

This is a neat idea! Smart and simple.

- Nils
- 22 Oct 09, 12:25 am
- Comment #15

It seems like this works great:

header("Content-Type: text/xml");

// Tidy up external HTML document
$tidy = new tidy();
$body = $tidy->repairfile('your-path-here', array(
    'hide-comments' => true,
    'numeric-entities' => true,
    'output-xml' => true,
    'show-body-only' => true,
    'indent' => true
));

// Return document
echo $body;
die;

- phoque
- 22 Oct 09, 1:15 am
- Comment #16

It seems like this works great […]

Self-explanatory, compact… tiny even! Perfect solution!

- bauhouse
- 22 Oct 09, 1:32 am
- Comment #17

I was going to say, I’ve had pretty good success with simple document calls to valid XML and XHTML documents:

 <xsl:param name="valid-xhtml" select="document('http://example.com/path/to/resource.html')"/>
 <xsl:for-each select="$valid-xhtml//div[id = 'content']"><xsl:value-of select="h2"/></xsl:for-each>

I was experimenting with importing data from valid sites by using the same sort of process I was experimenting with for importing XML from WordPress.

But, of course, the document needs to be valid or you’ll get some nasty errors. Sounds like you’ve figured out that part. Nice work, Nils!

- MrBlank
- 22 Oct 09, 1:58 am
- Comment #18

I’ve been importing an XHTML page that I hacked Delicious Library to publish for a while now. You can see it here.

I’m just using the dynamic XML data source.

But, yes, most of the time pages are not valid XHTML and any error will halt Symphony, so the extra step with PHP is a good option.

- Makenosound
- 22 Oct 09, 7:40 am
- Comment #19

Perhaps YQL is your friend: ch-ch-check it.

- Neither
- 01 Dec 09, 8:29 pm
- Comment #20

Hi Nils,

This seems to be exactly what I’m looking for but I’m not clear on how to implement the code you posted above.

I’m hoping to be able to pull some text and an image url out of some pages on mixcloud.com but their HTML is pretty invalid (missing closing divs, poor flash embedding etc). How do I go about placing that ‘cleaning’ code between this source and my XSLT for manipulation?

I gather that the PHP needs to sit inside an Event but have little idea about how to do this. I’ve looked through the code of some other Events but have been slightly flummoxed by it. I’m far from being a programmer but am comfortable with basic PHP. Just not sure how this plugs into the Symphony DS/Page relationship.

Hope all these questions aren’t beyond the scope of this thread!

Any help with this would be invaluable.

Symphony.

import external HTML and transform it with XSLT?

Search

Server Requirements

Symphony.

import external HTML and transform it with XSLT?

Search

You are looking at page 1 of 2

Server Requirements

Sign in