Search

Asked about this a few years back but have needed to come back to it recently.

A client wants to import data from a variety of other sources, most of which are poorly-written (non-x)html pages. I'm aware that XML Importer refuses to import invalid code, which is fine.

I'm pretty crap at PHP, but in my recent hackings I've been able to get DOMDocument to load and parse some of these broken client pages pretty well without it throwing any errors about malformed input.

So to any PHP geniuses out there, is there some way to hack up a different version XML Importer which allows for a poorly-formed source? (I'm pretty sure it uses DOMDocument to parse input but could be wrong as I'm often lost looking at well-written PHP!)

Any help would be great!

If you already have some code that is retrieving the data and transforming it from ugly HTML into well formed HTML, you could potentially interact with the XML Importer class directly.

The API for the XML Importer is quite basic, here's some pseudo code:

    // create a new class
$xmlImporter = new XMLImporter;

// false tells XML Importer that '$yourXmlString' is the data.
$xmlImporter->validate($yourXmlString, false); 

// this will actually import the data
$xmlImporter->commit();

Now surely it's not that easy I hear you say. True, it's not. Before you do the above, create your XML Importer through the GUI as you normally would. This is important as that will do the field mappings, set the import section etc. You can do this programmatically if you like, but I wouldn't recommend it.

If you go down this route, the pseudo code changes a little:

// create a new Manager
$xmlImporterManager = new XMLImporterManager;

// If your Importer is called Big Cats..
$xmlImporter = $xmlImporterManager->create('big-cats');

// false tells XML Importer that '$yourXmlString' is the data.
$xmlImporter->validate($yourXmlString, false); 

// this will actually import the data
$xmlImporter->commit();

I hope that helps :)

Thanks so much Brendan. The fact that it should be possible is some excellent news.

I'm pretty sure I get that code - just not sure where to work it into the code generated by my importer set up in the GUI.

Does it have something to do with the source value in the options array?

Actually, what you mention is dealing with the XMLImporterManager so I should look into that...?

(Please excuse my noobing)

After you make a xml-importer through the interface you will find it back in the workspace folder (workspace/xml-importers). There you can modify the importer as Brendan describes (I think :-))

@plenaforma Yep that's it. Got my newly-generated importer file, looks exactly as expected, just unclear as to how to override the default data import with API interaction as Brendan describes.

Will continue hacking away all the same.

As far as I can see (but my php skills are even more crappy) is a xml-importer an object of the class xmlimporter so it could use the inner functions (methods) of that class.

There is a validate function so I guess you can put that method in your xml importer file (as Brendan describes above:

// false tells XML Importer that '$yourXmlString' is the data.
$xmlImporter->validate($yourXmlString, false); 

I only can't figure out where to get $yourXmlString from

Ok, I'm trying a different (maybe too convoluted) way around this using a remote DS to feed an XML Importer.

Ended up using PHP's DOMDocument to build a custom XML document. Pretty sure that it's perfectly valid, throws no errors, loads fine when attached to a regular page and is visible in the page's XML debug with all data intact.

Which is great. I can build pages from that. Only we really need to be able to suck the data into the DB in order for it to be useful.

Trying to add this an XML Importer fails though. I get the error
Failed to retrieve data from source: Status code 0 was returned. Content-type:

Not sure what that means. The data shows in the debug page...

For anyone who comes across this post in the future, just thought I'd add how I managed to get this working (without hacking into any PHP)...

As suggested waaay back in the day, I fell back on using YQL to extract the relevant part of the remote source page. That is then rendered via an XML page in Symphony which in turn is targeted by an XML Importer. Pretty simple and works like a charm (only wish I didn't have to rely on a third-party API for this)

The YQL API just requires a valid source URL and XPath like this:

http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20html%20where%20url=%22{$url-yql-url}%22%20and%20xpath=%27{$url-yql-xpath}%27&format=xml

You can set up a Remote Data Source with that as the URL, where $url-yql-url and $url-yql-xpath are URL params you can pass into the page the DS is attached to, just to make the whole thing more flexible. (Didn't use regular page params due to the URL and XPath being full of confusing slashes)

Attach that DS to a page with type set to XML and spit out the returned data however you like. Could just be via an xsl:copy-of or more complex manipulations if needed.

You can then access the resulting page like this:

http://yoursite.com/yql-page?yql-url=http://www.getsymphony.com/&yql-xpath=//ul[@id="features"]

Create an account or sign in to comment.

Symphony • Open Source XSLT CMS

Server Requirements

  • PHP 5.3-5.6 or 7.0-7.3
  • PHP's LibXML module, with the XSLT extension enabled (--with-xsl)
  • MySQL 5.5 or above
  • An Apache or Litespeed webserver
  • Apache's mod_rewrite module or equivalent

Compatible Hosts

Sign in

Login details