Search

I am in need of very simple HTML-2-XML scraping. Basically a client needs to get some HTML from a certain (related) site (shop opening times) so that the editors are not required to copy-paste that info themselves.

So I'd like to have a super simple Dynamic DS from a scraper.

I am not experienced in this but loading an HTML document seems very simple with DomDocument etc. This would, however, require me to create a custom DS.

Currently the simplest way I could think of was to use YQL and use their API call as a dynamic DS

I guess this works but is there a better way? Are there benefits (besides the dependancy issue) to creating a custom DS ?

Is the HTML you're trying to access valid XHTML, so you could use it in your DS directly?

Hi Nils, and thanks. But nope: not valid XHTML (and I'd not want to bet on that).

Don't be afraid of custom DSs, they're not that difficult once you understand what you can and can't do.

Thanks John. I am not too afraid but I guess I'm just looking for the simplest robust way.

I am on 2.2.5: Any pointers on how I should go about parsing some HTML page with a custom DS?

Do you have my email David? if you send me some details over, I can have a look. Off the top of my head, I don't remember some simple pointers, but I have done this before to scrape a site using DOMDocument.

Happy to help you out...

Off the top of my head, I don't remember some simple pointers, but I have done this before to scrape a site using DOMDocument.

I too would love to know more about using DOMDocument to scrape a site.

Seconded! Would love to know how to do this too without having to rely on YQL.

I will try and get something simple up soon, it can be very specific to use cases, so I will do something generic to explain.

Right then.

I was expecting a method in XMLElement that would allow me to pass it a string of html to convert into an XMLElement object.

This doesn't exist, which means that to do this with DOMDocument is a very lengthy process, and unique to each project.

You can just return a string from a DS, it doesn't have to be an XMLElement. At least, it worked pre 2.3.

Ouch. Now I know that, I shall carry on! I always assumed (from following other's work) that it was an XMLElement.

Well then, it isn't as hard as first thought then!

Create an account or sign in to comment.

Symphony • Open Source XSLT CMS

Server Requirements

  • PHP 5.3-5.6 or 7.0-7.3
  • PHP's LibXML module, with the XSLT extension enabled (--with-xsl)
  • MySQL 5.5 or above
  • An Apache or Litespeed webserver
  • Apache's mod_rewrite module or equivalent

Compatible Hosts

Sign in

Login details