“XSLT and HTML5” – Forum Thread – Discuss

- Allen
- 26 Aug 10, 8:45 pm
- Comment #61

What you’re asking for is mentioned by Nick already (comment #56). You can see his code example here.

- davidhund
- 26 Aug 10, 9:24 pm
- Comment #62

Allen, thanks. I did not realize Nick was doing the same thing as a workaround.

Could you explain how I should use that extension.driver.php file? It seems to be a ‘incomplete extension’: can I just create a folder “html5_doctype” with this file and use it as a regular extension or should I use it another way?

So, to summarize:

Using the <xsl:text disable-output-escaping="yes"><</xsl:text>!DOCTYPE html<xsl:text disable-output-escaping="yes">></xsl:text> HTML5 doctype ‘hack’ in combination with output method ‘XML’ causes the XSLT parser to auto-close all empty elements such as <textarea></textarea> and <script src=...></script>, breaking the page.

Its is only possible at the moment to have ‘properly indented’ HTML5 markup, written in XHTML syntax (i.e. using the ‘xml’ output method), that makes use of some of the niceties of HTML5 (shorter meta definition etc.) by using a string parsing ‘hack’ that replaces the default HTML doctype with the HTML5 short enable-strict-mode variant.

- nickdunn
- 26 Aug 10, 9:35 pm
- Comment #63

can I just create a folder “html5_doctype” with this file and use it as a regular extension or should I use it another way

It is a normal extension, so just download it (or clone it) from Github with the folder named html5_doctype. I guess you’ll then need some more complex regular expressions to perform additional markup manipulation.

- davidhund
- 26 Aug 10, 10:26 pm
- Comment #64

Yup, works fine. Thanks a lot.

I added a regex to remove the XML declaration (although it’s not there when omit-xml-declaration="yes") and a regex to remove the XHTML namespace stuff from the <html> tag.

Taken from here…

Now I need to figure out how to change <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> to <meta charset="UTF-8" /> (I’m not a regex hero)

This is all the markup manipulation I guess that’s needed. Do you think I need more?

- bauhouse
- 11 Nov 10, 7:30 am
- Comment #65

I just ran into the same problem as davidhund. Thanks for the HTML5 Doctype hack, Nick.

I’ve updated the hack to include the html element and the meta element for the charset (which assumes English).

Edit: As far as the regex goes, this was a total guess. If anyone knows better, feel free to fix this. :-)

- bauhouse
- 11 Nov 10, 9:46 am
- Comment #66

I was able to get a colleague to help me with the regex. My fork of Nick’s HTML5 Doctype extension now preserves the language and charset settings of the original HTML document.

- davidhund
- 11 Nov 10, 8:27 pm
- Comment #67

@bauhaus that definitely looks a bit nicer than what I came up with

The only thing is that, in your code, the XML declaration still remains, correct? Another thing is the namespace decl. on the HTML element, is it stripped in your RegEx?

It’s a bit of a nasty hack really, and I wonder if there’s a performance overhead because of parsing the complete HTML document.

Ideally we would have more control over the output method: I’d like to be able to define a syntax preference (XHTML- or HTML style), indenting (tabs, spaces, how many?), etc. It might be nice to have a base ‘template’ that would be used for this XSLT output method.

Anyway, this is all dreaming… Apart from the indentation bug (it is right?) I think not much will change in the near future.

- bauhouse
- 12 Nov 10, 3:09 am
- Comment #68

@davidhund: Yes, it is a bit of a nasty hack. Hopefully there isn’t much of a performance hit, because it’s the solution I like the best so far.

I omit the XML declaration with omit-xml-declaration="yes" in the xsl:output instruction.

Thanks for posting your code example. It helped me figure out where I was going wrong with the case insensitive regex match. I neglected to escape forward slashes. Version 1.2.1 is available on GitHub.

- michael-e
- 12 Nov 10, 3:55 am
- Comment #69

Hopefully there isn’t much of a performance hit, because it’s the solution I like the best so far.

Well, parsing the complete output definitely is a problem. rainerborene’s Datestamp Helper extension worked in a similar way, which once caused severe issues for me. He then changed the inner workings, no more parsing the complete output.

- davidhund
- 12 Nov 10, 6:10 am
- Comment #70

michael-e I understand this can become an issue with many nodes such as your issue… However, I don’t see another way to strip/change those elements/attributes. Anybody?

- Nils
- 12 Nov 10, 6:21 am
- Comment #71

As it’s all about changing the beginning of the HTML document does splitting the output source help? So the regular expression could be applied to the first 100 characters of the document and later be re-joined with the rest of the document.

- davidhund
- 12 Nov 10, 7:13 am
- Comment #72

Nils, I was thinking about exactly that (but I am no PHP guru and do not know the inner workings of Symphony).

Would it simply be a matter of limiting the $context['output']; in the extension to 100 chars? This seems dangerous, people might include a lot of other stuff (comments, meta etc.) in the ‘head’.

Maybe we would need to give another argument (instead of the output context) to the callback? Or do we need a whole new/other delegate instead of FrontendOutputPostGenerate alltogether?

- bauhouse
- 12 Nov 10, 9:51 am
- Comment #73

michael-e, you’re right. Parsing the complete output is a problem. But, I still need to be able to process the HTML output.

What if we assume that the first four lines of the XSLT generated XHTML document will be similar:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

The intended result should be something like this:

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8" />

So, I should only grab the first four lines of the document, then use regex to parse only these lines rather than the complete output.

public function parse_html($context) {
    $html = $context['output'];

    // Split the HTML output into two variables:
    // $html_doctype contains the first four lines of the HTML document
    // $html_doc contains the rest of the HTML document
    $html_array = explode("\n", $html, 5);
    $html_doc = array_pop($html_array);
    $html_doctype = implode("\n", $html_array);

    // Parse the doctype to convert XHTML syntax to HTML5
    $html_doctype = preg_replace("/<!DOCTYPE [^>]+>/", "<!DOCTYPE html>", $html_doctype);
    $html_doctype = preg_replace('/(<html ).*(lang="[a-z]+").*>/i', '\1\2>', $html_doctype);
    $html_doctype = preg_replace('/<meta http-equiv="Content-Type" content="text/html; charset=(.*[a-z0-9-])" />/i', '<meta charset="\1" />', $html_doctype);

    // Concatenate the fragments into a complete HTML5 document
    $html = $html_doctype . "\n" . $html_doc;

    $context['output'] = $html;
}

That way, the script will not parse any code examples contained within the document and the regex processing will be confined to the string fragment that needs to be modified. Version 1.2.2 of the HTML5 Doctype extension implements this code.

- michael-e
- 12 Nov 10, 7:05 pm
- Comment #74

Cool idea, bauhouse!

- davidhund
- 12 Nov 10, 8:35 pm
- Comment #75

Cool, does this take into account any (possible) XSL comments in these first couple of lines? We should probably document/mention this if these could cause a problem.

Also: has anyone already noticed/tested a performance improvement? Theoretically you should notice an improvement in the Profile of the page rendering/generation.

- bauhouse
- 13 Nov 10, 3:24 am
- Comment #76

I’ll have to add some documentation to the extension. Something like this:

XSL Comments

If XSL comments are added to the beginning of the document, it would be necessary to increase the number of lines of text being processed by the regex. In this case, because the limit argument is set to a value of 5, the explode function returns an array of five strings: the first four elements of the array contain each of the first four lines of the HTML output, and the last element contains the rest of the HTML output.

If you wanted to add a couple lines of comments at the beginning of the document, you could accommodate this by increasing the value of the limit argument for the explode function. For example, increase the limit to 10.

 $html_array = explode("n", $html, 10);

XML Namespace

To preserve the XML namespace declaration on the HTML element, comment out this line:

 // $html_doctype = preg_replace('/(<html ).*(lang="[a-z]+").*>/i', '12>', $html_doctype);

Possible Feature: Preferences Page

I’m not sure if it’s worth creating a preferences page to manage options like this. I think it would be best to keep the code for this extension as sparse as possible.

However, because the extension hijacks the output of every page, it might be good to have a multiple select box to configure which pages to apply this hack to. For example, if you have a page that is meant to output only XML, it would be a waste for this script to run on that page.

- eKoeS
- 13 Nov 10, 3:39 am
- Comment #77

If XSL comments are added to the beginning of the document, it would be necessary to increase the number of lines of text being processed by the regex.

Some thoughts:

Even if there are $n xsl comments, the DOCTYPE will still appear in the first line of the output HTML;
What about looking for the string \n<html? This way you can avoid the whole limit thing by building a rexeg that performs the replacement until a > character is encountered.

Does it sound right? I’m open to insults! ;D

- bauhouse
- 13 Nov 10, 4:23 am
- Comment #78

Even if there are $n xsl comments, the DOCTYPE will still appear in the first line of the output HTML;

That was my first thought: who adds comments in front of the doctype? Would it even be possible with an XSLT stylesheet that was configured to output XHTML? Sounds like an edge case to me.

What about looking for the string n<html? This way you can avoid the whole limit thing by building a rexeg that performs the replacement until a > character is encountered.

I’m not sure that I follow. The limit thing is a way to avoid processing the entire output.

- eKoeS
- 13 Nov 10, 5:24 am
- Comment #79

That was my first thought: who adds comments in front of the doctype? Would it even be possible with an XSLT stylesheet that was configured to output XHTML? Sounds like an edge case to me.

As far as I know that shouldn’t be possible as the DOCTYPE is automagically added by the parser at the top of the file.

I’m not sure that I follow. The limit thing is a way to avoid processing the entire output.

Yes, sorry I missed that part!

- bauhouse
- 19 Nov 10, 12:16 pm
- Comment #80

Now I’m not liking the HTML5 Doctype extension hack. Unfortunately, it turns Symphony errors into blank pages. While you’re developing, disable the extension.

Symphony.

XSLT and HTML5

Search

XSL Comments

XML Namespace

Possible Feature: Preferences Page

Server Requirements

Symphony.

XSLT and HTML5

Search

You are looking at page 4 of 9

XSL Comments

XML Namespace

Possible Feature: Preferences Page

Server Requirements

Sign in