Re: [xml] Error on parsing HTML with libxml
I have looked into the libxml code and I found the method htmlParseScript() within HTMLParser.c. https://gitlab.gnome.org/GNOME/libxml2/blob/master/HTMLparser.c It describes the problem with the "<" character within scripts. But it offers the possibility to use the recover mode to ignore the tags. I have used xmllint --html -htmlout --recover mypage.html and it returns the last tag. The PHP equivalent does not work (there is a flag "recover" on class DOMDocument, but the output is always the same). So I will look into the DOMDocument code (if it is available). ~André On 18.08.2018 00:33, Eric S Eberhard wrote: > I could be way off base -- don't you have to encode the portions in the js? > Otherwise I can see it being confused. The js looks like data and it can't > have < or > in it. > > https://stackoverflow.com/questions/1398571/html-inside-xml-should-i-use-cdata-or-encode-the-html > > Eric > > > Eric S Eberhard > VICS (Vertical Integrated Computer Systems) > Voice: 928 567 3529 > Cell: 928 301 7537 (not reliable except for text or if not home) > 2933 W Middle Verde Rd > Camp Verde, AZ 86322 > > > -Original Message- > From: xml [mailto:xml-boun...@gnome.org] On Behalf Of André Rothe > Sent: Friday, August 17, 2018 5:43 AM > To: xml@gnome.org > Subject: [xml] Error on parsing HTML with libxml > > Hi, > > I run into an HTML parser problem during PHP development. There is a class > DOMDocument, which uses libxml2 to parse HTML and XML documents. I found out, > that there is a problem with HTML documents, which have inline Javascript > code, which uses HTML tags within Javascript String variables. > > There is a little code example, which shows the problem: > > https://3v4l.org/O0iEf > > As you can see there, the last tag is lost within the output. > Exactly the same error I will get with xmllint: > > xmllint --html --htmlout /tmp/page.html > > where page.html contains the HTML part of the example code above. The output > is > > page.html:11: HTML parser error : Unexpected end tag : td > printwin.document.writeln(''); > > and within the output, the String will be empty: > > printwin.document.writeln(''); > > So I think, that the PHP error comes from the error within libxml2. I use > libxml2 version 2.9.1. > > Is it possible to fix that or is it already fixed within a newer version? > > Best regards > André > > ___ > xml mailing list, project page http://xmlsoft.org/ xml@gnome.org > https://mail.gnome.org/mailman/listinfo/xml > > ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml
Re: [xml] Error on parsing HTML with libxml
I can't chage the source of the HTML page, because the page will be generated by another system, where I don't have access. I get only the pages from there and our Apache module makes a post-processing step just before the pages will be sent to the user's browser. And there I need a parser to change something within the page. So I think, the libxml should not parse the content of inline scripts to handle that. There is also a comment on https://stackoverflow.com/questions/51892455/php-5-4-16-domdocument-removes-parts-of-javascript which describes your idea with CDATA, but it didn't work. ~André On 18.08.2018 04:13, Liam R. E. Quin wrote: > On Fri, 2018-08-17 at 14:42 +0200, André Rothe wrote: >> >> https://3v4l.org/O0iEf > > Try changing > ...writeln(''); > to > ...writeln('<' + '/td>'); > and see if that helps; or use a CDATA section, > <![CDATA[ > //.. > ]]> to escape the markup from the HTML parser. > Although it may depend on what the missing //... lines look like, > assuming this is not the complete source. > > Better yet, don't use document.write at all, and switch to more modern > practices :) > > I'm not sure there's actually a bug here; if you feed the parser tag > soup, expect a mess. Keep zPHP, JavaScript, HTML, CSS in separate files > and life will probably be simpler. ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml
[xml] Error on parsing HTML with libxml
Hi, I run into an HTML parser problem during PHP development. There is a class DOMDocument, which uses libxml2 to parse HTML and XML documents. I found out, that there is a problem with HTML documents, which have inline Javascript code, which uses HTML tags within Javascript String variables. There is a little code example, which shows the problem: https://3v4l.org/O0iEf As you can see there, the last tag is lost within the output. Exactly the same error I will get with xmllint: xmllint --html --htmlout /tmp/page.html where page.html contains the HTML part of the example code above. The output is page.html:11: HTML parser error : Unexpected end tag : td printwin.document.writeln(''); and within the output, the String will be empty: printwin.document.writeln(''); So I think, that the PHP error comes from the error within libxml2. I use libxml2 version 2.9.1. Is it possible to fix that or is it already fixed within a newer version? Best regards André ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml