Re: [xml] Error on parsing HTML with libxml

2018-08-21 Thread Eric S Eberhard
That would be incorrect behavior for libxml2 -- as Liam and I both said -- you 
have to encode some how.  CDATA is one way and URL encoding (e.g. , , 
etc).

I sent you a link.  
https://stackoverflow.com/questions/1398571/html-inside-xml-should-i-use-cdata-or-encode-the-html

Which I believe is the correct answer.  If someone else is making the XML then 
they should fix it.  I also like the "soup" answer and agree.

We have people send invalid XML to our customers all the time ... my customers 
have chosen to make me fix it :-) .  That is what I get paid for so ...

We pre-process all XML files and fix every mistake we know (and the program 
slowly grows) before parsing it.  Examples include attributes without a space 
between the quote and start of next attribute.  It would be wrong for me to ask 
libxml2 to do this -- not on spec.  So I do it.

So if was you and you have you take the files like this -- then pre-process 
them and fix them with either CDATA or encoding because I don't think anyone 
else would support the kind of change you are asking for ...

Eric




Eric S Eberhard
VICS (Vertical Integrated Computer Systems)
Voice: 928 567 3529
Cell: 928 301 7537  (not reliable except for text or if not home)
2933 W Middle Verde Rd
Camp Verde, AZ  86322


-Original Message-
From: xml [mailto:xml-boun...@gnome.org] On Behalf Of André Rothe
Sent: Monday, August 20, 2018 12:48 AM
To: xml@gnome.org; Liam R. E. Quin 
Subject: Re: [xml] Error on parsing HTML with libxml

I can't chage the source of the HTML page, because the page will be generated 
by another system, where I don't have access. I get only the pages from there 
and our Apache module makes a post-processing step just before the pages will 
be sent to the user's browser. And there I need a parser to change something 
within the page.

So I think, the libxml should not parse the content of inline scripts to handle 
that.

There is also a comment on

https://stackoverflow.com/questions/51892455/php-5-4-16-domdocument-removes-parts-of-javascript

which describes your idea with CDATA, but it didn't work.

~André

On 18.08.2018 04:13, Liam R. E. Quin wrote:
> On Fri, 2018-08-17 at 14:42 +0200, André Rothe wrote:
>>
>> https://3v4l.org/O0iEf
> 
> Try changing
> ...writeln('');
> to
> ...writeln('<' + '/td>');
> and see if that helps; or use a CDATA section, <![CDATA[
>   //..
> ]]> to escape the  markup from the HTML parser.
> Although it may depend on what the missing //... lines look like, 
> assuming this is not the complete source.
> 
> Better yet, don't use document.write at all, and switch to more modern 
> practices :)
> 
> I'm not sure there's actually a bug here; if you feed the parser tag 
> soup, expect a mess. Keep zPHP, JavaScript, HTML, CSS in separate 
> files and life will probably be simpler.

___
xml mailing list, project page  http://xmlsoft.org/ xml@gnome.org 
https://mail.gnome.org/mailman/listinfo/xml


___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Error on parsing HTML with libxml

2018-08-20 Thread André Rothe
I have looked into the libxml code and I found the method
htmlParseScript() within HTMLParser.c.

https://gitlab.gnome.org/GNOME/libxml2/blob/master/HTMLparser.c

It describes the problem with the "<" character within scripts.
But it offers the possibility to use the recover mode to ignore
the tags.

I have used

xmllint --html -htmlout --recover mypage.html

and it returns the last  tag. The PHP equivalent does not work
(there is a flag "recover" on class DOMDocument, but the output is
always the same). So I will look into the DOMDocument code (if it is
available).

~André

On 18.08.2018 00:33, Eric S Eberhard wrote:
> I could be way off base -- don't you have to encode the portions in the js?  
> Otherwise I can see it being confused.  The js looks like data and it can't 
> have < or > in it.
> 
> https://stackoverflow.com/questions/1398571/html-inside-xml-should-i-use-cdata-or-encode-the-html
> 
> Eric
> 
> 
> Eric S Eberhard
> VICS (Vertical Integrated Computer Systems)
> Voice: 928 567 3529
> Cell: 928 301 7537  (not reliable except for text or if not home)
> 2933 W Middle Verde Rd
> Camp Verde, AZ  86322
> 
> 
> -Original Message-
> From: xml [mailto:xml-boun...@gnome.org] On Behalf Of André Rothe
> Sent: Friday, August 17, 2018 5:43 AM
> To: xml@gnome.org
> Subject: [xml] Error on parsing HTML with libxml
> 
> Hi,
> 
> I run into an HTML parser problem during PHP development. There is a class 
> DOMDocument, which uses libxml2 to parse HTML and XML documents. I found out, 
> that there is a problem with HTML documents, which have inline Javascript 
> code, which uses HTML tags within Javascript String variables.
> 
> There is a little code example, which shows the problem:
> 
> https://3v4l.org/O0iEf
> 
> As you can see there, the last tag  is lost within the output.
> Exactly the same error I will get with xmllint:
> 
> xmllint --html --htmlout /tmp/page.html
> 
> where page.html contains the HTML part of the example code above. The output 
> is
> 
> page.html:11: HTML parser error : Unexpected end tag : td
> printwin.document.writeln('');
> 
> and within the output, the String will be empty:
> 
> printwin.document.writeln('');
> 
> So I think, that the PHP error comes from the error within libxml2. I use 
> libxml2 version 2.9.1.
> 
> Is it possible to fix that or is it already fixed within a newer version?
> 
> Best regards
> André
> 
> ___
> xml mailing list, project page  http://xmlsoft.org/ xml@gnome.org 
> https://mail.gnome.org/mailman/listinfo/xml
> 
> 

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Error on parsing HTML with libxml

2018-08-20 Thread André Rothe
I can't chage the source of the HTML page, because the page will be
generated by another system, where I don't have access. I get only the
pages from there and our Apache module makes a post-processing step just
before the pages will be sent to the user's browser. And there I need a
parser to change something within the page.

So I think, the libxml should not parse the content of inline scripts to
handle that.

There is also a comment on

https://stackoverflow.com/questions/51892455/php-5-4-16-domdocument-removes-parts-of-javascript

which describes your idea with CDATA, but it didn't work.

~André

On 18.08.2018 04:13, Liam R. E. Quin wrote:
> On Fri, 2018-08-17 at 14:42 +0200, André Rothe wrote:
>>
>> https://3v4l.org/O0iEf
> 
> Try changing
> ...writeln('');
> to
> ...writeln('<' + '/td>');
> and see if that helps; or use a CDATA section,
>    //..
> ]]> to escape the  markup from the HTML parser.
> Although it may depend on what the missing //... lines look like,
> assuming this is not the complete source.
> 
> Better yet, don't use document.write at all, and switch to more modern
> practices :)
> 
> I'm not sure there's actually a bug here; if you feed the parser tag
> soup, expect a mess. Keep zPHP, JavaScript, HTML, CSS in separate files
> and life will probably be simpler.

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Error on parsing HTML with libxml

2018-08-17 Thread Liam R E Quin
On Fri, 2018-08-17 at 14:42 +0200, André Rothe wrote:
> 
> https://3v4l.org/O0iEf

Try changing
...writeln('');
to
...writeln('<' + '/td>');
and see if that helps; or use a CDATA section,
 to escape the  markup from the HTML parser.
Although it may depend on what the missing //... lines look like,
assuming this is not the complete source.

Better yet, don't use document.write at all, and switch to more modern
practices :)

I'm not sure there's actually a bug here; if you feed the parser tag
soup, expect a mess. Keep zPHP, JavaScript, HTML, CSS in separate files
and life will probably be simpler.

Liam



-- 
Liam Quin - web slave for https://www.fromoldbooks.org/
with fabulous vintage art and fascinating texts to read.


___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Error on parsing HTML with libxml

2018-08-17 Thread Eric S Eberhard
I could be way off base -- don't you have to encode the portions in the js?  
Otherwise I can see it being confused.  The js looks like data and it can't 
have < or > in it.

https://stackoverflow.com/questions/1398571/html-inside-xml-should-i-use-cdata-or-encode-the-html

Eric


Eric S Eberhard
VICS (Vertical Integrated Computer Systems)
Voice: 928 567 3529
Cell: 928 301 7537  (not reliable except for text or if not home)
2933 W Middle Verde Rd
Camp Verde, AZ  86322


-Original Message-
From: xml [mailto:xml-boun...@gnome.org] On Behalf Of André Rothe
Sent: Friday, August 17, 2018 5:43 AM
To: xml@gnome.org
Subject: [xml] Error on parsing HTML with libxml

Hi,

I run into an HTML parser problem during PHP development. There is a class 
DOMDocument, which uses libxml2 to parse HTML and XML documents. I found out, 
that there is a problem with HTML documents, which have inline Javascript code, 
which uses HTML tags within Javascript String variables.

There is a little code example, which shows the problem:

https://3v4l.org/O0iEf

As you can see there, the last tag  is lost within the output.
Exactly the same error I will get with xmllint:

xmllint --html --htmlout /tmp/page.html

where page.html contains the HTML part of the example code above. The output is

page.html:11: HTML parser error : Unexpected end tag : td
printwin.document.writeln('');

and within the output, the String will be empty:

printwin.document.writeln('');

So I think, that the PHP error comes from the error within libxml2. I use 
libxml2 version 2.9.1.

Is it possible to fix that or is it already fixed within a newer version?

Best regards
André

___
xml mailing list, project page  http://xmlsoft.org/ xml@gnome.org 
https://mail.gnome.org/mailman/listinfo/xml


___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml