Re: [xml] Error on parsing HTML with libxml

Eric S Eberhard Tue, 21 Aug 2018 12:56:35 -0700

That would be incorrect behavior for libxml2 -- as Liam and I both said -- you 
have to encode some how.  CDATA is one way and URL encoding (e.g. &lt, &gt, 
etc).

I sent you a link.  
https://stackoverflow.com/questions/1398571/html-inside-xml-should-i-use-cdata-or-encode-the-html

Which I believe is the correct answer.  If someone else is making the XML then 
they should fix it.  I also like the "soup" answer and agree.

We have people send invalid XML to our customers all the time ... my customers 
have chosen to make me fix it :-) .  That is what I get paid for so ...

We pre-process all XML files and fix every mistake we know (and the program 
slowly grows) before parsing it.  Examples include attributes without a space 
between the quote and start of next attribute.  It would be wrong for me to ask 
libxml2 to do this -- not on spec.  So I do it.

So if was you and you have you take the files like this -- then pre-process 
them and fix them with either CDATA or encoding because I don't think anyone 
else would support the kind of change you are asking for ...

Eric

Eric S Eberhard
VICS (Vertical Integrated Computer Systems)
Voice: 928 567 3529
Cell    : 928 301 7537  (not reliable except for text or if not home)
2933 W Middle Verde Rd
Camp Verde, AZ  86322

-----Original Message-----
From: xml [mailto:[email protected]] On Behalf Of André Rothe
Sent: Monday, August 20, 2018 12:48 AM
To: [email protected]; Liam R. E. Quin <[email protected]>
Subject: Re: [xml] Error on parsing HTML with libxml

I can't chage the source of the HTML page, because the page will be generated 
by another system, where I don't have access. I get only the pages from there 
and our Apache module makes a post-processing step just before the pages will 
be sent to the user's browser. And there I need a parser to change something 
within the page.

So I think, the libxml should not parse the content of inline scripts to handle 
that.

There is also a comment on

https://stackoverflow.com/questions/51892455/php-5-4-16-domdocument-removes-parts-of-javascript

which describes your idea with CDATA, but it didn't work.

~André

On 18.08.2018 04:13, Liam R. E. Quin wrote:
> On Fri, 2018-08-17 at 14:42 +0200, André Rothe wrote:
>>
>> https://3v4l.org/O0iEf
> 
> Try changing
>     ...writeln('</td>');
> to
>     ...writeln('<' + '/td>');
> and see if that helps; or use a CDATA section, <script><![CDATA[
>   //..
> ]]></script> to escape the </td> markup from the HTML parser.
> Although it may depend on what the missing //... lines look like, 
> assuming this is not the complete source.
> 
> Better yet, don't use document.write at all, and switch to more modern 
> practices :)
> 
> I'm not sure there's actually a bug here; if you feed the parser tag 
> soup, expect a mess. Keep zPHP, JavaScript, HTML, CSS in separate 
> files and life will probably be simpler.

_______________________________________________
xml mailing list, project page  http://xmlsoft.org/ [email protected] 
https://mail.gnome.org/mailman/listinfo/xml

_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
https://mail.gnome.org/mailman/listinfo/xml

Re: [xml] Error on parsing HTML with libxml

Reply via email to