Re: [xml] HTML parser sometimes doesn't close script tags in libxml2 2.9.8

2019-01-23 Thread Nick Wellnhofer

On 23/01/2019 16:14, Tomi Belan wrote:
I don't know too much 
about Python's C API, but [2] [3] suggests lxml is using a deprecated macro 
and giving libxml2 a multibyte buffer even though the input would fit into 
pure ASCII. This explains why it behaved differently than xmllint.


Right, if Python passes ASCII codes as, say, 16-bit integers, this will be 
detected as UTF-16 by libxml2 and encoding conversion will happen behind the 
scenes. I'm not sure what would happen with an encoding that isn't Unicode 
compatible. Maybe there's a bug lurking in lxml.


It would be good to add some tests to decrease the likelihood that 
this issue or something similar happens again.


Yes, that would be nice. But it was only a short-lived regression that I 
personally don't want to spend more time on. A UTF-16 test case derived from 
either your or the Chromium bug report would probably make most sense.


Nick
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] HTML parser sometimes doesn't close script tags in libxml2 2.9.8

2019-01-23 Thread Tomi Belan via xml
On Wed, Jan 23, 2019 at 12:55 PM Nick Wellnhofer 
wrote:

> The commit obviously also affected documents that didn't need encoding
> conversion. I didn't realize that.


Aha! I noticed that the chromium link you sent mentions a >32KB string
which gets converted to a >64KB string, which sounded suspiciously similar.
Looks like lxml's feed() function [1] is doing the same thing. I don't know
too much about Python's C API, but [2] [3] suggests lxml is using a
deprecated macro and giving libxml2 a multibyte buffer even though the
input would fit into pure ASCII. This explains why it behaved differently
than xmllint.

[1] https://github.com/lxml/lxml/blob/master/src/lxml/parser.pxi#L1242
[2]
https://stackoverflow.com/questions/26079392/how-is-unicode-represented-internally-in-python
[3] https://docs.python.org/3/c-api/unicode.html#c.PyUnicode_AS_DATA

I also noticed that feed() is doing something special with the first 4
bytes, giving them to _htmlCtxtResetPush() instead of htmlParseChunk(). So
the discussion about buffer boundaries might be slightly incorrect.

At least we know that the issue is isolated
> to 2.9.8. Thanks for your efforts!
>

Yes, thank you. Now it's clear that my immediate issue is solved and
version 2.9.9 works. So I probably won't look into this much further.

I guess it's up to you to decide what to do next, and if any libxml2
changes are needed. It would be good to add some tests to decrease the
likelihood that this issue or something similar happens again. For that,
you might still need to isolate the root cause further, and create a pure C
test case. (Maybe based on a test case from chromium instead of mine.) But
of course it's up to you to determine the priority of that. Thanks again
for your help, and good luck if you decide to continue.

Tomi
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] HTML parser sometimes doesn't close script tags in libxml2 2.9.8

2019-01-23 Thread Nick Wellnhofer

On 23/01/2019 01:47, Tomi Belan wrote:
But even so I still wasn't able to reproduce it in pure C. Could it be 
because xmllint reads ctxt->myDoc, and lxml uses SAX2 event handlers 
(according to parsertarget.pxi)? AFAICT xmllint's --push and --sax options are 
incompatible.


ctxt->myDoc is also built via internal SAX2 handlers, so I'm not sure what's 
going on exactly.


I had more luck with git bisect. Using a dynamically linked build of lxml, and 
pointing LD_LIBRARY_PATH to libxml2/.libs/, I successfully found out that the 
bug was:
- introduced by 
https://github.com/GNOME/libxml2/commit/6e6ae5daa6cd9640c9a83c1070896273e9b30d14
- fixed(?) by 
https://github.com/GNOME/libxml2/commit/7a1bd7f6497ac33a9023d556f6f47a48f01deac0


The first commit was an attempt to fix an (ICU-related?) issue but it turned 
out to be buggy. It's unfortunate that the commit made it into 2.9.8.


https://mail.gnome.org/archives/xml/2018-January/msg3.html
https://bugs.chromium.org/p/chromium/issues/detail?id=820163

I hope that's meaningful to you, because I have no idea what are those commits 
doing and how could it be related to this bug... The commits sound related to 
character encoding, but bad.html is plain ASCII...


The commit obviously also affected documents that didn't need encoding 
conversion. I didn't realize that. At least we know that the issue is isolated 
to 2.9.8. Thanks for your efforts!


Nick

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] HTML parser sometimes doesn't close script tags in libxml2 2.9.8

2019-01-22 Thread Tomi Belan via xml
Thanks, that's very useful!

With a dynamically linked build of lxml, I used "ltrace" to see the calls
to libxml2. Looks like you're correct there is only one call to
htmlParseChunk with the whole content (followed by a zero-length call to
terminate the input). But even so I still wasn't able to reproduce it in
pure C. Could it be because xmllint reads ctxt->myDoc, and lxml uses SAX2
event handlers (according to parsertarget.pxi)? AFAICT xmllint's --push and
--sax options are incompatible.

I had more luck with git bisect. Using a dynamically linked build of lxml,
and pointing LD_LIBRARY_PATH to libxml2/.libs/, I successfully found out
that the bug was:
- introduced by
https://github.com/GNOME/libxml2/commit/6e6ae5daa6cd9640c9a83c1070896273e9b30d14
- fixed(?) by
https://github.com/GNOME/libxml2/commit/7a1bd7f6497ac33a9023d556f6f47a48f01deac0

I hope that's meaningful to you, because I have no idea what are those
commits doing and how could it be related to this bug... The commits sound
related to character encoding, but bad.html is plain ASCII...

Tomi

On Tue, Jan 22, 2019 at 7:56 PM Nick Wellnhofer  wrote:

> On 22/01/2019 19:11, Tomi Belan wrote:
> > I tried to reproduce it with only xmllint as you suggest, but I'm not
> having
> > much luck. It produces correct results with "--html --debug bad.html",
> "--html
> > --debug --stream bad.html", "--html --debug --push bad.html", and
> "--html
> > --debug --sax bad.html".
> >
> > Maybe I'm just not using the right flags - I don't know if lxml uses SAX
> mode,
> > or streaming, etc. But at this point I wouldn't be too surprised if it
> > depended on the size of some internal input buffer that's different in
> lxml vs
> > xmllint. I'd welcome any advice about what else I should try, or how can
> I
> > find out what calls are being made from lxml to libxml2.
>
>  From a quick look at the lxml source, it seems that the `feed` method of
> HTMLParser calls htmlParseChunk, so you should pass `--html --push` to
> xmllint. But if it's a buffer boundary issue, you might have to recreate
> the
> exact chunk sizes to reproduce the problem. lxml seems to split into
> chunks of
> size INT_MAX, meaning a single chunk in most cases. xmllint first passes a
> chunk of 4 bytes, then splits the remaining data into chunks of 4096
> bytes.
> But maybe I'm missing something. To be sure, you could run your Python
> code
> under a debugger like gdb and set a break point on htmlParseChunk. Also
> break
> on htmlCtxtUseOptions to see which parser options are used exactly.
>
> You could also start experimenting with feeding chunks of different sizes
> in
> your Python script or with a small C program that calls htmlParseChunk in
> the
> same way as lxml, presumably writing a single chunk. You could also try to
> add
> 4 bytes somewhere at the beginning of `bad.html` and see if it helps with
> reproducing the issue using xmllint.
>
> > Other than that: It's not ideal, but could you please check if you can
> also
> > reproduce the bug with the first set of commands I posted? Just to
> verify it's
> > not just me.
>
> Yes, I can try.
>
> Nick
>
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] HTML parser sometimes doesn't close script tags in libxml2 2.9.8

2019-01-22 Thread Nick Wellnhofer

On 22/01/2019 19:11, Tomi Belan wrote:
I tried to reproduce it with only xmllint as you suggest, but I'm not having 
much luck. It produces correct results with "--html --debug bad.html", "--html 
--debug --stream bad.html", "--html --debug --push bad.html", and "--html 
--debug --sax bad.html".


Maybe I'm just not using the right flags - I don't know if lxml uses SAX mode, 
or streaming, etc. But at this point I wouldn't be too surprised if it 
depended on the size of some internal input buffer that's different in lxml vs 
xmllint. I'd welcome any advice about what else I should try, or how can I 
find out what calls are being made from lxml to libxml2.


From a quick look at the lxml source, it seems that the `feed` method of 
HTMLParser calls htmlParseChunk, so you should pass `--html --push` to 
xmllint. But if it's a buffer boundary issue, you might have to recreate the 
exact chunk sizes to reproduce the problem. lxml seems to split into chunks of 
size INT_MAX, meaning a single chunk in most cases. xmllint first passes a 
chunk of 4 bytes, then splits the remaining data into chunks of 4096 bytes. 
But maybe I'm missing something. To be sure, you could run your Python code 
under a debugger like gdb and set a break point on htmlParseChunk. Also break 
on htmlCtxtUseOptions to see which parser options are used exactly.


You could also start experimenting with feeding chunks of different sizes in 
your Python script or with a small C program that calls htmlParseChunk in the 
same way as lxml, presumably writing a single chunk. You could also try to add 
4 bytes somewhere at the beginning of `bad.html` and see if it helps with 
reproducing the issue using xmllint.


Other than that: It's not ideal, but could you please check if you can also 
reproduce the bug with the first set of commands I posted? Just to verify it's 
not just me.


Yes, I can try.

Nick
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] HTML parser sometimes doesn't close script tags in libxml2 2.9.8

2019-01-22 Thread Tomi Belan via xml
I also built lxml 4.2.5 with pristine libxml2 2.9.8 (using a variation of
the above command), and got the same results. So I don't think it's a
distro specific problem.

I tried to reproduce it with only xmllint as you suggest, but I'm not
having much luck. It produces correct results with "--html --debug
bad.html", "--html --debug --stream bad.html", "--html --debug --push
bad.html", and "--html --debug --sax bad.html".

Maybe I'm just not using the right flags - I don't know if lxml uses SAX
mode, or streaming, etc. But at this point I wouldn't be too surprised if
it depended on the size of some internal input buffer that's different in
lxml vs xmllint. I'd welcome any advice about what else I should try, or
how can I find out what calls are being made from lxml to libxml2.

Other than that: It's not ideal, but could you please check if you can also
reproduce the bug with the first set of commands I posted? Just to verify
it's not just me.

Tomi

On Tue, Jan 22, 2019 at 5:11 PM Nick Wellnhofer  wrote:

> On 22/01/2019 15:43, Tomi Belan via xml wrote:
> > After a lot of debugging, I determined the problem is in libxml2 and not
> the
> > other libraries in my stack, and that it only seems to happen on version
> > 2.9.8. But I don't see any related changes in news.html for 2.9.9, nor
> in the
> > diff between them, so I am still worried: I don't know if the bug is
> really
> > fixed, or just dormant. I hope you can find the root cause, and maybe
> add a
> > regression test if you do.
>
> I also don't see any directly related changes in either 2.9.8 or 2.9.9.
>
> > This will download
> > the manylinux binary build of lxml 4.2.5, which is statically linked to
> > libxml2 2.9.8.
>
> Are you sure that a pristine 2.9.8 build was used? Maybe there are
> additional
> patches added by a distro?
>
> > I couldn't shorten the file very much, because if I delete even a single
> > character, the bug stops triggering. (Could it be some buffer boundary
> issue?)
>
> Yes, a buffer boundary issue seems likely.
>
> > I also built my own lxml 4.2.5 with libxml2 2.9.9 and it was not
> affected. So
> > I believe this is a bug in libxml2 2.9.8 specifically, and not in a
> particular
> > version of lxml.
>
> Did you also try your own build with the official libxml2 2.9.8 sources?
>
> > I hope you can solve the mystery. Please let me know if I can be of any
> help.
>
> It would help if you could reproduce the issue with xmllint and no Python
> code
> involved. git-bisect might also be useful.
>
> Nick
>
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] HTML parser sometimes doesn't close script tags in libxml2 2.9.8

2019-01-22 Thread Nick Wellnhofer

On 22/01/2019 15:43, Tomi Belan via xml wrote:
After a lot of debugging, I determined the problem is in libxml2 and not the 
other libraries in my stack, and that it only seems to happen on version 
2.9.8. But I don't see any related changes in news.html for 2.9.9, nor in the 
diff between them, so I am still worried: I don't know if the bug is really 
fixed, or just dormant. I hope you can find the root cause, and maybe add a 
regression test if you do.


I also don't see any directly related changes in either 2.9.8 or 2.9.9.

This will download 
the manylinux binary build of lxml 4.2.5, which is statically linked to 
libxml2 2.9.8.


Are you sure that a pristine 2.9.8 build was used? Maybe there are additional 
patches added by a distro?


I couldn't shorten the file very much, because if I delete even a single 
character, the bug stops triggering. (Could it be some buffer boundary issue?) 


Yes, a buffer boundary issue seems likely.

I also built my own lxml 4.2.5 with libxml2 2.9.9 and it was not affected. So 
I believe this is a bug in libxml2 2.9.8 specifically, and not in a particular 
version of lxml.


Did you also try your own build with the official libxml2 2.9.8 sources?

I hope you can solve the mystery. Please let me know if I can be of any help. 


It would help if you could reproduce the issue with xmllint and no Python code 
involved. git-bisect might also be useful.


Nick
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml