Re: [xml] HTML parser sometimes doesn't close script tags in libxml2 2.9.8
On 23/01/2019 16:14, Tomi Belan wrote: I don't know too much about Python's C API, but [2] [3] suggests lxml is using a deprecated macro and giving libxml2 a multibyte buffer even though the input would fit into pure ASCII. This explains why it behaved differently than xmllint. Right, if Python passes ASCII codes as, say, 16-bit integers, this will be detected as UTF-16 by libxml2 and encoding conversion will happen behind the scenes. I'm not sure what would happen with an encoding that isn't Unicode compatible. Maybe there's a bug lurking in lxml. It would be good to add some tests to decrease the likelihood that this issue or something similar happens again. Yes, that would be nice. But it was only a short-lived regression that I personally don't want to spend more time on. A UTF-16 test case derived from either your or the Chromium bug report would probably make most sense. Nick ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml
Re: [xml] HTML parser sometimes doesn't close script tags in libxml2 2.9.8
On Wed, Jan 23, 2019 at 12:55 PM Nick Wellnhofer wrote: > The commit obviously also affected documents that didn't need encoding > conversion. I didn't realize that. Aha! I noticed that the chromium link you sent mentions a >32KB string which gets converted to a >64KB string, which sounded suspiciously similar. Looks like lxml's feed() function [1] is doing the same thing. I don't know too much about Python's C API, but [2] [3] suggests lxml is using a deprecated macro and giving libxml2 a multibyte buffer even though the input would fit into pure ASCII. This explains why it behaved differently than xmllint. [1] https://github.com/lxml/lxml/blob/master/src/lxml/parser.pxi#L1242 [2] https://stackoverflow.com/questions/26079392/how-is-unicode-represented-internally-in-python [3] https://docs.python.org/3/c-api/unicode.html#c.PyUnicode_AS_DATA I also noticed that feed() is doing something special with the first 4 bytes, giving them to _htmlCtxtResetPush() instead of htmlParseChunk(). So the discussion about buffer boundaries might be slightly incorrect. At least we know that the issue is isolated > to 2.9.8. Thanks for your efforts! > Yes, thank you. Now it's clear that my immediate issue is solved and version 2.9.9 works. So I probably won't look into this much further. I guess it's up to you to decide what to do next, and if any libxml2 changes are needed. It would be good to add some tests to decrease the likelihood that this issue or something similar happens again. For that, you might still need to isolate the root cause further, and create a pure C test case. (Maybe based on a test case from chromium instead of mine.) But of course it's up to you to determine the priority of that. Thanks again for your help, and good luck if you decide to continue. Tomi ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml
Re: [xml] HTML parser sometimes doesn't close script tags in libxml2 2.9.8
On 23/01/2019 01:47, Tomi Belan wrote: But even so I still wasn't able to reproduce it in pure C. Could it be because xmllint reads ctxt->myDoc, and lxml uses SAX2 event handlers (according to parsertarget.pxi)? AFAICT xmllint's --push and --sax options are incompatible. ctxt->myDoc is also built via internal SAX2 handlers, so I'm not sure what's going on exactly. I had more luck with git bisect. Using a dynamically linked build of lxml, and pointing LD_LIBRARY_PATH to libxml2/.libs/, I successfully found out that the bug was: - introduced by https://github.com/GNOME/libxml2/commit/6e6ae5daa6cd9640c9a83c1070896273e9b30d14 - fixed(?) by https://github.com/GNOME/libxml2/commit/7a1bd7f6497ac33a9023d556f6f47a48f01deac0 The first commit was an attempt to fix an (ICU-related?) issue but it turned out to be buggy. It's unfortunate that the commit made it into 2.9.8. https://mail.gnome.org/archives/xml/2018-January/msg3.html https://bugs.chromium.org/p/chromium/issues/detail?id=820163 I hope that's meaningful to you, because I have no idea what are those commits doing and how could it be related to this bug... The commits sound related to character encoding, but bad.html is plain ASCII... The commit obviously also affected documents that didn't need encoding conversion. I didn't realize that. At least we know that the issue is isolated to 2.9.8. Thanks for your efforts! Nick ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml
Re: [xml] HTML parser sometimes doesn't close script tags in libxml2 2.9.8
Thanks, that's very useful! With a dynamically linked build of lxml, I used "ltrace" to see the calls to libxml2. Looks like you're correct there is only one call to htmlParseChunk with the whole content (followed by a zero-length call to terminate the input). But even so I still wasn't able to reproduce it in pure C. Could it be because xmllint reads ctxt->myDoc, and lxml uses SAX2 event handlers (according to parsertarget.pxi)? AFAICT xmllint's --push and --sax options are incompatible. I had more luck with git bisect. Using a dynamically linked build of lxml, and pointing LD_LIBRARY_PATH to libxml2/.libs/, I successfully found out that the bug was: - introduced by https://github.com/GNOME/libxml2/commit/6e6ae5daa6cd9640c9a83c1070896273e9b30d14 - fixed(?) by https://github.com/GNOME/libxml2/commit/7a1bd7f6497ac33a9023d556f6f47a48f01deac0 I hope that's meaningful to you, because I have no idea what are those commits doing and how could it be related to this bug... The commits sound related to character encoding, but bad.html is plain ASCII... Tomi On Tue, Jan 22, 2019 at 7:56 PM Nick Wellnhofer wrote: > On 22/01/2019 19:11, Tomi Belan wrote: > > I tried to reproduce it with only xmllint as you suggest, but I'm not > having > > much luck. It produces correct results with "--html --debug bad.html", > "--html > > --debug --stream bad.html", "--html --debug --push bad.html", and > "--html > > --debug --sax bad.html". > > > > Maybe I'm just not using the right flags - I don't know if lxml uses SAX > mode, > > or streaming, etc. But at this point I wouldn't be too surprised if it > > depended on the size of some internal input buffer that's different in > lxml vs > > xmllint. I'd welcome any advice about what else I should try, or how can > I > > find out what calls are being made from lxml to libxml2. > > From a quick look at the lxml source, it seems that the `feed` method of > HTMLParser calls htmlParseChunk, so you should pass `--html --push` to > xmllint. But if it's a buffer boundary issue, you might have to recreate > the > exact chunk sizes to reproduce the problem. lxml seems to split into > chunks of > size INT_MAX, meaning a single chunk in most cases. xmllint first passes a > chunk of 4 bytes, then splits the remaining data into chunks of 4096 > bytes. > But maybe I'm missing something. To be sure, you could run your Python > code > under a debugger like gdb and set a break point on htmlParseChunk. Also > break > on htmlCtxtUseOptions to see which parser options are used exactly. > > You could also start experimenting with feeding chunks of different sizes > in > your Python script or with a small C program that calls htmlParseChunk in > the > same way as lxml, presumably writing a single chunk. You could also try to > add > 4 bytes somewhere at the beginning of `bad.html` and see if it helps with > reproducing the issue using xmllint. > > > Other than that: It's not ideal, but could you please check if you can > also > > reproduce the bug with the first set of commands I posted? Just to > verify it's > > not just me. > > Yes, I can try. > > Nick > ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml
Re: [xml] HTML parser sometimes doesn't close script tags in libxml2 2.9.8
On 22/01/2019 19:11, Tomi Belan wrote: I tried to reproduce it with only xmllint as you suggest, but I'm not having much luck. It produces correct results with "--html --debug bad.html", "--html --debug --stream bad.html", "--html --debug --push bad.html", and "--html --debug --sax bad.html". Maybe I'm just not using the right flags - I don't know if lxml uses SAX mode, or streaming, etc. But at this point I wouldn't be too surprised if it depended on the size of some internal input buffer that's different in lxml vs xmllint. I'd welcome any advice about what else I should try, or how can I find out what calls are being made from lxml to libxml2. From a quick look at the lxml source, it seems that the `feed` method of HTMLParser calls htmlParseChunk, so you should pass `--html --push` to xmllint. But if it's a buffer boundary issue, you might have to recreate the exact chunk sizes to reproduce the problem. lxml seems to split into chunks of size INT_MAX, meaning a single chunk in most cases. xmllint first passes a chunk of 4 bytes, then splits the remaining data into chunks of 4096 bytes. But maybe I'm missing something. To be sure, you could run your Python code under a debugger like gdb and set a break point on htmlParseChunk. Also break on htmlCtxtUseOptions to see which parser options are used exactly. You could also start experimenting with feeding chunks of different sizes in your Python script or with a small C program that calls htmlParseChunk in the same way as lxml, presumably writing a single chunk. You could also try to add 4 bytes somewhere at the beginning of `bad.html` and see if it helps with reproducing the issue using xmllint. Other than that: It's not ideal, but could you please check if you can also reproduce the bug with the first set of commands I posted? Just to verify it's not just me. Yes, I can try. Nick ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml
Re: [xml] HTML parser sometimes doesn't close script tags in libxml2 2.9.8
I also built lxml 4.2.5 with pristine libxml2 2.9.8 (using a variation of the above command), and got the same results. So I don't think it's a distro specific problem. I tried to reproduce it with only xmllint as you suggest, but I'm not having much luck. It produces correct results with "--html --debug bad.html", "--html --debug --stream bad.html", "--html --debug --push bad.html", and "--html --debug --sax bad.html". Maybe I'm just not using the right flags - I don't know if lxml uses SAX mode, or streaming, etc. But at this point I wouldn't be too surprised if it depended on the size of some internal input buffer that's different in lxml vs xmllint. I'd welcome any advice about what else I should try, or how can I find out what calls are being made from lxml to libxml2. Other than that: It's not ideal, but could you please check if you can also reproduce the bug with the first set of commands I posted? Just to verify it's not just me. Tomi On Tue, Jan 22, 2019 at 5:11 PM Nick Wellnhofer wrote: > On 22/01/2019 15:43, Tomi Belan via xml wrote: > > After a lot of debugging, I determined the problem is in libxml2 and not > the > > other libraries in my stack, and that it only seems to happen on version > > 2.9.8. But I don't see any related changes in news.html for 2.9.9, nor > in the > > diff between them, so I am still worried: I don't know if the bug is > really > > fixed, or just dormant. I hope you can find the root cause, and maybe > add a > > regression test if you do. > > I also don't see any directly related changes in either 2.9.8 or 2.9.9. > > > This will download > > the manylinux binary build of lxml 4.2.5, which is statically linked to > > libxml2 2.9.8. > > Are you sure that a pristine 2.9.8 build was used? Maybe there are > additional > patches added by a distro? > > > I couldn't shorten the file very much, because if I delete even a single > > character, the bug stops triggering. (Could it be some buffer boundary > issue?) > > Yes, a buffer boundary issue seems likely. > > > I also built my own lxml 4.2.5 with libxml2 2.9.9 and it was not > affected. So > > I believe this is a bug in libxml2 2.9.8 specifically, and not in a > particular > > version of lxml. > > Did you also try your own build with the official libxml2 2.9.8 sources? > > > I hope you can solve the mystery. Please let me know if I can be of any > help. > > It would help if you could reproduce the issue with xmllint and no Python > code > involved. git-bisect might also be useful. > > Nick > ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml
Re: [xml] HTML parser sometimes doesn't close script tags in libxml2 2.9.8
On 22/01/2019 15:43, Tomi Belan via xml wrote: After a lot of debugging, I determined the problem is in libxml2 and not the other libraries in my stack, and that it only seems to happen on version 2.9.8. But I don't see any related changes in news.html for 2.9.9, nor in the diff between them, so I am still worried: I don't know if the bug is really fixed, or just dormant. I hope you can find the root cause, and maybe add a regression test if you do. I also don't see any directly related changes in either 2.9.8 or 2.9.9. This will download the manylinux binary build of lxml 4.2.5, which is statically linked to libxml2 2.9.8. Are you sure that a pristine 2.9.8 build was used? Maybe there are additional patches added by a distro? I couldn't shorten the file very much, because if I delete even a single character, the bug stops triggering. (Could it be some buffer boundary issue?) Yes, a buffer boundary issue seems likely. I also built my own lxml 4.2.5 with libxml2 2.9.9 and it was not affected. So I believe this is a bug in libxml2 2.9.8 specifically, and not in a particular version of lxml. Did you also try your own build with the official libxml2 2.9.8 sources? I hope you can solve the mystery. Please let me know if I can be of any help. It would help if you could reproduce the issue with xmllint and no Python code involved. git-bisect might also be useful. Nick ___ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org https://mail.gnome.org/mailman/listinfo/xml