Re: [xml] libxml2 2.9.23 download

2022-03-16 Thread Stefan Behnel

Hi,

Jeffrey Walton via xml schrieb am 16.03.22 um 05:45:

libxml2 2.9.13 seems to be missing from ftp://xmlsoft.org/libxml2/.


As mentioned in the release announcement:

https://mail.gnome.org/archives/xml/2022-February/msg9.html

the releases have moved to

https://download.gnome.org/sources/libxml2/2.9/

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Release of libxml2 2.9.13

2022-02-23 Thread Stefan Behnel

Nick Wellnhofer schrieb am 23.02.22 um 11:36:
I asked on GNOME infra if it is possible to offer .tar.gz downloads, but 
this would require changes to the upload script.


Thanks for asking.

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Release of libxml2 2.9.13

2022-02-22 Thread Stefan Behnel

Nick Wellnhofer via xml schrieb am 20.02.22 um 13:53:

Version 2.9.13 of libxml2 is available at:

     https://download.gnome.org/sources/libxml2/2.9/


Thank you for the release, Nick!


Note that starting with this release, libxml2 tarballs are published on 
download.gnome.org instead of ftp.xmlsoft.org.


I noticed that they now use xz compression, whereas they were simply gzip 
compressed before. libxslt also changed the compression. That makes it more 
difficult to download them automatically, because scripts that want to list 
the available files now have to search for different file names. Also, 
Python 2.7 does not have built-in lzma compression support and needs an 
external module in order to handle it. (Both gz and bz2 have been supported 
essentially forever, OTOH.)


And it seems that xz is not considered safe for long-term storage by everyone:

https://www.nongnu.org/lzip/xz_inadequate.html

Could you make the archives available in a (second) format that matches all 
(previous) releases? Apparently, both libxml2 and libxslt were made 
available with gz and bz2 compression before. Either of them would probably 
be fine. bz2 seems to compress equally well as xz here. (And compression 
speed, where bz2 suffers a bit, was never an issue for downloads anyway, 
just decompression speed, where all three are fine.)


Thanks,
Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Resuming maintenance

2022-01-10 Thread Stefan Behnel

Nick Wellnhofer via xml schrieb am 10.01.22 um 15:20:
Thanks to a donation from Google, I'm able to resume maintenance of libxml2 
(and libxslt) for the remainder of 2022.


I'm very happy to read this, Nick. All the best for 2022.

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Release of libxml2 2.9.11

2021-05-14 Thread Stefan Behnel
Stefan Behnel schrieb am 13.05.21 um 23:13:
> I haven't looked into them in detail yet but will do so as soon as I find
> the time (probably during the next days). It's not possible that lxml is
> doing something here that libxml2 doesn't expect, but we'll see.

Sorry, I meant to write "it's possible" instead of "it's not possible".

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Release of libxml2 2.9.11

2021-05-13 Thread Stefan Behnel
Jan Tojnar schrieb am 13.05.21 um 21:44:
>> I fail to build libxslt 1.1.34 against it. The "configure" script of
>> libxslt has this line:
> 
> libxml2 now behaves more correctly by rejecting invalid arguments like
> `print`. This is fixed in libxslt master so it no longer passes it the
> extra print argument.

Thanks. I fixed it by discarding the erroneous "print" before building libxslt.

Past that obstacle, I tested the release with lxml in Python and found a
bunch of tests (>40) from lxml's test suite failing due to changes in the
serialiser. Most of them are due to a line break that was apparently added
to the end of the output, e.g.

"""
AssertionError:
b'testtest\n'
!=
b'testtest'
"""

Difficult to say if this is an improvement or deliberate breakage.
Technically, it's not a semantic change in the XML output, rather a byte
level change in ignorable whitespace. But I'll need to look into it further
to understand what the best adaptation to this change is.

More importantly, there also seem to be issues where additional closing
tags or duplicated PIs and comments are being written, e.g.

"""
AssertionError:
'Cyan\n'
!=
'Cyan'
"""

or

"""
AssertionError:
b'Hello world!\n'
!=
b'Hello world!'
"""

I haven't looked into them in detail yet but will do so as soon as I find
the time (probably during the next days). It's not possible that lxml is
doing something here that libxml2 doesn't expect, but we'll see.

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Release of libxml2 2.9.11

2021-05-13 Thread Stefan Behnel
Salut Daniel,

Daniel Veillard via xml schrieb am 13.05.21 um 15:54:
>   P, I am way way behind, but now that CVE-2021-3541 is out I just pushed
> that long awaited release. libxml2 2.9.11 is tagged in git and a signed
> tarball is available at the usual place:
> 
> ftp://xmlsoft.org/libxml2/

I fail to build libxslt 1.1.34 against it. The "configure" script of
libxslt has this line:

"""
if test "x$LIBXML_LIBS" = "x" && ${XML_CONFIG} --libs print > /dev/null 2>&1
"""

which, I guess, is incorrect and shouldn't have the "print". However, it
seems that in previous versions of libxml2, the xml2-config script printed
the libs and then failed, whereas in 2.9.11/12 it does *not* print the libs
any more and fails immediately. This breaks the "configure" script of
libxslt, which then reports that it could not find libxml2 anywhere.

I guess the work-around is to set "LIBXML_LIBS" externally to the result of
"xml2-config --libs" until there is a fix in libxslt. But maybe libxml2
could just go back to being nice towards the faults of libxslt?

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


[xml] Could we have a new release?

2021-04-14 Thread Stefan Behnel
Hi,

libxml2 2.9.10 has been around for almost 18 months now. There have been
lots of fixes during that time, so, may I kindly ask what's hindering a new
release?

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Entering freeze for libxml2-2.9.10

2019-10-31 Thread Stefan Behnel
Hi,

sorry to be late to the party. Let me note that the release tests fine with
lxml, just with two test failures due to changed (and apparently more
accurate) error texts/IDs. I'll adapt the tests in lxml.

Thank you for the release, Daniel!

Stefan



Daniel Veillard schrieb am 28.10.19 um 21:26:
> On Tue, Oct 29, 2019 at 07:40:34AM +1300, David Warring wrote:
>> The tests for newish Raku (Perl 6) bindings are fine with libxml2-2.9.10-rc1
>> https://github.com/p6-xml/LibXML-p6
> 
>   Okay seems we have a good one ! Thanks David
> 
> BTW is lxml current maintainer around, that may be one good way to validate
> that one and the libxslt 1.1.34 rc2 too, Stefan or Martijn are you around ?
> 
> Daniel
> 
>> - David
>>
>> On Thu, Oct 24, 2019 at 5:53 AM Daniel Veillard via xml 
>> wrote:
>>
>>>   Took a while but it's time to assemble a new release,
>>> I tagged it in git and pushed signed tarball and rpms to the
>>> usual place:
>>>
>>>ftp://xmlsoft.org/libxml2/
>>>
>>> I will try to make an rc2 during the week-end, and then we can
>>> roll up the release by mid next week.
>>>
>>>  In the meantime please give it some testing,
>>>
>>>thanks,
>>>
>>> Daniel
>>>
>>> --
>>> Daniel Veillard  | Red Hat Developers Tools
>>> http://developer.redhat.com/
>>> veill...@redhat.com  | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
>>> http://veillard.com/ | virtualization library  http://libvirt.org/
>>>
>>> ___
>>> xml mailing list, project page  http://xmlsoft.org/
>>> xml@gnome.org
>>> https://mail.gnome.org/mailman/listinfo/xml
>>>
> 

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Entering freeze for release of libxml2-2.9.9

2018-12-24 Thread Stefan Behnel
Nikolai Weibull schrieb am 24.12.18 um 12:00:
> Stefan Behnel, 2018-12-24 11:43:
>> Nick Wellnhofer schrieb am 19.12.18 um 17:02:
>>> On 30/11/2018 11:41, Nikolai Weibull via xml wrote:
>>>> OK, now I understand why it was working in my copy of the repository and
>>>> not yours. Something went wrong when you applied the patch, Daniel, as a
>>>> line was elided. Here’s a fix. We want to include XML_RELAXNG_TEXT here
>>>> as well, otherwise it won’t work. The second part of the patch below was
>>>> just to reorder the types to be listed in alphabetical order, so you may
>>>> certainly skip that.
>>>
>>> Stefan, can you confirm that Nikolai's patch fixes the lxml issue?
>>
>> Sorry for the silence, I wasn't aware that I had to do something. Problem
>> is, the patch that Nikolai sent doesn't apply for me.
> 
>> Nikolai, could you create a patch against the latest master that makes
>> relaxng.c the way you think it should be? (Or should have been in the first
>> place?) Please attach it rather than pasting it into a mail, to make sure
>> it passes without whitespace issues.
> 
> I just applied the patch (with patch < a.patch) without issue against
> master.  I’m attaching it as well so that you can try that.

With that patch applied, all tests in lxml pass again, and the stipped-down
test case as well. I also double-checked it by unapplying the patch, things
are still failing with the master branch and it's really just this change
that makes them work again.

Thanks, Nikolai!

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Entering freeze for release of libxml2-2.9.9

2018-12-24 Thread Stefan Behnel
Nick Wellnhofer schrieb am 19.12.18 um 17:02:
> On 30/11/2018 11:41, Nikolai Weibull via xml wrote:
>> OK, now I understand why it was working in my copy of the repository and
>> not yours.  Something went wrong when you applied the patch, Daniel, as a
>> line was elided.  Here’s a fix.  We want to include XML_RELAXNG_TEXT here
>> as well, otherwise it won’t work. The second part of the patch below was
>> just to reorder the types to be listed in alphabetical order, so you may
>> certainly skip that.
> 
> Stefan, can you confirm that Nikolai's patch fixes the lxml issue?

Sorry for the silence, I wasn't aware that I had to do something. Problem
is, the patch that Nikolai sent doesn't apply for me.

Nikolai, could you create a patch against the latest master that makes
relaxng.c the way you think it should be? (Or should have been in the first
place?) Please attach it rather than pasting it into a mail, to make sure
it passes without whitespace issues.

Thanks!

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Entering freeze for release of libxml2-2.9.9

2018-11-29 Thread Stefan Behnel
Daniel Veillard schrieb am 29.11.18 um 21:20:
> On Mon, Nov 26, 2018 at 11:48:37AM +0100, Nikolai Weibull via xml wrote:
>> Stefan Behnel, 2018-11-25 15:37:
>>> Nikolai Weibull schrieb am 24.11.18 um 00:12:
>>>> Yes, it seems that my patch for data in interleaves was added and
>>>> this may be the cause of these issues. The regression tests didn’t
>>>> display them, so this is something different. Could we perhaps get a
>>>> minimal test that breaks?
>>
>>> Here is what I could come up with so far. Since it's heavily stripped
>>> down,
>>> it probably isn't very reasonable anymore. The original schema is here:

https://raw.githubusercontent.com/lxml/lxml/82601a09d015bc3e7a4090223fcbb9a5d5d4590d/src/lxml/isoschematron/resources/rng/iso-schematron.rng

This is the direct file link now. I had attached the shortened test files here:

https://mail.gnome.org/archives/xml/2018-November/msg00023.html


>> Thank you!  As far as my tests go, with the patches that I’ve provided, this
>> validates without any issues.  I really hope we can get my patches from the
>> merge request into master so that this issue can be fixed.
> 
>   TBH it's weird it fails to validate for me with 2.9.8, with 2.9.9-rc1 and
> with 2.9.9-rc1 with the data interleave patch reverted ...

I tried both lxml's test suite and my stripped down test files with 2.9.8
and the two RCs now, and all of them pass with 2.9.8, but fail with both
2.9.9-rc1 and 2.9.9-rc2.

I figured out how to build libxml2 from a git checkout now so that I could
bisect it. The bug was definitely introduced in c8e5f9588, which is
Nikolai's change from November 22nd.

I used

git bisect run bash -c "make clean && make &&
./xmllint --relaxng ../iso-schematron.rng ../fail_schema.sch"

The change looks simple, but also a bit opaque to me. It could be that it's
related to the interleaving of optional tags/attributes and text somehow.
At least, that's what this part of the change might suggest:

-groups[nbgroups]->defs = xmlRelaxNGGetElements(ctxt, cur, 0);
+groups[nbgroups]->defs = xmlRelaxNGGetElements(ctxt, cur, 2);

And, in fact, changing that line in the latest master branch back to the
original "0" argument makes the validation pass for me. It probably also
reverts most of the intented behaviour that Nikolai wanted to achieve. :(

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Entering freeze for release of libxml2-2.9.9

2018-11-25 Thread Stefan Behnel
Nikolai Weibull schrieb am 24.11.18 um 00:12:
> Yes, it seems that my patch for data in interleaves was added and this may
> be the cause of these issues.  The regression tests didn’t display them, so
> this is something different.  Could we perhaps get a minimal test that breaks?

Here is what I could come up with so far. Since it's heavily stripped down,
it probably isn't very reasonable anymore. The original schema is here:

https://github.com/lxml/lxml/blob/82601a09d015bc3e7a4090223fcbb9a5d5d4590d/src/lxml/isoschematron/resources/rng/iso-schematron.rng

If any of the "interleave" tags is removed or otherwise modified (as far as
I tried), it either validates (i.e. stops failing) or fails with a
different error than the one I was chasing. The validation succeeds in
xmllint 2.9.4 (and probably later versions), and fails in 2.9.9-rc1.

Does this help?

Stefan


iso-schematron.rng
Description: XML document
http://purl.oclc.org/dsdl/schematron;>
Open Model

 BBB element is not present
 CCC element is not present


___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Entering freeze for release of libxml2-2.9.9

2018-11-25 Thread Stefan Behnel
Nikolai Weibull schrieb am 24.11.18 um 00:12:
> Yes, it seems that my patch for data in interleaves was added and this may
> be the cause of these issues.  The regression tests didn’t display them, so
> this is something different.  Could we perhaps get a minimal test that breaks?

It's a bit tricky to cut it down, but I'll try.

In any case, you can already reproduce it with the files I sent, maybe
there's something obvious that goes wrong.

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Entering freeze for release of libxml2-2.9.9

2018-11-23 Thread Stefan Behnel
Salut Daniel!

Daniel Veillard via xml schrieb am 22.11.18 um 18:32:
>   I have just tagged the Release Candidate 1 in git and pushed a signed
> tarball and signed rpms to the usual place:
> 
>   ftp://xmlsoft.org/libxml2/

I think something changed in the RelaxNG code. When I try to validate a
simple Schematron schema file (attached) against the schematron RNG spec here:

https://github.com/lxml/lxml/blob/82601a09d015bc3e7a4090223fcbb9a5d5d4590d/src/lxml/isoschematron/resources/rng/iso-schematron.rng

it validates in libxml2 2.9.4 (sorry, that's what I have installed, but I'm
pretty sure it also worked with at least 2.9.7 and probably also 2.9.8) but
fails in 2.9.9-rc1. I attached the output of both attempts.

Specifically, I get the following error, which seems nonsense, given that
"title" is the first child:

"""
-:3: element title: Relax-NG validity error : Expecting element p, got title
Relax-NG validity error : Extra element rule in interleave
"""

Something seems to be wrong with the interleave/group combination in the
schema, lines 56/61.

Any idea?

Stefan
$ xmllint --relaxng src/lxml/isoschematron/resources/rng/iso-schematron.rng - 


Open Model

 BBB element is not present
 CCC element is not present



Closed model"

 BBB element is not present
 CCC element is not present
There is an extra 
element



EOF


http://purl.oclc.org/dsdl/schematron;>

Open Model

 BBB element is not present
 CCC element is not present



Closed model"

 BBB element is not present
 CCC element is not present
There is an extra 
element



- validates

$ build/tmp/libxml2-2.9.9/xmllint --relaxng 
src/lxml/isoschematron/resources/rng/iso-schematron.rng - 

Open Model

 BBB element is not present
 CCC element is not present



Closed model"

 BBB element is not present
 CCC element is not present
There is an extra 
element



EOF


http://purl.oclc.org/dsdl/schematron;>

Open Model

 BBB element is not present
 CCC element is not present



Closed model"

 BBB element is not present
 CCC element is not present
There is an extra 
element



-:3: element title: Relax-NG validity error : Expecting element p, got title
Relax-NG validity error : Extra element rule in interleave
-:4: element rule: Relax-NG validity error : Element pattern failed to validate 
content
-:10: element title: Relax-NG validity error : Expecting element p, got title
Relax-NG validity error : Extra element rule in interleave
-:11: element rule: Relax-NG validity error : Element pattern failed to 
validate content
-:9: element pattern: Relax-NG validity error : Expecting element p, got pattern
Relax-NG validity error : Extra element pattern in interleave
-:9: element pattern: Relax-NG validity error : Element schema failed to 
validate content
- fails to validate
http://purl.oclc.org/dsdl/schematron;>

Open Model

 BBB element is not present
 CCC element is not present



Closed model"

 BBB element is not present
 CCC element is not present
There is an extra 
element



___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Release of libxml2-2.9.5

2017-09-04 Thread Stefan Behnel
Daniel Veillard schrieb am 04.09.2017 um 15:56:
>   It's out ! I tagged the release in git and pushed the signed tarball
> and rpms to the usual place:
> 
> ftp://xmlsoft.org/libxml2/
> 
> This is mostly a a security and bug fixes, most of the credit goes to Nick
> who wrote or reviewed most of the patches. There is a significant set of
> changes, but users are invited to upgrade if only to get the security fixes !
> There is also portability fixes for those ever special OSes :-)
> [...]
>  Thanks everybody and especially Nick who contributed to this release
> with bug reports, patches, docs, etc ...
> 
>Enjoy the release!

Thank you Daniel and Nick, and congrats to this release!

I hope that the new release process will make it easier for you to get
things out. If there's always a next release, then it's never too late to
get bugs fixed.

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Python 3.5 issue - SystemError: returned a result with an error set

2017-08-31 Thread Stefan Behnel
Petr Sumbera schrieb am 30.08.2017 um 14:00:
> anyone seen following error when running Python regression tests? This is
> just with Python 3.5. Pythons 2.7 and 3.4 are ok (I haven't tested Python
> 3.6).
> 
> ## running Python regression tests
> TypeError: 'NoneType' object is not callable
> 
> During handling of the above exception, another exception occurred:
> 
> Traceback (most recent call last):
>   File
> "/builds/psumbera/userland-libxml2-2.9.5/components/libxml2/libxml2-2.9.5/python/tests/tstLastError.py",
> line 80, in 
> test.test1()
>   File
> "/builds/psumbera/userland-libxml2-2.9.5/components/libxml2/libxml2-2.9.5/python/tests/tstLastError.py",
> line 62, in test1
> line=0)
>   File
> "/builds/psumbera/userland-libxml2-2.9.5/components/libxml2/libxml2-2.9.5/python/tests/tstLastError.py",
> line 30, in failUnlessXmlError
> f(*args)
>   File
> "/builds/psumbera/userland-libxml2-2.9.5/components/libxml2/libxml2-2.9.5/python/libxml2.py",
> line 1374, in readFile
> ret = libxml2mod.xmlReadFile(filename, encoding, options)
> SystemError:  returned a result with an
> error set
> -- tstLastError.py

What this error means is that a Python exception was raised and not
handled, and when returning from the Python function call to the
C-implemented function, it returned an actual result value instead of
returning NULL in order to propagate the exception. This is a bug in the C
extension. It should either silence and clear the exception explicitly with
PyErr_Clear(), or propagate it.

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Support of HTML v5 parsing

2015-06-29 Thread Stefan Behnel
Bruce Miller schrieb am 28.05.2015 um 18:37:
 On 05/28/2015 12:29 PM, Noam Postavsky wrote:
 On Thu, May 28, 2015 at 12:13 PM, Frank Gross wrote:
   Are there any plans to support parsing of HTML V5 in libxml ? I tried
 function htmlCtxtReadMemory(), but it raises an error for HTML document
 containing tags introduced in HTML V5 such as : Tag header invalid.
 
 I'd love to see this happen!  I'm so used to the libxml2 tools,
 and the tools built upon them, it would SO simplify my life.
 
 I think the same question has already been asked, and answered at
 https://mail.gnome.org/archives/xml/2013-April/msg6.html
 
 Sorta, yes. But HTML5 is essentially _defined_ by it's parser rather than
 by it's spec. In particular the (annoying) way that it rewrites the DOM
 to turn what you wrote into what it wants.  That being the case, there's
 more to adapting libxml's HTML parser than just being more forgiving about
 the unrecognized tags --- the resulting DOM might not be quite what HTML5
 specifies!

I think most people would be happy if the new tags were recognised
correctly, e.g. the self-closing ones. Whether or not the resulting DOM
tree is strictly HTML5 parsing conform or not - does it really matter that
much?


 Which is all to say that it's not quite trivial; would probably amount to
 importing the official parser and modifying it to create libxml's internal
 structure.  Sadly, Daniel doesn't have the time.   Nor, alas, do I.

There's a long list of tag metadata in the HTMLparser.c file. I'm sure a
patch that adds just a couple of the new tags would be warmly appreciated.
As long as everyone just goes *I* don't have time ATM, not even to add one
little tag, nothing's going to change here.

Stefan

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Memory usage 32 bit vs. 64 bit Linux

2015-05-12 Thread Stefan Behnel
Daniel Veillard schrieb am 12.05.2015 um 10:41:
 On Tue, May 12, 2015 at 10:28:34AM +0200, Robert Grasböck wrote:
 Hello Stefan!

 Memory consumption has nearly decreased by 50%, that's the good thing.
 But the bad thing is that the documentation says:

 no modification of the tree allowed afterwards (will possibly crash if you
 try to modify the tree)

 Should I be worry about it?
 
   The Compact option stores the small text string in the unused pointers
 when possible, but it makes changes to the tree content way harder.
 As long as you generate the tree to scan it and not change it that's safe
 if you dynamically change the tree from your application, there is risks

To clarify this a bit for the OP: The API functions should generally work
nicely also on the compact tree, but if you manipulate tree nodes manually,
you have to take care yourself that you don't try to free strings that were
not allocated separately, for example.

Stefan

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Memory usage 32 bit vs. 64 bit Linux

2015-05-11 Thread Stefan Behnel
Robert Grasböck schrieb am 05.05.2015 um 15:52:
 I have a question about memory usage of libxml2.
 I'm using libxml2 on two different systems, once a 32 bit linux other one a
 64 bit linux.
 On both I run the same application which use libxml2 to parse xml files.
 The application opens many small xml files (~200) with xmlParseFile.
 The document (xmlDocPtr) stays open in memory.
 I now noticed that the application running on 64 bit Linux uses more the 4
 times the heap memory as on the 32 bit version. I could understand a double
 up of used memory due to all the pointers are now double the size. But 4
 time???
 
 I did a check with valgrinds massif tool and it tells me that most of the
 heap allocation (in my case 2/3 of the total heap consumption) comes from
 libxml2 xmlParseFile.
 
 I don't know if it's a issue of libxml2 directly, but it seems it is.
 I would appreciate any suggestions to get the memory usage down!

Have you enabled the COMPACT parsing option? It avoids text nodes for short
text content and benefits a lot from a 64 bit architecture.

Stefan

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] [PATCH] Add methods for python3 iterator

2014-09-23 Thread Stefan Behnel
Ron Angeles schrieb am 18.09.2014 um 09:14:
 xmlCoreDepthFirstItertor and xmlCoreBreadthFirstItertor only
 implement a python2-compatible iterator interface. The necessary
 method names (__next__) have been added. They just passthrough
 to the python2 method (next).
 ---
  python/libxml.py | 8 
  1 file changed, 8 insertions(+)
 
 diff --git a/python/libxml.py b/python/libxml.py
 index e507e0f..abf0cd4 100644
 --- a/python/libxml.py
 +++ b/python/libxml.py
 @@ -530,6 +530,10 @@ class xmlCoreDepthFirstItertor:
  self.parents = []
  def __iter__(self):
  return self
 +# python3 iterator
 +def __next__(self):
 +return self.next()
 +# python2 iterator
  def next(self):
  while 1:
  if self.node:

You can write

__next__ = next

below the next method in both cases. No need to go through an indirection
here. I'd even reverse it: change the method definition to __next__(self)
and add a next alias instead of the above.

Also note that there is a Python package called lxml (developed by me)
that wraps libxml2 and libxslt. People tend to prefer it over the bare
Python bindings that come with both libraries.

Stefan

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


[xml] [BUG+FIX] valid.c erroneously ignores a validation error if no error callback set

2014-02-21 Thread Stefan Behnel
Hi,

valid.c contains this code:

   2636 if ((ctxt != NULL)  (ctxt-error != NULL)) {
   2637 xmlErrValidNode(ctxt, attr-parent, XML_DTD_ID_REDEFINED,
   2638 ID %s already defined\n,
   2639 value, NULL, NULL);
   2640 }

It prevents the error from being reported if ctxt-error is not set,
although simply calling xmlErrValifNode() would properly report the error
to the global error callback if the NULL checks above didn't exist.

The fix is to remove the surrounding if test.

https://bugzilla.gnome.org/show_bug.cgi?id=724903

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] [bug] external subset ignored by 2.9.0 when parsing in incremental mode

2012-10-20 Thread Stefan Behnel
Noam Postavsky, 11.10.2012 08:41:
 This patch fixes my test case as well.

Rebumping this again then. Thanks for testing.

Stefan

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] [bug] external subset ignored by 2.9.0 when parsing in incremental mode

2012-10-09 Thread Stefan Behnel
[bump]

See patch in original e-mail.

Stefan Behnel, 28.09.2012 13:44:
 Hi,
 
 there is an unfortunate interaction between the progressive parsing mode
 and the loading of an external DTD, e.g. to inject defaulted attribute
 values. I see this in lxml's iterparse() implementation that started
 failing to inject them in libxml2 2.9.0. It uses incremental push parsing.
 
 The problem results from the fact that xmlSAX2ExternalSubset() in SAX2.c
 reuses the existing parser context, which, in this case, is in progressive
 mode. When it calls into xmlParseExternalSubset(), that starts by running
 the GROW macro, which is a no-opt in progressive mode. Thus, no data is
 available and xmlParseExternalSubset() terminates without doing anything.
 
 I'm not currently sure why it worked in older releases. I suspect that one
 of the many additional places that now set the ctxt-progressive field to 1
 might have triggered it.
 
 I'm not entirely sure about the right way to fix this. Maybe
 xmlSAX2ExternalSubset() should also back up and restore the progressive
 field of the context and then set it to 0 before calling
 xmlParseExternalSubset()? I attached a patch that does that and that fixes
 the problem for me.
 
 BTW, is it correct that ctxt-progressive is sometimes set to 1 and
 sometimes to things like XML_PARSER_COMMENT or XML_PARSER_PI in
 parser.c? Those values are more commonly assigned to the instate field.
 
 Stefan

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


[xml] [bug] external subset ignored by 2.9.0 when parsing in incremental mode

2012-09-28 Thread Stefan Behnel
Hi,

there is an unfortunate interaction between the progressive parsing mode
and the loading of an external DTD, e.g. to inject defaulted attribute
values. I see this in lxml's iterparse() implementation that started
failing to inject them in libxml2 2.9.0. It uses incremental push parsing.

The problem results from the fact that xmlSAX2ExternalSubset() in SAX2.c
reuses the existing parser context, which, in this case, is in progressive
mode. When it calls into xmlParseExternalSubset(), that starts by running
the GROW macro, which is a no-opt in progressive mode. Thus, no data is
available and xmlParseExternalSubset() terminates without doing anything.

I'm not currently sure why it worked in older releases. I suspect that one
of the many additional places that now set the ctxt-progressive field to 1
might have triggered it.

I'm not entirely sure about the right way to fix this. Maybe
xmlSAX2ExternalSubset() should also back up and restore the progressive
field of the context and then set it to 0 before calling
xmlParseExternalSubset()? I attached a patch that does that and that fixes
the problem for me.

BTW, is it correct that ctxt-progressive is sometimes set to 1 and
sometimes to things like XML_PARSER_COMMENT or XML_PARSER_PI in
parser.c? Those values are more commonly assigned to the instate field.

Stefan
diff -r 58415f6342ee SAX2.c
--- a/SAX2.c	Wed Sep 26 10:21:06 2012 +0800
+++ b/SAX2.c	Fri Sep 28 13:40:08 2012 +0200
@@ -411,6 +411,7 @@
 	xmlParserInputPtr input = NULL;
 	xmlCharEncoding enc;
 	int oldcharset;
+	int oldprogressive;
 
 	/*
 	 * Ask the Entity resolver to load the damn thing
@@ -432,6 +433,7 @@
 	oldinputMax = ctxt-inputMax;
 	oldinputTab = ctxt-inputTab;
 	oldcharset = ctxt-charset;
+	oldprogressive = ctxt-progressive;
 
 	ctxt-inputTab = (xmlParserInputPtr *)
 	 xmlMalloc(5 * sizeof(xmlParserInputPtr));
@@ -442,11 +444,13 @@
 	ctxt-inputMax = oldinputMax;
 	ctxt-inputTab = oldinputTab;
 	ctxt-charset = oldcharset;
+	ctxt-progressive = oldprogressive;
 	return;
 	}
 	ctxt-inputNr = 0;
 	ctxt-inputMax = 5;
 	ctxt-input = NULL;
+	ctxt-progressive = 0;
 	xmlPushInput(ctxt, input);
 
 	/*
@@ -487,6 +491,7 @@
 	ctxt-inputMax = oldinputMax;
 	ctxt-inputTab = oldinputTab;
 	ctxt-charset = oldcharset;
+	ctxt-progressive = oldprogressive;
 	/* ctxt-wellFormed = oldwellFormed; */
 }
 }
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Availability of libxm2-2.9.0 release candidate 1

2012-08-10 Thread Stefan Behnel
Daniel Veillard, 10.08.2012 07:21:
   BTW do you have a git commit for 2.9.0 preparation in lxml now ? I may
 forward this to the packager for Fedora.

Hmm, I'm fixing it up only for lxml 3.0. Due to various changes in the code
base, that won't apply directly to the latest 2.3.x, and I'm not sure I
want to add support in the 2.3.x series. I might ...

The fixes aren't all that complex, though. If it's just to get it working,
the attached patch should show the necessary changes, but it won't apply to
2.3.x as is.

BTW, with my latest changes, I get lots of XSLT test failures like this
when I run it with libxslt 1.1.26:


Failed example:
str(result)
Expected:
'?xml version=1.0?\nfoochildNEW TEXT/child/foo\n'
Got:
'?xml version=?\nfoochildNEW TEXT/child/foo\n'


You might have seen these before.

Stefan

# HG changeset patch
# Parent a071dfc78c525bb6fda60746bbe694ff1b257200
adapt to upcoming buffer changes in libxml2 2.9

diff -r a071dfc78c52 -r e5da17790fc2 src/lxml/includes/etree_defs.h
--- a/src/lxml/includes/etree_defs.h	Thu Aug 09 17:06:50 2012 +0200
+++ b/src/lxml/includes/etree_defs.h	Thu Aug 09 18:24:42 2012 +0200
@@ -152,6 +152,13 @@
 #  define xmlSchematronSetValidStructuredErrors(ctxt, errorfunc, data)
 #endif
 
+#include libxml/tree.h
+#ifndef LIBXML2_NEW_BUFFER
+   typedef xmlBuffer xmlBuf;
+#  define xmlBufContent(buf) xmlBufferContent(buf)
+#  define xmlBufLength(buf) xmlBufferLength(buf)
+#endif
+
 /* libexslt 1.1.25+ support EXSLT functions in XPath */
 #if LIBXSLT_VERSION  10125
 #define exsltDateXpathCtxtRegister(ctxt, prefix)
diff -r a071dfc78c52 -r e5da17790fc2 src/lxml/includes/tree.pxd
--- a/src/lxml/includes/tree.pxd	Thu Aug 09 17:06:50 2012 +0200
+++ b/src/lxml/includes/tree.pxd	Thu Aug 09 18:24:42 2012 +0200
@@ -285,9 +285,11 @@
 
 ctypedef struct xmlBuffer
 
+ctypedef struct xmlBuf   # new in libxml2 2.9
+
 ctypedef struct xmlOutputBuffer:
-xmlBuffer* buffer
-xmlBuffer* conv
+xmlBuf* buffer
+xmlBuf* conv
 int error
 
 const_xmlChar* XML_XML_NAMESPACE
@@ -359,6 +361,8 @@
 cdef void xmlBufferFree(xmlBuffer* buf) nogil
 cdef const_xmlChar* xmlBufferContent(xmlBuffer* buf) nogil
 cdef int xmlBufferLength(xmlBuffer* buf) nogil
+cdef const_xmlChar* xmlBufContent(xmlBuf* buf) nogil # new in libxml2 2.9
+cdef size_t xmlBufLength(xmlBuf* buf) nogil # new in libxml2 2.9
 cdef int xmlKeepBlanksDefault(int val) nogil
 cdef xmlChar* xmlNodeGetBase(xmlDoc* doc, xmlNode* node) nogil
 cdef void xmlNodeSetBase(xmlNode* node, const_xmlChar* uri) nogil
diff -r a071dfc78c52 -r e5da17790fc2 src/lxml/serializer.pxi
--- a/src/lxml/serializer.pxi	Thu Aug 09 17:06:50 2012 +0200
+++ b/src/lxml/serializer.pxi	Thu Aug 09 18:24:42 2012 +0200
@@ -81,7 +81,7 @@
 tree.
 
 cdef tree.xmlOutputBuffer* c_buffer
-cdef tree.xmlBuffer* c_result_buffer
+cdef tree.xmlBuf* c_result_buffer
 cdef tree.xmlCharEncodingHandler* enchandler
 cdef const_char* c_enc
 cdef const_xmlChar* c_version
@@ -133,11 +133,11 @@
 
 try:
 if encoding is _unicode:
-result = (unsigned char*tree.xmlBufferContent(
-c_result_buffer))[:tree.xmlBufferLength(c_result_buffer)].decode('UTF-8')
+result = (unsigned char*tree.xmlBufContent(
+c_result_buffer))[:tree.xmlBufLength(c_result_buffer)].decode('UTF-8')
 else:
-result = bytes(unsigned char*tree.xmlBufferContent(
-c_result_buffer))[:tree.xmlBufferLength(c_result_buffer)]
+result = bytes(unsigned char*tree.xmlBufContent(
+c_result_buffer))[:tree.xmlBufLength(c_result_buffer)]
 finally:
 error_result = tree.xmlOutputBufferClose(c_buffer)
 if error_result  0:
@@ -287,6 +288,9 @@
 tree.xmlOutputBufferWrite(c_buffer, 3, ' [\n')
 if c_dtd.notations != NULL:
-tree.xmlDumpNotationTable(c_buffer.buffer,
-  tree.xmlNotationTable*c_dtd.notations)
+c_buf = tree.xmlBufferCreate()
+tree.xmlDumpNotationTable(c_buf, tree.xmlNotationTable*c_dtd.notations)
+tree.xmlOutputBufferWrite(
+c_buffer, tree.xmlBufferLength(c_buf), const_char*tree.xmlBufferContent(c_buf))
+tree.xmlBufferFree(c_buf)
 c_node = c_dtd.children
 while c_node is not NULL:
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Availability of libxm2-2.9.0 release candidate 1

2012-08-09 Thread Stefan Behnel
Daniel Veillard, 10.08.2012 04:42:
   Following the first rc0 snapshot from last week and after much cleanup
 and testing, the first release candidate for the next libxml2 release is
 available at the usual place:
 [...]
   As stated previously, I target a final release beginning of September,
 and will probably make an rc2 release around a week from now fixing
 what have been reported in the meantime.

Could you say something about the likeliness of the (second) timsort patch
being integrated for 2.9? It would provide a very serious improvement to
the XPath performance, including various cases where the current
performance is totally unacceptable.

Stefan

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml



Re: [xml] xmlXPathNodeSetSort performance

2012-08-08 Thread Stefan Behnel
Vojtech Fried, 08.08.2012 12:18:
 I had to do some changes to the original code to make it compile with msvc.

Did you report them back to the original author?

(note that github allows you to create pull requests through the web
interface, you can edit single files right in place)

Stefan

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Important: possible incompatible changes ahead for 2.9.0 !

2012-08-07 Thread Stefan Behnel
Daniel Veillard, 07.08.2012 10:16:
 On Mon, Aug 06, 2012 at 11:39:23PM +0200, Stefan Behnel wrote:
 thanks for the heads-up. I don't care all that much about the global dict
 size - 10M entries should be hard enough to reach for normal use cases.
 Most users only deal with a very small number of XML formats.
 
   Okay, the point too is that the dictionary may be used to intern small
 strings and while this is unlikely to break for a single document
 reusing the same dictionary over and over for many documents may lead to
 problem.

Ah, right - I remember one user complaining once that DTD IDs were stored
there and lead to a memory leak. He generated them on the fly, which
meant that each document had a completely distinct set of IDs. That can add
up pretty quickly.

Maybe I should add a parser option that would use a subdict instead of the
global per-thread dict.


 In any case the dictionary is only limited as part of the
 parsing process if you allocate it on your own and override the parser
 context one you won't be affected.

Ok, then lxml won't run into this anyway.


 https://github.com/lxml/lxml/blob/master/src/lxml/serializer.pxi#L123
 
   c_buffer = tree.xmlAllocOutputBuffer(enchandler)
   ...
   tree.xmlOutputBufferFlush(c_buffer)
  
   if c_buffer.conv is not NULL:
   c_result_buffer = c_buffer.conv
   else:
   c_result_buffer = c_buffer.buffer
 
 I think you can still keep the code initializing c_result_buffer
  the pointer names are kept the same and you're just testing for NULL
  so that should be fine
 
   if encoding is _unicode:
   result = (unsigned char*tree.xmlBufferContent(
 
 c_result_buffer))[:tree.xmlBufferLength(c_result_buffer)].decode('UTF-8')
   else:
   result = bytes(unsigned char*tree.xmlBufferContent( 
 c_result_buffer))[:tree.xmlBufferLength(c_result_buffer)]
 
 That will have to be changed. I don't know how you express #ifdef in pxi
 but if LIBXML2_NEW_BUFFER is defined then use
   tree.xmlBufContent/tree.xmlBufLenght
 if not defined keep using
   tree.xmlBufferContent/tree.xmlBufferLenght
 on the pointers.

Ok. I can just #define the names conditionally in an external header file.


 Another issue I found: xmlDumpNotationTable() still wants an xmlBuffer
 instead of the xmlBuf that outbuffer.buffer returns. Is the right fix here
 to include buf.h and call xmlBufBackToBuffer()?
 
   yeah those routines are unlikely to generate more than 2GBytes of
   output which is why using xmlBuffer is still okay. Actually I did not
   change any of the APIs, all the old APIs using xmlBuffer will still
   work as before, it's just that internally I changed
   xmlParserInputBuffer and xmlOutputBuffer to use the new xmlBuf
   internally.

 https://github.com/lxml/lxml/blob/master/src/lxml/serializer.pxi#L293
 
  if c_dtd.notations != NULL:
 tree.xmlDumpNotationTable(c_buffer.buffer,
   tree.xmlNotationTable*c_dtd.notations)
 
   that will need some fixing, right ... You can't include buf.h because
 it is private ...

 What I would do is create a new xmlBuffer, dump to it
 get the resulting string and do tree.xmlOutputBufferWrite() to append it
 and then free the buffer. That sounds the simplest and most portable to
 old and new versions.

Ok. Let's assume that internal subsets tend to be small and that this is
good enough.


 It seems to me that redefining xmlBufferLength and xmlBufferContent to call
 the new xmlBuf functions and using a size_t (or ssize_t?) to store the
 result of xmlBufLength would do the trick.
 
   I can't do that really it would break the API, both data structures
 will coexist ... forever.

Sorry - I meant that it would do the trick for me, i.e. doing it on user
side will work. I wasn't suggesting to do it in libxml2's header files.


 BTW, is there a reason why there's both an xmlBufLength() and an
 xmlBufUse() that do the same thing? Since this is a new API that doesn't
 suffer from legacy junk yet, wouldn't one be enough? (And wouldn't
 xmlBufLength() be the perfect name?)
 
   well I wanted the conversion to be as automatic as possible and
 since the field which was used was buf-use, I perfer to keep that
 'alias' for simpler conversions where needed. But in theory you're
 perfectly right it is pure code duplication ...

Your choice - I would have decided against having two ways to do it.


   I don't plan to make an official release with the changes before
 September, so there is a bit of time to get this all cleaned up, and
 possibly refine the migration stategy for the few apps affected.

 There'll be a new release (3.0) of lxml quite soon, within a few weeks. It
 should be doable to get this fixed up by then.
 
   Okay, tell me if you have problems. I didn't fully finished, as you
 can see I'm still commiting fixes and improvements on that part. Let's
 synchronize so that people making new build of lxml do it on top of
 the new libxml2, otherwise they will have to rely

Re: [xml] Important: possible incompatible changes ahead for 2.9.0 !

2012-08-07 Thread Stefan Behnel
Daniel Veillard, 06.08.2012 09:00:
I have put a snapshot tarball libxml2-2.9.0-rc0.tar.gz (and rpms)
 for people to have a try and raise issues with this change

One minor issue: I think you forgot to regenerate the documentation for the
above tar ball. I just noticed it because I routinely parse the HTML files
of new releases to generate the list of error constants in lxml, and the
archive above didn't add any new ones, despite having a new category
XML_FROM_BUFFER in the header files.

Stefan

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Important: possible incompatible changes ahead for 2.9.0 !

2012-08-06 Thread Stefan Behnel
Hi Daniel,

thanks for the heads-up. I don't care all that much about the global dict
size - 10M entries should be hard enough to reach for normal use cases.
Most users only deal with a very small number of XML formats.

But I did run into issues with the buffer changes.

Daniel Veillard, 06.08.2012 09:00:
   The new buffer structure will be ABI compatible with the old ones,
 i.e. the old code as compiled wil be able to work with the new one, as
 the fields with the same values are in the same place in the new
 structures. But the structure are now opaque and the few places where
 the code was using it directly will need fixing.
   What I see from the usage there are for example access to xmlOutputBuffers:
 
   buf = xmlAllocOutputBuffer (NULL);
   dump stuff to the buffer...
   use data at buf-buffer-content, of size buf-buffer-use
 
 First okay, that was allowed by the API, but such buffers were supposed
 to be used for I/O and encoding conversion, in general accessing
 buf-buffer-content and buf-buffer-use directly was not really the
 expected way to do things. The fact that xmlOutputBuffer were not
 supposed to be used that way is the reason why there is no accessors for
 getting the output data, this is now fixed as of commit
 
   
 http://git.gnome.org/browse/libxml2/commit/?id=e258adecd0e19a6cfe6afa232b89aa416368820e
 
  So where there is such use of direct access, check the LIBXML2_NEW_BUFFER
 macro and if present then
- replace buf-buffer-content with xmlOutputBufferGetContent(buf)
- replace buf-buffer-use with xmlOutputBufferGetSize(buf)

I tested it and found that lxml is affected by this. lxml currently takes
the xmlBuffer* from either the conv or buffer field of the output
buffer and then calls xmlBufferContent() and xmlBufferLength() to get at
the result. I take it that this isn't how it'll work in the future, because
xmlBufferLength() returns an int and buffers are supposed to be larger than
that, right?

However, xmlOutputBufferGetContent() only reads the buffer field, not the
conv field. How should I use the conv field now? Can't the new
xmlOutputBufferGetContent() do the right thing for me?

Code that uses xmlBuffer directly is here:

https://github.com/lxml/lxml/blob/master/src/lxml/serializer.pxi#L31

https://github.com/lxml/lxml/blob/master/src/lxml/serializer.pxi#L123

Another issue I found: xmlDumpNotationTable() still wants an xmlBuffer
instead of the xmlBuf that outbuffer.buffer returns. Is the right fix here
to include buf.h and call xmlBufBackToBuffer()?

https://github.com/lxml/lxml/blob/master/src/lxml/serializer.pxi#L293

(BTW, the reason why the serialisation code is doing so much stuff manually
is IIRC that lxml still supports a couple of libxml2 versions that lack the
newer features of the serialisation/xmlSave API. And also to avoid slight
changes to the serialised XML if it switched to native libxml2 functions
abruptly.)


   if in some place the xmlBufferPtr was passed independantly of the
 OutputBuffer, it's possible to use xmlBufGetContent(buffer) and
 xmlBufUse(buffer) to achieve the same.

I assume you meant xmlBufContent() ?

It seems to me that redefining xmlBufferLength and xmlBufferContent to call
the new xmlBuf functions and using a size_t (or ssize_t?) to store the
result of xmlBufLength would do the trick.

BTW, is there a reason why there's both an xmlBufLength() and an
xmlBufUse() that do the same thing? Since this is a new API that doesn't
suffer from legacy junk yet, wouldn't one be enough? (And wouldn't
xmlBufLength() be the perfect name?)


   I don't plan to make an official release with the changes before
 September, so there is a bit of time to get this all cleaned up, and
 possibly refine the migration stategy for the few apps affected.

There'll be a new release (3.0) of lxml quite soon, within a few weeks. It
should be doable to get this fixed up by then.

Stefan

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] xmlXPathNodeSetSort performance

2012-07-31 Thread Stefan Behnel
Stefan Behnel, 29.07.2012 06:55:
 Vojtech Fried, 26.07.2012 18:17:
 Third version of the timsort patch. Unfortunately, I was not able to
 finish it. It does not link on windows and I didn't test it in any way.
 But if anyone wants to try it, it is probably not far away... I moved
 the code to .c file and had to do some other shuffling.
 
 Hmm, it doesn't apply cleanly for me against 2.8.0 (so I guess you took it
 from the latest git master, which is the right thing to do). The only
 problem seems to be with the win32 setup, though, so I think I can safely
 ignore it.
 
 But once applied, it also doesn't build. The definition of the
 xmlNodeTimSort() function is missing.
 
 I tried your second patch and it works for me. Seeing how much more
 involved the third version of the patch is, I wonder if it's really all
 that bad to leave the timsort implementation in a header file. After all,
 it's supposed to be an externally maintained piece of code, and in the
 external repo, it lives in a header file. So it would reduce the
 maintenance overhead if it was just copied over unchanged.

Any comments? I think the second patch is ok as it stands.

Stefan

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] xmlXPathNodeSetSort performance

2012-07-28 Thread Stefan Behnel
Vojtech Fried, 26.07.2012 18:17:
 Third version of the timsort patch. Unfortunately, I was not able to
 finish it. It does not link on windows and I didn't test it in any way.
 But if anyone wants to try it, it is probably not far away... I moved
 the code to .c file and had to do some other shuffling.

Hmm, it doesn't apply cleanly for me against 2.8.0 (so I guess you took it
from the latest git master, which is the right thing to do). The only
problem seems to be with the win32 setup, though, so I think I can safely
ignore it.

But once applied, it also doesn't build. The definition of the
xmlNodeTimSort() function is missing.

I tried your second patch and it works for me. Seeing how much more
involved the third version of the patch is, I wonder if it's really all
that bad to leave the timsort implementation in a header file. After all,
it's supposed to be an externally maintained piece of code, and in the
external repo, it lives in a header file. So it would reduce the
maintenance overhead if it was just copied over unchanged.

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] xmlXPathNodeSetSort performance

2012-07-26 Thread Stefan Behnel
Vojtech Fried, 26.07.2012 15:45:
 Keeping it in header has the advantage that it remains generic and can
 be used from anywhere and with any type of parameters (e.g. not only for
 sorting xmlNodePtrs). If in .c file, there can only be one sort
 function. Although since the sort is used from only one place, it does
 not matter :-) Another thing would be the need to move
 XP_OPTIMIZED_NON_ELEM_COMPARISON to a header included both from the
 sort.c and xpath.c. But that would probably be for better.

Absolutely. If we ever need it for sorting other kinds of data, we can
simply add another entry point to the source file. Everything else can just
be static and hidden in the module.


 I have done more performance tests. Timsort behaves better than I
 thought (or rather Shellsort worse than I thought). For sorted nodesets
 of size n like '/item[true()]' Timsort does only n-1 comparisons, unlike
 current Shellsort. ('/item' does not need sorting.)

Well, /item[true()] doesn't need sorting either, if you know that the
underlying set on which you evaluate the condition is always sorted
already. O(n) is way worse than O(1), for most n. That's why I mentioned
the need for a flag sorted in the node set structure.


 It does not matter
 whether the data are small or big, Timsort wins. For partially unsorted
 sequences like '/item[(position() mod 10) = 0] | /item[(position() mod
 10) = 1] | ..' it wins too. It wins both at number of comparisons or
 valgrind instructions (of the whole sort).

Obviously. Tim Peters specifically designed his algorithm to minimise the
number of comparisons, and as I said, node comparisons can be very costly.

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] xmlXPathNodeSetSort performance

2012-07-26 Thread Stefan Behnel
Vojtech Fried, 26.07.2012 16:30:
 What I meant is that libxml currently sorts '/item[true()]', but it does
 not sort '/item'. I agree, it does not need to sort any of them. I agree
 with the flag sorted too, but it is another optimization, independent
 on what I am trying to do now.

Ok. I'm just mentioning it because it obviously needs to be done after this
change.

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] xmlXPathNodeSetSort performance

2012-07-25 Thread Stefan Behnel
Vojtech Fried, 25.07.2012 17:45:
 Second version of Timsort patch, slightly more polished. It builds on my
 gcc, I have fixed some warnings and merged the two headers into one. I
 did not move the code to .c file though, because the sort implementation
 uses some macro magic, i.e. the functions you see in the code are really
 function templates and they are instantiated with the name and type
 you choose with the macros (basically it is a poor man's C++ template
 system :-). I could remove the macros and specialize the functions for
 libxml xmlNodePtr, but that seems quite ugly to me.

What about moving it all into a .c file, then adding a new entry point at
the end that specialises the preceding code for the exact use case in xpath.c?

Thanks for doing this, BTW, the node set sorting performance is a huge
problem. Not only for very large lists of nodes, but also for multi-step
XPath expressions, which currently result in multiple sorting and
re-sorting steps.

Note that node comparison can be very costly, so another thing to
investigate would be to add a flag to the node set struct that remembers if
it's sorted already. That would allow the sort algorithm to skip to the
merging step directly, instead of traversing the whole node list first.

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


[xml] open libxml2 crash bugs in lxml's bug tracker

2012-07-02 Thread Stefan Behnel
Hi,

lxml's bug tracker currently holds two user code triggered crash bugs for
libxml2:

https://bugs.launchpad.net/lxml/+bug/1009118

- segfault with XPath expression with unknown namespace and nested
function calls

https://bugs.launchpad.net/lxml/+bug/502959

- segfault when parsing docbook XML with several external entities

I haven't managed so far to take a closer look (except for reproducing them
with a stock lxml), so I'm dumping them here for now (sorry).

There's also a libxslt bug that I just posted on the ML over there.

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Support for really large XML documents

2012-06-02 Thread Stefan Behnel
Hi,

note that your top-posting makes it harder to follow the discussion and to
reply to it.

Vit Zikmund, 25.05.2012 13:10:
 Well, you are right with the buffer writing to memory and the author of 
 the XMLSec library confirmed that he has to have the whole document there 
 due to c14n. Also it seems that it is a fundamental part of the process, 
 so there is no easy fix on his side.
 http://www.aleksey.com/pipermail/xmlsec/2012/009411.html

I don't see the link here. C14N can be output to a (temp) file just as
well, and the input can be streamed back afterwards in order to encrypt or
sign it (or whatever). It may really not be an easy fix, as the author
said, but I don't see it being impossible.

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Release candidate 1 of libxml2 2.8.0

2012-05-15 Thread Stefan Behnel
Daniel Veillard, 15.05.2012 14:48:
   I finally managed to go though all the patches which accumulated in
 Gnome bugzilla and do the various necessary cleanups to try to get a
 release. There is however *many* changes, especially on the portability
 side which I just can't test myself (including many changes on various
 Windows toolchains). So I just made a release candidate of 2.8.0
 
ftp://xmlsoft.org/libxml2/
 
 look for libxml2-2.8.0-rc1.tar.gz, there is also rpms generated on my
 Fedora 16 box for those interested.
 The code seems to work okay, I have been running with the changes as
 I added them and my system didn't blew up, it passes valgrind, there
 doesn't seems to be leaks at least on the main paths, so I think it's
 time for more general testing.
  Please give it a try, I will make other rc release if needed, and
 will try to shoot for a final 2.8.0 release a week from now ! I know a
 number of people waited for an official release, please help me finish
 by testing it and report problems if you find some :-)

Didn't do any extensive testing, but it builds and tests fine for me with lxml.


   thanks in advance,

Thanks Daniel!

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


[xml] XPath performance issues

2011-11-04 Thread Stefan Behnel

Hi,

almost exactly two years ago, I brought up the topic of the surprisingly 
unpredictable XPath performance on this list (thread titled confusing 
xpath performance characteristics, without response at the time). The 
problem is not the actual search, but the merging of node sets after the 
fact. The effect is that a detailed expression like


 /xs:schema/xs:complexType//xs:element[@name=equity]/@type

is several orders of magnitude slower than the less constrained expression

//xs:element[@name=equity]/@type

The problem here is that the evaluator finds several matches for 
xs:complexType, searches each subtree, and then goes into merging the 
subresults and removing duplicates along the way, using a quadratic 
algorithm. The runtime of this approach quickly explodes with the number of 
nodes in the node set, especially since it can get applied several times 
while going up the expression tree.


There are several surprising expressions where this shows, e.g.

descendant-or-self::dl/descendant::dt

is very fast, whereas the semantically only slightly different

descendant-or-self::dl/descendant-or-self::*/dt

is already quite a bit slower, but the equivalent

descendant-or-self::dl//dt

is orders of magnitude slower. You can test them against the 4.7MB HTML5 
spec page at


http://www.w3.org/TR/html5/Overview.html

The last approach takes literally hours, whereas the first two finish 
within the order of seconds. I ran this within callgrind, and the top 
function that takes 99.4% of the overall runtime is 
xmlXPathNodeSetMergeAndClear(), specifically the inner loop starting with 
the comment skip duplicates.


There are two issues here. One is that the duplicate removal uses the 
easiest to implement, and unluckily also slowest algorithm. This is 
understandable because doing better is not trivial at all in C. However, 
the algorithm could be improved quite substantially, e.g. by using merge 
sort (even based on an arbitrary sort criteria like the node address). If 
eventual sorting in document order is required, a merge sort is the best 
thing to do anyway, as it could be applied recursively along the XPath 
evaluation tree, thus always yielding sorted node sets at each stage.


The second issue is that the duplicate removal is not necessary at all in a 
wide variety of important cases. As long as the subexpression on the right 
does not access any parents (which likely applies to 95% of the real world 
XPath expressions), i.e. the subresults originate from distinct subtrees, 
there can be no duplicates, and the subresults are automatically in 
document order. Thus, the merging will collapse into a simple 
concatenation. I admit that this case will sometimes be hard to detect 
because the left part of the expression may already have matched 
overlapping parts of the tree. But I think it is worth more than just a try.


Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] libxml2 messed up MonoTouch and Interface Builder

2010-11-09 Thread Stefan Behnel

James Wright, 08.11.2010 17:40:

I tried to install libxml2 yesterday for a Ruby side project of mine.
First I tried it with MacPorts but my MacPorts wouldn't work so I tried to
download the source for libxml2 and make the install which ran with some
errors but nothing that stopped the install.
That still didn't work for my Ruby coding (was trying to use the nokogiri
library) so I got MacPorts working and installed libxml2 via that which
eventually worked fine.

After that I went back to my MonoTouch iPhone development and noticed that
MonoTouch would no longer work.
Now I don't actually get any error and am working with the MonoTouch people
to figure that out, but if I try to use Interface Builder in OS X it errors
when I try to create a new .xib file with this
messagehttp://pastebin.com/n7zkKau5  (pastbin
link).
The most important part of the message is this:



1. Thread 0 Crashed:  Dispatch queue: com.apple.main-thread
2. 0   libxml2.2.dylib 0x0001004f9b88
__xmlRaiseError + 888 (error.c:614)
3. 1   libxml2.2.dylib 0x0001004fa5b7
xmlErrEncodingInt + 151 (parserInternals.c:206)
4. 2   libxml2.2.dylib 0x0001004fc326
xmlCurrentChar + 550 (parserInternals.c:707)
5. 3   libxml2.2.dylib 0x000100507168
xmlParseCharData + 904 (parser.c:4232)
6. 4   libxml2.2.dylib 0x00010051264e
xmlParseTryOrFinish + 2014 (parser.c:10990)
7. 5   libxml2.2.dylib 0x0001005139bb
xmlParseChunk + 411 (parser.c:11611)
8. 6   com.apple.Foundation0x7fff8759bca8
-[NSXMLParser parse] + 294

As you see, it errors on libxml2.

I REALLY need to figure out what is going on here.  My knowledge in this
area is not strong and I have no clue what is going wrong.
I have tried reinstalling libxml2 and MacPorts and have tried uninstalling
libxml2 via MacPorts and no luck.


This looks more like a general MacOS problem than anything related 
specifically to libxml2. You may get better responses on a MacOS mailing 
list than what you'd get here.


Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] libxml2/libxslt: global variables considered harmful

2010-07-28 Thread Stefan Behnel

Daniel Veillard, 28.07.2010 11:53:

On Wed, Jul 14, 2010 at 02:07:42PM +0200, Michael Stahl wrote:

IMHO such a design would also be possible for libxml2/libxslt, but of
course this would be an incompatible interface change.
usually there isn't much enthusiasm for that kind of thing  :)


   Actually there is a bit of this already, see
 http://xmlsoft.org/html/libxml-globals.html#xmlGlobalState
it's stored in thread local variables, and access is redirected
via macros when compiling, it solves the problem in most cases but
not all cases. Plus that doesn't fix libxslt which is hit very hard
with the issue.


Same problem here. There were a couple of bug reports in the past regarding 
lxml's interaction with the original libxml2 Python bindings which often 
leads to crashes. The currently recommended way of working around this is 
to link lxml statically against its dependencies. Not a perfect solution, 
but a rather safe one.


Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Walking tree without recursion

2010-06-25 Thread Stefan Behnel

Michael Ludwig, 23.06.2010 23:29:

Oliver Kindernay schrieb am 23.06.2010 um 18:39 (+0200):

I am using libxml2 HTML 4.0 parser to parse HTML and XHTML web pages.
I want to found specific tags (i.e a), so I have to walk through the
tree of parsed document. And I don't want to use recursion like in
this example http://xmlsoft.org/examples/tree1.c. Is there some
mechanism in libxml which provides parsed nodes in some queue?


Sounds like you should be using a high-level approach such as XPath
or XSLT. Forgoing the benefits provided by these technologies is like
deliberately using flintstone to make fire.


Not necessarily. lxml.etree (Pythonic Python bindings for libxml2) has a 
pair of macros for an iterative tree traversal loop. When I introduced it, 
it gave me a 30% speed-up compared to my original recursive traversal code, 
and it was almost 10% faster than plain XPath at the time. See the 
bench_lxml_xpath() and bench_lxml_getiterator() functions here:


http://codespeak.net/lxml/performance.html#a-longer-example

The code is near the end of this file (look for a long comment starting 
with depth first tree walker):


http://codespeak.net/svn/lxml/trunk/src/lxml/etree_defs.h

These macros are the main reason why tree iteration is so blazingly fast in 
lxml.etree. Just look at these numbers:


http://codespeak.net/lxml/performance.html#tree-traversal

When searching for a specific tag (and when XML-ID is not an option), a 
well forged loop can be a lot faster than a generic XPath implementation.


Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] HTMLparser

2010-04-29 Thread Stefan Behnel

Sergio Monteiro Basto, 28.04.2010 20:08:

who is the maintainer of HTMLparser , I had report a bug , and no one
had reply .
What I could do about that ?
Should HTMLparser parse bad broken html or not ?


IIRC, the last thing I read was that the HTML parser should basically 
follow HTML5 where possible.


Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] XPath issue

2010-03-18 Thread Stefan Behnel

Joshua Kwan, 17.03.2010 18:21:

I've got an interesting problem about libxml2's XPath support posted on 
stackoverflow:
http://stackoverflow.com/questions/2459428/weird-xpath-behavior-in-libxml2
Please read about it there.

There haven't been any answers, so I thought I would consult the real libxml2 
mailing list before trying something else. I'm not subscribed to the list, so 
please CC me on replies.

Thanks in advance for any help you could provide!
BTW, I'm using libxml2 2.6.27 and a brief scan of the changelog didn't reveal 
anything that might describe my issue as a bug that's been fixed since.


Version 2.6.27 has known XPath issues. I'd certainly try upgrading before 
wasting time on anything else.


Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] confusing xpath performance characteristics

2010-01-27 Thread Stefan Behnel
[bump]

Any comments?

Stefan Behnel, 09.11.2009 19:23:
 Stefan Behnel, 09.11.2009 09:57:
 It's the last operation, merging and sorting large sets of results, that
 makes this extremely slow - it takes 92% of the evaluation time in my tests
 (using libxml2 2.7.5). It's much faster to traverse the document in a
 single step, and just select single attributes from it, that can quickly be
 appended to the node set.

 I imagine that this step could actually be optimised away in many cases
 (like the case above, where results are guaranteed to be found in doc
 order), so I guess it's just in there to avoid too much special casing. But
 it seriously kills the performance here.
 
 Would it be ok to add a new int isInDocumentOrder field to the xmlNodeSet
 struct that would be true for node sets that are known to be in document 
 order?
 
 That would make it easy to skip the sorting step in all cases where
 building the node set follows document order anyway. Given that node
 comparison is horribly expensive (more than 90% of the sort time in my
 tests), I think it's absolutely worth avoiding the sorting step whenever
 possible.
 
 Also, for sorted node sets, xmlXPathNodeSetAdd() could compare the new node
 to the last node in the node-set and clear the flag if the new node breaks
 the document order. That way, only N-1 comparisons would be required for a
 sorted set of N nodes, instead of something like O(N^2) currently.
 
 Stefan
 ___
 xml mailing list, project page  http://xmlsoft.org/
 xml@gnome.org
 http://mail.gnome.org/mailman/listinfo/xml
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] XML validation using Schematron using LibXml2?

2009-12-16 Thread Stefan Behnel

Andrew Hartley, 16.12.2009 12:50:
 Is it possible yet with the latest libxml2 build to validate an XML document
 using a Schematron?  If so can you update the LibXml2 web site to show code
 examples of how you go about doing this please?  If this is not yet
 possible, do you you know when this is likely to be fully implemented?

There are still limitations in the latest schematron support. However, you
can use libxslt and run the ISO Schematron stylesheets with it:

http://www.schematron.com/

We will actually bundle them in future versions of lxml (the Python XML
lib), as they are currently the most spec compliant implementation
available that runs with libxml2/libxslt.

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Line number value limit

2009-11-12 Thread Stefan Behnel
Hi,

Csaba Raduly, 12.11.2009 10:29:
 Why is the line number in xmlNode limited to an unsigned short ?

Because it's a trade-off between space and usefulness. Note that the parser
reports line numbers without that limitation. Only the xmlNode struct
restricts it.

This is a FAQ, BTW. You can look up the details in the mailing list archives.


 Doesn't libxml2 handle XML files with more than 65535 lines?

The reported line number does in no way imply any limitations of the parser.

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] confusing xpath performance characteristics

2009-11-09 Thread Stefan Behnel

Stefan Behnel, 09.11.2009 09:57:
 It's the last operation, merging and sorting large sets of results, that
 makes this extremely slow - it takes 92% of the evaluation time in my tests
 (using libxml2 2.7.5). It's much faster to traverse the document in a
 single step, and just select single attributes from it, that can quickly be
 appended to the node set.
 
 I imagine that this step could actually be optimised away in many cases
 (like the case above, where results are guaranteed to be found in doc
 order), so I guess it's just in there to avoid too much special casing. But
 it seriously kills the performance here.

Would it be ok to add a new int isInDocumentOrder field to the xmlNodeSet
struct that would be true for node sets that are known to be in document order?

That would make it easy to skip the sorting step in all cases where
building the node set follows document order anyway. Given that node
comparison is horribly expensive (more than 90% of the sort time in my
tests), I think it's absolutely worth avoiding the sorting step whenever
possible.

Also, for sorted node sets, xmlXPathNodeSetAdd() could compare the new node
to the last node in the node-set and clear the flag if the new node breaks
the document order. That way, only N-1 comparisons would be required for a
sorted set of N nodes, instead of something like O(N^2) currently.

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] html parsing incomplete - bug?

2009-10-13 Thread Stefan Behnel

Lydia Patrovic wrote:
 Note the mainamp;20090924_2 attribute value, which can be interpreted as an
 unterminated entity.

:) Nice little Freudian copypaste quoting error. Here's the line from the
real 'HTML' file:

script type=text/javascript src=merge.php?f=main20090924_2/script

Note the unescaped '' character in the URL.

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] html parsing incomplete - bug?

2009-10-13 Thread Stefan Behnel

Martin (gzlist) wrote:
 On 13/10/2009, Stefan Behnel stefan...@behnel.de wrote:
 Lydia Patrovic wrote:
 Note the mainamp;20090924_2 attribute value, which can be interpreted
 as an
 unterminated entity.
 :) Nice little Freudian copypaste quoting error. Here's the line from the
 real 'HTML' file:

 script type=text/javascript src=merge.php?f=main20090924_2/script

 Note the unescaped '' character in the URL.
 
 I'd have thought the embedded null at byte 532 would be the cause. Try
 bytes.replace(\x00, ) before treating it as a c string. Seems to
 get the document parsed pretty much as expected for me.

Interesting. Sounds totally like the right solution.

I wonder why the parser stops parsing here, though. Is '\0' explicitly
considered an invalid character in (broken) HTML, or is it really just the
usual C EOS slip?

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] html parsing incomplete - bug?

2009-10-13 Thread Stefan Behnel

Daniel Veillard wrote:
 On Tue, Oct 13, 2009 at 01:22:12PM +0100, Martin (gzlist) wrote:
 On 13/10/2009, Stefan Behnel wrote:
 I wonder why the parser stops parsing here, though. Is '\0' explicitly
 considered an invalid character in (broken) HTML, or is it really just the
 usual C EOS slip?
 It's certainly invalid, though could be recoverable.

 In the various html versions: HTML 4 defers to the SGML spec which I'm
 not rich enough to consult, XHTML 1 defers to XML which we all know
 says nulls are verboten, and the current HTML 5 draft is pretty clear:

 http://www.w3.org/TR/2009/WD-html5-20090825/syntax.html#preprocessing-the-input-stream

 All U+ NULL characters in the input must be replaced by U+FFFD
 REPLACEMENT CHARACTERs. Any occurrences of such characters is a parse
 error.

 (this is all in the context of an decoded-to-unicode stream, not raw
 UTF-16 etc.)
 
   When HTML5 will become a Last Call draft or something then I think it
 will make sense to try to update the parser to use the same recovery
 tricks.

In any case, the parser should either apply the above replacement rule or
report an error when encountering a '\0' byte in the input stream.
Currently, it just silently terminates.


 Note that the 0 in content may have cut the input at the Python-C
 interface layer. But sure libxml2 internals don't like 0 in content.

We also pass UCS4 encoded data though the same code, so, no, that's not an
issue here.

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


[xml] parsing UCS4 in chunks fails with 2.7.4/5

2009-09-28 Thread Stefan Behnel
Hi,

there seems to be a change in libxml2 2.7.4 that prevents it from parsing a
Python unicode string buffer, which is UCS4-LE encoded on my system. The
first call to xmlCtxtResetPush() works and parses the first chunk as
expected, but subsequent calls to xmlParseChunk() then fail with an error:
input conversion failed due to input error, bytes 0x22 0x00 0x00 0x00
(the latter being '', which was the first character in the second chunk).

So, when passing '?xml version=' to xmlCtxtResetPush() and '1.0?ro' to
xmlParseChunk(), I get the error above. I only noticed this by accident, as
a few badly written test cases in lxml happened to parse from Unicode
strings when run under Python 3.

Any ideas where this might originate from?

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Release of libxml2-2.7.4

2009-09-11 Thread Stefan Behnel

Daniel Veillard wrote:
   Better late than never, but an awful lot of pending bug got fixed.
 Still no major improvement except adding symbol versioning to libxml2
 shared libs, which is fairy important for long term maintainance, but
 not worth jumping to 2.8.0
 
Tarball and signed rpms available at
  ftp://xmlsoft.org/libxml2/
 
 There are still a few things which I would have loved to put in the
 release like per context error handling and the like but I prefer a
 (nearly) bug fix only release that people can upgrade to without
 troubles and then work on changing more stuff

Daniel, thanks a lot for this release. That's clearly the longest list of
bug fixes in a libxml2 release, ever.

Stefan

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] c14n 1.1 support (patch)

2009-08-20 Thread Stefan Behnel
Hi,

Aleksey Sanin wrote:
 Please find attached a patch that adds support for the new
 version of c14n (http://www.w3.org/TR/xml-c14n11/). I am
 getting questions about it in the xmlsec mailing list and
 I finally decided to implement it. I would greatly appreciate
 if you can accept this patch and push it into the gnome git
 repository (note, that there are some new files/folders added
 for the new test cases).

it's usually a good idea to put patches into the bug tracker so that they
do not get lost in the e-mail backscroll buffer.

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Inserting XML Schema default attributes

2009-05-31 Thread Stefan Behnel
Hi,

as a quick follow-up: injecting default attributes works when applying the
schema *after* the parsing step, it does *not* work when validating inside
the parser using the SAX plug.

Stefan


Stefan Behnel wrote:
 I'm trying to inject default attributes into a document from an XML Schema
 during parsing. I set up a validation context and set the
 XML_SCHEMA_VAL_VC_I_CREATE option on it, which, if I understand the docs
 correctly, tells the validator to create defaulted/fixed attributes if they
 do not exist already. Then I inject the validation context into the parser
 using xmlSchemaSAXPlug().
 
 The schema document I use is
 
 '''
 xsd:schema xmlns:xsd=http://www.w3.org/2001/XMLSchema;
   xsd:element name=a type=AType/
   xsd:complexType name=AType
 xsd:sequence minOccurs=4 maxOccurs=4
   xsd:element name=b type=BType /
 /xsd:sequence
   /xsd:complexType
   xsd:complexType name=BType
 xsd:attribute name=hardy type=xsd:string default=hey /
   /xsd:complexType
 /xsd:schema
 '''
 
 The document I parse is
 
 '''
 ab hardy=ho/b/b hardy=ho/b//a
 '''
 
 The document validates. However, no default attributes are inserted,
 neither with my code nor with xmllint (which doesn't seem to support that
 anyway). When I debug into the validator code in xmlschemas.c, I get to
 line 25351 (libxml2 2.7.3):
 
 '''
 /*
 * Get the owner element; needed for creation of default attributes.
 * This fixes bug #341337, reported by David Grohmann.
 */
 if (vctxt-options  XML_SCHEMA_VAL_VC_I_CREATE) {
 xmlSchemaNodeInfoPtr ielem = vctxt-elemInfos[vctxt-depth];
 == if (ielem  ielem-node  ielem-node-doc)
 defAttrOwnerElem = ielem-node;
 }
 '''
 
 but ielem-node is NULL every time it gets there, so this doesn't fly.
 
 Is there anything else I have to do to make this work?
 
 Thanks,
 
 Stefan
 ___
 xml mailing list, project page  http://xmlsoft.org/
 xml@gnome.org
 http://mail.gnome.org/mailman/listinfo/xml
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Possible bug, libxml segfault

2009-05-30 Thread Stefan Behnel
Hi,

this looks more like a problem in lxml, so I'll answer on the lxml mailing
list.

Stefan


Avleen Vig wrote:
 Background:
 We use libxml and libxslt in one of our applications (specifically, in
 Python via lxml).
 
 Recently we've seen our application dying at strange times for no
 apparent reason.
 We managed to get a core file out of one crash, and the results of
 some of our debugging are here:
 http://xml.pastebin.com/m70c259d6
 (I'd be happy to poke more in a particular direction on there, I'm a
 bit new to gdb :)
 
 To me, it seems the parser is complaining while trying to parse the
 namespaces in the stylesheet node in transforms/_base.xslt
 The node for that opens like this:
 xsl:stylesheet  version=1.0
 xmlns=http://www.w3.org/1999/xhtml;
 xmlns:xsl=http://www.w3.org/1999/XSL/Transform;
 xmlns:tfxslt=http://tfnet.co.uk/ns/tfxslt;
 xmlns:fb=http://www.facebook.com/2008/fbml;
 extension-element-prefixes=tfxslt str exsl
 exclude-result-prefixes=str exsl tfxslt fb
 xmlns:error=http://www.woome.com/error/;
 
 I dug a little deeper and found a bunch of the address out of bounds
 errors and thought I should ask here as I'm drawing a blank on where
 to go next.
 
 The problem happens intermittently, but usually several times a day. I
 could probably reproduce it.
 I also see the 'exclude-result-prefixes' mentioned in the backtrace,
 could that be involved here?
 
 Any suggestions you have would be much appreciated!
 ___
 xml mailing list, project page  http://xmlsoft.org/
 xml@gnome.org
 http://mail.gnome.org/mailman/listinfo/xml
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


[xml] parse-time validation against a user provided DTD

2009-05-09 Thread Stefan Behnel
Hi,

looking through the API docs, I can't really figure out a way to stick an
external DTD into the parser, so that it validates against that rather than
trying to load a DTD for the DOCTYPE (or also to do DTD validation if the
document does not define a DOCTYPE at all).

I can see that xmllint can validate against an externally provided DTD, but
only after parsing, so that doesn't help.

Has anyone done this before? Is there maybe even a preferred/obvious/RFTM
way of doing that? I'm interested in a way to do this a) by providing a
readily parsed xmlDtd, and maybe even b) by providing a public ID (in case
there is a shortcut other than looking up the DTD manually).

Thanks,

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


[xml] Inserting XML Schema default attributes

2009-05-08 Thread Stefan Behnel
Hi,

I'm trying to inject default attributes into a document from an XML Schema
during parsing. I set up a validation context and set the
XML_SCHEMA_VAL_VC_I_CREATE option on it, which, if I understand the docs
correctly, tells the validator to create defaulted/fixed attributes if they
do not exist already. Then I inject the validation context into the parser
using xmlSchemaSAXPlug().

The schema document I use is

'''
xsd:schema xmlns:xsd=http://www.w3.org/2001/XMLSchema;
  xsd:element name=a type=AType/
  xsd:complexType name=AType
xsd:sequence minOccurs=4 maxOccurs=4
  xsd:element name=b type=BType /
/xsd:sequence
  /xsd:complexType
  xsd:complexType name=BType
xsd:attribute name=hardy type=xsd:string default=hey /
  /xsd:complexType
/xsd:schema
'''

The document I parse is

'''
ab hardy=ho/b/b hardy=ho/b//a
'''

The document validates. However, no default attributes are inserted,
neither with my code nor with xmllint (which doesn't seem to support that
anyway). When I debug into the validator code in xmlschemas.c, I get to
line 25351 (libxml2 2.7.3):

'''
/*
* Get the owner element; needed for creation of default attributes.
* This fixes bug #341337, reported by David Grohmann.
*/
if (vctxt-options  XML_SCHEMA_VAL_VC_I_CREATE) {
xmlSchemaNodeInfoPtr ielem = vctxt-elemInfos[vctxt-depth];
== if (ielem  ielem-node  ielem-node-doc)
defAttrOwnerElem = ielem-node;
}
'''

but ielem-node is NULL every time it gets there, so this doesn't fly.

Is there anything else I have to do to make this work?

Thanks,

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] xmllint working with 213MB large xml files

2009-02-27 Thread Stefan Behnel
Hi,

Janis Rocans wrote:
 Last day I used xmllint (for windows xmllint.exe: using libxml
 version 20703) to validate xml file against XSD, but it found a lots
 of errors on line 65535 (binary ). I noticed, that
 there's no errors, but the line counter stucked. I believe theres a
 16bit int for counting rows. Maybe should use a larger int?

Yes, that's done to save space in the internal data structures. Most XML
data files are still way shorter than 64K lines.

AFAIR, there's a patch, though, that supports higher line numbers.

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] lxml binary for Python 2.6+

2009-01-13 Thread Stefan Behnel
Casey Schroeder wrote:
 I am searching for an easy way to get lxml for v. 2.6 Python on windows.
 Can someone tell me if there is a comparable exe to those listed here for
 2.6?

 http://users.skynet.be/sbi/libxml-python/

In case you really meant lxml (and not libxml2, for which this is the
mailing list), you might be happy with these:

http://pypi.python.org/pypi/lxml/2.1.4

Newer builds are not currently available.

Stefan

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] libxml2 and Python 2.6 on WindowsXP

2009-01-13 Thread Stefan Behnel
Bernd Blacha wrote:
 I want to use libxml / libxslt in Windows XP.

If all you want is to use the libraries and not the Python modules that
have the same API, you might be better off using lxml as it provides a
much more pythonic interface.

The latest stable MS-Windows builds are here:

http://pypi.python.org/pypi/lxml/2.1.4

Stefan

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] libxml2 very slow on big data dump

2008-12-16 Thread Stefan Behnel
Hi,

Alexandre Macard wrote:
 I try dump a node from a big xml (near 7mo), and the libxml2 is very
 slow to respond.

 I tried to trace the problem and it seems to take all it's time into the
 function: xmlOutputBufferWriteEscape.
 I do not need to escape data because I use a base64 encoding.

You didn't write which version of libxml2 you are using, but there was a
bug in an older version that could lead to horrible performance when
serialising character entities.

Try upgrading your library.

Stefan

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] libxml2 very slow on big data dump

2008-12-16 Thread Stefan Behnel
Alexandre Macard wrote:
 Stefan Behnel a écrit :
 Alexandre Macard wrote:
 I try dump a node from a big xml (near 7mo), and the libxml2 is very
 slow to respond.

 I tried to trace the problem and it seems to take all it's time into
 the
 function: xmlOutputBufferWriteEscape.
 I do not need to escape data because I use a base64 encoding.


 You didn't write which version of libxml2 you are using, but there was a
 bug in an older version that could lead to horrible performance when
 serialising character entities.

 Try upgrading your library.

 Sorry I forgot to precise this information. I am using the last version
 2.7.2.

So maybe it's a similar bug, but for a different encoding (I think it was
related to the ASCII encoding at the time).

Could you provide the code snippet that you use for serialisation? I.e.
what parameters you pass into what function?

Stefan

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


[xml] Fwd: [lxml-dev] lxml RelaxNG validation on hand-built documents

2008-11-07 Thread Stefan Behnel
Hi,

any idea what might trigger this?

The main API calls we use are:

ctx = xmlRelaxNGNewParserCtxt(filename)
schema = xmlRelaxNGParse(ctx)
xmlRelaxNGFreeParserCtxt(ctx)
...
// create doc
...
vc = xmlRelaxNGNewValidCtxt(schema)
xmlRelaxNGValidateDoc(vc, doc)
xmlRelaxNGFreeValidCtxt(vc)
...
// change doc
// repeat validation with schema

Thanks,

Stefan

 original message 
Subject: [lxml-dev] lxml RelaxNG validation on hand-built documents
Date: Thu, 30 Oct 2008 16:53:33 +0100
From: Atilla [EMAIL PROTECTED]
To: [EMAIL PROTECTED]

I've had a very curious issue that I'm trying to find the cause about.
Basically - if I try to validate a document tree that was dynamically
created by lxml with a relaxNG schema, the validation step passses
even if there are invalid elements. If I serialize that same tree to a
string and parse it once again, the newly created XML document fails
the validation. Given that I expect to process fairly large trees, I'd
rather not have to copy so much nformation in memory on every attempt
to validate a document.

Is there any reason why lxml wouldn't validate items that have been
newly created and inserted into the tree, or this is a bug? How would
I make sure a tree is valid, according to a schema, before I
serialized and saved it ?

Basically what i do is:

schema = etree.RelaxNG(file=schema.rng)

doc = etree.fromstring(valid/valid)

schema(doc)
True

doc[0].append(etree.Element(invalid))

schema(doc)
True

schema(etree.fromstring(etree.tostring(doc)))
False

It's really making me think I don't get some point in the whole
validation process. In hindsight - I had the same issues wiht the Perl
LibXML bindings at some point in the past. Is it maybe Libxml -related
?

Cheers,

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


[xml] News from the RNC front?

2008-09-11 Thread Stefan Behnel
Hi,

is there actually any news from the RelaxNG compact syntax parser front?
Last I heard, it was considered for inclusion way back when libxml2 2.6
was still young.

I tried to find the original patch and found several mails that mentioned
it, but none that contains it. The bug tracker doesn't seem to have it,
either. I hope it wasn't burned by the Spanish Inquisition?

Stefan

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] libxml2 2.7.1 breaks XML serialisation of HTML trees

2008-09-10 Thread Stefan Behnel
Hi,

Martin (gzlist) wrote:
 On 08/09/2008, Stefan Behnel [EMAIL PROTECTED] wrote:
  there was a change in 2.7.1 (xmlsave.c, ~760) that prevents HTML documents
  from being serialised in XML style...
  ...
  If the current behaviour is wanted, what's the future way of achieving
  this *without* temporarily modifying the document? (i.e. without breaking
  thread concurrency)
 
 I have been eyeing the other 28 bits of xmlSaveOption recently, mostly
 to add a XML_SAVE_XHTML to go counter to the current XML_SAVE_NO_XHTML
 that would unconditionally turn *on* the Appendix C rules without
 needing one of the XHTML 1.0 doctypes.

Sounds fine.


 Some other tweaks to like
 XML_SAVE_XHTML_NO_META_CHARSET would perhaps also be good.

Why only for XHTML? The meta entry is either wanted or not, and it changes
the document on output, which is not always desirable. The libxml2 options
should say: I want it added if it's not there (which is the current
behaviour anyway) and I do not want my document modified on output.


 Would an
 XML_SAVE_TEXT_HTML option to do the old sgmlish serialisation answer
 your use case?

Doesn't sound like it. The problem is that I need to distinguish between a
serialisation as well-formed XML and a serialisation in HTML style
*independent* of the type of document. And I also need to do so in a way that
produces the same output across libxml2 versions. I wouldn't mind switching to
a different API based on an #if LIBXML_VERSION ..., but I would still want
to get comparable output. lxml never used the xmlSave* API for exactly that
reason: the output changed heavily across the supported versions.

The change in 2.7.1 broke a whole bunch of doctests for lxml. I fixed some of
those, but users will run into the same problem.

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] libxml2 2.7.1 breaks XML serialisation of HTML trees

2008-09-10 Thread Stefan Behnel
Hi,

Daniel Veillard wrote:
 On Mon, Sep 08, 2008 at 03:01:29PM +0200, Stefan Behnel wrote:
 I now wonder why there are two serialisation methods (xmlNodeDump* and
 htmlNodeDump*) that ultimately do the same thing, instead of serialising
 to what they are named after.
 
   Well the goal is more to get people to use xmlSave* than the old
 xmlNodeDump and htmlNodeDump ones.

lxml uses those two because they (used to) produce the same output across
libxml2 versions. We do most of the output around the actual tree
serialisation by hand (e.g. doctype and XML decl), as there isn't an API that
generates reproducible output across libxml2 versions (we currently support
libxml2 2.6.20 and later). xmlSave*() is particularly bad in that regard, as
the early versions lack a lot of important options, so getting predictable
output across versions is extremely cumbersome.

One of the problems we face is that we try to be compatible with the
ElementTree library as far as possible, so if you do the same operations on
the same input, the output SHOULD look the same, too.


 Options are set at contect creation,
 we can add more options and trying to keep the old functions to support the
 same would require way too many entry points.

I agree. Making xmlSave* more usable is perfectly fine with me. However,
breaking functions that do very specific parts of the work is a pretty
negative side-effect.


 If the current behaviour is wanted, what's the future way of achieving
 this *without* temporarily modifying the document? (i.e. without breaking
 thread concurrency)
 
   Hum, sorry, clearly an oversight, I wanted to make xmlsave routines
 HTML aware, which in itself  sounds a good idea, no ?

Absolutely.


 I guess we can use an xmlSave option to force the output to use the
 HTML parser or the XML one and then make sure xmlNodeDump* and
 htmlNodeDump* use them appropriately.

That would fix it, yes. In any case, they should do what their name implies,
without being smarter than necessary.


 Sorry for the breakage, I forgot the old xmlSave* had been remapped to
 the new ones.

That's ok, libxml2 2.7 is young, that happens. lxml's history isn't free of
mistakes either.

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


[xml] libxml2 2.7.1 breaks XML serialisation of HTML trees

2008-09-08 Thread Stefan Behnel
Hi,

there was a change in 2.7.1 (xmlsave.c, ~760) that prevents HTML documents
from being serialised in XML style. That was actually a very convenient
feature in lxml, where you could select between XML and HTML serialisation
of an HTML tree based on a keyword argument.

I now wonder why there are two serialisation methods (xmlNodeDump* and
htmlNodeDump*) that ultimately do the same thing, instead of serialising
to what they are named after.

If the current behaviour is wanted, what's the future way of achieving
this *without* temporarily modifying the document? (i.e. without breaking
thread concurrency)

Stefan

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Processing information in a buffer to XML-document conversion

2008-09-04 Thread Stefan Behnel
Hi,

Goran Hasse wrote:
 When a  xmlParseMemory( ... ) i called a !   xml processing tag
 is inserted in the document.

Definitely not. I think you are mixing this up with the serialisation function
you are using.

Stefan

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Character reference encoding is slow

2008-08-31 Thread Stefan Behnel
Salut Daniel!

Daniel Veillard wrote:
 [loads of interesting results from the analysis of a pathological case]
   Anyway to make a long story short, I spent a few hours today
 fixing the problem by adding support for a new kind of buffers avoiding
 most of the memmoves needed when handling an encoding conversion
 exception like those. Bill reported that your test now parses and
 save in more or less the same time i.e. 7secs on one of his boxes.

Very cool, that sounds like a very good solution indeed!


 Grab 2.7.0 !

I definitely will, and I'll make sure lxml supports it as well as 2.6.

Thank you for the work you put into libxml2!

Stefan

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


[xml] Character reference encoding is slow

2008-08-29 Thread Stefan Behnel
Hi,

we got a report on the lxml list where someone tried to parse and
serialise a file that contains 8,000,000 non-ASCII character references
(#135;), as in

text + #135; * 800 + /text

Parsing this is pretty fast, so that's not the problem, but serialising
this document back to a US-ASCII encoding, i.e. re-encoding the
non-ASCII characters as character references, is slow as hell. The user
stopped the run after 12 hours at 100% CPU load. I tried this with xmllint
and you can literally wait for each byte that arrives in the target file.

Is there any reason why this is so, or does anyone have any insights what
the problem may be here? This definitely sounds like a bug to me.

Stefan

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Cleaning the Web - Implementing HTML 5 parsing in libxml2

2008-08-18 Thread Stefan Behnel
Hi,

Karl Dubost wrote:
Nick Kew weighed in and proposed that we should target [6]libxml
which includes an HTML parser and is already supported by Apache
server and many other tools.
 
   [6] http://xmlsoft.org/html/libxml-HTMLparser.html
 
From here it would be interesting to implement HTML 5 parsing
algorithm into libxml2. It would benefit the community as large.

Have you tried joining forces with the people who started the C implementation
of html5lib? Maybe they have ideas to contribute or (partially) working code
that you can look at. It may even happen that you get them convinced of the
project.

In any case, having working implementations in Python and Java should get you
a lot closer to your goal by looking under the hood.

Stefan


___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Cleaning the Web - Implementing HTML 5 parsing in libxml2

2008-08-08 Thread Stefan Behnel
Karl Dubost karl at w3.org writes:
 I have written a short document to explain the project [Cleaning the  
 Web][1].
 It describes what is html5 and what would be the benefits of  
 implementing the html 5 parsing algorithm in libxml2 html parser.

There's already an HTML5 implementation in Python (html5lib) which you can use 
together with lxml (so you can benefit from both HTML5 *and* libxml2 already). 
IIRC, there was also a push towards a C implementation, but I'm not sure that 
really lead anywhere. What's in SVN doesn't look very complete:

http://html5lib.googlecode.com/svn/trunk/c/chtml5lib/

IMHO, it's better to stick with higher level implementations during the 
specification phase, and to push the work on an optimised, low-level C 
implementation back until the target is a bit more focussed. But then, maybe 
that's just me...

I didn't read your proposal, so I'll just assume you meant to extend the 
existing HTML parser instead of writing a new one. That would sound more 
promising than a start from scratch.

Stefan


___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Better hash function for dict.c

2008-08-06 Thread Stefan Behnel
Salut Daniel,

Daniel Veillard wrote:
   - the second one is unfortunately not fixeable as is it comes from the
 key hash definitions themselves:
 
 -#define xmlDictComputeKey(dict, name, len) \
 -(((dict)-size == MIN_DICT_SIZE) ?  \
 - xmlDictComputeFastKey(name, len) : \
 - xmlDictComputeBigKey(name, len, len))  \
 -
 -#define xmlDictComputeQKey(dict, prefix, name, len)   \
 -(((prefix) == NULL) ? \
 -  (xmlDictComputeKey(dict, name, len)) :  \
 -  (((dict)-size == MIN_DICT_SIZE) ?  \
 -   xmlDictComputeFastQKey(prefix, name, len) :\
 - xmlDictComputeBigKey(name, len, len

Hmm, was that in my patch? Out of the top of my head, shouldn't the last line 
read

xmlDictComputeBigKey(prefix, -1, xmlDictComputeBigKey(name, len, len

or something in that line? This looks like a copypaste error to me...

Anyway:

 the problem is that basically if you compute the key for a QName
 as a:b you can get 2 different answers, one if you accessed it using
 a:b directly and hence xmlDictComputeKey() or if using a prefix and
 b name, given the algorithm the key are not the same, and it is a key
 property of the dictionary to always return the same exact pointer for
 the same string. This breaks that property.

True, I didn't know about this property. And the 4-byte-at-once property will
really make this very hard to achieve.

A way I see to fix this is to search the string for the first ':' and always
calculate the hash separately for the part before and after the ':', not
including the ':' itself. That should not break hashing namespace URIs either
(AFAIR, at the least the XML namespace gets hashed at some point). Something 
like

int len = strlen(s)
char* prefix_end = strchr(s, ':')
if (prefix_end)
h = xmlDictComputeBigKey(s, prefix_end-s,
  xmlDictComputeBigKey(prefix_end+1, len-(prefix_end-s-1),
 len-(prefix_end-s-1)))
else
h = xmlDictComputeBigKey(s, len, len)

(expect an off-by-1 error somewhere above ;)

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Better hash function for dict.c

2008-08-06 Thread Stefan Behnel
Daniel Veillard wrote:
   Another option I looked at is the 'One-at-a-Time Hash' from
 http://burtleburtle.net/bob/hash/doobs.html , looking at the criterias
 and the results it looks like a good hash too, not too expensive and
 should work well.

The page says it's pretty good when inlined, which should be doable in
libxml2's case. Plus, you can pass a previous hash as initial hash value,
so incremental hashing will work. And you can avoid strlen() by changing
the for loop into a while (*c != '\0') loop (no idea if that's really
faster, C compilers have weird optimisations these days, but I find it
worth mentioning).


 I will try to make a patch using this this morning,
 if you have a bit of time then, maybe you can rerun your initial tests
 with that one, is that possible ?

I can try, sure. Just send me a patch that removes the current hash
function from SVN and adds the new one, and I will find a way to compare
the two.

Stefan

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Better hash function for dict.c

2008-08-06 Thread Stefan Behnel

Stefan Behnel wrote:
 Daniel Veillard wrote:
 if you have a bit of time then, maybe you can rerun your initial tests
 with that one, is that possible ?
 
 I can try, sure. Just send me a patch that removes the current hash
 function from SVN and adds the new one, and I will find a way to compare
 the two.

Here's a little test script that runs xmllint --noout on a generated XML
file with varying numbers of distinct tag names, together with the numbers I
get. It looks like the new hash is a little slower than the one from my
original patch. At least, I get slightly lower throughput, but it's less than
10% difference throughout, so I guess it's within the usual margin. This is
likely due to the 4-byte reads of the other hash.

The distribution seems to be about comparable, and the timings stay more or
less constant over the range I tested (up to 1000 entries). Even with 2000
entries in the dict, the timings are only 15% lower than with 8, so I would
say this hash works just as well as the other one.

I did a quick check with lxml's benchmarks and they give me comparable
results: slightly slower, but about the same behavioural improvement.

Given that the new hash gives correct results, which the other one didn't, I'm
fine with the change. The price is definitely low enough.

Stefan
import sys, os
from time import time

def gen_xml(tag_count):
data = [root]

append = data.append
for i in range(10):
append( a%04dbc / % (i%tag_count) )

data.append(/root)
return '\n '.join(data)

for i in range(8, 2001, 8):
xml = gen_xml(i)
f = open(file.xml, 'wb')
f.write(xml)
f.close()


t = time()
os.system(xmllint --noout file.xml)
t = time() - t

print %4d %5.2f %8.1f % (i, t, len(xml)/(t*1000.0))
benchmarking libxml2 2.6.32: xmllint --noout file.xml

Columns:

 #distincs tags | seconds | kbytes/sec

  original hash  one-at-a-time hash

   8  0.10  13292.68  0.10  13647.7
  16  0.10  12943.7   16  0.10  13352.1
  24  0.10  12748.5   24  0.10  12856.2
  32  0.11  12300.0   32  0.10  12623.1
  40  0.11  12344.8   40  0.10  12461.8
  48  0.11  12060.3   48  0.11  12246.8
  56  0.12  10967.4   56  0.10  12956.2
  64  0.12  10807.2   64  0.10  13011.0
  72  0.12  10446.1   72  0.10  12779.8
  80  0.12  10824.6   80  0.10  12727.4
  88  0.12  10601.9   88  0.10  12812.4
  96  0.12  10632.7   96  0.10  12926.0
 104  0.13  10398.9  104  0.10  12909.6
 112  0.12  10528.6  112  0.10  13047.2
 120  0.13  10389.8  120  0.10  12824.4
 128  0.13  10249.2  128  0.10  12976.6
 136  0.13  10191.4  136  0.11  12141.6
 144  0.13   9958.5  144  0.10  12692.1
 152  0.13   9672.7  152  0.10  12806.0
 160  0.13   9740.6  160  0.10  12735.6
 168  0.13   9680.3  168  0.10  12705.4
 176  0.13   9664.8  176  0.11  12009.0
 184  0.14   9061.4  184  0.10  12624.2
 192  0.14   9558.7  192  0.10  12779.9
 200  0.14   9223.1  200  0.10  12772.7
 208  0.14   9409.3  208  0.10  12506.9
 216  0.14   9238.8  216  0.10  12461.3
 224  0.14   9239.2  224  0.10  12691.9
 232  0.14   8991.2  232  0.10  12540.5
 240  0.14   8981.5  240  0.10  12658.3
 248  0.15   8875.0  248  0.10  12621.1
 256  0.15   8813.3  256  0.10  12754.4
 264  0.15   8719.4  264  0.10  12575.4
 272  0.15   8638.8  272  0.10  12598.2
 280  0.15   8587.4  280  0.10  12493.8
 288  0.15   8594.9  288  0.10  12773.3
 296  0.15   8531.2  296  0.10  12630.3
 304  0.15   8559.2  304  0.10  12668.1
 312  0.15   8439.1  312  0.10  12836.0
 320  0.16   8354.5  320  0.10  12563.9
 328  0.16   8177.6  328  0.10  12526.3
 336  0.16   8154.6  336  0.10  12559.2
 344  0.16   8135.8  344  0.10  12629.0
 352  0.16   8014.6  352  0.10  12608.3
 360  0.17   7791.0  360  0.10  12762.5
 368  0.16   7905.4  368  0.10  12736.4
 376  0.17   7817.3  376  0.10  12472.8
 384  0.17   7752.3  384  0.10  12526.0
 392  0.17   7739.4  392  0.10  12611.3
 400  0.17   7762.5  400  0.10  12619.7
 408  0.17   7688.5  408  0.10  12530.9
 416  0.17   7625.4  416  0.10  12606.7
 424  0.17   7526.3  424  0.10  12463.5
 432  0.17   7443.6  432  0.10  12702.9
 440  0.18   7151.1  440  0.10  12605.0
 448  0.18   7321.2  448  0.10  12436.2
 456  0.18   7213.5  456  0.11  12277.0
 464  0.18   7141.5  464  0.11  11654.0
 472  0.18   7109.2  472  0.10  12552.4
 480  0.18   7118.3  480  0.10  12437.0
 488  0.18   7078.0  488  0.10  12632.3
 496  0.18   7095.4  496  0.10  12657.4
 504  0.18   7039.9  504  0.10  12609.9
 512  0.19   6983.3  512  0.10  12565.8
 520  0.19   6897.1  520  0.11  11936.7
 528  0.20   6585.0  528  0.10  12668.5
 536  0.19   6765.7  536  0.10  12682.3
 544  0.19   6781.3  544  0.10  12434.1
 552  0.20   6470.6  552

Re: [xml] enabling zlib support in Stephane Bidoul's Python binding? (win32)

2008-07-28 Thread Stefan Behnel
Meunier, Jean-Luc wrote:
 On win32, I'm interested in having the zlib support in libxml2 from
 Python.

If zlib support refers to parsing from zlib compressed XML files, lxml
will let you do that.

http://codespeak.net/lxml/

If you really want it enabled in a binary build of the original libxml2
Python bindings where it is not currently enabled, you'd best bug the
provider of the binary build himself.

Stefan

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Problems with schema validation

2008-07-18 Thread Stefan Behnel
Hi,

Robert Schweikert wrote:
 Hi I am trying to validate xsd files and am running into a problem. I
 have a negative test, i.e. a file I know is invalid, yet it passes
 validation.
 
 I used the example code from
 http://wiki.njh.eu/XML-Schema_validation_with_libxml2 and wrote a short
 main routine for the test.
 
 I downloaded the http://www.w3.org/2001/XMLSchema.xsd file to my machine
 to avoid any network issues.
 
 So here is the file I am validating and I expect this to fail:
 
 ?xml version=1.0 encoding=utf-8?
 !-- Invalid Schema definition used in the XML tools tests (no flip
 type)--
 xsd:schema xmlns:xsd=http://www.w3.org/2001/XMLSchema;
   xsd:element name=person type=person/
   xsd:complexType name=person
  xsd:sequence
 xsd:element name=first type=xsd:flip/
 xsd:element name=last type=xsd:string/
 xsd:element name=dob type=xsd:date/
 xsd:element name=phone type=xsd:string/
  /xsd:sequence
   /xsd:complexType
 /xsd:schema
 
 XSD defines no flip type, thus the validator should complain. I have
 libxml2.so.2.6.27 on my Debian 4 box.

It passes for me also in 2.6.32. However, the W3C schema validator only
complains when I check the option check as complete schema, otherwise, it
passes without complaint. Still, libxml2 might be better off considering a
schema it parses a complete one.

Does it complain when you try to validate XML documents with this schema?

Stefan

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Parsing from a compressed string

2008-07-15 Thread Stefan Behnel
Daniel Veillard wrote:
 On Sun, Jul 13, 2008 at 07:44:12AM +0200, Stefan Behnel wrote:
 Hi,

 it seems that libxml2 can parse zlib compressed data from files. What
 would be
 the right way to parse compressed data from a string in memory? And,
 yes, I
 want to avoid unpacking it before I parse it.

   The simplest I can see is use
http://xmlsoft.org/html/libxml-parser.html#xmlReadIO
 and providing ioread/ioclose/ioctxt to be compressed read/close and
 buffer arguments.

 Same question for serialisation? Is there anything like a compressing
 OutputBuffer?

   No but similary you can use xmlSaveToIO
http://xmlsoft.org/html/libxml-xmlsave.html#xmlSaveToIO
 with a compressed write and close.

   The harder is probably to debug the provided compressed handlers around
 the edge cases, but it really shoudl not be much code if using existing
 compression APIs.

Thanks, that's what I thought, too. It would be nice if libxml2 provided
an API function that just set everything up correctly. Because these
things tend to get pretty hairy when you get into the details.

I was hoping that someone would post some existing code, but since there
were no other responses so far...

Stefan

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


[xml] Parsing from a compressed string

2008-07-12 Thread Stefan Behnel
Hi,

it seems that libxml2 can parse zlib compressed data from files. What would be
the right way to parse compressed data from a string in memory? And, yes, I
want to avoid unpacking it before I parse it.

Same question for serialisation? Is there anything like a compressing
OutputBuffer?

Thanks,
Stefan

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


[xml] docs and dicts in xmlSetTreeDoc()

2008-05-02 Thread Stefan Behnel
Hi,

I've just fixed a long-standing problem in lxml, now I'm wondering if it isn't
actually a problem in libxml2. The function xmlSetTreeDoc() in tree.c is
called to update the xmlDoc* pointers of each node in a subtree when it gets
appended to a new parent in a different document. The question is: should this
function also re-assign the name pointers of the nodes if both documents use
a dict and the dictionary of the target document is different from the
dictionary of the source tree?

The decision is easy to take (compare the dict pointers of both documents) and
the code to re-assign the name is simple: call xmlDictLookup() and re-assign
the name of the node to the result. In addition, a call to
xmlDictOwns(old_dict, old_name) might be necessary to see if the old name must
be freed.

Would this be considered worth changing? Or are there any reasons not to do
this? There obviously is a performance impact, but I consider it the correct
thing to do if both documents are meant to be independent afterwards.

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Better hash function for dict.c

2008-04-20 Thread Stefan Behnel
Hi again (and sorry for all the noise),

Stefan Behnel wrote:
 If an application benefits from a different hash function depends on the
 vocabulary it uses in its XML files. A slow but well distributing hash
 function performs much better for large vocabularies (or many different
 vocabularies), while small vocabularies will not fill the dict enough to make
 a difference, in which case the faster hash function wins.

So the obvious solution is to combine the two. Here is a patch that uses the
original hash function to start with (but lowers the bucket fill limit a
little from 4 down to 3) and when it reaches the grow barrier for the first
time, switches to the new hash function. You will find a performance
comparison below, based on xmllint.

I decreased the bucket fill barrier for two reasons: to trigger an early
switch between the two functions, and because the second function has much
better load balancing, so a high bucket size in one place really means that
most buckets are at least close to that fill rate. As you can see from the
numbers, it works pretty well over a wide range of vocabulary sizes from small
to medium, and as I've shown before, it performs much (much!) better for
larger sizes.

BTW, Bob Jenkins did a comparison of a couple of hash functions, including the
additive hash (a variant of which is currently used) and the hash function
used in the patch.

http://burtleburtle.net/bob/hash/doobs.html

The hash function itself was written by Paul Hsieh and published on his web
site. According to Bob Jenkins, it's public domain (although I didn't ask
directly).

http://www.azillionmonkeys.com/qed/hash.html

Any objections to getting this patch merged?

Stefan


# original hash function, increasing number of tag names
 5: 100 iterations took 720 ms
10: 100 iterations took 720 ms
15: 100 iterations took 718 ms
20: 100 iterations took 724 ms
25: 100 iterations took 739 ms
30: 100 iterations took 735 ms
35: 100 iterations took 750 ms
40: 100 iterations took 748 ms
45: 100 iterations took 760 ms
50: 100 iterations took 766 ms
55: 100 iterations took 775 ms
60: 100 iterations took 778 ms
65: 100 iterations took 789 ms
70: 100 iterations took 782 ms
75: 100 iterations took 840 ms
80: 100 iterations took 813 ms
85: 100 iterations took 829 ms
90: 100 iterations took 821 ms
95: 100 iterations took 822 ms

# combined hash functions, increasing number of tag names
 5: 100 iterations took 725 ms
10: 100 iterations took 718 ms
15: 100 iterations took 723 ms
20: 100 iterations took 717 ms
25: 100 iterations took 742 ms
30: 100 iterations took 764 ms
35: 100 iterations took 773 ms
40: 100 iterations took 743 ms
45: 100 iterations took 762 ms
50: 100 iterations took 766 ms
55: 100 iterations took 768 ms
60: 100 iterations took 778 ms
65: 100 iterations took 741 ms
70: 100 iterations took 757 ms
75: 100 iterations took 742 ms
80: 100 iterations took 743 ms
85: 100 iterations took 757 ms
90: 100 iterations took 741 ms
95: 100 iterations took 743 ms

--- libxml2-2.6.32-orig/dict.c	2008-02-08 10:52:13.0 +0100
+++ libxml2-2.6.32/dict.c	2008-04-20 10:30:00.0 +0200
@@ -20,16 +20,30 @@
 #include libxml.h
 
 #include string.h
+#include stdint.h
 #include libxml/tree.h
 #include libxml/dict.h
 #include libxml/xmlmemory.h
 #include libxml/xmlerror.h
 #include libxml/globals.h
 
-#define MAX_HASH_LEN 4
+#define MAX_HASH_LEN 3
 #define MIN_DICT_SIZE 128
 #define MAX_DICT_HASH 8 * 2048
 
+#define xmlDictComputeKey(dict, name, len) \
+(((dict)-size == MIN_DICT_SIZE) ?		   \
+ xmlDictComputeFastKey(name, len) :	   \
+ xmlDictComputeBigKey(name, len, len))	   \
+
+#define xmlDictComputeQKey(dict, prefix, name, len)  		 \
+(((prefix) == NULL) ?	 \
+  (xmlDictComputeKey(dict, name, len)) :			 \
+  (((dict)-size == MIN_DICT_SIZE) ?			 \
+   xmlDictComputeFastQKey(prefix, name, len) :		 \
+   xmlDictComputeBigKey(prefix, xmlStrlen(prefix),		 \
+			xmlDictComputeBigKey(name, len, len
+
 /* #define ALLOW_REMOVAL */
 /* #define DEBUG_GROW */
 
@@ -223,11 +237,80 @@
 }
 
 /*
- * xmlDictComputeKey:
- * Calculate the hash key
+ * xmlDictComputeBigKey:
+ *
+ * Calculate a hash key using a good hash function that works well for
+ * larger hash table sizes.
+ *
+ * Hash function by Paul Hsieh, see
+ * http://www.azillionmonkeys.com/qed/hash.html
+ * http://burtleburtle.net/bob/hash/doobs.html
+ */
+#undef get16bits
+#if (defined(__GNUC__)  defined(__i386__)) || defined(__WATCOMC__) \
+  || defined(_MSC_VER) || defined (__BORLANDC__) || defined (__TURBOC__)
+#define get16bits(d) (*((const uint16_t *) (d)))
+#endif
+
+#if !defined (get16bits)
+#define get16bits(d) uint32_t)(((const uint8_t *)(d))[1]))  8)\
+   +(uint32_t)(((const uint8_t *)(d))[0]) )
+#endif
+
+static uint32_t
+xmlDictComputeBigKey(const xmlChar* data, int len, uint32_t hash) {
+uint32_t tmp;
+int rem;
+
+if (len = 0 || data == NULL

Re: [xml] Better hash function for dict.c

2008-04-19 Thread Stefan Behnel
Hi,

Daniel Veillard wrote:
 On Thu, Apr 17, 2008 at 10:05:03AM -0400, Daniel Veillard wrote:
   Since you seems to be interested in the performances of the hash 
 algorithm, I tried to drop the string comparisons on lookup when possible
 I have an old patch for this which I'm enclosing, but I never applied it
 since I had problems at the time (can't remember why/where, it's just 
 a FYI patch ;-)
 
   Of course I forgot the patch in the first mail ...

Hmm, wait, this patch is against hash.c, which already contains it (more or
less). I was talking about the hash function in dict.c (see subject line).

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Better hash function for dict.c

2008-04-19 Thread Stefan Behnel
Hi,

Daniel Veillard wrote:
 On Wed, Apr 16, 2008 at 10:53:04PM +0200, Stefan Behnel wrote:
   I would prefer to see benchmarks done with xmllint directly, to avoid
 side effect of more string interning than libxml2.

Ok, I did some testing with xmllint. I noticed that things can easily get
slower with the new hash function because the hash table doesn't always grow
enough to make a difference, but the now more expensive hash function still
takes its bite.

If an application benefits from a different hash function depends on the
vocabulary it uses in its XML files. A slow but well distributing hash
function performs much better for large vocabularies (or many different
vocabularies), while small vocabularies will not fill the dict enough to make
a difference, in which case the faster hash function wins.

I ran a very simple test that parses generated XML with an increasing number
of different tag names:

-
def gen_xml(tag_count):
data = [root]

append = data.append
for i in range(10):
append( a%04dbc / % (i%tag_count) )

data.append(/root)
return '\n '.join(data)
-

The result is attached. It shows that even for relatively small vocabularies
of 16/24/32 different names (tags/attributes in real life), the performance of
the current hash function degrades visibly, while the new hash function shows
more or less constant performance. Since the new hash function is generally a
bit slower, the impact depends on the input. In the case generated above, the
break-even seems to be already at about 10 tag names for me, while other input
(like a%dbc / tags) appears to need 40 distinct tag names before the new
hash runs faster, although it reaches parity around 20 names.

The thing is, I have no idea what a common size for an XML vocabulary is. If
it's smaller than, say, 40 names, it may not be worth the effort. Especially
trivial XML files (like log files) will definitely suffer a couple of percent
in performance. If we expect to parse larger vocabularies, possibly with many
attributes, we will see a quickly growing gain. And if it's worth it in that
case depends on the size of the input file.

What is clear, though, is that the way lxml uses libxml2's dict becomes
absolutely viable using the patch, even with a lot of different XML languages.
Even 1000 different names do not make a big difference in performance, and
there might still be space left for fine tuning, e.g. for the grow trigger.

Other opinions?

Stefan
Columns:
#tags   msecs   kbytes/msec

libxml2 2.6.32  libxml2 patched
--  ---
   8: 0.09 15055.5 8: 0.09 14606.8
  16: 0.09 14757.816: 0.09 14771.0
  24: 0.09 14476.524: 0.09 14660.8
  32: 0.09 14265.232: 0.09 14538.9
  40: 0.09 14062.040: 0.09 14397.4
  48: 0.09 13719.248: 0.09 14285.6
  56: 0.10 13634.156: 0.09 14264.9
  64: 0.10 13592.164: 0.09 14231.8
  72: 0.10 13515.572: 0.09 14127.5
  80: 0.10 13410.180: 0.09 14236.3
  88: 0.10 13149.588: 0.09 13967.7
  96: 0.10 13103.896: 0.09 14149.5
 104: 0.10 12961.0   104: 0.09 14156.0
 112: 0.10 12985.6   112: 0.09 14257.7
 120: 0.10 12658.8   120: 0.09 14036.8
 128: 0.10 12583.0   128: 0.09 14058.3
 136: 0.10 12397.3   136: 0.09 14097.3
 144: 0.11 12026.4   144: 0.09 14007.1
 152: 0.11 12086.8   152: 0.09 14130.3
 160: 0.11 11948.8   160: 0.09 14166.2
 168: 0.11 11810.5   168: 0.09 14113.8
 176: 0.11 11648.2   176: 0.09 14034.2
 184: 0.11 11598.2   184: 0.09 13940.9
 192: 0.11 11509.1   192: 0.09 14002.9
 200: 0.11 11424.7   200: 0.09 13889.6
 208: 0.11 11432.6   208: 0.09 13991.7
 216: 0.12 11230.0   216: 0.09 13784.6
 224: 0.12 11090.5   224: 0.10 13621.6
 232: 0.12 10968.4   232: 0.09 13765.9
 240: 0.12 10803.5   240: 0.09 13864.9
 248: 0.12 10690.2   248: 0.09 13702.5
 256: 0.12 10551.1   256: 0.10 13532.4
 264: 0.13 10338.8   264: 0.09 13753.4
 272: 0.13 10276.1   272: 0.09 13810.1
 280: 0.13 10143.5   280: 0.09 14427.2
 288: 0.13 10263.5   288: 0.09 14405.1
 296: 0.13 9775.8296: 0.09 14554.4
 304: 0.13 10202.5   304: 0.09 14456.3
 312: 0.13 9976.2312: 0.09 14572.2
 320: 0.13 9833.8320: 0.09 14635.0
 328: 0.13 9788.4328: 0.09 14359.9
 336: 0.13 9659.6336: 0.09 14559.8
 344: 0.14 9252.6344: 0.09 14346.6
 352: 0.14 9408.0352: 0.09 14596.4
 360: 0.14 9351.7360: 0.09 14542.5
 368: 0.14 9120.5368: 0.09 14587.1
 376: 0.14 9174.9376: 0.09 14572.1
 384: 0.14 9223.4384: 0.09 14560.8
 392: 0.14 9208.3392: 0.09 14376.9
 400: 0.14 9211.1400: 0.09 14529.2
 408: 0.14 9008.5408: 0.09 14527.3
 416: 0.14 9029.0416: 0.09 14525.9
 424: 0.15 8796.6424: 0.09 14504.9
 432: 0.15 8680.1432: 0.09 14471.8
 440: 0.15 8666.3

Re: [xml] Better hash function for dict.c

2008-04-18 Thread Stefan Behnel
Hi Daniel,

Daniel Veillard wrote:
 On Thu, Apr 17, 2008 at 10:05:03AM -0400, Daniel Veillard wrote:
   Since you seems to be interested in the performances of the hash 
 algorithm, I tried to drop the string comparisons on lookup when possible
 I have an old patch for this which I'm enclosing, but I never applied it
 since I had problems at the time (can't remember why/where, it's just 
 a FYI patch ;-)

only 6 out of 17 hunks of this patch apply against 2.6.32. Also, it does not
seem to know about the GCC optimisation that uses memcmp() instead.

A bit of work to get that working again ...

Stefan

___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


[xml] Better hash function for dict.c

2008-04-17 Thread Stefan Behnel
Hi,

long mail, bottom line being: 30% to multiple times faster parsing with a
different hash function in dict.c. Keep reading.

I did some profiling on lxml using callgrind. lxml configures the parser to
use a global per-thread dictionary, so everything it parses ends up in one
dictionary instead of tons of calls to malloc(). That by itself is quite an
impressive performance improvement.

What I found in callgrind was that a huge part of the overall time is spent in
xmlDictLookup. And that this time is almost completely wasted on calls to
memcmp(), i.e. in walking through the content of one hash bucket and comparing
strings.

Next, I did some testing on the current hash function (against
/usr/share/dict/words). For small dict sizes (64,128,256), the distribution is
fine. However, it seems that the distribution degrades with increasing size of
the hash table. Given a dict size of 8192, for example, the words file leads
to 80% of the buckets being larger than 4, but with an almost linearly
increasing distribution. To make things worse, a larger dict size is
encouraged by the size maintenance algorithm, which increases the size when it
has to look through more than 4 values sequentially (if I understood that
correctly). According to the profiling data, there appears to be a 50% chance
of having to grow the dictionary when a new entry gets added to the dict in
xmlDictLookup().

A little web search offers a couple of completely unreadable hash functions
that come with some obviously biased benchmarks. :) However, this one

http://www.azillionmonkeys.com/qed/hash.html

is quite short and readable and seems to do what I was looking for. In a (non
representative) benchmark, I get a speedup of a factor of 7 in lxml when
parsing from memory (bench_XML benchmark in lxml's bench_etree.py suite).
That's from over 300 msecs down to 40!

However, that benchmark uses generated tag names, so it's not necessarily
representative for an average XML file. More realistic benchmarks, like
parsing and XPath searching an XML file containing the old testament, run
between 2% and 30% faster. Also, the number of cases where the dictionary size
is checked for an increase (i.e. where there were more than 4 entries in a
bucket) drop quite dramatically according to callgrind, from 49% to below 2%.
Hmmm, I have difficulties in believing those numbers myself...

I attached a patch against libxml2 2.6.32 that replaces the current
xmlDictComputeKey() implementation with the SuperFastHash() function from the
web page.

Note that this is not a production ready patch, not even a ready for
inclusion patch. It is just meant to let others give it a try to see if they
get similar results. The problem with hash functions is that they may work
great for some input but worse for others. This hash function is slower than
the current one, actually a lot slower. But the achieved hash distribution
seems to be so much better that it wins the contest by a long way. And there
may still be space left for improvements before the final inclusion.

So I would really like to get some feedback from others.

Stefan

--- dict.c.ORIG	2008-04-16 21:59:19.0 +0200
+++ dict.c	2008-04-16 20:56:16.0 +0200
@@ -222,34 +222,72 @@
 return(ret);
 }
 
+
+#include stdint.h
+#undef get16bits
+#if (defined(__GNUC__)  defined(__i386__)) || defined(__WATCOMC__) \
+  || defined(_MSC_VER) || defined (__BORLANDC__) || defined (__TURBOC__)
+#define get16bits(d) (*((const uint16_t *) (d)))
+#endif
+
+#if !defined (get16bits)
+#define get16bits(d) uint32_t)(((const uint8_t *)(d))[1]))  8)\
+   +(uint32_t)(((const uint8_t *)(d))[0]) )
+#endif
+
+static uint32_t
+SuperFastHash (const xmlChar* data, int len, uint32_t hash) {
+uint32_t tmp;
+int rem;
+
+if (len = 0 || data == NULL) return hash;
+
+rem = len  3;
+len = 2;
+
+/* Main loop */
+for (;len  0; len--) {
+hash  += get16bits (data);
+tmp= (get16bits (data+2)  11) ^ hash;
+hash   = (hash  16) ^ tmp;
+data  += 2*sizeof (uint16_t);
+hash  += hash  11;
+}
+
+/* Handle end cases */
+switch (rem) {
+case 3: hash += get16bits (data);
+hash ^= hash  16;
+hash ^= data[sizeof (uint16_t)]  18;
+hash += hash  11;
+break;
+case 2: hash += get16bits (data);
+hash ^= hash  11;
+hash += hash  17;
+break;
+case 1: hash += *data;
+hash ^= hash  10;
+hash += hash  1;
+}
+
+/* Force avalanching of final 127 bits */
+hash ^= hash  3;
+hash += hash  5;
+hash ^= hash  4;
+hash += hash  17;
+hash ^= hash  25;
+hash += hash  6;
+
+return hash;
+}
+
 /*
  * xmlDictComputeKey:
  * Calculate the hash key
  */
 static unsigned long
 xmlDictComputeKey(const xmlChar *name, int namelen) {
-unsigned long value = 0L;
-
-if (name 

Re: [xml] Better hash function for dict.c

2008-04-17 Thread Stefan Behnel
Hi,

Stefan Behnel wrote:
 http://www.azillionmonkeys.com/qed/hash.html
 is quite short and readable and seems to do what I was looking for.

Some more real-world numbers. I used lxml to parse - using xmlCtxtParseFile()
- all .xml and .xsd files that locate found on my hard disc, some 58000
files, 218MB in total (I should really run a cleanup session when I find the
time...).

This took 14.62 seconds with the unchanged libxml2 2.6.32. The patched version
did this in 14.19 seconds, so that's 3% faster. Not much, but it's
reproducible, so it's not just noise. And note that there is a lot of OS
interaction involved, even though it parses from the disk cache. I also did a
test parsing the same files from memory and that takes the unpatched version
9.87 seconds (22MB/s) and the patched one 9.19 seconds (23,8MB/s), so that's
7% better. (BTW, libxml2 is lightning fast already, thanks to everyone who
made this possible - Daniel above all, I assume).

I ran callgrind on this benchmark (only measuring xmlDictLookup, which took 18
minutes already). The xmlDictGrow() function is called twice in both cases, so
the dictionary has the same size, but the timing for calls to memcmp() goes
down by 60%.

The total time used for looping linearly over the target buffer is only 20% (a
fifth) of the original version, where it took 54% of the overall runtime.
That's down to 15% now, which indicates that the hash table distribution is
much better in the patched version.

In total, xmlDictLookup() runs an averaged 25% faster according to callgrind.

I attached the two result logs. I recommend KCachegrind for investigating them.

So, I would like to get this integrated. Should I try fixing up the patch or
is anyone else interested?

Stefan


callgrind.out.patched-dict.gz
Description: GNU Zip compressed data


callgrind.out.unpatched-dict.gz
Description: GNU Zip compressed data
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] parsing soap..

2008-03-09 Thread Stefan Behnel
Hi,

Subramanian S wrote:
   I am not able to parse the soap messages. I am using libxml2. The
 simple method of node traversing is not working.
   How can it be done???

This might help:

http://catb.org/~esr/faqs/smart-questions.html

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] indentation after adding new nodes

2008-03-07 Thread Stefan Behnel
Hi,

Senthil Nathan *top-posted*:
 On Thu, Mar 6, 2008 at 11:33 PM, Stefan Behnel [EMAIL PROTECTED] wrote:
 Senthil Nathan wrote:
 I tried using the xmlCopyDocNode( ) and xmlCopyNode( ). It copies the
 node
 but the indentation is not proper.
 There is no indentation in an XML tree, but there may be text nodes that
 contain whitespace. Maybe you didn't copy those?


 How can we set the indentation in libxml2?
 What are you trying to do?

 That's true. But when I dump the tree to a file, the indentation is not
 proper.
 All the nodes that I copied are just continuous.

 The xml file generated after dumping from libxml2 with few nodes added or
 copied,
 root
level1
   level2/level2level21/level21level22/level22
 /level1
 /root

Ok, so what are you doing to serialise this result? Are you using the format
option? That will not always give you the expected result, as libxml2 cannot
easily distinguish between ignorable and meaningful whitespace.

Two options:

- parse with the XML_PARSE_NOBLANKS option to remove whitespace

- copy the whitespace text nodes when you copy the element nodes

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] indentation after adding new nodes

2008-03-07 Thread Stefan Behnel


Senthil Nathan *top-posted* again:
 I used xmlSaveFile(file.xml, docTree); This just dumps the xmlDocPtr
 docTree to file.xml
 without any indentation. Is there any options in libxml2 to set it properly.

Care to read the manual?

Go to

http://xmlsoft.org/html/libxml-tree.html

and look for format.

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] indentation after adding new nodes

2008-03-06 Thread Stefan Behnel
Hi,

Senthil Nathan wrote:
 I tried using the xmlCopyDocNode( ) and xmlCopyNode( ). It copies the node
 but the indentation is not proper.

There is no indentation in an XML tree, but there may be text nodes that
contain whitespace. Maybe you didn't copy those?


 How can we set the indentation in libxml2?

What are you trying to do?

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


[xml] XML Schema crash in W3C test suite

2008-02-28 Thread Stefan Behnel
Hi,

I just ran xmllint of a vanilla libxml2 2.6.31 over the SUN part of the W3C
XML Schema test suite.

I get a couple of failures, but also a crash in one case, so I thought I'd
send in the results.

BTW, does anyone have a script to run the whole suite? For example, I have no
idea how to figure out which of the M$ tests are supposed to be valid or
invalid. rantI bet they have a truly platform independent Active-X control
somewhere on their page that knows the expected results (platform
independent == runs on Windows XP *and* on Windows Vista)/rant

Anyway, these are the test failures I get:

xmlschema2006-11-06/sunData/AttrUse/AU_valConstr/AU_valConstr00101m/AU_valConstr00101m1_n.xml
xmlschema2006-11-06/sunData/CType/attributeUses/attributeUses00101m/attributeUses00101m1_p.xml
xmlschema2006-11-06/sunData/CType/derivationMethod/derivationMethod00102m/derivationMethod00102m1_p.xml
xmlschema2006-11-06/sunData/CType/derivationMethod/derivationMethod00102m/derivationMethod00102m2_p.xml
xmlschema2006-11-06/sunData/ElemDecl/valueConstraint/valueConstraint00701m/valueConstraint00701m1_n.xml
xmlschema2006-11-06/sunData/Notation/name/name00101m/name00101m1_p.xml

I attached the test output. Seems to be mainly one problem with whitespace
around integers not being stripped.

For the following test, however, I get a crash:

xmlschema2006-11-06/sunData/Notation/targetNS/targetNS00101m/targetNS00101m2_p.xml

Valgrind gives me this:

==15628== Invalid free() / delete / delete[]
==15628==at 0x402237F: free (vg_replace_malloc.c:233)
==15628==by 0x4187985: xmlSchemaFreeValue (xmlschemastypes.c:1047)
==15628==by 0x416A6CC: xmlSchemaFreeFacet (xmlschemas.c:3927)
==15628==by 0x416A742: xmlSchemaFreeType (xmlschemas.c:3954)
==15628==by 0x416A9A9: xmlSchemaComponentListFree (xmlschemas.c:4022)
==15628==by 0x416: xmlSchemaBucketFree (xmlschemas.c:3504)
==15628==by 0x410D7E8: xmlHashFree (hash.c:307)
==15628==by 0x416AC49: xmlSchemaFree (xmlschemas.c:4119)
==15628==by 0x804F853: main (xmllint.c:3534)
==15628==  Address 0x4389800 is 0 bytes inside a block of size 4 free'd
==15628==at 0x402237F: free (vg_replace_malloc.c:233)
==15628==by 0x4174BF8: xmlSchemaValidateNotation (xmlschemas.c:21820)
==15628==by 0x417634B: xmlSchemaVCheckCVCSimpleType (xmlschemas.c:24469)
==15628==by 0x417D606: xmlSchemaCheckFacet (xmlschemas.c:18599)
==15628==by 0x417DC88: xmlSchemaFixupSimpleTypeStageTwo (xmlschemas.c:18756)
==15628==by 0x4183E33: xmlSchemaFixupComponents (xmlschemas.c:20988)
==15628==by 0x418694E: xmlSchemaParse (xmlschemas.c:21263)
==15628==by 0x804F457: main (xmllint.c:3384)


Stefan
FAILED: 
xmlschema2006-11-06/sunData/AttrUse/AU_valConstr/AU_valConstr00101m/AU_valConstr00101m1_n.xml
xmlschema2006-11-06/sunData/AttrUse/AU_valConstr/AU_valConstr00101m/AU_valConstr00101m1_n.xml
 validates

FAILED: 
xmlschema2006-11-06/sunData/CType/attributeUses/attributeUses00101m/attributeUses00101m1_p.xml
xmlschema2006-11-06/sunData/CType/attributeUses/attributeUses00101m/attributeUses00101m1_p.xml:13:
 element a: Schemas validity error : Element '{attributeUses}a': '
123
' is not a valid value of the atomic type 'xs:int'.
xmlschema2006-11-06/sunData/CType/attributeUses/attributeUses00101m/attributeUses00101m1_p.xml
 fails to validate

FAILED: 
xmlschema2006-11-06/sunData/CType/derivationMethod/derivationMethod00102m/derivationMethod00102m1_p.xml
xmlschema2006-11-06/sunData/CType/derivationMethod/derivationMethod00102m/derivationMethod00102m1_p.xml:13:
 element a: Schemas validity error : Element '{derivationMethod}a': '
123
' is not a valid value of the atomic type 'xs:int'.
xmlschema2006-11-06/sunData/CType/derivationMethod/derivationMethod00102m/derivationMethod00102m1_p.xml
 fails to validate

FAILED: 
xmlschema2006-11-06/sunData/CType/derivationMethod/derivationMethod00102m/derivationMethod00102m2_p.xml
xmlschema2006-11-06/sunData/CType/derivationMethod/derivationMethod00102m/derivationMethod00102m2_p.xml:13:
 element a: Schemas validity error : Element '{derivationMethod}a': '
123
' is not a valid value of the atomic type 'xs:int'.
xmlschema2006-11-06/sunData/CType/derivationMethod/derivationMethod00102m/derivationMethod00102m2_p.xml
 fails to validate

FAILED: 
xmlschema2006-11-06/sunData/ElemDecl/valueConstraint/valueConstraint00701m/valueConstraint00701m1_n.xml
xmlschema2006-11-06/sunData/ElemDecl/valueConstraint/valueConstraint00701m/valueConstraint00701m1_n.xml
 validates

FAILED: xmlschema2006-11-06/sunData/Notation/name/name00101m/name00101m1_p.xml
xmlschema2006-11-06/sunData/Notation/name/name00101m/name00101m1.xsd:21: 
element enumeration: Schemas parser error : Element 
'{http://www.w3.org/2001/XMLSchema}enumeration': 'png' is not a valid value of 
the atomic type 'xs:NOTATION'.
xmlschema2006-11-06/sunData/Notation/name/name00101m/name00101m1.xsd:21: 
element enumeration: Schemas parser error : Element 
'{http://www.w3.org/2001/XMLSchema}enumeration': The 

Re: [xml] Useless function calls in xmlSetProp()?

2008-02-22 Thread Stefan Behnel
Hi,

Julien Charbon wrote:
 - With old xmlSetProp():
 
 $ ./test-setprop-big
 Size:   8   Time:   000:14397
 Size:   16  Time:   000:03429
 Size:   32  Time:   000:03164
[...]
 - With new [now current] xmlSetProp():
 
 $ ./test-setprop-big
 Size:   8   Time:   000:04981
 Size:   16  Time:   000:01847
 Size:   32  Time:   000:00906
[...]
  [Yes, attributes with value size of 1 MB are unrealistic, it is just to
 show how xmlSetProp() scaled before setprop.patch]

There is a huge difference for small strings, though. Any idea why the (most
common) really short string values take three times as long as the somewhat
longer ones? Or is it just the usual benchmark uncertainty?

What is the time scale you used above anyway?

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Return value of xmlCharEncodingInputFunc

2008-02-01 Thread Stefan Behnel
Hi,

Ralf Junker wrote:
 I am in doubt about the -1 return value of these function prototypes:
 
   * xmlCharEncodingInputFunc
   * xmlCharEncodingOutputFunc
 
 The documentation says that -1 means lack of space. However, in various 
 implementations of these function prototype I see this:
 
   if ((out == NULL) || (outlen == NULL) || (inlen == NULL)) return(-1);
 
 So I wonder if the -1 result value means
 
   1. lack of space in the output buffer
   2. illegal arguments passed
 
 The difference is that with 1 I would need to provide more output, but with 
 2. I would issue a parameter error.
 
 Does anyone which one of the two (or yet another) is the corret understanding 
 of -1?

I think that's easy: you control what goes in and out, so the NULL checks will
not apply to you. All that can really go wrong is malloc problems. So the docs
are (mainly) correct.

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] ATTRIBUTE NAME validation problem

2008-01-28 Thread Stefan Behnel
Hi,

murali wrote:
 !ATTLIST doc : CDATA #IMPLIED
 
 is a valid declaration of attribute : for element doc.
 
 But , currently LIBXML2 generates a error when it encounters this.

well, that's just because ':' isn't really a well-formed attribute name. So
you're actually lucky libxml2 tells you that, otherwise you'd generate XML
that no parser can parse.

 Java xml parser also has the same behavior.

you see?

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] handling xpath error - libxml

2008-01-28 Thread Stefan Behnel
Hi,

Senthil Nathan wrote:
 I would like to how to handle the xpath error gracefully when I use the
 libxml api,
 xmlXPathEvalExpression(path, xpathCtx).
 
 If I pass a invalid path string to evaluate on the xpathCtx, it throws the
 error as below and stops there.
 But I would like to handle that error gracefully and log it accordingly and
 proceed with my application.

What is the reason why you cannot just continue after this error? Just call
xmlXPathEvalExpression() again (with a working expression) and everything
should be fine.


 So, could anyone give me ideas on doing the same? Or is it possible to check
 the xpath expr is valid or not even before
 calling the libxml api.

You can compile the expression.

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] handling xpath error - libxml

2008-01-28 Thread Stefan Behnel
Hi,

Senthil Nathan wrote:
 On 1/28/08, Stefan Behnel [EMAIL PROTECTED] wrote:
 Hi,

 Senthil Nathan wrote:
 I would like to how to handle the xpath error gracefully when I use the
 libxml api,
 xmlXPathEvalExpression(path, xpathCtx).

 If I pass a invalid path string to evaluate on the xpathCtx, it throws
 the
 error as below and stops there.
 But I would like to handle that error gracefully and log it accordingly
 and
 proceed with my application.
 What is the reason why you cannot just continue after this error? Just
 call
 xmlXPathEvalExpression() again (with a working expression) and everything
 should be fine.
 
 In my application, when an invalid xpath string is given, during the
 xmlXPathEvalExpression( ),
 it fails with the error and just stops there. it's not continuing further
 and just hangs or stops there.
 So, I only need to break the application. Is there a better way to handle,
 in case of these xpath errors.

Ah, so it hangs *in* the eval call and does not return? I've never seen that
before. And it definitely works for me in lxml (libxml2 2.6.31):

   import lxml.etree as et
   root = et.XML(root/)
   root.xpath(/roottag///[EMAIL PROTECTED]'1'])
  Traceback (most recent call last):
  [...]
  lxml.etree.XPathEvalError: Invalid expression
  

That's basically using this code:

  xpathCtxt.node = some_node;
  xpathObj = xmlXPathEvalExpression(c_path, xpathCtxt);


Could you supply the libxml2 version you are using and some example code that
shows the problem? Does your machine show high CPU load while it hangs? (i.e.
does it do something?)

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] ATTRIBUTE NAME validation problem

2008-01-28 Thread Stefan Behnel
Hi,

Mike Hommey wrote:
 But that's still true that the XML declaration spec tells : is a valid
 attribute name in an ATTLIST. ::: is, too.
 
 http://www.w3.org/TR/xml/#NT-Name
 
 That's puzzling.

Hmm, interesting. Even the errata list this as a valid production for a Name.

http://www.w3.org/XML/xml-V10-4e-errata#E09

There, it even says:


Document authors are encouraged to use names which are meaningful words or
combinations of words in natural languages, and to avoid symbolic or white
space characters in names. Note that COLON, HYPHEN-MINUS, FULL STOP (period),
LOW LINE (underscore), and MIDDLE DOT are explicitly permitted.


However, I assume that the reason is simply the support for XML namespaces,
which explicitly exclude the ':' in their NCName production:

http://www.w3.org/TR/REC-xml-names/#ns-decl
http://www.w3.org/TR/REC-xml-names/#ns-qualnames

So, a namespace aware parser is absolutely allowed to treat :: names as
an error.

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


Re: [xml] XHTML 1.0 to HTML 4.01

2008-01-24 Thread Stefan Behnel

Florent Guiliani wrote:
 I wonder if there a way, within libxml2, to convert an XML document 
 (xmlDocPtr) 
 that contains valid XHTML 1.0 into HTML 4.01 ? A way with only libxml2 
 function 
 calls would be perfect.

If you parse it in, you can use the HTML parser, which should also handle
XHTML without problems. But as Daniel suggested, you might want to add a
doctype declaration.

Stefan
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml


  1   2   >