Re: [xml] Important: possible incompatible changes ahead for 2.9.0 !

Stefan Behnel Tue, 07 Aug 2012 02:07:32 -0700

Daniel Veillard, 07.08.2012 10:16:
> On Mon, Aug 06, 2012 at 11:39:23PM +0200, Stefan Behnel wrote:
>> thanks for the heads-up. I don't care all that much about the global dict
>> size - 10M entries should be hard enough to reach for normal use cases.
>> Most users only deal with a very small number of XML formats.
> 
>   Okay, the point too is that the dictionary may be used to intern small
> strings and while this is unlikely to break for a single document
> reusing the same dictionary over and over for many documents may lead to
> problem.


Ah, right - I remember one user complaining once that DTD IDs were stored
there and lead to a "memory leak". He generated them on the fly, which
meant that each document had a completely distinct set of IDs. That can add
up pretty quickly.

Maybe I should add a parser option that would use a subdict instead of the
global per-thread dict.


> In any case the dictionary is only limited as part of the
> parsing process if you allocate it on your own and override the parser
> context one you won't be affected.

Ok, then lxml won't run into this anyway.


>> https://github.com/lxml/lxml/blob/master/src/lxml/serializer.pxi#L123
> 
>   c_buffer = tree.xmlAllocOutputBuffer(enchandler)
>   ...
>   tree.xmlOutputBufferFlush(c_buffer)
>  
>   if c_buffer.conv is not NULL:
>       c_result_buffer = c_buffer.conv
>   else:
>       c_result_buffer = c_buffer.buffer
> 
>     I think you can still keep the code initializing c_result_buffer
>  the pointer names are kept the same and you're just testing for NULL
>  so that should be fine
> 
>   if encoding is _unicode:
>       result = (<unsigned char*>tree.xmlBufferContent(
>             
> c_result_buffer))[:tree.xmlBufferLength(c_result_buffer)].decode('UTF-8')
>   else:
>       result = <bytes>(<unsigned char*>tree.xmlBufferContent( 
> c_result_buffer))[:tree.xmlBufferLength(c_result_buffer)]
> 
>     That will have to be changed. I don't know how you express #ifdef in pxi
> but if LIBXML2_NEW_BUFFER is defined then use
>       tree.xmlBufContent/tree.xmlBufLenght
> if not defined keep using
>       tree.xmlBufferContent/tree.xmlBufferLenght
> on the pointers.

Ok. I can just #define the names conditionally in an external header file.


>> Another issue I found: xmlDumpNotationTable() still wants an xmlBuffer
>> instead of the xmlBuf that outbuffer.buffer returns. Is the right fix here
>> to include buf.h and call xmlBufBackToBuffer()?
> 
>   yeah those routines are unlikely to generate more than 2GBytes of
>   output which is why using xmlBuffer is still okay. Actually I did not
>   change any of the APIs, all the old APIs using xmlBuffer will still
>   work as before, it's just that internally I changed
>   xmlParserInputBuffer and xmlOutputBuffer to use the new xmlBuf
>   internally.
>
>> https://github.com/lxml/lxml/blob/master/src/lxml/serializer.pxi#L293
> 
>      if c_dtd.notations != NULL:
>         tree.xmlDumpNotationTable(c_buffer.buffer,
>                                   <tree.xmlNotationTable*>c_dtd.notations)
> 
>   that will need some fixing, right ... You can't include buf.h because
> it is private ...
>
> What I would do is create a new xmlBuffer, dump to it
> get the resulting string and do tree.xmlOutputBufferWrite() to append it
> and then free the buffer. That sounds the simplest and most portable to
> old and new versions.

Ok. Let's assume that internal subsets tend to be small and that this is
good enough.


>> It seems to me that redefining xmlBufferLength and xmlBufferContent to call
>> the new xmlBuf functions and using a size_t (or ssize_t?) to store the
>> result of xmlBufLength would do the trick.
> 
>   I can't do that really it would break the API, both data structures
> will coexist ... forever.

Sorry - I meant that it would do the trick for me, i.e. doing it on user
side will work. I wasn't suggesting to do it in libxml2's header files.


>> BTW, is there a reason why there's both an xmlBufLength() and an
>> xmlBufUse() that do the same thing? Since this is a new API that doesn't
>> suffer from legacy junk yet, wouldn't one be enough? (And wouldn't
>> xmlBufLength() be the perfect name?)
> 
>   well I wanted the conversion to be as automatic as possible and
> since the field which was used was buf->use, I perfer to keep that
> 'alias' for simpler conversions where needed. But in theory you're
> perfectly right it is pure code duplication ...

Your choice - I would have decided against having two ways to do it.


>>>   I don't plan to make an official release with the changes before
>>> September, so there is a bit of time to get this all cleaned up, and
>>> possibly refine the migration stategy for the few apps affected.
>>
>> There'll be a new release (3.0) of lxml quite soon, within a few weeks. It
>> should be doable to get this fixed up by then.
> 
>   Okay, tell me if you have problems. I didn't fully finished, as you
> can see I'm still commiting fixes and improvements on that part. Let's
> synchronize so that people making new build of lxml do it on top of
> the new libxml2, otherwise they will have to rely on the ABI
> compatibility ! I plan beginning of September, I hope it's not too late
> for you :)

I'm not all too worried here. Linux users either use the lxml that comes
with their system or freshly build it themselves. In the first case, it's
unlikely that a distribution would switch to libxml2 2.9 behind their back
(distributors tend to be rather conservative), in the latter case, they'll
often take care of the dependencies themselves and just rebuild when they
upgrade something.

MacOS users as well as Windows users typically use builds that link libxml2
and libxslt in statically. On MacOS, because the system libraries tend to
interfere with newer installations in uncontrollable ways and on Windows
because users can't build software themselves and happily click on
redundantly bloated installers they find on the Internet.

So it's actually rare that libxml2 gets updated without users being able to
update lxml as well. And building Python extensions is pretty automatic for
most users who tend to build their own software.

Thanks again!

Stefan

_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
https://mail.gnome.org/mailman/listinfo/xml

Re: [xml] Important: possible incompatible changes ahead for 2.9.0 !

Reply via email to