On Mon, Aug 06, 2012 at 11:39:23PM +0200, Stefan Behnel wrote: > Hi Daniel,
Hi Stefan, > thanks for the heads-up. I don't care all that much about the global dict > size - 10M entries should be hard enough to reach for normal use cases. > Most users only deal with a very small number of XML formats. Okay, the point too is that the dictionary may be used to intern small strings and while this is unlikely to break for a single document reusing the same dictionary over and over for many documents may lead to problem. In any case the dictionary is only limited as part of the parsing process if you allocate it on your own and override the parser context one you won't be affected. > But I did run into issues with the buffer changes. > > Daniel Veillard, 06.08.2012 09:00: > > The new buffer structure will be ABI compatible with the old ones, > > i.e. the old code as compiled wil be able to work with the new one, as > > the fields with the same values are in the same place in the new > > structures. But the structure are now opaque and the few places where > > the code was using it directly will need fixing. > > What I see from the usage there are for example access to > > xmlOutputBuffers: > > > > buf = xmlAllocOutputBuffer (NULL); > > ....dump stuff to the buffer... > > use data at buf->buffer->content, of size buf->buffer->use > > > > First okay, that was allowed by the API, but such buffers were supposed > > to be used for I/O and encoding conversion, in general accessing > > buf->buffer->content and buf->buffer->use directly was not really the > > expected way to do things. The fact that xmlOutputBuffer were not > > supposed to be used that way is the reason why there is no accessors for > > getting the output data, this is now fixed as of commit > > > > > > http://git.gnome.org/browse/libxml2/commit/?id=e258adecd0e19a6cfe6afa232b89aa416368820e > > > > So where there is such use of direct access, check the LIBXML2_NEW_BUFFER > > macro and if present then > > - replace buf->buffer->content with xmlOutputBufferGetContent(buf) > > - replace buf->buffer->use with xmlOutputBufferGetSize(buf) > > I tested it and found that lxml is affected by this. lxml currently takes > the xmlBuffer* from either the "conv" or "buffer" field of the output > buffer and then calls xmlBufferContent() and xmlBufferLength() to get at > the result. I take it that this isn't how it'll work in the future, because > xmlBufferLength() returns an int and buffers are supposed to be larger than > that, right? yes basically it's the problem, the two fields have been replaced by xmlBufPtr instead of xmlBufferPtr . But you should just have to call xmlBufContent() and xmlBufLength() conditionally to the availability of LIBXML2_NEW_BUFFER to compile in the new setup > However, xmlOutputBufferGetContent() only reads the "buffer" field, not the > "conv" field. How should I use the "conv" field now? Can't the new > xmlOutputBufferGetContent() do "the right thing" for me? Just don't use xmlOutputBufferGetContent() , use xmlBufContent() and xmlBufLength() on the new structure instead and keep you code as-is > Code that uses xmlBuffer directly is here: > > https://github.com/lxml/lxml/blob/master/src/lxml/serializer.pxi#L31 Actually that looks just fine. xmlBufferPtr still exists and the associated routines and entry points are maintained. Calling c_buffer = tree.xmlBufferCreate() c_text = tree.xmlBufferContent(c_buffer) is part of the API, it's not changed in any way. > https://github.com/lxml/lxml/blob/master/src/lxml/serializer.pxi#L123 c_buffer = tree.xmlAllocOutputBuffer(enchandler) ... tree.xmlOutputBufferFlush(c_buffer) if c_buffer.conv is not NULL: c_result_buffer = c_buffer.conv else: c_result_buffer = c_buffer.buffer I think you can still keep the code initializing c_result_buffer the pointer names are kept the same and you're just testing for NULL so that should be fine if encoding is _unicode: result = (<unsigned char*>tree.xmlBufferContent( c_result_buffer))[:tree.xmlBufferLength(c_result_buffer)].decode('UTF-8') else: result = <bytes>(<unsigned char*>tree.xmlBufferContent( c_result_buffer))[:tree.xmlBufferLength(c_result_buffer)] That will have to be changed. I don't know how you express #ifdef in pxi but if LIBXML2_NEW_BUFFER is defined then use tree.xmlBufContent/tree.xmlBufLenght if not defined keep using tree.xmlBufferContent/tree.xmlBufferLenght on the pointers. > Another issue I found: xmlDumpNotationTable() still wants an xmlBuffer > instead of the xmlBuf that outbuffer.buffer returns. Is the right fix here > to include buf.h and call xmlBufBackToBuffer()? yeah those routines are unlikely to generate more than 2GBytes of output which is why using xmlBuffer is still okay. Actually I did not change any of the APIs, all the old APIs using xmlBuffer will still work as before, it's just that internally I changed xmlParserInputBuffer and xmlOutputBuffer to use the new xmlBuf internally. > https://github.com/lxml/lxml/blob/master/src/lxml/serializer.pxi#L293 if c_dtd.notations != NULL: tree.xmlDumpNotationTable(c_buffer.buffer, <tree.xmlNotationTable*>c_dtd.notations) that will need some fixing, right ... You can't include buf.h because it is private ... What I would do is create a new xmlBuffer, dump to it get the resulting string and do tree.xmlOutputBufferWrite() to append it and then free the buffer. That sounds the simplest and most portable to old and new versions. > (BTW, the reason why the serialisation code is doing so much stuff manually > is IIRC that lxml still supports a couple of libxml2 versions that lack the > newer features of the serialisation/xmlSave API. And also to avoid slight > changes to the serialised XML if it switched to native libxml2 functions > abruptly.) yeah, understood... libxslt does a bit of mess around this too :-) > > > if in some place the xmlBufferPtr was passed independantly of the > > OutputBuffer, it's possible to use xmlBufGetContent(buffer) and > > xmlBufUse(buffer) to achieve the same. > > I assume you meant xmlBufContent() ? yup, right > It seems to me that redefining xmlBufferLength and xmlBufferContent to call > the new xmlBuf functions and using a size_t (or ssize_t?) to store the > result of xmlBufLength would do the trick. I can't do that really it would break the API, both data structures will coexist ... forever. > BTW, is there a reason why there's both an xmlBufLength() and an > xmlBufUse() that do the same thing? Since this is a new API that doesn't > suffer from legacy junk yet, wouldn't one be enough? (And wouldn't > xmlBufLength() be the perfect name?) well I wanted the conversion to be as automatic as possible and since the field which was used was buf->use, I perfer to keep that 'alias' for simpler conversions where needed. But in theory you're perfectly right it is pure code duplication ... > > > I don't plan to make an official release with the changes before > > September, so there is a bit of time to get this all cleaned up, and > > possibly refine the migration stategy for the few apps affected. > > There'll be a new release (3.0) of lxml quite soon, within a few weeks. It > should be doable to get this fixed up by then. Okay, tell me if you have problems. I didn't fully finished, as you can see I'm still commiting fixes and improvements on that part. Let's synchronize so that people making new build of lxml do it on top of the new libxml2, otherwise they will have to rely on the ABI compatibility ! I plan beginning of September, I hope it's not too late for you :) Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ [email protected] | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/ _______________________________________________ xml mailing list, project page http://xmlsoft.org/ [email protected] https://mail.gnome.org/mailman/listinfo/xml
