On Mon, Aug 06, 2012 at 11:39:23PM +0200, Stefan Behnel wrote:
> Hi Daniel,

  Hi Stefan,

> thanks for the heads-up. I don't care all that much about the global dict
> size - 10M entries should be hard enough to reach for normal use cases.
> Most users only deal with a very small number of XML formats.

  Okay, the point too is that the dictionary may be used to intern small
strings and while this is unlikely to break for a single document
reusing the same dictionary over and over for many documents may lead to
problem. In any case the dictionary is only limited as part of the
parsing process if you allocate it on your own and override the parser
context one you won't be affected.

> But I did run into issues with the buffer changes.
> 
> Daniel Veillard, 06.08.2012 09:00:
> >   The new buffer structure will be ABI compatible with the old ones,
> > i.e. the old code as compiled wil be able to work with the new one, as
> > the fields with the same values are in the same place in the new
> > structures. But the structure are now opaque and the few places where
> > the code was using it directly will need fixing.
> >   What I see from the usage there are for example access to 
> > xmlOutputBuffers:
> > 
> >   buf = xmlAllocOutputBuffer (NULL);
> >   ....dump stuff to the buffer...
> >   use data at buf->buffer->content, of size buf->buffer->use
> > 
> > First okay, that was allowed by the API, but such buffers were supposed
> > to be used for I/O and encoding conversion, in general accessing
> > buf->buffer->content and buf->buffer->use directly was not really the
> > expected way to do things. The fact that xmlOutputBuffer were not
> > supposed to be used that way is the reason why there is no accessors for
> > getting the output data, this is now fixed as of commit
> > 
> >   
> > http://git.gnome.org/browse/libxml2/commit/?id=e258adecd0e19a6cfe6afa232b89aa416368820e
> > 
> >  So where there is such use of direct access, check the LIBXML2_NEW_BUFFER
> > macro and if present then
> >    - replace buf->buffer->content with xmlOutputBufferGetContent(buf)
> >    - replace buf->buffer->use with xmlOutputBufferGetSize(buf)
> 
> I tested it and found that lxml is affected by this. lxml currently takes
> the xmlBuffer* from either the "conv" or "buffer" field of the output
> buffer and then calls xmlBufferContent() and xmlBufferLength() to get at
> the result. I take it that this isn't how it'll work in the future, because
> xmlBufferLength() returns an int and buffers are supposed to be larger than
> that, right?

  yes basically it's the problem, the two fields have been replaced by
  xmlBufPtr instead of xmlBufferPtr . But you should just have to
  call xmlBufContent() and xmlBufLength() conditionally to the
  availability of LIBXML2_NEW_BUFFER to compile in the new setup

> However, xmlOutputBufferGetContent() only reads the "buffer" field, not the
> "conv" field. How should I use the "conv" field now? Can't the new
> xmlOutputBufferGetContent() do "the right thing" for me?

  Just don't use xmlOutputBufferGetContent() , use xmlBufContent() and
  xmlBufLength() on the new structure instead and keep you code as-is

> Code that uses xmlBuffer directly is here:
> 
> https://github.com/lxml/lxml/blob/master/src/lxml/serializer.pxi#L31

  Actually that looks just fine. xmlBufferPtr still exists and the
  associated routines and entry points are maintained. Calling

    c_buffer = tree.xmlBufferCreate()
    c_text = tree.xmlBufferContent(c_buffer)

is part of the API, it's not changed in any way.

> https://github.com/lxml/lxml/blob/master/src/lxml/serializer.pxi#L123

  c_buffer = tree.xmlAllocOutputBuffer(enchandler)
  ...
  tree.xmlOutputBufferFlush(c_buffer)
 
  if c_buffer.conv is not NULL:
      c_result_buffer = c_buffer.conv
  else:
      c_result_buffer = c_buffer.buffer

    I think you can still keep the code initializing c_result_buffer
 the pointer names are kept the same and you're just testing for NULL
 so that should be fine

  if encoding is _unicode:
      result = (<unsigned char*>tree.xmlBufferContent(
              
c_result_buffer))[:tree.xmlBufferLength(c_result_buffer)].decode('UTF-8')
  else:
      result = <bytes>(<unsigned char*>tree.xmlBufferContent( 
c_result_buffer))[:tree.xmlBufferLength(c_result_buffer)]

    That will have to be changed. I don't know how you express #ifdef in pxi
but if LIBXML2_NEW_BUFFER is defined then use
      tree.xmlBufContent/tree.xmlBufLenght
if not defined keep using
      tree.xmlBufferContent/tree.xmlBufferLenght
on the pointers.

> Another issue I found: xmlDumpNotationTable() still wants an xmlBuffer
> instead of the xmlBuf that outbuffer.buffer returns. Is the right fix here
> to include buf.h and call xmlBufBackToBuffer()?

  yeah those routines are unlikely to generate more than 2GBytes of
  output which is why using xmlBuffer is still okay. Actually I did not
  change any of the APIs, all the old APIs using xmlBuffer will still
  work as before, it's just that internally I changed
  xmlParserInputBuffer and xmlOutputBuffer to use the new xmlBuf
  internally.

> https://github.com/lxml/lxml/blob/master/src/lxml/serializer.pxi#L293

     if c_dtd.notations != NULL:
        tree.xmlDumpNotationTable(c_buffer.buffer,
                                  <tree.xmlNotationTable*>c_dtd.notations)

  that will need some fixing, right ... You can't include buf.h because
it is private ... What I would do is create a new xmlBuffer, dump to it
get the resulting string and do tree.xmlOutputBufferWrite() to append it
and then free the buffer. That sounds the simplest and most portable to
old and new versions.
 
> (BTW, the reason why the serialisation code is doing so much stuff manually
> is IIRC that lxml still supports a couple of libxml2 versions that lack the
> newer features of the serialisation/xmlSave API. And also to avoid slight
> changes to the serialised XML if it switched to native libxml2 functions
> abruptly.)

  yeah, understood... libxslt does a bit of mess around this too :-)

> 
> >   if in some place the xmlBufferPtr was passed independantly of the
> > OutputBuffer, it's possible to use xmlBufGetContent(buffer) and
> > xmlBufUse(buffer) to achieve the same.
> 
> I assume you meant xmlBufContent() ?

  yup, right

> It seems to me that redefining xmlBufferLength and xmlBufferContent to call
> the new xmlBuf functions and using a size_t (or ssize_t?) to store the
> result of xmlBufLength would do the trick.

  I can't do that really it would break the API, both data structures
will coexist ... forever.

> BTW, is there a reason why there's both an xmlBufLength() and an
> xmlBufUse() that do the same thing? Since this is a new API that doesn't
> suffer from legacy junk yet, wouldn't one be enough? (And wouldn't
> xmlBufLength() be the perfect name?)

  well I wanted the conversion to be as automatic as possible and
since the field which was used was buf->use, I perfer to keep that
'alias' for simpler conversions where needed. But in theory you're
perfectly right it is pure code duplication ...

> 
> >   I don't plan to make an official release with the changes before
> > September, so there is a bit of time to get this all cleaned up, and
> > possibly refine the migration stategy for the few apps affected.
> 
> There'll be a new release (3.0) of lxml quite soon, within a few weeks. It
> should be doable to get this fixed up by then.

  Okay, tell me if you have problems. I didn't fully finished, as you
can see I'm still commiting fixes and improvements on that part. Let's
synchronize so that people making new build of lxml do it on top of
the new libxml2, otherwise they will have to rely on the ABI
compatibility ! I plan beginning of September, I hope it's not too late
for you :)

Daniel

-- 
Daniel Veillard      | libxml Gnome XML XSLT toolkit  http://xmlsoft.org/
[email protected]  | Rpmfind RPM search engine http://rpmfind.net/
http://veillard.com/ | virtualization library  http://libvirt.org/
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
https://mail.gnome.org/mailman/listinfo/xml

Reply via email to