Tres Seaver wrote:
Unicode XML is not only problematic for streaming. For instance, you
*can't* pass a Unicode string to the libxml2 *at all* , unless you want
a core dump.  The API requires that you pass it strings encoded as UTF8.
You can in lxml. :) libxml2 as a C API doesn't even support any unicode string type as far as I am aware.

It *requires* UTF-8-encoded strings.  See

  12. So what is this funky "xmlChar" used all the time?

      It is a null terminated sequence of utf-8 characters. And only
      utf-8! You need to convert strings encoded in different ways to
      utf-8 before passing them to the API. This can be accomplished
      with the iconv library for instance.

Um, Tres, no need to tell me about the libxml2 API..

There is also the libxml2 *python* API, which I believe has a knob to turn on the ability to pass in unicode strings, though I haven't tried that myself. Then there's of course lxml, which is a Python-layer which requires unicode or plain-ascii strings in its DOM-ish (elementtree API), and encoded data for the parser.

We should distinguish the behavior of libxml2 as a tree API (utf-8 all the way) and as a parser/serializer (all sorts of encodings). Generally XML libraries make a distinction between the two.

Frankly, I don't get the desire to *store* a complete XML document (as
opposed to the extracted contents of attributes or nodes) as unicode:
it isn't as though it can be easily processed in that form without
re-encoding (even if lxml is the one doing the re-encoding).  It isn't
"discourse", in the Zope3 sense of "text intended for human
consumption", and the tools people use with it are all going to expect
some kind of validly-encoded string.

There are objects that allow you to edit XML; the ZPT page is an example. I do not know whether it stores as unicode right now, but you can argue it's text intended for human consumption, as humans are supposed to be editing it. :)

It may indeed make more sense to store this information as UTF-8 however from an efficiency point of view. This would probably still require recoding the data into unicode for the purposes of inspecting it and editing it.



Zope3-dev mailing list

Reply via email to