Tres Seaver wrote:
You can in lxml. :) libxml2 as a C API doesn't even support any unicode
string type as far as I am aware.
Unicode XML is not only problematic for streaming. For instance, you
*can't* pass a Unicode string to the libxml2 *at all* , unless you want
a core dump. The API requires that you pass it strings encoded as UTF8.
It *requires* UTF-8-encoded strings. See http://xmlsoft.org/xml.html
12. So what is this funky "xmlChar" used all the time?
It is a null terminated sequence of utf-8 characters. And only
utf-8! You need to convert strings encoded in different ways to
utf-8 before passing them to the API. This can be accomplished
with the iconv library for instance.
Um, Tres, no need to tell me about the libxml2 API..
There is also the libxml2 *python* API, which I believe has a knob to
turn on the ability to pass in unicode strings, though I haven't tried
that myself. Then there's of course lxml, which is a Python-layer which
requires unicode or plain-ascii strings in its DOM-ish (elementtree
API), and encoded data for the parser.
We should distinguish the behavior of libxml2 as a tree API (utf-8 all
the way) and as a parser/serializer (all sorts of encodings). Generally
XML libraries make a distinction between the two.
Frankly, I don't get the desire to *store* a complete XML document (as
opposed to the extracted contents of attributes or nodes) as unicode:
it isn't as though it can be easily processed in that form without
re-encoding (even if lxml is the one doing the re-encoding). It isn't
"discourse", in the Zope3 sense of "text intended for human
consumption", and the tools people use with it are all going to expect
some kind of validly-encoded string.
There are objects that allow you to edit XML; the ZPT page is an
example. I do not know whether it stores as unicode right now, but you
can argue it's text intended for human consumption, as humans are
supposed to be editing it. :)
It may indeed make more sense to store this information as UTF-8 however
from an efficiency point of view. This would probably still require
recoding the data into unicode for the purposes of inspecting it and
Zope3-dev mailing list