Tres Seaver wrote:
Hash: SHA1

Andreas Jung wrote:
--On 14. Januar 2007 18:14:45 +0000 Chris Withers <[EMAIL PROTECTED]> wrote:

Dieter Maurer wrote:
A halfway intelligent parser would accept Unicode when it gets it
and concentrate on the remaining part of its task: either reporting
structural events or building a parse tree.
The trivial fix I use in Twiddler is as follows:

if isinstance(source,unicode):
   source = source.encode('utf-8')

Of course, this assumes a heading of either <?xml version="1.0"
encoding="utf-8"?> or a missing encoding attribute, in which case the xml
spec states that the string must be utf-8 encoded.
The encoding of the XML preamble should not matter when parsing a XML
document stored as unicode string.

That encoding is a *lie*, which is the real problem.  Parsers expect it
to be *correct*, and if missing, expect the text to be encoded as UTF-8,
per the spec (if the document comes from an HTTP request, then the
application may supply the encoding from the request headers).

Nothing in the XML specs allows or specifies and behavior for XML
documents serialized as unicode, becuase such serializations are
*programming language specific*.

While I agree that the encoding declaration is ambiguous at best and should be rejected, you can find a bit in the spec which supports XML as Python unicode strings. A Python unicode string can be seen as a string with "external character encoding information": it's the native encoding of Python. Therefore we can make sense of it in an XML parser. For my previous analysis of the spec see here:

What however is bad and evil is to just ignore conflicting encoding declarations in an XML document itself. I'd choose either one of:

* bail with a clear error when unicode is supplied at all

* bail with a clear error when unicode is supplied with any explicit encoding declaration in the XML.

It is of importance as soon as you convert the document back to a stream e.g. when we deliver the content back to a browser or a FTP client. The ZPublisher (for Zope 2) deals with that by changing the encoding parameter of the preamble for XML documents based on the desired output encoding. utf-8 is always a good choice however
other encodings like iso-8859-15 might raise UnicodeDecodeErrors. The Zope 2
publisher "avoids" this problem converting the unicode result using errors='replace' (which is likely something we might discuss :-))

Unicode XML is not only problematic for streaming. For instance, you
*can't* pass a Unicode string to the libxml2 *at all* , unless you want
a core dump.  The API requires that you pass it strings encoded as UTF8.

You can in lxml. :) libxml2 as a C API doesn't even support any unicode string type as far as I am aware.



Zope3-dev mailing list

Reply via email to