-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Martijn Faassen wrote: > Tres Seaver wrote: >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA1 >> >> Andreas Jung wrote: >>> --On 14. Januar 2007 18:14:45 +0000 Chris Withers <[EMAIL PROTECTED]> >>> wrote: >>> >>>> Dieter Maurer wrote: >>>>> A halfway intelligent parser would accept Unicode when it gets it >>>>> and concentrate on the remaining part of its task: either reporting >>>>> structural events or building a parse tree. >>>> The trivial fix I use in Twiddler is as follows: >>>> >>>> if isinstance(source,unicode): >>>> source = source.encode('utf-8') >>>> >>>> Of course, this assumes a heading of either <?xml version="1.0" >>>> encoding="utf-8"?> or a missing encoding attribute, in which case the xml >>>> spec states that the string must be utf-8 encoded. >>> The encoding of the XML preamble should not matter when parsing a XML >>> document stored as unicode string. >> That encoding is a *lie*, which is the real problem. Parsers expect it >> to be *correct*, and if missing, expect the text to be encoded as UTF-8, >> per the spec (if the document comes from an HTTP request, then the >> application may supply the encoding from the request headers). >> >> Nothing in the XML specs allows or specifies and behavior for XML >> documents serialized as unicode, becuase such serializations are >> *programming language specific*. > > While I agree that the encoding declaration is ambiguous at best and > should be rejected, you can find a bit in the spec which supports XML as > Python unicode strings. A Python unicode string can be seen as a string > with "external character encoding information": it's the native encoding > of Python. Therefore we can make sense of it in an XML parser. For my > previous analysis of the spec see here: > > http://codespeak.net/pipermail/lxml-dev/2006-May/001137.html > > What however is bad and evil is to just ignore conflicting encoding > declarations in an XML document itself. I'd choose either one of: > > * bail with a clear error when unicode is supplied at all > > * bail with a clear error when unicode is supplied with any explicit > encoding declaration in the XML. > >>> It is of importance as soon as you >>> convert the document back to a stream e.g. when we deliver the content >>> back to a browser or a FTP client. The ZPublisher (for Zope 2) deals with >>> that by changing the encoding parameter of the preamble for XML documents >>> based on the desired output encoding. utf-8 is always a good choice however >>> other encodings like iso-8859-15 might raise UnicodeDecodeErrors. The Zope 2 >>> publisher "avoids" this problem converting the unicode result using >>> errors='replace' (which is likely something we might discuss :-)) >> Unicode XML is not only problematic for streaming. For instance, you >> *can't* pass a Unicode string to the libxml2 *at all* , unless you want >> a core dump. The API requires that you pass it strings encoded as UTF8. > > You can in lxml. :) libxml2 as a C API doesn't even support any unicode > string type as far as I am aware.
It *requires* UTF-8-encoded strings. See http://xmlsoft.org/xml.html 12. So what is this funky "xmlChar" used all the time? It is a null terminated sequence of utf-8 characters. And only utf-8! You need to convert strings encoded in different ways to utf-8 before passing them to the API. This can be accomplished with the iconv library for instance. Frankly, I don't get the desire to *store* a complete XML document (as opposed to the extracted contents of attributes or nodes) as unicode: it isn't as though it can be easily processed in that form without re-encoding (even if lxml is the one doing the re-encoding). It isn't "discourse", in the Zope3 sense of "text intended for human consumption", and the tools people use with it are all going to expect some kind of validly-encoded string. Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 [EMAIL PROTECTED] Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v18.104.22.168 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFq/ix+gerLs4ltQ4RAmkTAJ9ifMH37TNyfZXo+v5zvXCsrFXIXQCfZFow GBTndXG+0Gw9OnAZeNCxADs= =Yr7F -----END PGP SIGNATURE----- _______________________________________________ Zope3-dev mailing list Zope3email@example.com Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com