[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode
Andreas Jung wrote: I am replying to the three proposals. First I have to kick the proposal of Tres (UTF-8 storage). We want unicode as internal representation for any kind of ZPT (both text/html and text/xml). I'm not sure I understand this. Wouldn't the internal representation be unicode after the parse, no matter what the representation of the text itself is? I may be missing something about the way ZPT templates are stored, though. Regards, Martijn ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode
--On 18. Januar 2007 08:29:57 -0500 Fred Drake <[EMAIL PROTECTED]> wrote: On 1/18/07, Andreas Jung <[EMAIL PROTECTED]> wrote: We're faster with new Zope versions than the W3C with any standard. So? The recommendation for XML 1.1 is already a done deal (a "second edition" was published last September), so there are already multiple specified versions. Since other version strings are allowed, whether there's a published specification or not, we don't want to make assumptions about what's there. Are the underlying frameworks (TAL, xml.parsers.pyexat) ready for XML 1.1? -aj pgpsZe3h8qPY0.pgp Description: PGP signature ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode
On 1/18/07, Andreas Jung <[EMAIL PROTECTED]> wrote: We're faster with new Zope versions than the W3C with any standard. So? The recommendation for XML 1.1 is already a done deal (a "second edition" was published last September), so there are already multiple specified versions. Since other version strings are allowed, whether there's a published specification or not, we don't want to make assumptions about what's there. How the information should be stored is another matter; my point is only that we can't make any assumptions about it beyond that it's "1.0" if the XML declaration is omitted. -Fred -- Fred L. Drake, Jr. "Every sin is the result of a collaboration." --Lucius Annaeus Seneca ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode
--On 17. Januar 2007 22:49:11 +0100 Dieter Maurer <[EMAIL PROTECTED]> wrote: Andreas Jung wrote at 2007-1-17 17:48 +0100: ... So Martijn's and my proposal remain. They are not very different. In the end the behavior is almost identical. But I will adopt your suggestion to remove the preamble when storing the data internally (basically to avoid a possible encoding ambiguity). In future times, the preamble might contain information which should not be dropped, e.g. when there is an XML version different from "1.0". We're faster with new Zope versions than the W3C with any standard. For PageTemplates, we know that the encoding information is probably not relevant after the parsing -- unless we want to use it as a default for the "Content-Type" charset but I doubt that this is a good thing. If the "Content-Type"'s charset is given explicitely, then the "encoding" of the XML declaration needs to be adapted to this value during the serialization anyway -- thus overriding any "encoding" present there. ? -aj pgpGl17OH27Hh.pgp Description: PGP signature ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode
Andreas Jung wrote at 2007-1-17 17:48 +0100: > ... >So Martijn's and my proposal remain. They are not very different. In the >end the behavior is almost identical. But I will adopt your suggestion to >remove >the preamble when storing the data internally (basically to avoid a >possible encoding ambiguity). In future times, the preamble might contain information which should not be dropped, e.g. when there is an XML version different from "1.0". For PageTemplates, we know that the encoding information is probably not relevant after the parsing -- unless we want to use it as a default for the "Content-Type" charset but I doubt that this is a good thing. If the "Content-Type"'s charset is given explicitely, then the "encoding" of the XML declaration needs to be adapted to this value during the serialization anyway -- thus overriding any "encoding" present there. -- Dieter ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode
Martijn Faassen wrote at 2007-1-16 23:19 +0100: >Dieter Maurer wrote: >> Martijn Faassen wrote at 2007-1-15 15:44 +0100: >>> >>> I would say refusing to guess and bailing out with an error message is >>> better in this case. >> >> I disagree with you. >> >> Logically, parsing an encoded XML document consists of two >> passes: decode the encoded string into unicode and reconstruct >> the XML info elements from the serialization. >> >> Traditionally, these two passes are not performed one after >> the other but folded together in a single pass. >> >> But that tradition should not prevent to separate out the >> (Unicode) decoding phase. And after this phase is done, >> there is not ambiguity left with the "XML declaration". >> Its encoding attribute is simply irrelevant for the second phase >> (apart from generating the PI info element). > >That's nice as far as it goes. What if after the second phase you need >to parse the XML again? >What do you do with your encoding header then? After the second phase, I now longer have an XML string but instead either a sequence of events (SAX style) or a tree of XML info elements (syntax tree style). But, whatever I have, the second stage does not magically change my unicode string. It could be parsed over and over again. >If it's irrelevant, you better strip it out before you put it into the >parser. I loose information then. The event stream or info element tree lacks the XML declaration PI then, or at least its "encoding" attribute. The parsing process is allowed to loose some information. For example it can loose whitespace details or the order of attributes. I don't know whether the loss or modification of "PI"s is considered acceptable. In general, this would definitely be wrong. I have read some article in "comp.text.xml" that complained about the loss of the encoding information -- at it may be a good hint about the default encoding to be used on encoding/serialization. This menas that some XML processing systems loose the information and not everyone is happy. -- Dieter ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode
Martijn Faassen wrote: That's what I do too... See my post elsewhere in the thread for an example of why this is Not Good. Luckily Twiddler is still less than version 1.0 ;-) When someone reports it as a bug, I'll fix it. cheers, Chris -- Simplistix - Content Management, Zope & Python Consulting - http://www.simplistix.co.uk ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode
--On 16. Januar 2007 14:12:46 +0100 Martijn Faassen <[EMAIL PROTECTED]> wrote: I am replying to the three proposals. First I have to kick the proposal of Tres (UTF-8 storage). We want unicode as internal representation for any kind of ZPT (both text/html and text/xml). Supporting unicode for text/html and utf-8 for text/xml would make code more complicated and lead to further unicode encoding conflicts. We're trying to solve this issue right now and I don't want to introduce a new construction site. So Martijn's and my proposal remain. They are not very different. In the end the behavior is almost identical. But I will adopt your suggestion to remove the preamble when storing the data internally (basically to avoid a possible encoding ambiguity). Andreas pgpxXQNoRi2gs.pgp Description: PGP signature ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode
Tres Seaver wrote: [snip] The "just store the XML" scenario is in surprisingly nice. It only needs attention to encoding and decoding in the always complicated ZPublisher direct output scenario, and in the edit form scenario. As you speculated, this is actually my preference, except that I don't see the need to in scenario D to recode the data and strip the prolog encoding attribute. Why wouldn't we just use the XML template's own declared encoding to encode any data subsituted into the template? I mean, if the user has marked up the document to indicate a "preferred" encoding, why should we bother storing such an encoding in another location? Yes, I was thinking along those lines too. Then the only time we would need to munge the document would be at inclusion time, which is the only time we actually *need* to have unicode in hand. We might even elide the decode-recode stage if the target document uses the same encoding! That such an optimization might not be worth the complexity, however. Yes, one complexity is that trying to do this would break the assumption that ZPT templates always return unicode or pure-ascii strings, not anything else (such as encoded data). Only at the last phase of the publisher will it be encoded into something else. I really appreciate keeping this assumption in place. :) Note that in the inclusion case (scenario E), we almost certainly *should* be stripping the *entire* prolog, which is only valid at the start of the merged document. If you are including it as a document, yes. If you are included it quoted, as for instance the contents of a text area allowing you to edit the XML text directly, then no. This suggests we actually have two scenarios here. I guess there is a subscenario, which is that the "included" document is actually the 'main_template' supplying the prolog: METAL might should leave the prolog alone, while 'tal:replace' and 'tal:content' (with 'structure') would strip it? Yay, another scenario. :) Regards, Martijn ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode
Dieter Maurer wrote: Martijn Faassen wrote at 2007-1-15 15:44 +0100: Hey, On 1/15/07, Andreas Jung <[EMAIL PROTECTED]> wrote: [snip] ok, got it. But this problem can be solved easily by changing the encoding within the preamble. I would say refusing to guess and bailing out with an error message is better in this case. I disagree with you. Logically, parsing an encoded XML document consists of two passes: decode the encoded string into unicode and reconstruct the XML info elements from the serialization. Traditionally, these two passes are not performed one after the other but folded together in a single pass. But that tradition should not prevent to separate out the (Unicode) decoding phase. And after this phase is done, there is not ambiguity left with the "XML declaration". Its encoding attribute is simply irrelevant for the second phase (apart from generating the PI info element). That's nice as far as it goes. What if after the second phase you need to parse the XML again? What do you do with your encoding header then? If it's irrelevant, you better strip it out before you put it into the parser. Regards, Martijn ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode
Tres Seaver wrote at 2007-1-16 10:39 -0500: > ... >As you speculated, this is actually my preference, except that I don't >see the need to in scenario D to recode the data and strip the prolog >encoding attribute. Why wouldn't we just use the XML template's own >declared encoding to encode any data subsituted into the template? Maybe, because an XML template "T1" using encoding "e1" uses a macro from template "T2" encoded with "e2"? Or maybe, because in such a case some values passed into the macro (e.g. the splots) cannot be encoded in "e2"? -- Dieter ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode
Tres Seaver wrote at 2007-1-15 16:57 -0500: > ... >Frankly, I don't get the desire to *store* a complete XML document (as >opposed to the extracted contents of attributes or nodes) as unicode My desire comes from the easy principle: all text should be unicode. Decoding/encoding happens only at the system boundaries and no longer internally. -- Dieter ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode
Martijn Faassen wrote at 2007-1-15 15:44 +0100: > >Hey, > >On 1/15/07, Andreas Jung <[EMAIL PROTECTED]> wrote: >[snip] >> ok, got it. But this problem can be solved easily by changing the encoding >> within the preamble. > >I would say refusing to guess and bailing out with an error message is >better in this case. I disagree with you. Logically, parsing an encoded XML document consists of two passes: decode the encoded string into unicode and reconstruct the XML info elements from the serialization. Traditionally, these two passes are not performed one after the other but folded together in a single pass. But that tradition should not prevent to separate out the (Unicode) decoding phase. And after this phase is done, there is not ambiguity left with the "XML declaration". Its encoding attribute is simply irrelevant for the second phase (apart from generating the PI info element). Thus, there is no guessing; someone else has just performed the first phase of the complete process -- maybe using the "encoding" attribute or some overriding information. -- Dieter ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode
Chris Withers wrote at 2007-1-14 18:14 +: > ... >The problem comes when someone sends you something like: > >u'' > >What should be done then? We parse the declaration and generate an info element for it but otherwise ignore it as it has lost its meaning after the XML has been converted to Unicode. -- Dieter ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Martijn Faassen wrote: > Andreas Jung wrote: >> >> --On 15. Januar 2007 22:15:46 +0100 Martijn Faassen > [snip] I still don't see what should ambiguous with this approach. >>> Ambiguous in that the string seems to say it's in two encodings at once. >>> You're then "guessing": you're letting the Python string type trump the >>> declaration. Then, since we've shown that leads to bugs, you propose >>> actually change the encoding declaration of the XML document. I wonder >>> what people then expect to happen upon serialization. In effect, your >>> proposal would, I think, serialize to UTF-8 only, right? (in which case >>> the encoding declaration can be dropped as it's the default. >> When you download a ZPT through FTP/WebDAV then the unicode representation >> of the XML will be converted using the 'output_encoding' property of the >> corresponding ZPT which is set when uploading a new XML document (and taken >> from the premable). So when you upload an latin1 XML file you should get >> it back as valid latin1 through FTP/WebDAV. > > Okay, understood, this makes sense in the case of the FTP/WebDAV > support, though recoding to UTF-8 and ripping off the encoding > declaration would also be pretty safe in case of XML. > >> When you download text/xml content through the ZPublisher then the >> ZPublisher will convert unicode textual content to some encoding which is >> either taken from an already set 'content-type: text/...; charset=X' >> HTTP Header or as fallback from the zpublisher-default-encoding property >> as defined in the zope.conf file. > > And the same behavior actually applies to HTML content, right? > >> So the application can specify in both case the encoding of the serialized >> XML content. Where is the problem? > > What I'm trying to express here is that this stuff should not be treated > as "where is the problem?" but should be thought through carefully as > this is extremely easy to do wrong. I'll think it through carefully > here. Let's list some cases: > > A) FTP download: stored ML gets downloaded through FTP/WebDAV support. > > B) FTP upload: external XML gets uploaded through FTP/WebDAV > > C) parse: stored XML is parsed inside of Zope by the page template engine. > > D) publisher download: stored XML is downloaded as text/xml directly > through the publisher > > E) ZPT inclusion: stored XML is included in another page template, for > instance to present it in a text area. > > F) form submit: Text area is then saved and needs to be stored again. > > Andreas Jung proposal (speculation) > === > > As far as I understand it you're proposing: > > * store XML as unicode text > > * separately store the encoding on the page template object > > * also keep the encoding="" bit in the XML preamble when storing. > > Let's go through the cases > > A) FTP download: encode this to whatever encoding is stored on the ZPT > object using Python unicode support. No encoding mangling necessary. > > B) FTP upload: read encoding="" bit and store this on ZPT. Then decode > to unicode using that encoding. Could not be implemented by a > parse/serialization step without extra encoding="" manipulation > afterwards (after decoding to unicode). > > C) parse: Rip out the 'encoding=""' bit before you send it in the > parser. encode to UTF-8 just before entering the parser. > > D) publisher download: Rip out the 'encoding=""' bit. Then encode > according to response header (or zope.conf). Then add back encoding="" > bit stating if output is non-UTF-8 (not Python names like 'latin1' but > encoding identifiers XML is aware of). > > E) ZPT inclusion: Send the unicode text to the page template. > encoding="" bit will be presented in the editor. > > F) form submit: decode to unicode according to encoding of page that > displayed edit form and store it. Read 'encoding=' bit and store it in > ZPT object. Don't manipulate 'encoding=""' bit in XML. > > encoding="" removal: C, D > encoding="" adding: D > encoding="" reading: B, F > encode from unicode: A, C, D > decode to unicode: B, F > > no encoding="" manipulation required: A, E > no recoding required: E > straightforward: E > > The forms editor scenario (E and F) is potentially confusing as the user > may be tempted by the ability to use encoding="" to paste latin-1 XML > text. Editor could say it only wants it in whatever encoding the page is > in, though. > > Martijn Faassen proposal > > > If you rip out the encoding before data is stored in the page template > and then store as unicode, then we have the following cases: > > A) FTP download: Encode to UTF-8, output in UTF-8 only. No encoding > mangling necessary. > > B) FTP upload: read encoding="" bit and decode to unicode accordingly. > Rip out encoding="". Could be done by a parse/serialization step, then > decode result to unicode. > > C) parse: enco
[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode
Andreas Jung wrote: --On 15. Januar 2007 22:15:46 +0100 Martijn Faassen [snip] I still don't see what should ambiguous with this approach. Ambiguous in that the string seems to say it's in two encodings at once. You're then "guessing": you're letting the Python string type trump the declaration. Then, since we've shown that leads to bugs, you propose actually change the encoding declaration of the XML document. I wonder what people then expect to happen upon serialization. In effect, your proposal would, I think, serialize to UTF-8 only, right? (in which case the encoding declaration can be dropped as it's the default. When you download a ZPT through FTP/WebDAV then the unicode representation of the XML will be converted using the 'output_encoding' property of the corresponding ZPT which is set when uploading a new XML document (and taken from the premable). So when you upload an latin1 XML file you should get it back as valid latin1 through FTP/WebDAV. Okay, understood, this makes sense in the case of the FTP/WebDAV support, though recoding to UTF-8 and ripping off the encoding declaration would also be pretty safe in case of XML. When you download text/xml content through the ZPublisher then the ZPublisher will convert unicode textual content to some encoding which is either taken from an already set 'content-type: text/...; charset=X' HTTP Header or as fallback from the zpublisher-default-encoding property as defined in the zope.conf file. And the same behavior actually applies to HTML content, right? So the application can specify in both case the encoding of the serialized XML content. Where is the problem? What I'm trying to express here is that this stuff should not be treated as "where is the problem?" but should be thought through carefully as this is extremely easy to do wrong. I'll think it through carefully here. Let's list some cases: A) FTP download: stored ML gets downloaded through FTP/WebDAV support. B) FTP upload: external XML gets uploaded through FTP/WebDAV C) parse: stored XML is parsed inside of Zope by the page template engine. D) publisher download: stored XML is downloaded as text/xml directly through the publisher E) ZPT inclusion: stored XML is included in another page template, for instance to present it in a text area. F) form submit: Text area is then saved and needs to be stored again. Andreas Jung proposal (speculation) === As far as I understand it you're proposing: * store XML as unicode text * separately store the encoding on the page template object * also keep the encoding="" bit in the XML preamble when storing. Let's go through the cases A) FTP download: encode this to whatever encoding is stored on the ZPT object using Python unicode support. No encoding mangling necessary. B) FTP upload: read encoding="" bit and store this on ZPT. Then decode to unicode using that encoding. Could not be implemented by a parse/serialization step without extra encoding="" manipulation afterwards (after decoding to unicode). C) parse: Rip out the 'encoding=""' bit before you send it in the parser. encode to UTF-8 just before entering the parser. D) publisher download: Rip out the 'encoding=""' bit. Then encode according to response header (or zope.conf). Then add back encoding="" bit stating if output is non-UTF-8 (not Python names like 'latin1' but encoding identifiers XML is aware of). E) ZPT inclusion: Send the unicode text to the page template. encoding="" bit will be presented in the editor. F) form submit: decode to unicode according to encoding of page that displayed edit form and store it. Read 'encoding=' bit and store it in ZPT object. Don't manipulate 'encoding=""' bit in XML. encoding="" removal: C, D encoding="" adding: D encoding="" reading: B, F encode from unicode: A, C, D decode to unicode: B, F no encoding="" manipulation required: A, E no recoding required: E straightforward: E The forms editor scenario (E and F) is potentially confusing as the user may be tempted by the ability to use encoding="" to paste latin-1 XML text. Editor could say it only wants it in whatever encoding the page is in, though. Martijn Faassen proposal If you rip out the encoding before data is stored in the page template and then store as unicode, then we have the following cases: A) FTP download: Encode to UTF-8, output in UTF-8 only. No encoding mangling necessary. B) FTP upload: read encoding="" bit and decode to unicode accordingly. Rip out encoding="". Could be done by a parse/serialization step, then decode result to unicode. C) parse: encode to UTF-8 just before entering the parser. D) publisher download: Encode according to response header or zope.conf. Add in encoding="" if output is non-UTF-8 using XML names for encoding. E) ZPT inclusion: send unicode text to the page template. No encoding="" bit will be in the XML presented in th
[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode
Tres Seaver wrote: [snip] Unicode XML is not only problematic for streaming. For instance, you *can't* pass a Unicode string to the libxml2 *at all* , unless you want a core dump. The API requires that you pass it strings encoded as UTF8. You can in lxml. :) libxml2 as a C API doesn't even support any unicode string type as far as I am aware. It *requires* UTF-8-encoded strings. See http://xmlsoft.org/xml.html 12. So what is this funky "xmlChar" used all the time? It is a null terminated sequence of utf-8 characters. And only utf-8! You need to convert strings encoded in different ways to utf-8 before passing them to the API. This can be accomplished with the iconv library for instance. Um, Tres, no need to tell me about the libxml2 API.. There is also the libxml2 *python* API, which I believe has a knob to turn on the ability to pass in unicode strings, though I haven't tried that myself. Then there's of course lxml, which is a Python-layer which requires unicode or plain-ascii strings in its DOM-ish (elementtree API), and encoded data for the parser. We should distinguish the behavior of libxml2 as a tree API (utf-8 all the way) and as a parser/serializer (all sorts of encodings). Generally XML libraries make a distinction between the two. Frankly, I don't get the desire to *store* a complete XML document (as opposed to the extracted contents of attributes or nodes) as unicode: it isn't as though it can be easily processed in that form without re-encoding (even if lxml is the one doing the re-encoding). It isn't "discourse", in the Zope3 sense of "text intended for human consumption", and the tools people use with it are all going to expect some kind of validly-encoded string. There are objects that allow you to edit XML; the ZPT page is an example. I do not know whether it stores as unicode right now, but you can argue it's text intended for human consumption, as humans are supposed to be editing it. :) It may indeed make more sense to store this information as UTF-8 however from an efficiency point of view. This would probably still require recoding the data into unicode for the purposes of inspecting it and editing it. Regards, Martijn ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode
--On 15. Januar 2007 22:15:46 +0100 Martijn Faassen <[EMAIL PROTECTED]> wrote: My point is that: u"Some non-ascii text" is confusing at best. One part of this says it's a unicode string, the other part says it's in encoding latin-1. The string above would be used for internal storage but *not* for processing. Btw. this is not different from storing HTML files as unicode string. An application must convert the unicode string back to a serialized string - either to the encoding as specified inside the preamble or to a 'general' encoding (that covers the unicode database) like utf-8 with changing the encoding inside the preamble - both are legitimate approaches. There is no ambiguity. A smart XML parser will represent a XML document *independent* of the source encoding in most general way (storing a textual content a unicode (or utf-8 at least). I still don't see what should ambiguous with this approach. Ambiguous in that the string seems to say it's in two encodings at once. You're then "guessing": you're letting the Python string type trump the declaration. Then, since we've shown that leads to bugs, you propose actually change the encoding declaration of the XML document. I wonder what people then expect to happen upon serialization. In effect, your proposal would, I think, serialize to UTF-8 only, right? (in which case the encoding declaration can be dropped as it's the default. When you download a ZPT through FTP/WebDAV then the unicode representation of the XML will be converted using the 'output_encoding' property of the corresponding ZPT which is set when uploading a new XML document (and taken from the premable). So when you upload an latin1 XML file you should get it back as valid latin1 through FTP/WebDAV. When you download text/xml content through the ZPublisher then the ZPublisher will convert unicode textual content to some encoding which is either taken from an already set 'content-type: text/...; charset=X' HTTP Header or as fallback from the zpublisher-default-encoding property as defined in the zope.conf file. So the application can specify in both case the encoding of the serialized XML content. Where is the problem? Andreas pgpUMJ3Mc5Oh4.pgp Description: PGP signature ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Martijn Faassen wrote: > Tres Seaver wrote: >> -BEGIN PGP SIGNED MESSAGE- >> Hash: SHA1 >> >> Andreas Jung wrote: >>> --On 14. Januar 2007 18:14:45 + Chris Withers <[EMAIL PROTECTED]> >>> wrote: >>> Dieter Maurer wrote: > A halfway intelligent parser would accept Unicode when it gets it > and concentrate on the remaining part of its task: either reporting > structural events or building a parse tree. The trivial fix I use in Twiddler is as follows: if isinstance(source,unicode): source = source.encode('utf-8') Of course, this assumes a heading of either >>> encoding="utf-8"?> or a missing encoding attribute, in which case the xml spec states that the string must be utf-8 encoded. >>> The encoding of the XML preamble should not matter when parsing a XML >>> document stored as unicode string. >> That encoding is a *lie*, which is the real problem. Parsers expect it >> to be *correct*, and if missing, expect the text to be encoded as UTF-8, >> per the spec (if the document comes from an HTTP request, then the >> application may supply the encoding from the request headers). >> >> Nothing in the XML specs allows or specifies and behavior for XML >> documents serialized as unicode, becuase such serializations are >> *programming language specific*. > > While I agree that the encoding declaration is ambiguous at best and > should be rejected, you can find a bit in the spec which supports XML as > Python unicode strings. A Python unicode string can be seen as a string > with "external character encoding information": it's the native encoding > of Python. Therefore we can make sense of it in an XML parser. For my > previous analysis of the spec see here: > > http://codespeak.net/pipermail/lxml-dev/2006-May/001137.html > > What however is bad and evil is to just ignore conflicting encoding > declarations in an XML document itself. I'd choose either one of: > > * bail with a clear error when unicode is supplied at all > > * bail with a clear error when unicode is supplied with any explicit > encoding declaration in the XML. > >>> It is of importance as soon as you >>> convert the document back to a stream e.g. when we deliver the content >>> back to a browser or a FTP client. The ZPublisher (for Zope 2) deals with >>> that by changing the encoding parameter of the preamble for XML documents >>> based on the desired output encoding. utf-8 is always a good choice however >>> other encodings like iso-8859-15 might raise UnicodeDecodeErrors. The Zope 2 >>> publisher "avoids" this problem converting the unicode result using >>> errors='replace' (which is likely something we might discuss :-)) >> Unicode XML is not only problematic for streaming. For instance, you >> *can't* pass a Unicode string to the libxml2 *at all* , unless you want >> a core dump. The API requires that you pass it strings encoded as UTF8. > > You can in lxml. :) libxml2 as a C API doesn't even support any unicode > string type as far as I am aware. It *requires* UTF-8-encoded strings. See http://xmlsoft.org/xml.html 12. So what is this funky "xmlChar" used all the time? It is a null terminated sequence of utf-8 characters. And only utf-8! You need to convert strings encoded in different ways to utf-8 before passing them to the API. This can be accomplished with the iconv library for instance. Frankly, I don't get the desire to *store* a complete XML document (as opposed to the extracted contents of attributes or nodes) as unicode: it isn't as though it can be easily processed in that form without re-encoding (even if lxml is the one doing the re-encoding). It isn't "discourse", in the Zope3 sense of "text intended for human consumption", and the tools people use with it are all going to expect some kind of validly-encoded string. Tres. - -- === Tres Seaver +1 540-429-0999 [EMAIL PROTECTED] Palladion Software "Excellence by Design"http://palladion.com -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFq/ix+gerLs4ltQ4RAmkTAJ9ifMH37TNyfZXo+v5zvXCsrFXIXQCfZFow GBTndXG+0Gw9OnAZeNCxADs= =Yr7F -END PGP SIGNATURE- ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode
Andreas Jung wrote: --On 15. Januar 2007 15:44:01 +0100 Martijn Faassen <[EMAIL PROTECTED]> wrote: On 1/15/07, Andreas Jung <[EMAIL PROTECTED]> wrote: [snip] ok, got it. But this problem can be solved easily by changing the encoding within the preamble. I would say refusing to guess and bailing out with an error message is better in this case. The Zen of Python: In the face of ambiguity, refuse the temptation to guess. Sorry but I don't get your point. What's happening with a XML inside a ZPT? My point is that: u"Some non-ascii text" is confusing at best. One part of this says it's a unicode string, the other part says it's in encoding latin-1. What is it? What happens to this if you recode this to, say, UTF-8? What happens to this if you parse and *then* serialize it? What does the developer expect will happen? What do users expect when they enter XML in a form and include an encoding declaration? I proposed we make nobody worry about this by simply not accepting this. - XML data encoded as XXX comes in (either by editing the XML file through the ZMI or FTP/WebDAV upload) - ZPT converts the encoded string to unicode based on the encoding in the preamble - for parsing it is up to the application to decide what to do with the data. It is not up to the editor to decide how the ZPT engine should deal with XML internally. The ZPT engine decides to serializes the unicode string as utf-8 and to fix the XML preamble (which will result in a valid XML file which should identical with the original file - except the encoding might be different). I still don't see what should ambiguous with this approach. Ambiguous in that the string seems to say it's in two encodings at once. You're then "guessing": you're letting the Python string type trump the declaration. Then, since we've shown that leads to bugs, you propose actually change the encoding declaration of the XML document. I wonder what people then expect to happen upon serialization. In effect, your proposal would, I think, serialize to UTF-8 only, right? (in which case the encoding declaration can be dropped as it's the default) Regards, Martijn ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode
Tres Seaver wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Andreas Jung wrote: --On 14. Januar 2007 18:14:45 + Chris Withers <[EMAIL PROTECTED]> wrote: Dieter Maurer wrote: A halfway intelligent parser would accept Unicode when it gets it and concentrate on the remaining part of its task: either reporting structural events or building a parse tree. The trivial fix I use in Twiddler is as follows: if isinstance(source,unicode): source = source.encode('utf-8') Of course, this assumes a heading of either or a missing encoding attribute, in which case the xml spec states that the string must be utf-8 encoded. The encoding of the XML preamble should not matter when parsing a XML document stored as unicode string. That encoding is a *lie*, which is the real problem. Parsers expect it to be *correct*, and if missing, expect the text to be encoded as UTF-8, per the spec (if the document comes from an HTTP request, then the application may supply the encoding from the request headers). Nothing in the XML specs allows or specifies and behavior for XML documents serialized as unicode, becuase such serializations are *programming language specific*. While I agree that the encoding declaration is ambiguous at best and should be rejected, you can find a bit in the spec which supports XML as Python unicode strings. A Python unicode string can be seen as a string with "external character encoding information": it's the native encoding of Python. Therefore we can make sense of it in an XML parser. For my previous analysis of the spec see here: http://codespeak.net/pipermail/lxml-dev/2006-May/001137.html What however is bad and evil is to just ignore conflicting encoding declarations in an XML document itself. I'd choose either one of: * bail with a clear error when unicode is supplied at all * bail with a clear error when unicode is supplied with any explicit encoding declaration in the XML. It is of importance as soon as you convert the document back to a stream e.g. when we deliver the content back to a browser or a FTP client. The ZPublisher (for Zope 2) deals with that by changing the encoding parameter of the preamble for XML documents based on the desired output encoding. utf-8 is always a good choice however other encodings like iso-8859-15 might raise UnicodeDecodeErrors. The Zope 2 publisher "avoids" this problem converting the unicode result using errors='replace' (which is likely something we might discuss :-)) Unicode XML is not only problematic for streaming. For instance, you *can't* pass a Unicode string to the libxml2 *at all* , unless you want a core dump. The API requires that you pass it strings encoded as UTF8. You can in lxml. :) libxml2 as a C API doesn't even support any unicode string type as far as I am aware. Regards, Martijn ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode
--On 15. Januar 2007 15:44:01 +0100 Martijn Faassen <[EMAIL PROTECTED]> wrote: Hey, On 1/15/07, Andreas Jung <[EMAIL PROTECTED]> wrote: [snip] ok, got it. But this problem can be solved easily by changing the encoding within the preamble. I would say refusing to guess and bailing out with an error message is better in this case. The Zen of Python: In the face of ambiguity, refuse the temptation to guess. Sorry but I don't get your point. What's happening with a XML inside a ZPT? - XML data encoded as XXX comes in (either by editing the XML file through the ZMI or FTP/WebDAV upload) - ZPT converts the encoded string to unicode based on the encoding in the preamble - for parsing it is up to the application to decide what to do with the data. It is not up to the editor to decide how the ZPT engine should deal with XML internally. The ZPT engine decides to serializes the unicode string as utf-8 and to fix the XML preamble (which will result in a valid XML file which should identical with the original file - except the encoding might be different). I still don't see what should ambiguous with this approach. Andrea pgpq0GGi0oSZu.pgp Description: PGP signature ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode
Chris Withers wrote: Philipp von Weitershausen wrote: u'' What should be done then? Not sure. We could ignore it or raise an error. I'm inclined to ignore it. That's what I do too... See my post elsewhere in the thread for an example of why this is Not Good. Regards, Martijn ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Andreas Jung wrote: > > --On 14. Januar 2007 18:14:45 + Chris Withers <[EMAIL PROTECTED]> > wrote: > >> Dieter Maurer wrote: >>> A halfway intelligent parser would accept Unicode when it gets it >>> and concentrate on the remaining part of its task: either reporting >>> structural events or building a parse tree. >> The trivial fix I use in Twiddler is as follows: >> >> if isinstance(source,unicode): >>source = source.encode('utf-8') >> >> Of course, this assumes a heading of either > encoding="utf-8"?> or a missing encoding attribute, in which case the xml >> spec states that the string must be utf-8 encoded. > > The encoding of the XML preamble should not matter when parsing a XML > document stored as unicode string. That encoding is a *lie*, which is the real problem. Parsers expect it to be *correct*, and if missing, expect the text to be encoded as UTF-8, per the spec (if the document comes from an HTTP request, then the application may supply the encoding from the request headers). Nothing in the XML specs allows or specifies and behavior for XML documents serialized as unicode, becuase such serializations are *programming language specific*. > It is of importance as soon as you > convert the document back to a stream e.g. when we deliver the content > back to a browser or a FTP client. The ZPublisher (for Zope 2) deals with > that by changing the encoding parameter of the preamble for XML documents > based on the desired output encoding. utf-8 is always a good choice however > other encodings like iso-8859-15 might raise UnicodeDecodeErrors. The Zope 2 > publisher "avoids" this problem converting the unicode result using > errors='replace' (which is likely something we might discuss :-)) Unicode XML is not only problematic for streaming. For instance, you *can't* pass a Unicode string to the libxml2 *at all* , unless you want a core dump. The API requires that you pass it strings encoded as UTF8. Tres. - -- === Tres Seaver +1 540-429-0999 [EMAIL PROTECTED] Palladion Software "Excellence by Design"http://palladion.com -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFq9wf+gerLs4ltQ4RAvBkAKCGZke7HHr7vWQKcwn5IHW93GHlFQCgyXMJ a+vZYi2VRnZTt1XBt7O6U3Y= =+i3B -END PGP SIGNATURE- ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode
Hey, On 1/15/07, Andreas Jung <[EMAIL PROTECTED]> wrote: [snip] ok, got it. But this problem can be solved easily by changing the encoding within the preamble. I would say refusing to guess and bailing out with an error message is better in this case. The Zen of Python: In the face of ambiguity, refuse the temptation to guess. applies very much in this case in my opinion. Changing the preamble is too much like "do what I mean" to me - do we really know the developer actually had any clue what they were doing when they somehow created this unicode string with an encoding declaration? I'm not even sure I know what it *means* to have a unicode serialized XML string with an encoding declaration. I already think we have code in lxml we can look at to base refusal to guess on. Regards, Martijn ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode
--On 15. Januar 2007 14:52:42 +0100 Martijn Faassen <[EMAIL PROTECTED]> wrote: Hey, Gmane isn't updating so I can't really reply to the message (not visible in gmane) that I want to, but I saw the following solution proposed: def ourparse(text): if isinstance(text, unicode): text = text.encode('UTF-8') xml_parser.parse(text) now consider what will happen if you do the following: text = u"Some non-ascii characters here" ourparse(text) what will happen is that text is converted to a UTF-8 string (8-bit ascii). It's then passed to a hopefully compliant XML parser. This XML parser sees an 8-bit ascii string, and checks the encoding header for more information on the encoding of the string. It will therefore assume the string is in latin-1. The parse will break with an obscure error and the developer doing this is probably very confused. ok, got it. But this problem can be solved easily by changing the encoding within the preamble. -aj pgpi1m3ddiYBz.pgp Description: PGP signature ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode
Hey, Gmane isn't updating so I can't really reply to the message (not visible in gmane) that I want to, but I saw the following solution proposed: def ourparse(text): if isinstance(text, unicode): text = text.encode('UTF-8') xml_parser.parse(text) now consider what will happen if you do the following: text = u"Some non-ascii characters here" ourparse(text) what will happen is that text is converted to a UTF-8 string (8-bit ascii). It's then passed to a hopefully compliant XML parser. This XML parser sees an 8-bit ascii string, and checks the encoding header for more information on the encoding of the string. It will therefore assume the string is in latin-1. The parse will break with an obscure error and the developer doing this is probably very confused. This is why it's better to refuse to guess. Regards, Martijn ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode
--On 15. Januar 2007 13:26:16 +0100 Martijn Faassen <[EMAIL PROTECTED]> wrote: How would you propose to parse the following unicode string? u"" If your parser is unicode-aware then the encoding of the preamble does not matter since you have already unicode internally and can process your file totally on XML. If your parser isn't unicode-aware then you will likely convert it to utf-8 and work internally with utf-8 encoded strings. In fact xml.parsers.expat since to support unicode (it can return unicode strings to the handlers, see 'returns_unicode' property). However you need to reconstruct the XMl preamble when you reconstruct your XML from the parsed data. Or am I missing something? Andreas pgpQNy99FMGyu.pgp Description: PGP signature ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode
Philipp von Weitershausen wrote: [snip] A workaround inside parseString() would to check for unicode and convert the string on-the-fly to a Python string with utf-8 encoding. This is possibly a limitation of the underlying Expat parser...any recommendation how to deal with this issue? Fixed it in 3.3 and trunk. If you had given me a bit more time, this could even have been in 2.10.2b :). Oh well, I guess that's what 2.10.2 will be for ;) What did you fix? Please see my posting for a dangerous ambiguity: u"" Regards, Martijn ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode
Andreas Jung wrote: [snip] [Bernd Dorn] IMHO it should only accept strings, because in the value should be a xml string and therefore always has to be encoded in 'utf-8' or in the encoding specified in the processing instruction. I disagree with that. Since Zope 3 is supposed to use unicode internally (at least that's the legend) it should support unicode also at the parser level. Other languages like Java store XML also as unicode strings and support parsing it. Bernd Dorn raises a good point though, and it's one you need to think about carefully. To say "languages like Java store XML also as unicode" is rather ambiguous. While I'm not aware of the details of Java, serialized XML is typically stored in some encoded form, most commonly UTF-8 (the default 8 bit encoding), but latin 1 is also supported, and there are also multi-byte encodings. *Parsed* XML exposed through a DOM is exposed as unicode strings. I'm sure Java supports this usage patterns, as naturally files on disk need to be parsable. Here you are talking about parsing XML, so maintaining the position that this should be encoded is a reasonable one. This is how for instance the Python ElementTree operates (parse encoded, expose API as unicode (or pure ascii)), and this has been designed by Fredrik Lundh, who, as you may know, was instrumental in developing Python's unicode support. How would you propose to parse the following unicode string? u"" If you are going to allow the parsing of unicode strings, I would strongly recommend *rejecting* any unicode string that itself declares an encoding as ambiguous: refuse to guess. With lxml (which is an extension of the ElementTree API) we've taken the latter option: it's possible to pass a unicode string into the parser, but if that contains an encoding declaration, there will be an error. Underneath we actually re-encode this string back to UTF-8, as that's what the libxml2 parser expects. We made this change with the objections of Fredrik Lundh by the way - we felt user errors would be mostly prevented because it refuses to guess. Regards, Martijn ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode
Philipp von Weitershausen wrote: u'' What should be done then? Not sure. We could ignore it or raise an error. I'm inclined to ignore it. That's what I do too... Chris -- Simplistix - Content Management, Zope & Python Consulting - http://www.simplistix.co.uk ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode
On 14 Jan 2007, at 19:14 , Chris Withers wrote: Dieter Maurer wrote: A halfway intelligent parser would accept Unicode when it gets it and concentrate on the remaining part of its task: either reporting structural events or building a parse tree. The trivial fix I use in Twiddler is as follows: if isinstance(source,unicode): source = source.encode('utf-8') It's the same fix I used. Of course, this assumes a heading of either encoding="utf-8"?> or a missing encoding attribute, in which case the xml spec states that the string must be utf-8 encoded. The problem comes when someone sends you something like: u'' What should be done then? Not sure. We could ignore it or raise an error. I'm inclined to ignore it. ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode
--On 14. Januar 2007 18:14:45 + Chris Withers <[EMAIL PROTECTED]> wrote: Dieter Maurer wrote: A halfway intelligent parser would accept Unicode when it gets it and concentrate on the remaining part of its task: either reporting structural events or building a parse tree. The trivial fix I use in Twiddler is as follows: if isinstance(source,unicode): source = source.encode('utf-8') Of course, this assumes a heading of either or a missing encoding attribute, in which case the xml spec states that the string must be utf-8 encoded. The encoding of the XML preamble should not matter when parsing a XML document stored as unicode string. It is of importance as soon as you convert the document back to a stream e.g. when we deliver the content back to a browser or a FTP client. The ZPublisher (for Zope 2) deals with that by changing the encoding parameter of the preamble for XML documents based on the desired output encoding. utf-8 is always a good choice however other encodings like iso-8859-15 might raise UnicodeDecodeErrors. The Zope 2 publisher "avoids" this problem converting the unicode result using errors='replace' (which is likely something we might discuss :-)) Andreas pgpY2ic9Zojnl.pgp Description: PGP signature ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode
On 14 Jan 2007, at 18:37 , Dieter Maurer wrote: Philipp von Weitershausen wrote at 2007-1-14 14:59 +0100: ... Traditionally, you parse an 8bit string, figure out its encoding (e.g. from and return some representation of that XML with unicode data. That's why it's actually quite ok for XML parsers to only accept string data. Parsing usually means rebuilding the structure from a text string and *NOT* encoding guessing or Unicode decoding. Therefore, it is actually quite stupid for a parser to try to encode an already decoded string (i.e. a Unicode string) only that it can guess the encoding ;-) A halfway intelligent parser would accept Unicode when it gets it and concentrate on the remaining part of its task: either reporting structural events or building a parse tree. Yes, I agree. Unfortunately, expat isn't smart enough, which caused this whole discussion. ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode
Dieter Maurer wrote: A halfway intelligent parser would accept Unicode when it gets it and concentrate on the remaining part of its task: either reporting structural events or building a parse tree. The trivial fix I use in Twiddler is as follows: if isinstance(source,unicode): source = source.encode('utf-8') Of course, this assumes a heading of either encoding="utf-8"?> or a missing encoding attribute, in which case the xml spec states that the string must be utf-8 encoded. The problem comes when someone sends you something like: u'' What should be done then? Chris -- Simplistix - Content Management, Zope & Python Consulting - http://www.simplistix.co.uk ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode
Philipp von Weitershausen wrote at 2007-1-14 14:59 +0100: > ... >Traditionally, you parse an 8bit string, figure out its encoding (e.g. >from and return some representation of that XML >with unicode data. That's why it's actually quite ok for XML parsers to >only accept string data. Parsing usually means rebuilding the structure from a text string and *NOT* encoding guessing or Unicode decoding. Therefore, it is actually quite stupid for a parser to try to encode an already decoded string (i.e. a Unicode string) only that it can guess the encoding ;-) A halfway intelligent parser would accept Unicode when it gets it and concentrate on the remaining part of its task: either reporting structural events or building a parse tree. -- Dieter ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com
[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode
Andreas Jung wrote: Hi, the XMLParser.parseString() method raises an exception File "/opt/python-2.4.4/lib/python2.4/unittest.py", line 260, in run testMethod() File "/Users/ajung_data/sandboxes/Zope/Zope/lib/python/zope/tal/tests/test_xmlparser.py", line 127, in test_xx self._run_check(xml, ()) File "/Users/ajung_data/sandboxes/Zope/Zope/lib/python/zope/tal/tests/test_xmlparser.py", line 106, in _run_check parser.parseString(source) File "/Users/ajung_data/sandboxes/Zope/Zope/lib/python/zope/tal/xmlparser.py", line 77, in parseString self.parser.Parse(s, 1) UnicodeEncodeError: 'ascii' codec can't encode characters in position 43-48: ordinal not in range(128) if the string to be parsed is a unicode strings and contains some non-ascii chars. The following snippet from a private unittest (test_xmlparsers.py) shows the error. def test_xx(self): xml = unicode('encoding="utf-8"?>üöä', 'iso-8859-15') self._run_check(xml, ()) I am not sure if this behavior is intentional?! Is the XMLParser supposed to deal with unicode strings or will it only accept a standard Python string? Traditionally, you parse an 8bit string, figure out its encoding (e.g. from and return some representation of that XML with unicode data. That's why it's actually quite ok for XML parsers to only accept string data. With ZPTs it's a bit different: When editing ZPTs TTW for example, we like to store its source in unicode. So it makes sense for us to be able to parse unicode input as XML. A workaround inside parseString() would to check for unicode and convert the string on-the-fly to a Python string with utf-8 encoding. This is possibly a limitation of the underlying Expat parser...any recommendation how to deal with this issue? Fixed it in 3.3 and trunk. If you had given me a bit more time, this could even have been in 2.10.2b :). Oh well, I guess that's what 2.10.2 will be for ;) -- http://worldcookery.com -- Professional Zope documentation and training 2nd edition of Web Component Development with Zope 3 is now shipping! ___ Zope3-dev mailing list Zope3-dev@zope.org Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com