subject:"\[Zope3\-dev\] Re\: zope.tal.xmlparser.XMLParser\(\) dislikes unicode"


Tres Seaver wrote:
[snip]
The "just store the XML" scenario is in surprisingly nice. It only needs 
attention to encoding and decoding in the always complicated ZPublisher 
direct output scenario, and in the edit form scenario.


As you speculated, this is actually my preference, except that I don't
see the need to in scenario D to recode the data and strip the prolog
encoding attribute.  Why wouldn't we just use the XML template's own
declared encoding to encode any data subsituted into the template?  I
mean, if the user has marked up the document to indicate a "preferred"
encoding, why should we bother storing such an encoding in another location?


Yes, I was thinking along those lines too.


Then the only time we would need to munge the document would be at
inclusion time, which is the only time we actually *need* to have
unicode in hand.  We might even elide the decode-recode stage if the
target document uses the same encoding!  That such an optimization might
not be worth the complexity, however.


Yes, one complexity is that trying to do this would break the assumption 
that ZPT templates always return unicode or pure-ascii strings, not 
anything else (such as encoded data). Only at the last phase of the 
publisher will it be encoded into something else. I really appreciate 
keeping this assumption in place. :)



Note that in the inclusion case (scenario E), we almost certainly
*should* be stripping the *entire* prolog, which is only valid at the
start of the merged document. 


If you are including it as a document, yes. If you are included it 
quoted, as for instance the contents of a text area allowing you to edit 
the XML text directly, then no. This suggests we actually have two 
scenarios here.



I guess there is a subscenario, which is
that the "included" document is actually the 'main_template' supplying
the prolog:  METAL might should leave the prolog alone, while
'tal:replace' and 'tal:content' (with 'structure') would strip it?


Yay, another scenario. :)

Regards,

Martijn

___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com

[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode


Dieter Maurer wrote:

Martijn Faassen wrote at 2007-1-15 15:44 +0100:


Hey,

On 1/15/07, Andreas Jung <[EMAIL PROTECTED]> wrote:
[snip]

ok, got it. But this problem can be solved easily by changing the encoding
within the preamble.

I would say refusing to guess and bailing out with an error message is
better in this case.


I disagree with you.

  Logically, parsing an encoded XML document consists of two
  passes: decode the encoded string into unicode and reconstruct
  the XML info elements from the serialization.

  Traditionally, these two passes are not performed one after
  the other but folded together in a single pass.
  
  But that tradition should not prevent to separate out the

  (Unicode) decoding phase. And after this phase is done,
  there is not ambiguity left with the "XML declaration".
  Its encoding attribute is simply irrelevant for the second phase
  (apart from generating the PI info element).


That's nice as far as it goes. What if after the second phase you need 
to parse the XML again? What do you do with your encoding header then? 
If it's irrelevant, you better strip it out before you put it into the 
parser.


Regards,

Martijn

___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com

Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

Tres Seaver wrote at 2007-1-16 10:39 -0500:
> ...
>As you speculated, this is actually my preference, except that I don't
>see the need to in scenario D to recode the data and strip the prolog
>encoding attribute.  Why wouldn't we just use the XML template's own
>declared encoding to encode any data subsituted into the template?

Maybe, because an  XML template "T1" using encoding "e1"
uses a macro from template "T2" encoded with "e2"?

Or maybe, because in such a case some values passed into the macro
(e.g. the splots) cannot be encoded in "e2"?



-- 
Dieter
___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com

Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

Tres Seaver wrote at 2007-1-15 16:57 -0500:
> ...
>Frankly, I don't get the desire to *store* a complete XML document (as
>opposed to the extracted contents of attributes or nodes) as unicode

My desire comes from the easy principle: all text should be unicode.

Decoding/encoding happens only at the system boundaries
and no longer internally.



-- 
Dieter
___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com

Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

Martijn Faassen wrote at 2007-1-15 15:44 +0100:
> 
>Hey,
>
>On 1/15/07, Andreas Jung <[EMAIL PROTECTED]> wrote:
>[snip]
>> ok, got it. But this problem can be solved easily by changing the encoding
>> within the preamble.
>
>I would say refusing to guess and bailing out with an error message is
>better in this case.

I disagree with you.

  Logically, parsing an encoded XML document consists of two
  passes: decode the encoded string into unicode and reconstruct
  the XML info elements from the serialization.

  Traditionally, these two passes are not performed one after
  the other but folded together in a single pass.

  But that tradition should not prevent to separate out the
  (Unicode) decoding phase. And after this phase is done,
  there is not ambiguity left with the "XML declaration".
  Its encoding attribute is simply irrelevant for the second phase
  (apart from generating the PI info element).

  Thus, there is no guessing; someone else has just performed
  the first phase of the complete process -- maybe using the
  "encoding" attribute or some overriding information.

-- 
Dieter
___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com

Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

Chris Withers wrote at 2007-1-14 18:14 +:
> ...
>The problem comes when someone sends you something like:
>
>u''
>
>What should be done then?

We parse the declaration  and generate an info element
for it but otherwise ignore it as it has lost its meaning after
the XML has been converted to Unicode.



-- 
Dieter
___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com

[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

2007-01-16 Thread Tres Seaver

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Martijn Faassen wrote:
> Andreas Jung wrote:
>>
>> --On 15. Januar 2007 22:15:46 +0100 Martijn Faassen 
> [snip]
 I still don't see what should ambiguous with this approach.
>>> Ambiguous in that the string seems to say it's in two encodings at once.
>>> You're then "guessing": you're letting the Python string type trump the
>>> declaration. Then, since we've shown that leads to bugs, you propose
>>> actually change the encoding declaration of the XML document. I wonder
>>> what people then expect to happen upon serialization. In effect, your
>>> proposal would, I think, serialize to UTF-8 only, right? (in which case
>>> the encoding declaration can be dropped as it's the default.
>> When you download a ZPT through FTP/WebDAV then the unicode representation
>> of the XML will be converted using the 'output_encoding' property of the
>> corresponding ZPT which is set when uploading a new XML document (and taken
>> from the premable). So when you upload an latin1 XML file you should get 
>> it back as valid latin1 through FTP/WebDAV.
> 
> Okay, understood, this makes sense in the case of the FTP/WebDAV 
> support, though recoding to UTF-8 and ripping off the encoding 
> declaration would also be pretty safe in case of XML.
> 
>> When you download text/xml content through the ZPublisher then the 
>> ZPublisher will convert unicode textual content to some encoding which is
>> either taken from an already set 'content-type: text/...; charset=X'
>> HTTP Header or as fallback from the zpublisher-default-encoding property
>> as defined in the zope.conf file.
> 
> And the same behavior actually applies to HTML content, right?
> 
>> So the application can specify in both case the encoding of the serialized
>> XML content. Where is the problem?
> 
> What I'm trying to express here is that this stuff should not be treated 
> as "where is the problem?" but should be thought through carefully as 
> this is extremely easy to do wrong. I'll think it through carefully 
> here. Let's list some cases:
> 
> A) FTP download: stored ML gets downloaded through FTP/WebDAV support.
> 
> B) FTP upload: external XML gets uploaded through FTP/WebDAV
> 
> C) parse: stored XML is parsed inside of Zope by the page template engine.
> 
> D) publisher download: stored XML is downloaded as text/xml directly 
> through the publisher
> 
> E) ZPT inclusion: stored XML is included in another page template, for 
> instance to present it in a text area.
> 
> F) form submit: Text area is then saved and needs to be stored again.
> 
> Andreas Jung proposal (speculation)
> ===
> 
> As far as I understand it you're proposing:
> 
> * store XML as unicode text
> 
> * separately store the encoding on the page template object
> 
> * also keep the encoding="" bit in the XML preamble when storing.
> 
> Let's go through the cases
> 
> A) FTP download: encode this to whatever encoding is stored on the ZPT 
> object using Python unicode support. No encoding mangling necessary.
> 
> B) FTP upload: read encoding="" bit and store this on ZPT. Then decode 
> to unicode using that encoding. Could not be implemented by a 
> parse/serialization step without extra encoding="" manipulation 
> afterwards (after decoding to unicode).
> 
> C) parse: Rip out the 'encoding=""' bit before you send it in the 
> parser. encode to UTF-8 just before entering the parser.
> 
> D) publisher download: Rip out the 'encoding=""' bit. Then encode 
> according to response header (or zope.conf). Then add back encoding="" 
> bit stating if output is non-UTF-8 (not Python names like 'latin1' but 
> encoding identifiers XML is aware of).
> 
> E) ZPT inclusion: Send the unicode text to the page template. 
> encoding="" bit will be presented in the editor.
> 
> F) form submit: decode to unicode according to encoding of page that 
> displayed edit form and store it. Read 'encoding=' bit and store it in 
> ZPT object. Don't manipulate 'encoding=""' bit in XML.
> 
> encoding="" removal: C, D
> encoding="" adding: D
> encoding="" reading: B, F
> encode from unicode: A, C, D
> decode to unicode: B, F
> 
> no encoding="" manipulation required: A, E
> no recoding required: E
> straightforward: E
> 
> The forms editor scenario (E and F) is potentially confusing as the user 
> may be tempted by the ability to use encoding="" to paste latin-1 XML 
> text. Editor could say it only wants it in whatever encoding the page is 
> in, though.
> 
> Martijn Faassen proposal
> 
> 
> If you rip out the encoding before data is stored in the page template 
> and then store as unicode, then we have the following cases:
> 
> A) FTP download: Encode to UTF-8, output in UTF-8 only. No encoding 
> mangling necessary.
> 
> B) FTP upload: read encoding="" bit and decode to unicode accordingly. 
> Rip out encoding="". Could be done by a parse/serialization step, then 
> decode result to unicode.
> 
> C) parse: enco

[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode


Andreas Jung wrote:



--On 15. Januar 2007 22:15:46 +0100 Martijn Faassen 

[snip]

I still don't see what should ambiguous with this approach.


Ambiguous in that the string seems to say it's in two encodings at once.
You're then "guessing": you're letting the Python string type trump the
declaration. Then, since we've shown that leads to bugs, you propose
actually change the encoding declaration of the XML document. I wonder
what people then expect to happen upon serialization. In effect, your
proposal would, I think, serialize to UTF-8 only, right? (in which case
the encoding declaration can be dropped as it's the default.


When you download a ZPT through FTP/WebDAV then the unicode representation
of the XML will be converted using the 'output_encoding' property of the
corresponding ZPT which is set when uploading a new XML document (and taken
from the premable). So when you upload an latin1 XML file you should get 
it back as valid latin1 through FTP/WebDAV.


Okay, understood, this makes sense in the case of the FTP/WebDAV 
support, though recoding to UTF-8 and ripping off the encoding 
declaration would also be pretty safe in case of XML.


When you download text/xml content through the ZPublisher then the 
ZPublisher will convert unicode textual content to some encoding which is

either taken from an already set 'content-type: text/...; charset=X'
HTTP Header or as fallback from the zpublisher-default-encoding property
as defined in the zope.conf file.


And the same behavior actually applies to HTML content, right?


So the application can specify in both case the encoding of the serialized
XML content. Where is the problem?


What I'm trying to express here is that this stuff should not be treated 
as "where is the problem?" but should be thought through carefully as 
this is extremely easy to do wrong. I'll think it through carefully 
here. Let's list some cases:


A) FTP download: stored ML gets downloaded through FTP/WebDAV support.

B) FTP upload: external XML gets uploaded through FTP/WebDAV

C) parse: stored XML is parsed inside of Zope by the page template engine.

D) publisher download: stored XML is downloaded as text/xml directly 
through the publisher


E) ZPT inclusion: stored XML is included in another page template, for 
instance to present it in a text area.


F) form submit: Text area is then saved and needs to be stored again.

Andreas Jung proposal (speculation)
===

As far as I understand it you're proposing:

* store XML as unicode text

* separately store the encoding on the page template object

* also keep the encoding="" bit in the XML preamble when storing.

Let's go through the cases

A) FTP download: encode this to whatever encoding is stored on the ZPT 
object using Python unicode support. No encoding mangling necessary.


B) FTP upload: read encoding="" bit and store this on ZPT. Then decode 
to unicode using that encoding. Could not be implemented by a 
parse/serialization step without extra encoding="" manipulation 
afterwards (after decoding to unicode).


C) parse: Rip out the 'encoding=""' bit before you send it in the 
parser. encode to UTF-8 just before entering the parser.


D) publisher download: Rip out the 'encoding=""' bit. Then encode 
according to response header (or zope.conf). Then add back encoding="" 
bit stating if output is non-UTF-8 (not Python names like 'latin1' but 
encoding identifiers XML is aware of).


E) ZPT inclusion: Send the unicode text to the page template. 
encoding="" bit will be presented in the editor.


F) form submit: decode to unicode according to encoding of page that 
displayed edit form and store it. Read 'encoding=' bit and store it in 
ZPT object. Don't manipulate 'encoding=""' bit in XML.


encoding="" removal: C, D
encoding="" adding: D
encoding="" reading: B, F
encode from unicode: A, C, D
decode to unicode: B, F

no encoding="" manipulation required: A, E
no recoding required: E
straightforward: E

The forms editor scenario (E and F) is potentially confusing as the user 
may be tempted by the ability to use encoding="" to paste latin-1 XML 
text. Editor could say it only wants it in whatever encoding the page is 
in, though.


Martijn Faassen proposal


If you rip out the encoding before data is stored in the page template 
and then store as unicode, then we have the following cases:


A) FTP download: Encode to UTF-8, output in UTF-8 only. No encoding 
mangling necessary.


B) FTP upload: read encoding="" bit and decode to unicode accordingly. 
Rip out encoding="". Could be done by a parse/serialization step, then 
decode result to unicode.


C) parse: encode to UTF-8 just before entering the parser.

D) publisher download: Encode according to response header or zope.conf. 
Add in encoding="" if output is non-UTF-8 using XML names for encoding.


E) ZPT inclusion: send unicode text to the page template. No encoding="" 
bit will be in the XML presented in th

[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode


Tres Seaver wrote:
[snip]

Unicode XML is not only problematic for streaming. For instance, you
*can't* pass a Unicode string to the libxml2 *at all* , unless you want
a core dump.  The API requires that you pass it strings encoded as UTF8.
You can in lxml. :) libxml2 as a C API doesn't even support any unicode 
string type as far as I am aware.


It *requires* UTF-8-encoded strings.  See http://xmlsoft.org/xml.html



  12. So what is this funky "xmlChar" used all the time?

  It is a null terminated sequence of utf-8 characters. And only
  utf-8! You need to convert strings encoded in different ways to
  utf-8 before passing them to the API. This can be accomplished
  with the iconv library for instance.


Um, Tres, no need to tell me about the libxml2 API..

There is also the libxml2 *python* API, which I believe has a knob to 
turn on the ability to pass in unicode strings, though I haven't tried 
that myself. Then there's of course lxml, which is a Python-layer which 
requires unicode or plain-ascii strings in its DOM-ish (elementtree 
API), and encoded data for the parser.


We should distinguish the behavior of libxml2 as a tree API (utf-8 all 
the way) and as a parser/serializer (all sorts of encodings). Generally 
XML libraries make a distinction between the two.



Frankly, I don't get the desire to *store* a complete XML document (as
opposed to the extracted contents of attributes or nodes) as unicode:
it isn't as though it can be easily processed in that form without
re-encoding (even if lxml is the one doing the re-encoding).  It isn't
"discourse", in the Zope3 sense of "text intended for human
consumption", and the tools people use with it are all going to expect
some kind of validly-encoded string.


There are objects that allow you to edit XML; the ZPT page is an 
example. I do not know whether it stores as unicode right now, but you 
can argue it's text intended for human consumption, as humans are 
supposed to be editing it. :)


It may indeed make more sense to store this information as UTF-8 however 
from an efficiency point of view. This would probably still require 
recoding the data into unicode for the purposes of inspecting it and 
editing it.


Regards,

Martijn

___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com

Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode




--On 15. Januar 2007 22:15:46 +0100 Martijn Faassen 
<[EMAIL PROTECTED]> wrote:


My point is that:

u"Some non-ascii
text"

is confusing at best. One part of this says it's a unicode string, the
other part says it's in encoding latin-1.


The string above would be used for internal storage but *not* for 
processing. Btw. this is not different from storing HTML files as unicode 
string. An application must convert the unicode string back to a serialized
string - either to the encoding as specified inside the preamble or to a 
'general' encoding (that covers the unicode database) like utf-8 with 
changing the encoding inside the preamble - both are legitimate approaches.

There is no ambiguity. A smart XML parser will represent a XML document
*independent* of the source encoding in most general way (storing a textual
content a unicode (or utf-8 at least).


I still don't see what should ambiguous with this approach.


Ambiguous in that the string seems to say it's in two encodings at once.
You're then "guessing": you're letting the Python string type trump the
declaration. Then, since we've shown that leads to bugs, you propose
actually change the encoding declaration of the XML document. I wonder
what people then expect to happen upon serialization. In effect, your
proposal would, I think, serialize to UTF-8 only, right? (in which case
the encoding declaration can be dropped as it's the default.


When you download a ZPT through FTP/WebDAV then the unicode representation
of the XML will be converted using the 'output_encoding' property of the
corresponding ZPT which is set when uploading a new XML document (and taken
from the premable). So when you upload an latin1 XML file you should get it 
back as valid latin1 through FTP/WebDAV.


When you download text/xml content through the ZPublisher then the 
ZPublisher will convert unicode textual content to some encoding which is

either taken from an already set 'content-type: text/...; charset=X'
HTTP Header or as fallback from the zpublisher-default-encoding property
as defined in the zope.conf file.

So the application can specify in both case the encoding of the serialized
XML content. Where is the problem?

Andreas


pgpUMJ3Mc5Oh4.pgp
Description: PGP signature
___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com

[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

2007-01-15 Thread Tres Seaver

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Martijn Faassen wrote:
> Tres Seaver wrote:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA1
>>
>> Andreas Jung wrote:
>>> --On 14. Januar 2007 18:14:45 + Chris Withers <[EMAIL PROTECTED]> 
>>> wrote:
>>>
 Dieter Maurer wrote:
> A halfway intelligent parser would accept Unicode when it gets it
> and concentrate on the remaining part of its task: either reporting
> structural events or building a parse tree.
 The trivial fix I use in Twiddler is as follows:

 if isinstance(source,unicode):
source = source.encode('utf-8')

 Of course, this assumes a heading of either >>> encoding="utf-8"?> or a missing encoding attribute, in which case the xml
 spec states that the string must be utf-8 encoded.
>>> The encoding of the XML preamble should not matter when parsing a XML
>>> document stored as unicode string.
>> That encoding is a *lie*, which is the real problem.  Parsers expect it
>> to be *correct*, and if missing, expect the text to be encoded as UTF-8,
>> per the spec (if the document comes from an HTTP request, then the
>> application may supply the encoding from the request headers).
>>
>> Nothing in the XML specs allows or specifies and behavior for XML
>> documents serialized as unicode, becuase such serializations are
>> *programming language specific*.
> 
> While I agree that the encoding declaration is ambiguous at best and 
> should be rejected, you can find a bit in the spec which supports XML as 
> Python unicode strings. A Python unicode string can be seen as a string 
> with "external character encoding information": it's the native encoding 
> of Python. Therefore we can make sense of it in an XML parser. For my 
> previous analysis of the spec see here:
> 
> http://codespeak.net/pipermail/lxml-dev/2006-May/001137.html
> 
> What however is bad and evil is to just ignore conflicting encoding 
> declarations in an XML document itself. I'd choose either one of:
> 
> * bail with a clear error when unicode is supplied at all
> 
> * bail with a clear error when unicode is supplied with any explicit 
> encoding declaration in the XML.
> 
>>> It is of importance as soon as you 
>>> convert the document back to a stream e.g. when we deliver the content
>>> back to a browser or a FTP client. The ZPublisher (for Zope 2) deals with 
>>> that by changing the encoding parameter of the preamble for XML documents 
>>> based on the desired output encoding. utf-8 is always a good choice however
>>> other encodings like iso-8859-15 might raise UnicodeDecodeErrors. The Zope 2
>>> publisher "avoids" this problem converting the unicode result using 
>>> errors='replace' (which is likely something we might discuss :-))
>> Unicode XML is not only problematic for streaming. For instance, you
>> *can't* pass a Unicode string to the libxml2 *at all* , unless you want
>> a core dump.  The API requires that you pass it strings encoded as UTF8.
> 
> You can in lxml. :) libxml2 as a C API doesn't even support any unicode 
> string type as far as I am aware.

It *requires* UTF-8-encoded strings.  See http://xmlsoft.org/xml.html

  12. So what is this funky "xmlChar" used all the time?

  It is a null terminated sequence of utf-8 characters. And only
  utf-8! You need to convert strings encoded in different ways to
  utf-8 before passing them to the API. This can be accomplished
  with the iconv library for instance.

Frankly, I don't get the desire to *store* a complete XML document (as
opposed to the extracted contents of attributes or nodes) as unicode:
it isn't as though it can be easily processed in that form without
re-encoding (even if lxml is the one doing the re-encoding).  It isn't
"discourse", in the Zope3 sense of "text intended for human
consumption", and the tools people use with it are all going to expect
some kind of validly-encoded string.


Tres.
- --
===
Tres Seaver  +1 540-429-0999  [EMAIL PROTECTED]
Palladion Software   "Excellence by Design"http://palladion.com
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFq/ix+gerLs4ltQ4RAmkTAJ9ifMH37TNyfZXo+v5zvXCsrFXIXQCfZFow
GBTndXG+0Gw9OnAZeNCxADs=
=Yr7F
-END PGP SIGNATURE-

___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com

[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode


Andreas Jung wrote:

--On 15. Januar 2007 15:44:01 +0100 Martijn Faassen 
<[EMAIL PROTECTED]> wrote:

On 1/15/07, Andreas Jung <[EMAIL PROTECTED]> wrote:
[snip]

ok, got it. But this problem can be solved easily by changing the
encoding within the preamble.


I would say refusing to guess and bailing out with an error message is
better in this case. The Zen of Python:

In the face of ambiguity, refuse the temptation to guess.



Sorry but I don't get your point. What's happening with a XML inside a ZPT?


My point is that:

u"Some non-ascii text"

is confusing at best. One part of this says it's a unicode string, the 
other part says it's in encoding latin-1. What is it? What happens to 
this if you recode this to, say, UTF-8? What happens to this if you 
parse and *then* serialize it? What does the developer expect will 
happen? What do users expect when they enter XML in a form and include 
an encoding declaration?


I proposed we make nobody worry about this by simply not accepting this.


- XML data encoded as XXX comes in (either by editing the XML file through
  the ZMI or FTP/WebDAV upload)

- ZPT converts the encoded string to unicode based on the encoding in 
the preamble


- for parsing it is up to the application to decide what to do with the 
data. It is not up to the editor to decide how the ZPT engine should 
deal with XML internally. The ZPT engine decides to serializes the 
unicode string as utf-8 and to fix the XML preamble (which will result 
in a valid XML file
which should identical with the original file - except the encoding 
might be different).



I still don't see what should ambiguous with this approach.


Ambiguous in that the string seems to say it's in two encodings at once. 
You're then "guessing": you're letting the Python string type trump the 
declaration. Then, since we've shown that leads to bugs, you propose 
actually change the encoding declaration of the XML document. I wonder 
what people then expect to happen upon serialization. In effect, your 
proposal would, I think, serialize to UTF-8 only, right? (in which case 
the encoding declaration can be dropped as it's the default)


Regards,

Martijn

___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com

[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

Tres Seaver wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Andreas Jung wrote:
--On 14. Januar 2007 18:14:45 + Chris Withers <[EMAIL PROTECTED]>
wrote:

Dieter Maurer wrote:

A halfway intelligent parser would accept Unicode when it gets it
and concentrate on the remaining part of its task: either reporting
structural events or building a parse tree.

The trivial fix I use in Twiddler is as follows:

if isinstance(source,unicode):
source = source.encode('utf-8')

Of course, this assumes a heading of either or a missing encoding attribute, in which case the xml
spec states that the string must be utf-8 encoded.

The encoding of the XML preamble should not matter when parsing a XML
document stored as unicode string.

That encoding is a *lie*, which is the real problem. Parsers expect it
to be *correct*, and if missing, expect the text to be encoded as UTF-8,
per the spec (if the document comes from an HTTP request, then the
application may supply the encoding from the request headers).

Nothing in the XML specs allows or specifies and behavior for XML
documents serialized as unicode, becuase such serializations are
*programming language specific*.

While I agree that the encoding declaration is ambiguous at best and
should be rejected, you can find a bit in the spec which supports XML as
Python unicode strings. A Python unicode string can be seen as a string
with "external character encoding information": it's the native encoding
of Python. Therefore we can make sense of it in an XML parser. For my
previous analysis of the spec see here:

http://codespeak.net/pipermail/lxml-dev/2006-May/001137.html

What however is bad and evil is to just ignore conflicting encoding
declarations in an XML document itself. I'd choose either one of:

* bail with a clear error when unicode is supplied at all

* bail with a clear error when unicode is supplied with any explicit
encoding declaration in the XML.

It is of importance as soon as you
convert the document back to a stream e.g. when we deliver the content
back to a browser or a FTP client. The ZPublisher (for Zope 2) deals with
that by changing the encoding parameter of the preamble for XML documents
based on the desired output encoding. utf-8 is always a good choice however

other encodings like iso-8859-15 might raise UnicodeDecodeErrors. The Zope 2
publisher "avoids" this problem converting the unicode result using
errors='replace' (which is likely something we might discuss :-))

Unicode XML is not only problematic for streaming. For instance, you
*can't* pass a Unicode string to the libxml2 *at all* , unless you want
a core dump. The API requires that you pass it strings encoded as UTF8.

You can in lxml. :) libxml2 as a C API doesn't even support any unicode
string type as far as I am aware.

Regards,

Martijn

___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com

Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode




--On 15. Januar 2007 15:44:01 +0100 Martijn Faassen 
<[EMAIL PROTECTED]> wrote:



Hey,

On 1/15/07, Andreas Jung <[EMAIL PROTECTED]> wrote:
[snip]

ok, got it. But this problem can be solved easily by changing the
encoding within the preamble.


I would say refusing to guess and bailing out with an error message is
better in this case. The Zen of Python:

In the face of ambiguity, refuse the temptation to guess.



Sorry but I don't get your point. What's happening with a XML inside a ZPT?

- XML data encoded as XXX comes in (either by editing the XML file through
  the ZMI or FTP/WebDAV upload)

- ZPT converts the encoded string to unicode based on the encoding in the 
preamble


- for parsing it is up to the application to decide what to do with the 
data. It is not up to the editor to decide how the ZPT engine should deal 
with XML internally. The ZPT engine decides to serializes the unicode 
string as utf-8 and to fix the XML preamble (which will result in a valid 
XML file
which should identical with the original file - except the encoding might 
be different).


I still don't see what should ambiguous with this approach.

Andrea

pgpq0GGi0oSZu.pgp
Description: PGP signature
___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com

[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode


Chris Withers wrote:

Philipp von Weitershausen wrote:

u''

What should be done then?


Not sure. We could ignore it or raise an error. I'm inclined to ignore 
it.


That's what I do too...


See my post elsewhere in the thread for an example of why this is Not Good.

Regards,

Martijn

___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com

[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

2007-01-15 Thread Tres Seaver

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Andreas Jung wrote:
> 
> --On 14. Januar 2007 18:14:45 + Chris Withers <[EMAIL PROTECTED]> 
> wrote:
> 
>> Dieter Maurer wrote:
>>> A halfway intelligent parser would accept Unicode when it gets it
>>> and concentrate on the remaining part of its task: either reporting
>>> structural events or building a parse tree.
>> The trivial fix I use in Twiddler is as follows:
>>
>> if isinstance(source,unicode):
>>source = source.encode('utf-8')
>>
>> Of course, this assumes a heading of either > encoding="utf-8"?> or a missing encoding attribute, in which case the xml
>> spec states that the string must be utf-8 encoded.
> 
> The encoding of the XML preamble should not matter when parsing a XML
> document stored as unicode string.

That encoding is a *lie*, which is the real problem.  Parsers expect it
to be *correct*, and if missing, expect the text to be encoded as UTF-8,
per the spec (if the document comes from an HTTP request, then the
application may supply the encoding from the request headers).

Nothing in the XML specs allows or specifies and behavior for XML
documents serialized as unicode, becuase such serializations are
*programming language specific*.

> It is of importance as soon as you 
> convert the document back to a stream e.g. when we deliver the content
> back to a browser or a FTP client. The ZPublisher (for Zope 2) deals with 
> that by changing the encoding parameter of the preamble for XML documents 
> based on the desired output encoding. utf-8 is always a good choice however
> other encodings like iso-8859-15 might raise UnicodeDecodeErrors. The Zope 2
> publisher "avoids" this problem converting the unicode result using 
> errors='replace' (which is likely something we might discuss :-))

Unicode XML is not only problematic for streaming. For instance, you
*can't* pass a Unicode string to the libxml2 *at all* , unless you want
a core dump.  The API requires that you pass it strings encoded as UTF8.


Tres.
- --
===
Tres Seaver  +1 540-429-0999  [EMAIL PROTECTED]
Palladion Software   "Excellence by Design"http://palladion.com
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFq9wf+gerLs4ltQ4RAvBkAKCGZke7HHr7vWQKcwn5IHW93GHlFQCgyXMJ
a+vZYi2VRnZTt1XBt7O6U3Y=
=+i3B
-END PGP SIGNATURE-

___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com

Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode


Hey,

On 1/15/07, Andreas Jung <[EMAIL PROTECTED]> wrote:
[snip]

ok, got it. But this problem can be solved easily by changing the encoding
within the preamble.


I would say refusing to guess and bailing out with an error message is
better in this case. The Zen of Python:

In the face of ambiguity, refuse the temptation to guess.

applies very much in this case in my opinion. Changing the preamble is
too much like "do what I mean" to me - do we really know the developer
actually had any clue what they were doing when they somehow created
this unicode string with an encoding declaration? I'm not even sure I
know what it *means* to have a unicode serialized XML string with an
encoding declaration.

I already think we have code in lxml we can look at to base refusal to guess on.

Regards,

Martijn
___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com

Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode




--On 15. Januar 2007 14:52:42 +0100 Martijn Faassen 
<[EMAIL PROTECTED]> wrote:



Hey,

Gmane isn't updating so I can't really reply to the message (not visible
in gmane) that I want to, but I saw the following solution proposed:

def ourparse(text):
if isinstance(text, unicode):
   text = text.encode('UTF-8')
xml_parser.parse(text)

now consider what will happen if you do the following:

text = u"Some non-ascii
characters here"
ourparse(text)

what will happen is that text is converted to a UTF-8 string (8-bit
ascii). It's then passed to a hopefully compliant XML parser. This XML
parser sees an 8-bit ascii string, and checks the encoding header for
more information on the encoding of the string. It will therefore assume
the string is in latin-1. The parse will break with an obscure error and
the developer doing this is probably very confused.



ok, got it. But this problem can be solved easily by changing the encoding
within the preamble.

-aj

pgpi1m3ddiYBz.pgp
Description: PGP signature
___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com

[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode


Hey,

Gmane isn't updating so I can't really reply to the message (not visible 
in gmane) that I want to, but I saw the following solution proposed:


def ourparse(text):
   if isinstance(text, unicode):
  text = text.encode('UTF-8')
   xml_parser.parse(text)

now consider what will happen if you do the following:

text = u"Some non-ascii 
characters here"

ourparse(text)

what will happen is that text is converted to a UTF-8 string (8-bit 
ascii). It's then passed to a hopefully compliant XML parser. This XML 
parser sees an 8-bit ascii string, and checks the encoding header for 
more information on the encoding of the string. It will therefore assume 
the string is in latin-1. The parse will break with an obscure error and 
the developer doing this is probably very confused.


This is why it's better to refuse to guess.

Regards,

Martijn

___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com

Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode




--On 15. Januar 2007 13:26:16 +0100 Martijn Faassen 
<[EMAIL PROTECTED]> wrote:




How would you propose to parse the following unicode string?

u""


If your parser is unicode-aware then the encoding of the preamble
does not matter since you have already unicode internally and can process 
your file totally on XML.


If your parser isn't unicode-aware then you will likely convert it to
utf-8 and work internally with utf-8 encoded strings. In fact 
xml.parsers.expat since to support unicode (it can return unicode strings

to the handlers, see 'returns_unicode' property). However you need to
reconstruct the XMl preamble when you reconstruct your XML from the
parsed data.

Or am I missing something?

Andreas

pgpQNy99FMGyu.pgp
Description: PGP signature
___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com

[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode


Philipp von Weitershausen wrote:
[snip]

A workaround inside parseString() would to check for unicode
and convert the string on-the-fly to a Python string with utf-8 encoding.
This is possibly a limitation of the underlying Expat parser...any 
recommendation how to deal with this issue?


Fixed it in 3.3 and trunk. If you had given me a bit more time, this 
could even have been in 2.10.2b :). Oh well, I guess that's what 2.10.2 
will be for ;)


What did you fix? Please see my posting for a dangerous ambiguity:

u""

Regards,

Martijn

___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com

[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode