[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

2007-01-19 Thread Martijn Faassen

Andreas Jung wrote:

I am replying to the three proposals. First I have to kick the proposal 
of Tres (UTF-8 storage). We want unicode as internal representation for 
any kind of ZPT (both text/html and text/xml).


I'm not sure I understand this. Wouldn't the internal representation be 
unicode after the parse, no matter what the representation of the text 
itself is? I may be missing something about the way ZPT templates are 
stored, though.


Regards,

Martijn

___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

2007-01-18 Thread Fred Drake

On 1/18/07, Andreas Jung [EMAIL PROTECTED] wrote:

We're faster with new Zope versions than the W3C with any standard.


So?  The recommendation for XML 1.1 is already a done deal (a second
edition was published last September), so there are already multiple
specified versions.  Since other version strings are allowed, whether
there's a published specification or not, we don't want to make
assumptions about what's there.

How the information should be stored is another matter; my point is
only that we can't make any assumptions about it beyond that it's
1.0 if the XML declaration is omitted.


 -Fred

--
Fred L. Drake, Jr.fdrake at gmail.com
Every sin is the result of a collaboration. --Lucius Annaeus Seneca
___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

2007-01-18 Thread Andreas Jung



--On 18. Januar 2007 08:29:57 -0500 Fred Drake [EMAIL PROTECTED] wrote:


On 1/18/07, Andreas Jung [EMAIL PROTECTED] wrote:

We're faster with new Zope versions than the W3C with any standard.


So?  The recommendation for XML 1.1 is already a done deal (a second
edition was published last September), so there are already multiple
specified versions.  Since other version strings are allowed, whether
there's a published specification or not, we don't want to make
assumptions about what's there.


Are the underlying frameworks (TAL, xml.parsers.pyexat) ready for XML 1.1?

-aj



pgpsZe3h8qPY0.pgp
Description: PGP signature
___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

2007-01-17 Thread Andreas Jung



--On 16. Januar 2007 14:12:46 +0100 Martijn Faassen 
[EMAIL PROTECTED] wrote:



I am replying to the three proposals. First I have to kick the proposal of 
Tres (UTF-8 storage). We want unicode as internal representation for any 
kind of ZPT (both text/html and text/xml). Supporting unicode for text/html 
and utf-8 for text/xml would make code more complicated and lead to further
unicode encoding conflicts. We're trying to solve this issue right now and 
I don't want to introduce a new construction site.


So Martijn's and my proposal remain. They are not very different. In the 
end the behavior is almost identical. But I will adopt your suggestion to 
remove
the preamble when storing the data internally (basically to avoid a 
possible encoding ambiguity).


Andreas

pgpxXQNoRi2gs.pgp
Description: PGP signature
___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

2007-01-17 Thread Chris Withers

Martijn Faassen wrote:

That's what I do too...


See my post elsewhere in the thread for an example of why this is Not Good.


Luckily Twiddler is still less than version 1.0 ;-)

When someone reports it as a bug, I'll fix it.

cheers,

Chris

--
Simplistix - Content Management, Zope  Python Consulting
   - http://www.simplistix.co.uk

___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

2007-01-17 Thread Dieter Maurer
Martijn Faassen wrote at 2007-1-16 23:19 +0100:
Dieter Maurer wrote:
 Martijn Faassen wrote at 2007-1-15 15:44 +0100:
 
 I would say refusing to guess and bailing out with an error message is
 better in this case.
 
 I disagree with you.
 
   Logically, parsing an encoded XML document consists of two
   passes: decode the encoded string into unicode and reconstruct
   the XML info elements from the serialization.
 
   Traditionally, these two passes are not performed one after
   the other but folded together in a single pass.
   
   But that tradition should not prevent to separate out the
   (Unicode) decoding phase. And after this phase is done,
   there is not ambiguity left with the XML declaration.
   Its encoding attribute is simply irrelevant for the second phase
   (apart from generating the PI info element).

That's nice as far as it goes. What if after the second phase you need 
to parse the XML again?
What do you do with your encoding header then? 

After the second phase, I now longer have an XML string but
instead either a sequence of events (SAX style) or a tree of
XML info elements (syntax tree style).

But, whatever I have, the second stage does not magically change
my unicode string. It could be parsed over and over again.

If it's irrelevant, you better strip it out before you put it into the 
parser.

I loose information then. The event stream or info element tree
lacks the XML declaration PI then, or at least its encoding attribute.

The parsing process is allowed to loose some information.
For example it can loose whitespace details or the order
of attributes. I don't know whether the loss or modification
of PIs is considered acceptable. In general, this would
definitely be wrong.

I have read some article in comp.text.xml that complained
about the loss of the encoding information -- at it may be a good hint
about the default encoding to be used on encoding/serialization.
This menas that some XML processing systems loose the information
and not everyone is happy.



-- 
Dieter
___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

2007-01-17 Thread Dieter Maurer
Andreas Jung wrote at 2007-1-17 17:48 +0100:
 ...
So Martijn's and my proposal remain. They are not very different. In the 
end the behavior is almost identical. But I will adopt your suggestion to 
remove
the preamble when storing the data internally (basically to avoid a 
possible encoding ambiguity).

In future times, the preamble might contain information which
should not be dropped, e.g. when there is an XML version
different from 1.0.

For PageTemplates, we know that the encoding information is probably
not relevant after the parsing -- unless we want to use it
as a default for the Content-Type charset but I doubt that this
is a good thing. If the Content-Type's charset is given explicitely,
then the encoding of the XML declaration needs to be
adapted to this value during the serialization anyway -- thus
overriding any encoding present there.



-- 
Dieter
___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

2007-01-17 Thread Andreas Jung



--On 17. Januar 2007 22:49:11 +0100 Dieter Maurer [EMAIL PROTECTED] 
wrote:



Andreas Jung wrote at 2007-1-17 17:48 +0100:

...
So Martijn's and my proposal remain. They are not very different. In the
end the behavior is almost identical. But I will adopt your suggestion
to  remove
the preamble when storing the data internally (basically to avoid a
possible encoding ambiguity).


In future times, the preamble might contain information which
should not be dropped, e.g. when there is an XML version
different from 1.0.


We're faster with new Zope versions than the W3C with any standard.



For PageTemplates, we know that the encoding information is probably
not relevant after the parsing -- unless we want to use it
as a default for the Content-Type charset but I doubt that this
is a good thing. If the Content-Type's charset is given explicitely,
then the encoding of the XML declaration needs to be
adapted to this value during the serialization anyway -- thus
overriding any encoding present there.


?

-aj



pgpGl17OH27Hh.pgp
Description: PGP signature
___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

2007-01-16 Thread Martijn Faassen

Tres Seaver wrote:
[snip]

Unicode XML is not only problematic for streaming. For instance, you
*can't* pass a Unicode string to the libxml2 *at all* , unless you want
a core dump.  The API requires that you pass it strings encoded as UTF8.
You can in lxml. :) libxml2 as a C API doesn't even support any unicode 
string type as far as I am aware.


It *requires* UTF-8-encoded strings.  See http://xmlsoft.org/xml.html



  12. So what is this funky xmlChar used all the time?

  It is a null terminated sequence of utf-8 characters. And only
  utf-8! You need to convert strings encoded in different ways to
  utf-8 before passing them to the API. This can be accomplished
  with the iconv library for instance.


Um, Tres, no need to tell me about the libxml2 API..

There is also the libxml2 *python* API, which I believe has a knob to 
turn on the ability to pass in unicode strings, though I haven't tried 
that myself. Then there's of course lxml, which is a Python-layer which 
requires unicode or plain-ascii strings in its DOM-ish (elementtree 
API), and encoded data for the parser.


We should distinguish the behavior of libxml2 as a tree API (utf-8 all 
the way) and as a parser/serializer (all sorts of encodings). Generally 
XML libraries make a distinction between the two.



Frankly, I don't get the desire to *store* a complete XML document (as
opposed to the extracted contents of attributes or nodes) as unicode:
it isn't as though it can be easily processed in that form without
re-encoding (even if lxml is the one doing the re-encoding).  It isn't
discourse, in the Zope3 sense of text intended for human
consumption, and the tools people use with it are all going to expect
some kind of validly-encoded string.


There are objects that allow you to edit XML; the ZPT page is an 
example. I do not know whether it stores as unicode right now, but you 
can argue it's text intended for human consumption, as humans are 
supposed to be editing it. :)


It may indeed make more sense to store this information as UTF-8 however 
from an efficiency point of view. This would probably still require 
recoding the data into unicode for the purposes of inspecting it and 
editing it.


Regards,

Martijn

___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

2007-01-16 Thread Martijn Faassen

Andreas Jung wrote:



--On 15. Januar 2007 22:15:46 +0100 Martijn Faassen 

[snip]

I still don't see what should ambiguous with this approach.


Ambiguous in that the string seems to say it's in two encodings at once.
You're then guessing: you're letting the Python string type trump the
declaration. Then, since we've shown that leads to bugs, you propose
actually change the encoding declaration of the XML document. I wonder
what people then expect to happen upon serialization. In effect, your
proposal would, I think, serialize to UTF-8 only, right? (in which case
the encoding declaration can be dropped as it's the default.


When you download a ZPT through FTP/WebDAV then the unicode representation
of the XML will be converted using the 'output_encoding' property of the
corresponding ZPT which is set when uploading a new XML document (and taken
from the premable). So when you upload an latin1 XML file you should get 
it back as valid latin1 through FTP/WebDAV.


Okay, understood, this makes sense in the case of the FTP/WebDAV 
support, though recoding to UTF-8 and ripping off the encoding 
declaration would also be pretty safe in case of XML.


When you download text/xml content through the ZPublisher then the 
ZPublisher will convert unicode textual content to some encoding which is

either taken from an already set 'content-type: text/...; charset=X'
HTTP Header or as fallback from the zpublisher-default-encoding property
as defined in the zope.conf file.


And the same behavior actually applies to HTML content, right?


So the application can specify in both case the encoding of the serialized
XML content. Where is the problem?


What I'm trying to express here is that this stuff should not be treated 
as where is the problem? but should be thought through carefully as 
this is extremely easy to do wrong. I'll think it through carefully 
here. Let's list some cases:


A) FTP download: stored ML gets downloaded through FTP/WebDAV support.

B) FTP upload: external XML gets uploaded through FTP/WebDAV

C) parse: stored XML is parsed inside of Zope by the page template engine.

D) publisher download: stored XML is downloaded as text/xml directly 
through the publisher


E) ZPT inclusion: stored XML is included in another page template, for 
instance to present it in a text area.


F) form submit: Text area is then saved and needs to be stored again.

Andreas Jung proposal (speculation)
===

As far as I understand it you're proposing:

* store XML as unicode text

* separately store the encoding on the page template object

* also keep the encoding= bit in the XML preamble when storing.

Let's go through the cases

A) FTP download: encode this to whatever encoding is stored on the ZPT 
object using Python unicode support. No encoding mangling necessary.


B) FTP upload: read encoding= bit and store this on ZPT. Then decode 
to unicode using that encoding. Could not be implemented by a 
parse/serialization step without extra encoding= manipulation 
afterwards (after decoding to unicode).


C) parse: Rip out the 'encoding=' bit before you send it in the 
parser. encode to UTF-8 just before entering the parser.


D) publisher download: Rip out the 'encoding=' bit. Then encode 
according to response header (or zope.conf). Then add back encoding= 
bit stating if output is non-UTF-8 (not Python names like 'latin1' but 
encoding identifiers XML is aware of).


E) ZPT inclusion: Send the unicode text to the page template. 
encoding= bit will be presented in the editor.


F) form submit: decode to unicode according to encoding of page that 
displayed edit form and store it. Read 'encoding=' bit and store it in 
ZPT object. Don't manipulate 'encoding=' bit in XML.


encoding= removal: C, D
encoding= adding: D
encoding= reading: B, F
encode from unicode: A, C, D
decode to unicode: B, F

no encoding= manipulation required: A, E
no recoding required: E
straightforward: E

The forms editor scenario (E and F) is potentially confusing as the user 
may be tempted by the ability to use encoding= to paste latin-1 XML 
text. Editor could say it only wants it in whatever encoding the page is 
in, though.


Martijn Faassen proposal


If you rip out the encoding before data is stored in the page template 
and then store as unicode, then we have the following cases:


A) FTP download: Encode to UTF-8, output in UTF-8 only. No encoding 
mangling necessary.


B) FTP upload: read encoding= bit and decode to unicode accordingly. 
Rip out encoding=. Could be done by a parse/serialization step, then 
decode result to unicode.


C) parse: encode to UTF-8 just before entering the parser.

D) publisher download: Encode according to response header or zope.conf. 
Add in encoding= if output is non-UTF-8 using XML names for encoding.


E) ZPT inclusion: send unicode text to the page template. No encoding= 
bit will be in the XML presented in the editor.


F) form submit: Rip out 

[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

2007-01-16 Thread Tres Seaver
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Martijn Faassen wrote:
 Andreas Jung wrote:

 --On 15. Januar 2007 22:15:46 +0100 Martijn Faassen 
 [snip]
 I still don't see what should ambiguous with this approach.
 Ambiguous in that the string seems to say it's in two encodings at once.
 You're then guessing: you're letting the Python string type trump the
 declaration. Then, since we've shown that leads to bugs, you propose
 actually change the encoding declaration of the XML document. I wonder
 what people then expect to happen upon serialization. In effect, your
 proposal would, I think, serialize to UTF-8 only, right? (in which case
 the encoding declaration can be dropped as it's the default.
 When you download a ZPT through FTP/WebDAV then the unicode representation
 of the XML will be converted using the 'output_encoding' property of the
 corresponding ZPT which is set when uploading a new XML document (and taken
 from the premable). So when you upload an latin1 XML file you should get 
 it back as valid latin1 through FTP/WebDAV.
 
 Okay, understood, this makes sense in the case of the FTP/WebDAV 
 support, though recoding to UTF-8 and ripping off the encoding 
 declaration would also be pretty safe in case of XML.
 
 When you download text/xml content through the ZPublisher then the 
 ZPublisher will convert unicode textual content to some encoding which is
 either taken from an already set 'content-type: text/...; charset=X'
 HTTP Header or as fallback from the zpublisher-default-encoding property
 as defined in the zope.conf file.
 
 And the same behavior actually applies to HTML content, right?
 
 So the application can specify in both case the encoding of the serialized
 XML content. Where is the problem?
 
 What I'm trying to express here is that this stuff should not be treated 
 as where is the problem? but should be thought through carefully as 
 this is extremely easy to do wrong. I'll think it through carefully 
 here. Let's list some cases:
 
 A) FTP download: stored ML gets downloaded through FTP/WebDAV support.
 
 B) FTP upload: external XML gets uploaded through FTP/WebDAV
 
 C) parse: stored XML is parsed inside of Zope by the page template engine.
 
 D) publisher download: stored XML is downloaded as text/xml directly 
 through the publisher
 
 E) ZPT inclusion: stored XML is included in another page template, for 
 instance to present it in a text area.
 
 F) form submit: Text area is then saved and needs to be stored again.
 
 Andreas Jung proposal (speculation)
 ===
 
 As far as I understand it you're proposing:
 
 * store XML as unicode text
 
 * separately store the encoding on the page template object
 
 * also keep the encoding= bit in the XML preamble when storing.
 
 Let's go through the cases
 
 A) FTP download: encode this to whatever encoding is stored on the ZPT 
 object using Python unicode support. No encoding mangling necessary.
 
 B) FTP upload: read encoding= bit and store this on ZPT. Then decode 
 to unicode using that encoding. Could not be implemented by a 
 parse/serialization step without extra encoding= manipulation 
 afterwards (after decoding to unicode).
 
 C) parse: Rip out the 'encoding=' bit before you send it in the 
 parser. encode to UTF-8 just before entering the parser.
 
 D) publisher download: Rip out the 'encoding=' bit. Then encode 
 according to response header (or zope.conf). Then add back encoding= 
 bit stating if output is non-UTF-8 (not Python names like 'latin1' but 
 encoding identifiers XML is aware of).
 
 E) ZPT inclusion: Send the unicode text to the page template. 
 encoding= bit will be presented in the editor.
 
 F) form submit: decode to unicode according to encoding of page that 
 displayed edit form and store it. Read 'encoding=' bit and store it in 
 ZPT object. Don't manipulate 'encoding=' bit in XML.
 
 encoding= removal: C, D
 encoding= adding: D
 encoding= reading: B, F
 encode from unicode: A, C, D
 decode to unicode: B, F
 
 no encoding= manipulation required: A, E
 no recoding required: E
 straightforward: E
 
 The forms editor scenario (E and F) is potentially confusing as the user 
 may be tempted by the ability to use encoding= to paste latin-1 XML 
 text. Editor could say it only wants it in whatever encoding the page is 
 in, though.
 
 Martijn Faassen proposal
 
 
 If you rip out the encoding before data is stored in the page template 
 and then store as unicode, then we have the following cases:
 
 A) FTP download: Encode to UTF-8, output in UTF-8 only. No encoding 
 mangling necessary.
 
 B) FTP upload: read encoding= bit and decode to unicode accordingly. 
 Rip out encoding=. Could be done by a parse/serialization step, then 
 decode result to unicode.
 
 C) parse: encode to UTF-8 just before entering the parser.
 
 D) publisher download: Encode according to response header or zope.conf. 
 Add in encoding= if output is non-UTF-8 using XML names 

Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

2007-01-16 Thread Dieter Maurer
Chris Withers wrote at 2007-1-14 18:14 +:
 ...
The problem comes when someone sends you something like:

u'?xml version=1.0 encoding=something-else?node /'

What should be done then?

We parse the declaration  and generate an info element
for it but otherwise ignore it as it has lost its meaning after
the XML has been converted to Unicode.



-- 
Dieter
___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

2007-01-16 Thread Dieter Maurer
Martijn Faassen wrote at 2007-1-15 15:44 +0100:
 
Hey,

On 1/15/07, Andreas Jung [EMAIL PROTECTED] wrote:
[snip]
 ok, got it. But this problem can be solved easily by changing the encoding
 within the preamble.

I would say refusing to guess and bailing out with an error message is
better in this case.

I disagree with you.

  Logically, parsing an encoded XML document consists of two
  passes: decode the encoded string into unicode and reconstruct
  the XML info elements from the serialization.

  Traditionally, these two passes are not performed one after
  the other but folded together in a single pass.
  
  But that tradition should not prevent to separate out the
  (Unicode) decoding phase. And after this phase is done,
  there is not ambiguity left with the XML declaration.
  Its encoding attribute is simply irrelevant for the second phase
  (apart from generating the PI info element).

  Thus, there is no guessing; someone else has just performed
  the first phase of the complete process -- maybe using the
  encoding attribute or some overriding information.

-- 
Dieter
___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

2007-01-16 Thread Dieter Maurer
Tres Seaver wrote at 2007-1-15 16:57 -0500:
 ...
Frankly, I don't get the desire to *store* a complete XML document (as
opposed to the extracted contents of attributes or nodes) as unicode

My desire comes from the easy principle: all text should be unicode.

Decoding/encoding happens only at the system boundaries
and no longer internally.



-- 
Dieter
___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

2007-01-16 Thread Martijn Faassen

Dieter Maurer wrote:

Martijn Faassen wrote at 2007-1-15 15:44 +0100:


Hey,

On 1/15/07, Andreas Jung [EMAIL PROTECTED] wrote:
[snip]

ok, got it. But this problem can be solved easily by changing the encoding
within the preamble.

I would say refusing to guess and bailing out with an error message is
better in this case.


I disagree with you.

  Logically, parsing an encoded XML document consists of two
  passes: decode the encoded string into unicode and reconstruct
  the XML info elements from the serialization.

  Traditionally, these two passes are not performed one after
  the other but folded together in a single pass.
  
  But that tradition should not prevent to separate out the

  (Unicode) decoding phase. And after this phase is done,
  there is not ambiguity left with the XML declaration.
  Its encoding attribute is simply irrelevant for the second phase
  (apart from generating the PI info element).


That's nice as far as it goes. What if after the second phase you need 
to parse the XML again? What do you do with your encoding header then? 
If it's irrelevant, you better strip it out before you put it into the 
parser.


Regards,

Martijn

___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

2007-01-16 Thread Martijn Faassen

Tres Seaver wrote:
[snip]
The just store the XML scenario is in surprisingly nice. It only needs 
attention to encoding and decoding in the always complicated ZPublisher 
direct output scenario, and in the edit form scenario.


As you speculated, this is actually my preference, except that I don't
see the need to in scenario D to recode the data and strip the prolog
encoding attribute.  Why wouldn't we just use the XML template's own
declared encoding to encode any data subsituted into the template?  I
mean, if the user has marked up the document to indicate a preferred
encoding, why should we bother storing such an encoding in another location?


Yes, I was thinking along those lines too.


Then the only time we would need to munge the document would be at
inclusion time, which is the only time we actually *need* to have
unicode in hand.  We might even elide the decode-recode stage if the
target document uses the same encoding!  That such an optimization might
not be worth the complexity, however.


Yes, one complexity is that trying to do this would break the assumption 
that ZPT templates always return unicode or pure-ascii strings, not 
anything else (such as encoded data). Only at the last phase of the 
publisher will it be encoded into something else. I really appreciate 
keeping this assumption in place. :)



Note that in the inclusion case (scenario E), we almost certainly
*should* be stripping the *entire* prolog, which is only valid at the
start of the merged document. 


If you are including it as a document, yes. If you are included it 
quoted, as for instance the contents of a text area allowing you to edit 
the XML text directly, then no. This suggests we actually have two 
scenarios here.



I guess there is a subscenario, which is
that the included document is actually the 'main_template' supplying
the prolog:  METAL might should leave the prolog alone, while
'tal:replace' and 'tal:content' (with 'structure') would strip it?


Yay, another scenario. :)

Regards,

Martijn

___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

2007-01-15 Thread Martijn Faassen

Andreas Jung wrote:
[snip]

[Bernd Dorn]

IMHO it should only accept strings, because in the value should be a xml
string and therefore always has to be encoded in 'utf-8' or in the
encoding specified in the processing instruction.



I disagree with that. Since Zope 3 is supposed to use unicode internally
(at least that's the legend) it should support unicode also at the 
parser level. Other languages like Java store XML also as unicode 
strings and support parsing it.


Bernd Dorn raises a good point though, and it's one you need to think 
about carefully. To say languages like Java store XML also as unicode 
is rather ambiguous. While I'm not aware of the details of Java, 
serialized XML is typically stored in some encoded form, most commonly 
UTF-8 (the default 8 bit encoding), but latin 1 is also supported, and 
there are also multi-byte encodings. *Parsed* XML exposed through a DOM 
is exposed as unicode strings. I'm sure Java supports this usage 
patterns, as naturally files on disk need to be parsable.


Here you are talking about parsing XML, so maintaining the position that 
this should be encoded is a reasonable one. This is how for instance the 
Python ElementTree operates (parse encoded, expose API as unicode (or 
pure ascii)), and this has been designed by Fredrik Lundh, who, as you 
may know, was instrumental in developing Python's unicode support.


How would you propose to parse the following unicode string?

u?xml version=1.0 encoding=ISO-8859-1?foo /

If you are going to allow the parsing of unicode strings, I would 
strongly recommend *rejecting* any unicode string that itself declares 
an encoding as ambiguous: refuse to guess.


With lxml (which is an extension of the ElementTree API) we've taken the 
latter option: it's possible to pass a unicode string into the parser, 
but if that contains an encoding declaration, there will be an error. 
Underneath we actually re-encode this string back to UTF-8, as that's 
what the libxml2 parser expects. We made this change with the objections 
of Fredrik Lundh by the way - we felt user errors would be mostly 
prevented because it refuses to guess.


Regards,

Martijn

___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

2007-01-15 Thread Martijn Faassen

Philipp von Weitershausen wrote:
[snip]

A workaround inside parseString() would to check for unicode
and convert the string on-the-fly to a Python string with utf-8 encoding.
This is possibly a limitation of the underlying Expat parser...any 
recommendation how to deal with this issue?


Fixed it in 3.3 and trunk. If you had given me a bit more time, this 
could even have been in 2.10.2b :). Oh well, I guess that's what 2.10.2 
will be for ;)


What did you fix? Please see my posting for a dangerous ambiguity:

u?xml version=1.0 encoding=ISO 8859-1 ?

Regards,

Martijn

___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

2007-01-15 Thread Andreas Jung



--On 15. Januar 2007 13:26:16 +0100 Martijn Faassen 
[EMAIL PROTECTED] wrote:




How would you propose to parse the following unicode string?

u?xml version=1.0 encoding=ISO-8859-1?foo /


If your parser is unicode-aware then the encoding of the preamble
does not matter since you have already unicode internally and can process 
your file totally on XML.


If your parser isn't unicode-aware then you will likely convert it to
utf-8 and work internally with utf-8 encoded strings. In fact 
xml.parsers.expat since to support unicode (it can return unicode strings

to the handlers, see 'returns_unicode' property). However you need to
reconstruct the XMl preamble when you reconstruct your XML from the
parsed data.

Or am I missing something?

Andreas

pgpQNy99FMGyu.pgp
Description: PGP signature
___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

2007-01-15 Thread Martijn Faassen

Hey,

Gmane isn't updating so I can't really reply to the message (not visible 
in gmane) that I want to, but I saw the following solution proposed:


def ourparse(text):
   if isinstance(text, unicode):
  text = text.encode('UTF-8')
   xml_parser.parse(text)

now consider what will happen if you do the following:

text = u?xml version=1.0 encoding=ISO-8859-1 ?fooSome non-ascii 
characters here/foo

ourparse(text)

what will happen is that text is converted to a UTF-8 string (8-bit 
ascii). It's then passed to a hopefully compliant XML parser. This XML 
parser sees an 8-bit ascii string, and checks the encoding header for 
more information on the encoding of the string. It will therefore assume 
the string is in latin-1. The parse will break with an obscure error and 
the developer doing this is probably very confused.


This is why it's better to refuse to guess.

Regards,

Martijn

___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

2007-01-15 Thread Andreas Jung



--On 15. Januar 2007 14:52:42 +0100 Martijn Faassen 
[EMAIL PROTECTED] wrote:



Hey,

Gmane isn't updating so I can't really reply to the message (not visible
in gmane) that I want to, but I saw the following solution proposed:

def ourparse(text):
if isinstance(text, unicode):
   text = text.encode('UTF-8')
xml_parser.parse(text)

now consider what will happen if you do the following:

text = u?xml version=1.0 encoding=ISO-8859-1 ?fooSome non-ascii
characters here/foo
ourparse(text)

what will happen is that text is converted to a UTF-8 string (8-bit
ascii). It's then passed to a hopefully compliant XML parser. This XML
parser sees an 8-bit ascii string, and checks the encoding header for
more information on the encoding of the string. It will therefore assume
the string is in latin-1. The parse will break with an obscure error and
the developer doing this is probably very confused.



ok, got it. But this problem can be solved easily by changing the encoding
within the preamble.

-aj

pgpi1m3ddiYBz.pgp
Description: PGP signature
___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

2007-01-15 Thread Martijn Faassen

Hey,

On 1/15/07, Andreas Jung [EMAIL PROTECTED] wrote:
[snip]

ok, got it. But this problem can be solved easily by changing the encoding
within the preamble.


I would say refusing to guess and bailing out with an error message is
better in this case. The Zen of Python:

In the face of ambiguity, refuse the temptation to guess.

applies very much in this case in my opinion. Changing the preamble is
too much like do what I mean to me - do we really know the developer
actually had any clue what they were doing when they somehow created
this unicode string with an encoding declaration? I'm not even sure I
know what it *means* to have a unicode serialized XML string with an
encoding declaration.

I already think we have code in lxml we can look at to base refusal to guess on.

Regards,

Martijn
___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

2007-01-15 Thread Tres Seaver
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Andreas Jung wrote:
 
 --On 14. Januar 2007 18:14:45 + Chris Withers [EMAIL PROTECTED] 
 wrote:
 
 Dieter Maurer wrote:
 A halfway intelligent parser would accept Unicode when it gets it
 and concentrate on the remaining part of its task: either reporting
 structural events or building a parse tree.
 The trivial fix I use in Twiddler is as follows:

 if isinstance(source,unicode):
source = source.encode('utf-8')

 Of course, this assumes a heading of either ?xml version=1.0
 encoding=utf-8? or a missing encoding attribute, in which case the xml
 spec states that the string must be utf-8 encoded.
 
 The encoding of the XML preamble should not matter when parsing a XML
 document stored as unicode string.

That encoding is a *lie*, which is the real problem.  Parsers expect it
to be *correct*, and if missing, expect the text to be encoded as UTF-8,
per the spec (if the document comes from an HTTP request, then the
application may supply the encoding from the request headers).

Nothing in the XML specs allows or specifies and behavior for XML
documents serialized as unicode, becuase such serializations are
*programming language specific*.

 It is of importance as soon as you 
 convert the document back to a stream e.g. when we deliver the content
 back to a browser or a FTP client. The ZPublisher (for Zope 2) deals with 
 that by changing the encoding parameter of the preamble for XML documents 
 based on the desired output encoding. utf-8 is always a good choice however
 other encodings like iso-8859-15 might raise UnicodeDecodeErrors. The Zope 2
 publisher avoids this problem converting the unicode result using 
 errors='replace' (which is likely something we might discuss :-))

Unicode XML is not only problematic for streaming. For instance, you
*can't* pass a Unicode string to the libxml2 *at all* , unless you want
a core dump.  The API requires that you pass it strings encoded as UTF8.


Tres.
- --
===
Tres Seaver  +1 540-429-0999  [EMAIL PROTECTED]
Palladion Software   Excellence by Designhttp://palladion.com
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFq9wf+gerLs4ltQ4RAvBkAKCGZke7HHr7vWQKcwn5IHW93GHlFQCgyXMJ
a+vZYi2VRnZTt1XBt7O6U3Y=
=+i3B
-END PGP SIGNATURE-

___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

2007-01-15 Thread Martijn Faassen

Chris Withers wrote:

Philipp von Weitershausen wrote:

u'?xml version=1.0 encoding=something-else?node /'

What should be done then?


Not sure. We could ignore it or raise an error. I'm inclined to ignore 
it.


That's what I do too...


See my post elsewhere in the thread for an example of why this is Not Good.

Regards,

Martijn

___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

2007-01-15 Thread Andreas Jung



--On 15. Januar 2007 15:44:01 +0100 Martijn Faassen 
[EMAIL PROTECTED] wrote:



Hey,

On 1/15/07, Andreas Jung [EMAIL PROTECTED] wrote:
[snip]

ok, got it. But this problem can be solved easily by changing the
encoding within the preamble.


I would say refusing to guess and bailing out with an error message is
better in this case. The Zen of Python:

In the face of ambiguity, refuse the temptation to guess.



Sorry but I don't get your point. What's happening with a XML inside a ZPT?

- XML data encoded as XXX comes in (either by editing the XML file through
  the ZMI or FTP/WebDAV upload)

- ZPT converts the encoded string to unicode based on the encoding in the 
preamble


- for parsing it is up to the application to decide what to do with the 
data. It is not up to the editor to decide how the ZPT engine should deal 
with XML internally. The ZPT engine decides to serializes the unicode 
string as utf-8 and to fix the XML preamble (which will result in a valid 
XML file
which should identical with the original file - except the encoding might 
be different).


I still don't see what should ambiguous with this approach.

Andrea

pgpq0GGi0oSZu.pgp
Description: PGP signature
___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

2007-01-15 Thread Martijn Faassen

Tres Seaver wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Andreas Jung wrote:
--On 14. Januar 2007 18:14:45 + Chris Withers [EMAIL PROTECTED] 
wrote:



Dieter Maurer wrote:

A halfway intelligent parser would accept Unicode when it gets it
and concentrate on the remaining part of its task: either reporting
structural events or building a parse tree.

The trivial fix I use in Twiddler is as follows:

if isinstance(source,unicode):
   source = source.encode('utf-8')

Of course, this assumes a heading of either ?xml version=1.0
encoding=utf-8? or a missing encoding attribute, in which case the xml
spec states that the string must be utf-8 encoded.

The encoding of the XML preamble should not matter when parsing a XML
document stored as unicode string.


That encoding is a *lie*, which is the real problem.  Parsers expect it
to be *correct*, and if missing, expect the text to be encoded as UTF-8,
per the spec (if the document comes from an HTTP request, then the
application may supply the encoding from the request headers).

Nothing in the XML specs allows or specifies and behavior for XML
documents serialized as unicode, becuase such serializations are
*programming language specific*.


While I agree that the encoding declaration is ambiguous at best and 
should be rejected, you can find a bit in the spec which supports XML as 
Python unicode strings. A Python unicode string can be seen as a string 
with external character encoding information: it's the native encoding 
of Python. Therefore we can make sense of it in an XML parser. For my 
previous analysis of the spec see here:


http://codespeak.net/pipermail/lxml-dev/2006-May/001137.html

What however is bad and evil is to just ignore conflicting encoding 
declarations in an XML document itself. I'd choose either one of:


* bail with a clear error when unicode is supplied at all

* bail with a clear error when unicode is supplied with any explicit 
encoding declaration in the XML.


It is of importance as soon as you 
convert the document back to a stream e.g. when we deliver the content
back to a browser or a FTP client. The ZPublisher (for Zope 2) deals with 
that by changing the encoding parameter of the preamble for XML documents 
based on the desired output encoding. utf-8 is always a good choice however

other encodings like iso-8859-15 might raise UnicodeDecodeErrors. The Zope 2
publisher avoids this problem converting the unicode result using 
errors='replace' (which is likely something we might discuss :-))


Unicode XML is not only problematic for streaming. For instance, you
*can't* pass a Unicode string to the libxml2 *at all* , unless you want
a core dump.  The API requires that you pass it strings encoded as UTF8.


You can in lxml. :) libxml2 as a C API doesn't even support any unicode 
string type as far as I am aware.


Regards,

Martijn

___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

2007-01-15 Thread Martijn Faassen

Andreas Jung wrote:

--On 15. Januar 2007 15:44:01 +0100 Martijn Faassen 
[EMAIL PROTECTED] wrote:

On 1/15/07, Andreas Jung [EMAIL PROTECTED] wrote:
[snip]

ok, got it. But this problem can be solved easily by changing the
encoding within the preamble.


I would say refusing to guess and bailing out with an error message is
better in this case. The Zen of Python:

In the face of ambiguity, refuse the temptation to guess.



Sorry but I don't get your point. What's happening with a XML inside a ZPT?


My point is that:

u?xml version=1.0 encoding=ISO-8859-1?fooSome non-ascii text/foo

is confusing at best. One part of this says it's a unicode string, the 
other part says it's in encoding latin-1. What is it? What happens to 
this if you recode this to, say, UTF-8? What happens to this if you 
parse and *then* serialize it? What does the developer expect will 
happen? What do users expect when they enter XML in a form and include 
an encoding declaration?


I proposed we make nobody worry about this by simply not accepting this.


- XML data encoded as XXX comes in (either by editing the XML file through
  the ZMI or FTP/WebDAV upload)

- ZPT converts the encoded string to unicode based on the encoding in 
the preamble


- for parsing it is up to the application to decide what to do with the 
data. It is not up to the editor to decide how the ZPT engine should 
deal with XML internally. The ZPT engine decides to serializes the 
unicode string as utf-8 and to fix the XML preamble (which will result 
in a valid XML file
which should identical with the original file - except the encoding 
might be different).



I still don't see what should ambiguous with this approach.


Ambiguous in that the string seems to say it's in two encodings at once. 
You're then guessing: you're letting the Python string type trump the 
declaration. Then, since we've shown that leads to bugs, you propose 
actually change the encoding declaration of the XML document. I wonder 
what people then expect to happen upon serialization. In effect, your 
proposal would, I think, serialize to UTF-8 only, right? (in which case 
the encoding declaration can be dropped as it's the default)


Regards,

Martijn

___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

2007-01-15 Thread Tres Seaver
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Martijn Faassen wrote:
 Tres Seaver wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Andreas Jung wrote:
 --On 14. Januar 2007 18:14:45 + Chris Withers [EMAIL PROTECTED] 
 wrote:

 Dieter Maurer wrote:
 A halfway intelligent parser would accept Unicode when it gets it
 and concentrate on the remaining part of its task: either reporting
 structural events or building a parse tree.
 The trivial fix I use in Twiddler is as follows:

 if isinstance(source,unicode):
source = source.encode('utf-8')

 Of course, this assumes a heading of either ?xml version=1.0
 encoding=utf-8? or a missing encoding attribute, in which case the xml
 spec states that the string must be utf-8 encoded.
 The encoding of the XML preamble should not matter when parsing a XML
 document stored as unicode string.
 That encoding is a *lie*, which is the real problem.  Parsers expect it
 to be *correct*, and if missing, expect the text to be encoded as UTF-8,
 per the spec (if the document comes from an HTTP request, then the
 application may supply the encoding from the request headers).

 Nothing in the XML specs allows or specifies and behavior for XML
 documents serialized as unicode, becuase such serializations are
 *programming language specific*.
 
 While I agree that the encoding declaration is ambiguous at best and 
 should be rejected, you can find a bit in the spec which supports XML as 
 Python unicode strings. A Python unicode string can be seen as a string 
 with external character encoding information: it's the native encoding 
 of Python. Therefore we can make sense of it in an XML parser. For my 
 previous analysis of the spec see here:
 
 http://codespeak.net/pipermail/lxml-dev/2006-May/001137.html
 
 What however is bad and evil is to just ignore conflicting encoding 
 declarations in an XML document itself. I'd choose either one of:
 
 * bail with a clear error when unicode is supplied at all
 
 * bail with a clear error when unicode is supplied with any explicit 
 encoding declaration in the XML.
 
 It is of importance as soon as you 
 convert the document back to a stream e.g. when we deliver the content
 back to a browser or a FTP client. The ZPublisher (for Zope 2) deals with 
 that by changing the encoding parameter of the preamble for XML documents 
 based on the desired output encoding. utf-8 is always a good choice however
 other encodings like iso-8859-15 might raise UnicodeDecodeErrors. The Zope 2
 publisher avoids this problem converting the unicode result using 
 errors='replace' (which is likely something we might discuss :-))
 Unicode XML is not only problematic for streaming. For instance, you
 *can't* pass a Unicode string to the libxml2 *at all* , unless you want
 a core dump.  The API requires that you pass it strings encoded as UTF8.
 
 You can in lxml. :) libxml2 as a C API doesn't even support any unicode 
 string type as far as I am aware.

It *requires* UTF-8-encoded strings.  See http://xmlsoft.org/xml.html

  12. So what is this funky xmlChar used all the time?

  It is a null terminated sequence of utf-8 characters. And only
  utf-8! You need to convert strings encoded in different ways to
  utf-8 before passing them to the API. This can be accomplished
  with the iconv library for instance.

Frankly, I don't get the desire to *store* a complete XML document (as
opposed to the extracted contents of attributes or nodes) as unicode:
it isn't as though it can be easily processed in that form without
re-encoding (even if lxml is the one doing the re-encoding).  It isn't
discourse, in the Zope3 sense of text intended for human
consumption, and the tools people use with it are all going to expect
some kind of validly-encoded string.


Tres.
- --
===
Tres Seaver  +1 540-429-0999  [EMAIL PROTECTED]
Palladion Software   Excellence by Designhttp://palladion.com
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFFq/ix+gerLs4ltQ4RAmkTAJ9ifMH37TNyfZXo+v5zvXCsrFXIXQCfZFow
GBTndXG+0Gw9OnAZeNCxADs=
=Yr7F
-END PGP SIGNATURE-

___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

2007-01-15 Thread Andreas Jung



--On 15. Januar 2007 22:15:46 +0100 Martijn Faassen 
[EMAIL PROTECTED] wrote:


My point is that:

u?xml version=1.0 encoding=ISO-8859-1?fooSome non-ascii
text/foo

is confusing at best. One part of this says it's a unicode string, the
other part says it's in encoding latin-1.


The string above would be used for internal storage but *not* for 
processing. Btw. this is not different from storing HTML files as unicode 
string. An application must convert the unicode string back to a serialized
string - either to the encoding as specified inside the preamble or to a 
'general' encoding (that covers the unicode database) like utf-8 with 
changing the encoding inside the preamble - both are legitimate approaches.

There is no ambiguity. A smart XML parser will represent a XML document
*independent* of the source encoding in most general way (storing a textual
content a unicode (or utf-8 at least).


I still don't see what should ambiguous with this approach.


Ambiguous in that the string seems to say it's in two encodings at once.
You're then guessing: you're letting the Python string type trump the
declaration. Then, since we've shown that leads to bugs, you propose
actually change the encoding declaration of the XML document. I wonder
what people then expect to happen upon serialization. In effect, your
proposal would, I think, serialize to UTF-8 only, right? (in which case
the encoding declaration can be dropped as it's the default.


When you download a ZPT through FTP/WebDAV then the unicode representation
of the XML will be converted using the 'output_encoding' property of the
corresponding ZPT which is set when uploading a new XML document (and taken
from the premable). So when you upload an latin1 XML file you should get it 
back as valid latin1 through FTP/WebDAV.


When you download text/xml content through the ZPublisher then the 
ZPublisher will convert unicode textual content to some encoding which is

either taken from an already set 'content-type: text/...; charset=X'
HTTP Header or as fallback from the zpublisher-default-encoding property
as defined in the zope.conf file.

So the application can specify in both case the encoding of the serialized
XML content. Where is the problem?

Andreas


pgpUMJ3Mc5Oh4.pgp
Description: PGP signature
___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



[Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

2007-01-14 Thread Philipp von Weitershausen

Andreas Jung wrote:

Hi,

the XMLParser.parseString() method  raises an exception

 File /opt/python-2.4.4/lib/python2.4/unittest.py, line 260, in run
   testMethod()
 File 
/Users/ajung_data/sandboxes/Zope/Zope/lib/python/zope/tal/tests/test_xmlparser.py, 
line 127, in test_xx

   self._run_check(xml, ())
 File 
/Users/ajung_data/sandboxes/Zope/Zope/lib/python/zope/tal/tests/test_xmlparser.py, 
line 106, in _run_check

   parser.parseString(source)
 File 
/Users/ajung_data/sandboxes/Zope/Zope/lib/python/zope/tal/xmlparser.py, 
line 77, in parseString

   self.parser.Parse(s, 1)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 
43-48: ordinal not in range(128)


if the string to be parsed is a unicode strings and contains some non-ascii
chars. The following snippet from a private unittest (test_xmlparsers.py)
shows the error.

   def test_xx(self):
   xml = unicode('?xml version=1.0 
encoding=utf-8?fooüöä/foo', 'iso-8859-15')

   self._run_check(xml, ())

I am not sure if this behavior is intentional?! Is the XMLParser supposed
to deal with unicode strings or will it only accept a standard Python 
string?


Traditionally, you parse an 8bit string, figure out its encoding (e.g. 
from ?xml encoding=utf-8? and return some representation of that XML 
with unicode data. That's why it's actually quite ok for XML parsers to 
only accept string data.


With ZPTs it's a bit different: When editing ZPTs TTW for example, we 
like to store its source in unicode. So it makes sense for us to be able 
to parse unicode input as XML.



A workaround inside parseString() would to check for unicode
and convert the string on-the-fly to a Python string with utf-8 encoding.
This is possibly a limitation of the underlying Expat parser...any 
recommendation how to deal with this issue?


Fixed it in 3.3 and trunk. If you had given me a bit more time, this 
could even have been in 2.10.2b :). Oh well, I guess that's what 2.10.2 
will be for ;)



--
http://worldcookery.com -- Professional Zope documentation and training
2nd edition of Web Component Development with Zope 3 is now shipping!
___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

2007-01-14 Thread Dieter Maurer
Philipp von Weitershausen wrote at 2007-1-14 14:59 +0100:
 ...
Traditionally, you parse an 8bit string, figure out its encoding (e.g. 
from ?xml encoding=utf-8? and return some representation of that XML 
with unicode data. That's why it's actually quite ok for XML parsers to 
only accept string data.

Parsing usually means rebuilding the structure from a text string and *NOT*
encoding guessing or Unicode decoding.

Therefore, it is actually quite stupid for a parser
to try to encode an already decoded string (i.e. a Unicode string)
only that it can guess the encoding ;-)
A halfway intelligent parser would accept Unicode when it gets it
and concentrate on the remaining part of its task: either reporting
structural events or building a parse tree.



-- 
Dieter
___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

2007-01-14 Thread Chris Withers

Dieter Maurer wrote:

A halfway intelligent parser would accept Unicode when it gets it
and concentrate on the remaining part of its task: either reporting
structural events or building a parse tree.


The trivial fix I use in Twiddler is as follows:

if isinstance(source,unicode):
  source = source.encode('utf-8')

Of course, this assumes a heading of either ?xml version=1.0 
encoding=utf-8? or a missing encoding attribute, in which case the 
xml spec states that the string must be utf-8 encoded.


The problem comes when someone sends you something like:

u'?xml version=1.0 encoding=something-else?node /'

What should be done then?

Chris

--
Simplistix - Content Management, Zope  Python Consulting
   - http://www.simplistix.co.uk
___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

2007-01-14 Thread Philipp von Weitershausen

On 14 Jan 2007, at 18:37 , Dieter Maurer wrote:

Philipp von Weitershausen wrote at 2007-1-14 14:59 +0100:

...
Traditionally, you parse an 8bit string, figure out its encoding  
(e.g.
from ?xml encoding=utf-8? and return some representation of  
that XML
with unicode data. That's why it's actually quite ok for XML  
parsers to

only accept string data.


Parsing usually means rebuilding the structure from a text string  
and *NOT*

encoding guessing or Unicode decoding.

Therefore, it is actually quite stupid for a parser
to try to encode an already decoded string (i.e. a Unicode string)
only that it can guess the encoding ;-)
A halfway intelligent parser would accept Unicode when it gets it
and concentrate on the remaining part of its task: either reporting
structural events or building a parse tree.


Yes, I agree. Unfortunately, expat isn't smart enough, which caused  
this whole discussion.


___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com



Re: [Zope3-dev] Re: zope.tal.xmlparser.XMLParser() dislikes unicode

2007-01-14 Thread Philipp von Weitershausen

On 14 Jan 2007, at 19:14 , Chris Withers wrote:

Dieter Maurer wrote:

A halfway intelligent parser would accept Unicode when it gets it
and concentrate on the remaining part of its task: either reporting
structural events or building a parse tree.


The trivial fix I use in Twiddler is as follows:

if isinstance(source,unicode):
  source = source.encode('utf-8')


It's the same fix I used.

Of course, this assumes a heading of either ?xml version=1.0  
encoding=utf-8? or a missing encoding attribute, in which case  
the xml spec states that the string must be utf-8 encoded.


The problem comes when someone sends you something like:

u'?xml version=1.0 encoding=something-else?node /'

What should be done then?


Not sure. We could ignore it or raise an error. I'm inclined to  
ignore it.




___
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com