Andreas Jung wrote:


--On 15. Januar 2007 22:15:46 +0100 Martijn Faassen
[snip]
I still don't see what should ambiguous with this approach.

Ambiguous in that the string seems to say it's in two encodings at once.
You're then "guessing": you're letting the Python string type trump the
declaration. Then, since we've shown that leads to bugs, you propose
actually change the encoding declaration of the XML document. I wonder
what people then expect to happen upon serialization. In effect, your
proposal would, I think, serialize to UTF-8 only, right? (in which case
the encoding declaration can be dropped as it's the default.

When you download a ZPT through FTP/WebDAV then the unicode representation
of the XML will be converted using the 'output_encoding' property of the
corresponding ZPT which is set when uploading a new XML document (and taken
from the premable). So when you upload an latin1 XML file you should get it back as valid latin1 through FTP/WebDAV.

Okay, understood, this makes sense in the case of the FTP/WebDAV support, though recoding to UTF-8 and ripping off the encoding declaration would also be pretty safe in case of XML.

When you download text/xml content through the ZPublisher then the ZPublisher will convert unicode textual content to some encoding which is
either taken from an already set 'content-type: text/...; charset=XXXXX'
HTTP Header or as fallback from the zpublisher-default-encoding property
as defined in the zope.conf file.

And the same behavior actually applies to HTML content, right?

So the application can specify in both case the encoding of the serialized
XML content. Where is the problem?

What I'm trying to express here is that this stuff should not be treated as "where is the problem?" but should be thought through carefully as this is extremely easy to do wrong. I'll think it through carefully here. Let's list some cases:

A) FTP download: stored ML gets downloaded through FTP/WebDAV support.

B) FTP upload: external XML gets uploaded through FTP/WebDAV

C) parse: stored XML is parsed inside of Zope by the page template engine.

D) publisher download: stored XML is downloaded as text/xml directly through the publisher

E) ZPT inclusion: stored XML is included in another page template, for instance to present it in a text area.

F) form submit: Text area is then saved and needs to be stored again.

Andreas Jung proposal (speculation)
===================================

As far as I understand it you're proposing:

* store XML as unicode text

* separately store the encoding on the page template object

* also keep the encoding="" bit in the XML preamble when storing.

Let's go through the cases

A) FTP download: encode this to whatever encoding is stored on the ZPT object using Python unicode support. No encoding mangling necessary.

B) FTP upload: read encoding="" bit and store this on ZPT. Then decode to unicode using that encoding. Could not be implemented by a parse/serialization step without extra encoding="" manipulation afterwards (after decoding to unicode).

C) parse: Rip out the 'encoding=""' bit before you send it in the parser. encode to UTF-8 just before entering the parser.

D) publisher download: Rip out the 'encoding=""' bit. Then encode according to response header (or zope.conf). Then add back encoding="" bit stating if output is non-UTF-8 (not Python names like 'latin1' but encoding identifiers XML is aware of).

E) ZPT inclusion: Send the unicode text to the page template. encoding="" bit will be presented in the editor.

F) form submit: decode to unicode according to encoding of page that displayed edit form and store it. Read 'encoding=' bit and store it in ZPT object. Don't manipulate 'encoding=""' bit in XML.

encoding="" removal: C, D
encoding="" adding: D
encoding="" reading: B, F
encode from unicode: A, C, D
decode to unicode: B, F

no encoding="" manipulation required: A, E
no recoding required: E
straightforward: E

The forms editor scenario (E and F) is potentially confusing as the user may be tempted by the ability to use encoding="" to paste latin-1 XML text. Editor could say it only wants it in whatever encoding the page is in, though.

Martijn Faassen proposal
========================

If you rip out the encoding before data is stored in the page template and then store as unicode, then we have the following cases:

A) FTP download: Encode to UTF-8, output in UTF-8 only. No encoding mangling necessary.

B) FTP upload: read encoding="" bit and decode to unicode accordingly. Rip out encoding="". Could be done by a parse/serialization step, then decode result to unicode.

C) parse: encode to UTF-8 just before entering the parser.

D) publisher download: Encode according to response header or zope.conf. Add in encoding="" if output is non-UTF-8 using XML names for encoding.

E) ZPT inclusion: send unicode text to the page template. No encoding="" bit will be in the XML presented in the editor.

F) form submit: Rip out any encoding="" before storing, ignoring it as XML was in output encoding, then convert to unicode using input encoding.

encoding="" removal: B, F
encoding="" adding: D
encoding="" reading: B
encode from unicode: A, C, D
decode to unicode: B, F

no encoding="" manipulation required: A, C, E
no recoding required: E
straightforward: E

No storage of encoding information on ZPT object is necessary.

Case B) potentially confusion as upon re-download XML document will be recoded to UTF-8 (though XML editors should be able to deal with this as it's the default).

Form edit still potentially confusing as encoding="" bit disappears, but at least suggestion to user is not made that information *presented* in a textarea is in a particular encoding specified in the encoding="" bit.

Tres Seaver proposal (speculation)
==================================

Storage in UTF-8.

A) FTP download: output in UTF-8 only, can be done directly.

B) FTP upload: read encoding="" bit and, if not UTF-8, decode to unicode accordingly. Then recode to UTF-8. Rip out encoding="". Could be done by an XML parse/serialization step.

C) parse: just pass UTF-8 to parser.

D) publisher download: Decode to unicode. Then recode to desired output encoding (with XML names for encoding added in encoding="") bit.

E) ZPT inclusion: Decode text to unicode. No encoding="" bit will be in the XML presented in the editor.

F) form submit: Rip out any encoding="" before storing, ignoring it as XML was in output encoding, then convert to unicode using that encoding, then convert again to UTF-8.

encoding="" removal: B, F
encoding="" adding: D
encoding="" reading: B
encode from unicode: B, D, F
decode to unicode: B, D, F

no encoding="" manipulation required: A, C, E
no recoding required: A, C (B and F if UTF-8 uploaded)
straightforward: A, C

No storage of encoding information in ZPT object is necessary.

Case B) potentially confusion as upon re-download XML document will be recoded to UTF-8 (though XML editors should be able to deal with this as it's the default).

Form edit still potentially confusing as encoding="" bit disappears, but at least suggestion to user is not made that information *presented* in a textarea is in a particular encoding specified in the encoding="" bit.

Just store the XML text
=======================

Storage XML text literally as received. Maybe this is actually what Tres meant. :)

A) FTP download: output can be done directly.

B) FTP upload: store input directly

C) parse: just pass text to parser.

D) publisher download: Decode to unicode using encoding="" bit. Remove encoding bit. Then recode to desired output encoding (with XML names for encoding added in encoding="") bit.

E) ZPT inclusion: Decode text to unicode using encoding="" bit.

F) form submit: Encode text in form from unicode according to encoding="" bit.

encoding="" removal: D
encoding="" adding: D
encoding="" reading: B, D, E, F
encode from unicode: D, F
decode to unicode: D, E

no encoding="" manipulation required: A, C (but B, E, F only reading)
no recoding required: A, B, C
straightforward: A, C

No storage of encoding information in ZPT object is necessary, though could be done to optimize extraction of encoding=""

Form edit potentially confusing as in Andreas Jung scenario.

..............

Any use cases I missed or got wrong? The scenarios are all complicated. :)


The "Andreas Jung" scenario has "leave the XML text alone except make it unicode" goal in mind, but actually ends up messing about with "encoding=""" more than the other scenarios.

The "Martijn Faassen" scenario tries to follow the rule: decode to unicode on input, get rid of encoding="" in XML, and encode only on output as much as possible, with the exception of the parser call.

The Tres Seaver scenario as I sketched it has the "turn the XML into UTF-8" goal. It needs to do recoding less frequently than the other scenarios, though more frequently than one would hope.

The "just store the XML" scenario is in surprisingly nice. It only needs attention to encoding and decoding in the always complicated ZPublisher direct output scenario, and in the edit form scenario.

The "just store XML" proposal starts to look attractive. It requires very little actual XML text manipulation, only in D, and while it does require more reading of the encoding="" bit, this can be cached and at least doesn't require string manipulation. Care can be taken that there is an API to represent the XML as unicode strings - this is done for display purposes only (clearly human readable text) and this is the only case where the encoding="" bit is rather misleading.

Regards,

Martijn

_______________________________________________
Zope3-dev mailing list
Zope3-dev@zope.org
Unsub: http://mail.zope.org/mailman/options/zope3-dev/archive%40mail-archive.com

Reply via email to