Andreas Jung wrote:

[Bernd Dorn]
IMHO it should only accept strings, because in the value should be a xml
string and therefore always has to be encoded in 'utf-8' or in the
encoding specified in the processing instruction.

I disagree with that. Since Zope 3 is supposed to use unicode internally
(at least that's the legend) it should support unicode also at the parser level. Other languages like Java store XML also as unicode strings and support parsing it.

Bernd Dorn raises a good point though, and it's one you need to think about carefully. To say "languages like Java store XML also as unicode" is rather ambiguous. While I'm not aware of the details of Java, serialized XML is typically stored in some encoded form, most commonly UTF-8 (the default 8 bit encoding), but latin 1 is also supported, and there are also multi-byte encodings. *Parsed* XML exposed through a DOM is exposed as unicode strings. I'm sure Java supports this usage patterns, as naturally files on disk need to be parsable.

Here you are talking about parsing XML, so maintaining the position that this should be encoded is a reasonable one. This is how for instance the Python ElementTree operates (parse encoded, expose API as unicode (or pure ascii)), and this has been designed by Fredrik Lundh, who, as you may know, was instrumental in developing Python's unicode support.

How would you propose to parse the following unicode string?

u"<?xml version="1.0" encoding="ISO-8859-1"?><foo />"

If you are going to allow the parsing of unicode strings, I would strongly recommend *rejecting* any unicode string that itself declares an encoding as ambiguous: refuse to guess.

With lxml (which is an extension of the ElementTree API) we've taken the latter option: it's possible to pass a unicode string into the parser, but if that contains an encoding declaration, there will be an error. Underneath we actually re-encode this string back to UTF-8, as that's what the libxml2 parser expects. We made this change with the objections of Fredrik Lundh by the way - we felt user errors would be mostly prevented because it refuses to guess.



