subject:"\[issue8047\] Serialiser in ElementTree returns unicode strings in Py3k"

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2011-10-28 Thread Florent Xicluna


Florent Xicluna florent.xicl...@gmail.com added the comment:

3.1 is no longer in scope for this issue.

--
resolution:  - out of date
status: open - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-08-08 Thread Stefan Behnel


Stefan Behnel sco...@users.sourceforge.net added the comment:

I would suggest fixing the tostring() behaviour also in a future 3.1.x bug fix 
release. After all, the current behaviour means that 3.0 and 3.1 would behave 
different from any other (released or future) Python version here.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-08-08 Thread Florent Xicluna


Florent Xicluna florent.xicl...@gmail.com added the comment:

Done for 3.2 with r83851.

Still opened, if someone wants to propose a patch for 3.1.

--
assignee: effbot - 
keywords: +easy -patch
stage: commit review - needs patch
versions:  -Python 3.2

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-08-08 Thread Guido van Rossum


Changes by Guido van Rossum gu...@python.org:


--
nosy:  -gvanrossum

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-07-31 Thread Florent Xicluna


Changes by Florent Xicluna florent.xicl...@gmail.com:


Removed file: http://bugs.python.org/file16543/issue8047_etree_encoding.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-07-31 Thread Florent Xicluna


Florent Xicluna florent.xicl...@gmail.com added the comment:

Patch updated here, and on Rietveld too.
http://codereview.appspot.com/664043

Rules (as discussed):
 - tree.tostring(encoding=None)  = encodes to US-ASCII
   (compatible with 2.7 and lxml.etree)
 - tree.tostring(encoding=unicode) = outputs Unicode
 - tree.tostring(encoding=str) = outputs Unicode
   (compatible with lxml.etree)

For 2.7, no change planned.
For 3.1, do we keep the current behavior?
  - tree.tostring(encoding=None)  = outputs Unicode

--
components: +XML
stage: patch review - commit review
Added file: http://bugs.python.org/file18286/issue8047_etree_encoding_v2.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-22 Thread Florent Xicluna


Florent Xicluna florent.xicl...@gmail.com added the comment:

http://codereview.appspot.com/664043 (patch against 3.x)

IIUC, the changes proposed (for 3.2) are:
 - default encoding or bool(encoding) == False
   == fallback to 'US-ASCII' encoding (instead of Unicode)
 - encoding=str or encoding='unicode'
   == serialize to Unicode

And it changes the behavior of :
 - ET.write()
 - tostring()
 - tostringlist()

For 2.x we could add the options for Unicode output:
 - encoding=unicode
 - and encoding='unicode'

--
assignee: georg.brandl - effbot
stage: test needed - patch review

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-22 Thread Stefan Behnel


Stefan Behnel sco...@users.sourceforge.net added the comment:

 Supporting unicode for lxml.etree compatibility is fine with me, but I
 think it might make sense to support the string unicode as well (as
 a pseudo-encoding -- it's pretty clear to me that nobody will ever
 define a real character encoding with that name :-).

The reason I chose the unicode type over a 'unicode' string name at the time 
was that I wanted to make a clear distinction to show that this is not just 
selecting a different codec but that it changes the output type.

I don't really care either way, though, given that this reads a lot less well 
in Py3. If ET supports both, lxml will follow.

Stefan

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-22 Thread Marc-Andre Lemburg


Marc-Andre Lemburg m...@egenix.com added the comment:

Stefan Behnel wrote:
 
 Stefan Behnel sco...@users.sourceforge.net added the comment:
 
 Supporting unicode for lxml.etree compatibility is fine with me, but I
 think it might make sense to support the string unicode as well (as
 a pseudo-encoding -- it's pretty clear to me that nobody will ever
 define a real character encoding with that name :-).
 
 The reason I chose the unicode type over a 'unicode' string name at the time 
 was that I wanted to make a clear distinction to show that this is not just 
 selecting a different codec but that it changes the output type.
 
 I don't really care either way, though, given that this reads a lot less well 
 in Py3. If ET supports both, lxml will follow.

There's always the possibility of adding a new official codec
called 'unicode' which converts Unicode to Unicode as no-op.

This may also be useful to have in other situations where you
want to signal a special case for Unicode input or output.

--
nosy: +lemburg

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-21 Thread Fredrik Lundh


Fredrik Lundh fred...@effbot.org added the comment:

Hmm.  I'm not entirely sure about giving False a meaning when None has 
traditionally had a different (and documented) meaning.  And sleeping on it 
hasn't convinced me in either direction :-(

(well, I'd say no, but the compatibility argument is somewhat tempting)

I'm not that concerned by changing the default for write -- 3.x users with 
utf-8 as the default output encoding will get different output, but still 
perfectly valid XML.  3.x users with non-utf-8 default encodings  will get 
valid XML also in cases where it didn't work before.

tostring() is more problematic, but I'm leaning towards Guido's torpedoes 
approach there -- changing the default output to bytestrings is more likely to 
cause code to blow up than cause bad output, and you can trivially make your 
program backwards compatible by adding an extra check/decode after the call.  
Supporting unicode for lxml.etree compatibility is fine with me, but I think it 
might make sense to support the string unicode as well (as a pseudo-encoding 
-- it's pretty clear to me that nobody will ever define a real character 
encoding with that name :-).

Have you posted/can you post the patch to riedveld, btw?  I have some questions 
about the code that are independent of the encoding decision.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-14 Thread Florent Xicluna


Florent Xicluna florent.xicl...@gmail.com added the comment:

Currently tree.write(file) returns Unicode in 3.1 (and 3.x).
I would propose the following change:

 tree.write(file)
#  ==  encode to ASCII without xml declaration (compatible 2.x)
 tree.write(file, encoding=utf-8)
#  ==  encode to UTF-8 without xml declaration (compatible 2.x + 3.1)
 tree.write(file, encoding=False)
#  ==  output Unicode, without xml declaration (compatible 3.1)

The xml_declaration keyword argument can be set to True explicitly.

For compatibility with lxml.etree, encoding=str returns the same as 
encoding=False.

Functions tostring() and tostringlist() will inherit the same behavior.
This change could be backported to 2.7, because it is backward compatible.

See proposed patch for implementation details.

--
keywords: +patch
Added file: http://bugs.python.org/file16543/issue8047_etree_encoding.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-14 Thread Stefan Behnel


Stefan Behnel sco...@users.sourceforge.net added the comment:

That's a funny idea. I like that. +1

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-12 Thread Fredrik Lundh


Fredrik Lundh fred...@effbot.org added the comment:

'None' has always been the documented default for the encoding parameter

That's probably mostly by accident at least in original ET, but the 1.3 draft 
docs at effbot.org/elementtree does spell it out explicitly for the 'write' 
method:

   Output encoding. If omitted or set to None, defaults to US-ASCII.

Not sure I'd consider this text binding in itself, though (even if I'd argue 
that it's preferred to have the same interpretation of encoding everywhere).

writing out the Unicode serialisation will result in an incorrect XML 
serialisation

I think Guido meant the ElementTree.write method; is that broken too?

The file.write(et.tostring()) issue is probably my most pressing concern here; 
that's a common use case (e.g. when using iterparse to cut pieces from a big 
document), and the defaults were chosen to increase the chance that this 
automatically do the right thing for non-ASCII even if the programmer never 
tests it.  In 3.X, that construct is suddenly dependent on the interpreter's 
default encoding.

I think I'd prefer old tostring behaviour and a separate tounicode 
function, and I'm still not convinced that the latter is required for the XML 
use case (which implies that maybe it should live in lxml.html for the HTML 
case, even if it ends up calling the same internal implementation).

Or should that be tobytes and tounicode to eliminate all ambiguity?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-12 Thread Fredrik Lundh


Fredrik Lundh fred...@effbot.org added the comment:

(what's the Python 3 replacement for the array module, btw?)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-12 Thread Stefan Behnel


Stefan Behnel sco...@users.sourceforge.net added the comment:

'None' has always been the documented default for the encoding parameter

What I meant here was that help(ET.tostring) will show you that as the 
default. Also, in the docs, the signature is tostring(tree, encoding=None), 
so None is the documented default value for the argument, regardless of the 
internal handling.


 writing out the Unicode serialisation will result in an incorrect
 XML serialisation
 I think Guido meant the ElementTree.write method; is that broken too?

Yes, the feature has been implemeted deep down in the _encode() helper 
function, so it impacts the entire serialiser, not only its API.


 I think I'd prefer old tostring behaviour and a separate tounicode 
 function, and I'm still not convinced that the latter is required for the XML 
 use case (which implies that maybe it should live in lxml.html for the HTML 
 case, even if it ends up calling the same internal implementation).

I obviously agree that the use case for XML is fable, but that alone doesn't 
make this a convincing argument to move it into lxml.html when the 
implementation will stay in lxml.etree anyway. Besides, that's pretty off-topic 
for this bug tracker.


 Or should that be tobytes and tounicode to eliminate all ambiguity?

That might be the clean break-all-bridges solution, but I don't think the name 
tostring() is so inherently broken in Py3 that it needs fixing. It's not 
tostr(), for example.

I wouldn't raise much opposition against tobytes() as an alias for tostring(), 
although that sounds more like duplicating an otherwise simple API.

Stefan

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-12 Thread Fredrik Lundh


Fredrik Lundh fred...@effbot.org added the comment:

Yes, the feature has been implemented deep down in the _encode() helper 
function, so it impacts the entire serialiser, not only its API

Ouch.

 import locale
 locale.getpreferredencoding() == utf-8
False
 from xml.etree.ElementTree import *
 e = Element(tag)
 e.text = hellö
 tostring(e)
'taghellö/tag'
 ElementTree(e).write(out.xml)
 tree = parse(out.xml)
Traceback (most recent call last):
  File stdin, line 1, in module
  File C:\Python31\lib\xml\etree\ElementTree.py, line 843, in parse
tree.parse(source, parser)
  File C:\Python31\lib\xml\etree\ElementTree.py, line 581, in parse
parser.feed(data)
  File C:\Python31\lib\xml\etree\ElementTree.py, line 1221, in feed
self._parser.Parse(data, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 9

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-12 Thread Fredrik Lundh


Fredrik Lundh fred...@effbot.org added the comment:

I wouldn't raise much opposition against tobytes() as an alias for tostring(), 
although that sounds more like duplicating an otherwise simple API.

Adding an alias would be a way address the 2.X/3.X terminology overlap; string 
traditionally implies 8-bit in 2.X, and apparently now Unicode in 3.X.  That's 
likely to cause a lot of confusion for people switching over (and to people 
writing 3.X documentation, as well; the array module's documentation is an 
example).

ET isn't the only thing with tostring functionality, of course -- it's  pretty 
much the standard name for serialize data structure to byte string for later 
transmission -- so it probably wouldn't hurt with a python-dev pronouncement 
here.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-12 Thread Fredrik Lundh


Changes by Fredrik Lundh fred...@effbot.org:


--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-12 Thread Fredrik Lundh


Fredrik Lundh fred...@effbot.org added the comment:

I wouldn't raise much opposition against tobytes() as an alias for tostring(), 
although that sounds more like duplicating an otherwise simple API.

Adding an alias would be a way address the 2.X/3.X terminology overlap; string 
traditionally implies 8-bit in 2.X, and apparently now Unicode in 3.X.  That's 
likely to cause a lot of confusion for people switching from 2 to 3 (and to 
people writing 3.X documentation, apparently; the array module's documentation 
is an example of that).

(And once everyone has switched over, we can deprecate the tostring spelling... 
:)

ET isn't the only thing with tostring functionality, of course -- it's  pretty 
much the standard name for serialize data structure to byte string for later 
transmission -- so it probably wouldn't hurt with a python-dev pronouncement 
here.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-12 Thread Florent Xicluna


Florent Xicluna florent.xicl...@gmail.com added the comment:

I plan to merge ET 1.3 in the 3.x branch tomorrow (See #6472)
Currently, the patch is consistent with 3.1 behaviour.
It could be changed later, depending on the pronouncement on this compatibility 
issue.


 Previously, in ElementTree, serialising without an explicit encoding
 was a way to get a byte encoded serialisation without an XML
 declaration header.

Now you can pass keyword argument xml_declaration=False to skip the header 
explicitely.


 xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 
 9

Now it works better.


~ $ ./python 
Python 3.2a0 (py3k:78865M, Mar 12 2010, 13:05:30) 
[GCC 4.3.4] on linux2
Type help, copyright, credits or license for more information.
 import locale
 locale.getpreferredencoding() == utf-8
False
 from xml.etree.ElementTree import *
 e = Element(tag)
 e.text = hellö
 tostring(e)
'taghellö/tag'
 ElementTree(e).write(out.xml)
 tree = parse(out.xml)
 dump(tree)
taghellö/tag

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-12 Thread Fredrik Lundh


Fredrik Lundh fred...@effbot.org added the comment:

Interesting.  But isn't the problem with 3.1 that it relies on the standard 
encoding, which results in code that may or may not work depending on a global 
platform setting?  Who's doing the encoding in the new version?  And what ends 
up in the file?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-12 Thread Florent Xicluna


Florent Xicluna florent.xicl...@gmail.com added the comment:

 tree = parse(out.xml)

Actually the test in my previous message does not prove anything.
locale.getpreferredencoding() returns UTF-8 != utf-8.

:)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-12 Thread Fredrik Lundh


Fredrik Lundh fred...@effbot.org added the comment:

Oops :)  Yeah, that was pretty lousy way to show what encoding I was using for 
that test:

 import locale
 locale.getpreferredencoding()
'cp1252'


(Somewhat related, it would be nice if Python actually normalized 
defaultencoding/preferredencoding to some canonical name for the codec in use, 
i.e. preferred MIME name or at least IANA; we had a rather nice little bug 
recently that wouldn't have happened if that had been the case...)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-12 Thread Guido van Rossum


Guido van Rossum gu...@python.org added the comment:

I propose that we continue to see Fredrik as elementtree's BDFL. If Fredrik 
wants the API in 3.2 to be changed to be backwards compatible with 2.x, we 
should do that, and damn the torpedoes (um, 3.1 compatibility).

I would do this ASAP; if you can, fix it *before* merging 1.3.

Since I hate XML equally whether it's text or bytes, please leave me out of 
this in the future; I apologize for having cause the problem in the first place 
(but note that apparently nobody cared or noticed until a week ago).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-11 Thread Antoine Pitrou


Antoine Pitrou pit...@free.fr added the comment:

 The no header thing is very much done on purpose, and it's
 documented in the upstream ElementTree documentation.

I'm sorry, where is that?
I can't find it either at
http://effbot.org/zone/pythondoc-elementtree-ElementTree.htm#elementtree.ElementTree.tostring-function
or
http://effbot.org/zone/pythondoc-elementtree-ElementTree.htm#elementtree.ElementTree.ElementTree.write-method

 I suggest dropping this Python 3 exists in its own universe
 nonsense; it's not very professional, and it's hurting Python, its
 users, and all third party developers.

Ha. There has been a very long temporal window (until 3.1, probably)
during which things were very much in flux and anyone with a
professional knowledge of elementtree and XML APIs could chime in and
point out any nonsense in py3k.

Now Python 3.1 is out and as a result py3k also has to ensure upwards
compatibility for its own APIs. Of course we can still make exceptions
if the alleged breakage is truly major. To me, it doesn't /seem/ to be
the case here.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-11 Thread Stefan Behnel


Stefan Behnel sco...@users.sourceforge.net added the comment:

Sorry, Antoine, but you can't possibly mean what you say here. The culprit in 
question is clearly one of the best hidden features of the new Py3 ET API. The 
only existing reference to it that I can find is the SVN commit comment when it 
was applied. How is that supposed to be any reason for keeping up backwards 
compatibility within the Py3 series?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-11 Thread R. David Murray


R. David Murray rdmur...@bitdance.com added the comment:

I suspect that what Antoine is referring to is the fact that Python 3.1 has 
this behavior.  Whether or not it is explicitly documented is a secondary issue.

We're having a similar issue in the unittest package, where there's a new 
function, assertSameElements, that has an unfortunate and poorly documented 
API.  But changing that API now that the function exists in a released version 
(3.1) is not something to be done lightly, if it is done at all.

This is definitely an unfortunate state of affairs no matter how you look at it.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-11 Thread Fredrik Lundh


Fredrik Lundh fred...@effbot.org added the comment:

 if I don't specify an encoding, I get unicode.  If I do specify an encoding, 
 I get encoded bytes.

You're confusing the XML document encoding with character set encoding.

A serialized (unparsed) XML document is a byte stream, not a string of Unicode 
characters.  And the character set encoding is both embedded in that byte 
stream and affects how it's generated in more than one way; you cannot just 
recode XML documents nilly willy and expect things to work.

A parsed XML document (an infoset) -- for ET, that's the tree of Element 
objects -- does indeed contain Unicode strings, but the transformation from the 
byte stream to the Unicode string doesn't just involve character set decoding; 
there are several other constructs that are handled by the XML parser.

 Ha. There has been a very long temporal window

You should have had plenty of time to fix it, then, right?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-11 Thread Antoine Pitrou


Antoine Pitrou pit...@free.fr added the comment:

  Ha. There has been a very long temporal window
 
 You should have had plenty of time to fix it, then, right?

Under the condition that someone would have actually reported it, yes.
We don't magically fix bugs if nobody (including us) detects and reports
them.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-11 Thread Stefan Behnel


Stefan Behnel sco...@users.sourceforge.net added the comment:

Then I would call that a clear sign that no-one actually stumbled over this 
feature in Py3 before I did, well hidden as it was. Still time to fix it.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-11 Thread R. David Murray


R. David Murray rdmur...@bitdance.com added the comment:

You may well be correct.  But just because no one reported a bug does not mean 
that no one is using the API.  The person using it may find it perfectly 
logical (and may be writing py3 only code, not porting py2 code).

However, regardless of whether we decide it is acceptable to change the 
behavior, it seems to me that having an interface named 'tostring' that returns 
bytes by default in Python3 would be a broken API.  I don't see any way around 
that terminology problem.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-11 Thread Fredrik Lundh


Fredrik Lundh fred...@effbot.org added the comment:

 import array
 array.array(i, [1, 2, 3]).tostring()
b'\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00'

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-11 Thread Antoine Pitrou


Antoine Pitrou pit...@free.fr added the comment:

Le Thu, 11 Mar 2010 22:03:37 +,
Fredrik Lundh rep...@bugs.python.org a écrit :
 
  import array
  array.array(i, [1, 2, 3]).tostring()
 b'\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00'

The fact that array is old, rusty and slightly broken doesn't meen we
should propagate that brokenness to other Python modules.
Also, as David said, the fact that you think there is a bug
here doesn't mean everyone would agree.
Finally, the behaviour you seem to be looking for could be added
as a separated API or an optional method argument. Patches welcome.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-11 Thread Fredrik Lundh


Fredrik Lundh fred...@effbot.org added the comment:

So now it's the domain experts against some hypothetical people that might 
exist?  Tricky.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-11 Thread R. David Murray


R. David Murray rdmur...@bitdance.com added the comment:

Well, Benjamin pointed out to me that it would be a bad thing if array.tostring 
produced a string.  True, the method is named wrong, but it is less broken than 
returning a string.  I suspect that that is the same argument Fredrik is 
making: that returning the XML as a byte string is less broken than returning 
it as a string when it in fact may contain other encoded stuff.  The email 
package has some of the same problems, and there we are retooling the API to 
deal with this.

Presumably ET needs to have a retooled API for Python3 as well.  Then the 
question becomes what do we do in the meantime?  For email, we are just living 
with the breakage until we can get something better in place, because no one 
has come up with any good short term solutions for email.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-11 Thread Guido van Rossum


Guido van Rossum gu...@python.org added the comment:

Hey, can we all try to get along?

For anyone who didn't follow the link to r56841, that was mine (though 
Christian Heimes provided the basis for much of the patch apart from 
elementtree), and I wrote at the time:

I had to fix a few tests and modules beyond what Christian did, and
invent a few conventions.  E.g. in elementtree, I chose to
write/return Unicode strings whe no encoding is given, but bytes when
an explicit encoding is given.

I am not a user of elementtree, so this may well have been a mistake -- at the 
time (in 2007) we were so busy making zillions of tests pass that some mistakes 
were made.  Some of those were caught in time, others apparently not.

My thinking was that since an XML document looks like text, it should probably 
be considered text, at least by default.  (There may have been some unittests 
that appeared to require this -- of course this was probably just the confusion 
between byte strings and 8-bit text strings inherent in Python 2.)

Regarding backwards compatibility, there are now two backwards compatibility 
problems: with 2.x, and with 3.1.  It seems we cannot easily be backwards 
compatible with both (though if someone figures out a way that would be best of 
course).

If I were to propose an API for returning a Unicode string, I would probably 
add a new method (e.g. tounicode()) rather than using a magical argument 
(tostring(encoding=str)), but given that that exists in another 
supposedly-compatible implementation I'm not against it.  Maybe 
tostring(encoding=None) could also be made to work? That would at least make it 
*possible* to write code that receives a text object and that works in 3.1 and 
3.2 both.  In 2.x I think neither of these should work, and there probably 
isn't a need -- apps needing full compatibility will just have to refrain from 
calling tostring() without arguments.

ISTM that the behavior of write() is just fine -- the contents of the file will 
be correct after all.

--
nosy: +gvanrossum

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-11 Thread Antoine Pitrou


Antoine Pitrou pit...@free.fr added the comment:

Not wanting to waste my time anymore on this.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-11 Thread Antoine Pitrou


Changes by Antoine Pitrou pit...@free.fr:


--
nosy:  -pitrou

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-11 Thread Stefan Behnel


Stefan Behnel sco...@users.sourceforge.net added the comment:

Hi Guido, your comment was long overdue in this discussion.

Guido van Rossum, 12.03.2010 01:35:
 My thinking was that since an XML document looks like text, it should
 probably be considered text, at least by default.  (There may have
 been some unittests that appeared to require this -- of course this
 was probably just the confusion between byte strings and 8-bit text
 strings inherent in Python 2.)

Well, well, XML...

It does look like text, but it's encoded text that is defined as a stream of 
bytes, and that's the only safe way of dealing with it.

There certainly *is* a use case for treating the serialised result as text, 
that's why lxml has this feature. A minor one is for debug output (which 
certainly doesn't merit being the default), but another one is when dealing 
with HTML, where encoding information is certainly less well defined and *much* 
less often seen in the wild. So users tend to be happy when they get their 
real-world HTML input fixed up into proper Unicode, still happier when they see 
that lxml can parse that correctly and even serialise the result back into a 
Unicode string directly, that they can post-process as text if they need to.

However, the main part here is the input, i.e. getting HTML data properly 
decoded into Unicode. The output part is a lot less important, and it's often 
easier to let lxml.html do the correct serialisation into bytes with proper 
encoding meta information, rather than dealing with it yourself.

Those are the two use cases I see for lxml. Their impact on ElementTree is 
relatively low as it doesn't support *parsing* from a Unicode string, so the 
most important HTML feature isn't there in the first place. The lack of major 
use cases in ElementTree is one of the reasons I'm so opposed to making this 
feature the backwards incompatible default for the output side.


 Regarding backwards compatibility, there are now two backwards
 compatibility problems: with 2.x, and with 3.1.  It seems we cannot
 easily be backwards compatible with both (though if someone figures
 out a way that would be best of course).
 
 If I were to propose an API for returning a Unicode string, I would
 probably add a new method (e.g. tounicode()) rather than using a
 magical argument (tostring(encoding=str)), but given that that
 exists in another supposedly-compatible implementation I'm not
 against it.

Actually, lxml.etree originally had a tounicode() function for this purpose, 
and I deprecated it in favour of tostring(encoding=unicode) to avoid having a 
separate interface for this, while staying just as explicit as before.  I'm 
aware that this wasn't an all-win decision, but I found passing the unicode 
type to be explicit enough, and separate enough from an encoding /name/ to make 
it clear what happens. It's certainly less beautiful in Py3, where you write 
tostring(encoding=str).

I still didn't remove the function from the API, but it's been deprecated for 
years. Reactivating it in lxml.etre, and duplicating it in ET would safe 
lxml.etree from having to break user code (as tostring(encoding=str) could 
simply continue to work, but disappear from the docs). It wouldn't safe ET-Py3 
from breaking backwards compatibility to itself, though.


 Maybe tostring(encoding=None) could also be made to work? That would
 at least make it *possible* to write code that receives a text object
 and that works in 3.1 and 3.2 both.  In 2.x I think neither of these
 should work, and there probably isn't a need -- apps needing full
 compatibility will just have to refrain from calling tostring()
 without arguments.

It could be made to work, and it doesn't even read that bad. I can't imagine 
anyone using this explicitly to get the default behaviour, although you never 
know how people put together their keyword argument dicts programmatically. 
'None' has always been the documented default for the encoding parameter, so 
I'm sure there's at least a tiny bit of code that uses it to say I'm not 
overriding the default here.

Actually, the encoding has been a keyword-only parameter in lxml.etree for 
ages, which was ok with the original default and conform with the official ET 
documentation. So it would be easy to switch here, although not beautiful in 
the implementation. Same for ElementTree, where the current default None in the 
signature could simply be replaced by the 'real' default 'us-ascii'. Within the 
Py3 series, this change would not keep up backwards compatibility either.

So, as a solution, I do prefer separating this feature out into a separate 
function, so that we can simplify the interface of tostring() into always 
returning a byte string serialisation, as it always was in ET. The rather 
distinct use case of serialising to an unencoded text string can well be 
handled by a tounicode() function.


 ISTM that the behavior of write() is just fine -- the contents of the
 file will be correct after all.

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-11 Thread Stefan Behnel


Stefan Behnel sco...@users.sourceforge.net added the comment:

One more thing: given that many web-frameworks are still not available for Py3 
at this time, and that there are still tons of third-party libraries missing on 
that platform, I would be surprised if there was any ElementTree based XML/HTML 
processing code written specifically and only for Py3 by now. So I cannot 
imagine any noticeable body of code being available that relies on this new Py3 
feature.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-08 Thread Stefan Behnel


Stefan Behnel sco...@users.sourceforge.net added the comment:

Antoine, in the same comment, you say that it was not backported to Py2 in 
order to prevent breaking existing code, and then you ask if it's difficult to 
support in lxml. ;-)

Supporting the same behaviour in lxml would either mean that it breaks existing 
code in Py2 (when making the API consistent), or that you can safely (and 
correctly) write the return value to a file in Py2, but that you can't do the 
same in Py3 (when adopting the change only in Py3).

Previously, in ElementTree, serialising without an explicit encoding was a way 
to get a byte encoded serialisation without an XML declaration header, so I 
expect there to be code that depends on this. Since ElementTree 1.3 uses the 
same keyword argument as lxml for this feature, I assume that Florent's patches 
provide at least an alternative here, even if it requires users to adapt their 
code.

I just wish this backwards incompatible feature had been advertised at the 
time, or at least *documented* in any way. Even the latest 3.2-dev docs still 
state that the default encoding of the serialiser is US-ASCII, not a word about 
*ever* returning a unicode string, especially not by default, and totally not 
the required big fat warning that writing to a file will fail with mysterious 
errors if no encoding is specified.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-08 Thread Florent Xicluna


Florent Xicluna florent.xicl...@gmail.com added the comment:

With ET 1.3, you should have an explicit keyword argument xml_declaration:

# 
if xml_declaration or (xml_declaration is None and
   encoding not in (utf-8, us-ascii)):
if method == xml:
write(?xml version='1.0' encoding='%s'?\n % encoding)
# 

In ET 1.2.6, the same snippet looks like:
# 
if encoding != utf-8 and encoding != us-ascii:
file.write(?xml version='1.0' encoding='%s'?\n % encoding)
# 

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-08 Thread Antoine Pitrou


Antoine Pitrou pit...@free.fr added the comment:

Le Mon, 08 Mar 2010 09:01:19 +,
Stefan Behnel rep...@bugs.python.org a écrit :
 
 Antoine, in the same comment, you say that it was not backported to
 Py2 in order to prevent breaking existing code, and then you ask if
 it's difficult to support in lxml. ;-)

I meant breaking existing *user* code. Besides, the fact that
compatibility is broken doesn't mean third-party code difficult to fix;
hence my question.

 Supporting the same behaviour in lxml would either mean that it
 breaks existing code in Py2 (when making the API consistent), or that
 you can safely (and correctly) write the return value to a file in
 Py2, but that you can't do the same in Py3 (when adopting the change
 only in Py3).

Sorry, I don't understand this. Are you saying it's impossible
for you to define two different behaviours based on the current Python
version? What's bad with
if sys.version_info() = (3, 0, 0): # blah

 Previously, in ElementTree, serialising without an explicit encoding
 was a way to get a byte encoded serialisation without an XML
 declaration header, so I expect there to be code that depends on
 this.

This doesn't seem to be documented. The doc simply says
encoding is the output encoding (default is US-ASCII).

In other words, undocumented (and untested) behaviour has been broken
when porting to 3.0, which is the version which deliberately broke
compatibility for documented things. I guess we can live with it ;)

 Even the latest
 3.2-dev docs still state that the default encoding of the serialiser
 is US-ASCII, not a word about *ever* returning a unicode string,
 especially not by default, and totally not the required big fat
 warning that writing to a file will fail with mysterious errors if no
 encoding is specified.

Ok, perhaps some documentation changes are in order :-)
(I wonder why the default was US-ASCII, though. Sounds a bit braindead)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-08 Thread Antoine Pitrou


Changes by Antoine Pitrou pit...@free.fr:


--
assignee:  - georg.brandl
components: +Documentation
nosy: +georg.brandl

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-07 Thread Stefan Behnel


Stefan Behnel sco...@users.sourceforge.net added the comment:

It has been brought up several times that ET is special in the stdlib in that 
it is an externally maintained package. Correct me if I'm wrong, but the rules 
seem to be: features come outside, adaptation to Py3 can happen inside. What we 
are talking about here is a new feature that makes sense for both Py2 and Py3. 
We are not talking about a bug fix, neither is this an adaptation to Py3. It is 
a new feature that was added inside of the standard library and that is not 
compatible with the external libraries that are supposed to implement the same 
interface, namely, ElementTree and lxml.etree.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-07 Thread Antoine Pitrou


Antoine Pitrou pit...@free.fr added the comment:

As Florent said, it is a rule of py3k to avoid implicit encoding/decoding. The 
fact that it could have made sense for 2.x as well is not relevant, since the 
change was only done in py3k (and for good reason: we normally try not to break 
compatibility without prior notice).

In any case, I have trouble understanding your concern here. Do you think the 
change is bad? Is it really that difficult to support it in lxml?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-05 Thread Antoine Pitrou


Antoine Pitrou pit...@free.fr added the comment:

I don't know what compatibility you are talking about. Py3k deliberately breaks 
compatibility with many 2.x behaviours that were considered defective or 
suboptimal.

--
nosy: +pitrou

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-03 Thread R. David Murray


R. David Murray rdmur...@bitdance.com added the comment:

I'm not an ElementTree user, but that spelling (etree.tostring(encode=str), or 
even etree.tostring(encode=unicode)) strikes me as horrible.  You don't encode 
to unicode, you *decode* to unicode.  Thus the current Python3 interface works 
the way I'd expect: if I don't specify an encoding, I get unicode.  If I do 
specify an encoding, I get encoded bytes.  In the general case the fact that 
you can no longer get away with being sloppy about what encoding a byte stream 
is in, the way you could in Python2, is a feature of Python3, not a bug.

If anything, having 'tostring' return bytes is broken, given its name.  But I 
think we fudge that by claiming it is returning a 'byte string' when given an 
encoding.

That said, I'm not sure how much, if at all, my opinion counts :)

--
nosy: +effbot, flox, r.david.murray
priority:  - normal

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-03 Thread Stefan Behnel


Stefan Behnel sco...@users.sourceforge.net added the comment:

I agree that the lxml API is somewhat clumsy here. I just mentioned it to show 
that there are already ways to do it in a backwards compatible way, so this 
change does two things: it breaks existing code, and it does so in a way that 
is incompatible with other existing implementations. That's what *I* would call 
horrible.

Also, this is absolutely not a feature that is restricted to Py3, so what's the 
equivalent feature in the standard library of Py2 going to be, and how much 
code will it break for the Py2 series?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-03 Thread R. David Murray


R. David Murray rdmur...@bitdance.com added the comment:

My understanding is that backward compatibility, while nice to retain, was not 
considered a stopper for cleaning up interfaces in py3.  Exactly how considered 
this change was, I have no idea, but as I said it does make sense to me.  As 
for 2.x, what's there is what's there, as far as I can see.  Florent could 
speak to whether or not that API is likely to change in 2.7, but I doubt it 
will.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-03 Thread Florent Xicluna


Florent Xicluna florent.xicl...@gmail.com added the comment:

With ET 1.3, the serializer ElementTree.write() should output bytes only. And 
the default encoding is still US-ASCII.

The new behaviour is specific to the 3.x branch (since 3.0, r56841).
Even if it is not fully backward compatible, I don't find this behavior 
shocking: it is a rule of Python 3 to avoid implicit encoding/decoding.

--
stage:  - test needed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue8047] Serialiser in ElementTree returns unicode strings in Py3k

2010-03-02 Thread Stefan Behnel


New submission from Stefan Behnel sco...@users.sourceforge.net:

The xml.etree.ElementTree package in the Python 3.x standard library breaks 
compatibility with existing ET 1.2 code. The serialiser returns a unicode 
string when no encoding is passed. Previously, the serialiser was guaranteed to 
return a byte string. By default, the string was 7-bit ASCII compatible.

This behavioural change breaks all code that relies on the default behaviour of 
ElementTree. Since there is no longer a default encoding in Python 3, unicode 
strings are incompatible with byte strings, which means that the result of the 
serialisation can no longer be written to a file, for example.

XML is well defined as a stream of bytes. Redefining it as a unicode string *by 
default* is hard to understand at best.

Finally, it would have been good to look at the other ET implementation before 
introducing such a change. The lxml.etree package has had support for 
serialising XML into a unicode string for years, and does so in a clear, safe 
and explicit way. It requires the user to pass the 'unicode' (Py3 'str') type 
as encoding parameter, e.g.

tree.tostring(encoding=str)

which is explicit enough to make it clear that this is different from a normal 
encoding.

--
components: Library (Lib)
messages: 100333
nosy: scoder
severity: normal
status: open
title: Serialiser in ElementTree returns unicode strings in Py3k
type: behavior
versions: Python 3.1, Python 3.2

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue8047
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

52 matches

Mail list logo