[issue26369] doc for unicode.decode and str.encode is unnecessarily confusing

2016-02-19 Thread Terry J. Reedy

Terry J. Reedy added the comment:

The intended use for str.encode is for same-type transcoding, like this:

I was unaware of the seemingly useless behavior you quote.

>>> 'abc'.encode('base64')
'YWJj\n'
>>> 'YWJj\n'.decode('base64')
'abc'

Here is a similar use for unicode.decode.

>>> u'abc'.encode('base64')
'YWJj\n'
>>> u'YWJj\n'.decode('base64')
'abc'

Any doc change should make the intended use clear if not already.

(Note that the above give lookup errors in 3.x
>>> 'abc'.encode('base64')
...
LookupError: 'base64' is not a text encoding; use codecs.encode() to handle 
arbitrary codecs)

--
nosy: +terry.reedy

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26369] doc for unicode.decode and str.encode is unnecessarily confusing

2016-02-16 Thread Ezio Melotti

Changes by Ezio Melotti :


--
nosy: +ezio.melotti

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26369] doc for unicode.decode and str.encode is unnecessarily confusing

2016-02-16 Thread Steven D'Aprano

Steven D'Aprano added the comment:

Perhaps you could suggest a specific change to the docstrings for str.encode 
and unicode.decode?

(BTW, I presume you are aware that the equivalent of (bytes)str.encode and 
unicode.decode are gone in Python 3?)

--
nosy: +steven.daprano

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26369] doc for unicode.decode and str.encode is unnecessarily confusing

2016-02-16 Thread Ben Spiller

New submission from Ben Spiller:

It's well known that lots of people struggle writing correct programs using 
non-ascii strings in python 2.x, but I think one of the main reasons for this 
could be very easily fixed with a small addition to the documentation for 
str.encode and unicode.decode, which is currently quite vague. 

The decode/encode methods really make most sense when called on a unicode 
string i.e. unicode.encode() to produce a byte string, or on a byte string e.g. 
str.decode() to produce a unicode object from a byte string. 

However, the additional presence of the opposite methods str.encode() and 
unicode.decode() is quite confusing, and a frequent source of errors - e.g. 
calling str.encode('utf-8') first DECODES the str object (which might already 
be in utf8) to a unicode string **using the default encoding of "ascii"** (!) 
before ENCODING to a utf-8 byte str as requested, which of course will fail at 
the first stage with the classic error "UnicodeDecodeError: 'ascii' codec can't 
decode byte" if there are any non-ascii chars present. It's unfortunate that 
this initial decode/encode stage ignores both the "encoding" argument (used 
only for the subsequent encode/decode) and the "errors" argument (commonly used 
when the programmer is happy with a best-effort conversion e.g. for logging 
purposes).

Anyway, given this behaviour, a lot of time would be saved by a simple sentence 
on the doc for str.encode()/unicode.decode() essentially warning people that 
those methods aren't that useful and they probably really intended to use 
str.decode()/unicode.encode() - the current doc gives absolutely no clue about 
this extra stage which ignores the input arguments and sues 'ascii' and 
'strict'. It might also be worth stating in the documentation that the pattern 
(u.encode(encoding) if isinstance(u, unicode) else u) can be helpful for cases 
where you unavoidably have to deal with both kinds of input, string calling 
str.encode is such a bad idea. 

In an ideal world I'd love to see the implementation of 
str.encode/unicode.decode changed to be more useful (i.e. instead of using 
ascii, it would be more logical and useful to use the passed-in encoding to 
perform the initial decode/encode, and the apss-in 'errors' value). I wasn't 
sure if that change would be accepted so for now I'm proposing better 
documentation of the existing behaviour as a second-best.

--
assignee: docs@python
components: Documentation
messages: 260359
nosy: benspiller, docs@python
priority: normal
severity: normal
status: open
title: doc for unicode.decode and str.encode is unnecessarily confusing
type: behavior
versions: Python 2.7

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com