[issue26369] unicode.decode and str.encode are unnecessarily confusing for non-ascii

2020-05-30 Thread Serhiy Storchaka


Change by Serhiy Storchaka :


--
resolution:  -> out of date
stage:  -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26369] unicode.decode and str.encode are unnecessarily confusing for non-ascii

2016-05-20 Thread Ben Spiller

Ben Spiller added the comment:

Thanks for considering this, anyway. I'll admit I'm disappointed we couldn't 
fix this on the 2.7 train, as to me fixing a method that takes an 
errors='ignore' argument and then throws an exception anyway seems a little 
more like a bug than a feature (and changing it would likely not affect 
behaviour in any existing non-broken programs), but if that's the decision then 
fine. Of course I'm aware (as I mentioned earlier on the thread) that the 
radically different unicode handling in python 3 solves this entirely and only 
wish it was practical to move our existing (enormous) codebase and customers 
over to it, but we're stuck with Python 2.7 - I believe lots of people are in 
the same situation unfortunately. 

As Josh suggested, perhaps we can at least add something to the doc for the 
str/unicode encode and decode methods so users are aware of the behaviour 
without trial and error. I'll update the component of this bug to reflect it's 
now considered a doc issue. 

Based on the inputs from Terry, and what seem to be the key info that would 
have been helpful to me and those who are hitting the same issues for the first 
time, I'd propose the following text (feel free to adjust as you see fit):

For encode:
"For most encodings, the return type is a byte str regardless of whether it is 
called on a str or unicode object. For example, call encode on a unicode object 
with "utf-8" to return a byte str object, or call encode on a str object with 
"base64" to return a base64-encoded str object.

It is _not_ recommended to use call this method on "str" objects when using 
codecs such as utf-8 that convert betweens str and unicode objects, as any 
characters not supported by python's default encoding (usually 7-bit ascii) 
will result in a UnicodeDecodeError exception, even if errors='ignore' was 
specified. For such conversions the str.decode and unicode.encode methods 
should be used. If you need to produce an encoded version of a string that 
could be either a str or unicode object, only call the encode() method after 
checking it is a unicode object not a str object, using isinstance(s, unicode)."

and for decode:
"The return type may be either str or unicode, depending on which encoding is 
used and whether the method is called on a str or unicode object. For example, 
call decode on a str object with "utf-8" to return a unicode object, or call 
decode on a unicode or str object with "base64" to return a base64-decoded str 
object.

It is _not_ recommended to use call this method on "unicode" objects when using 
codecs such as utf-8 that convert betweens str and unicode objects, as any 
characters not supported by python's default encoding (usually 7-bit ascii) 
will result in a UnicodeEncodeError exception, even if errors='ignore' was 
specified. For such conversions the str.decode and unicode.encode methods 
should be used. If you need to produce a decoded version of a string that could 
be either a str or unicode object, only call the decode() method after checking 
it is a str object not a unicode object, using isinstance(s, str)."

--
components: +Documentation -Interpreter Core

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26369] unicode.decode and str.encode are unnecessarily confusing for non-ascii

2016-05-20 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

Ben, the methods on stings and Unicode objects in Python 2.x are direct 
interfaces to the underlying codecs. The codecs can handle any number of input 
and output types, so there are some which only work on 8-bit strings (bytes) 
and others which take Unicode as input.

As a result, you sometimes see errors due to the conversion of an 8-bit string 
to Unicode (in the case, where the codec expects a Unicode input).

As example, take the UTF-8 codec. This expects a Unicode input when decoding, 
so when you pass in an 8-bit string, Python will convert this to Unicode using 
the default encoding (which is normally set to 'ascii') and then applies the 
codec operation.

When the 8-bit string is plain ASCII this works great. If not, chances are high 
that you'll run into a Unicode error.

Now, in Python 2.x you can change the default encoding to either make this work 
by assuming that all your 8-bit strings are UTF-8 (set it to 'utf-8' in 
sitecustomize.py), or you can disable the automatic conversion altogether by 
setting the default encoding to 'unknown', which is a codec specifically 
created for this purpose. The latter will also raise an exception when 
attempting to convert an 8-bit string to Unicode - similar to what Python 3 
does, except that the error type is different.

Hope that helps.

--
nosy: +lemburg

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26369] unicode.decode and str.encode are unnecessarily confusing for non-ascii

2016-05-19 Thread Josh Rosenberg

Josh Rosenberg added the comment:

Agree with Steven; the whole reason Python 3 changed from unicode and str to 
str and bytes was because having Py2 str be text sometimes, and binary data at 
other times is confusing. The existing behavior can't change in Py2 in any 
meaningful way without breaking existing code, introducing special cases for 
text->text encodings (where Python 3 supports them using the codecs module 
only), behaving in non-obvious ways in corner cases, etc. Silently treating 
str.encode("utf-8") to mean "decode as UTF-8 and throw away the result to 
verify that it's already UTF-8 bytes" is not particularly intuitive either.

It does seem like a doc fix would be useful though; right now, we have only 
"String methods" documented, with no distinction between str and unicode. It 
might be helpful to explicitly deprecate str.encode on str objects, and 
unicode.decode, with a note that while it's meaningful to use these methods in 
Python 2 for text<->text encoding/decoding, the methods don't exist at all in 
Python 3.

Otherwise, yes, if you want consistent text/binary types, that's what Python 3 
is for. Python 2 has tons of flaws when it comes to handling unicode (e.g. csv 
module), and fixing any given single problem (creating backward compatibility 
headaches in the process) is not worth the trouble.

If you're concerned about excessive boilerplate, just write a function (or a 
type) that allows you to perform the tests/conversions you care about as a 
single call. For example, the following seems like it achieves your objectives 
(one line usage, handles str by verifying that it's legal in provided encoding 
in strict mode, dropping/replacing characters in ignore/replace mode, etc.):

def basestringencode(s, encoding=sys.getdefaultencoding(), errors="strict"):
if isinstance(s, str):
# Decode with provided rules, so a str with illegal characters
# raises exception, replaces, ignores, etc. per arguments
s = s.decode(encoding, errors)
return s.encode(encoding, errors)

If you don't want to see UnicodeDecodeError, you either pass 'ignore' for 
errors, or wrap the s.decode step in a try/except and raise a different 
exception type.

The biggest change I could see happening code wise would be a textual change to 
the UnicodeDecodeError error str.encode raises, so str.encode specifically 
replaces the default error message (but not type, for back compat reasons) with 
something like "str.encode cannot perform implicit decode with 
sys.getdefaultencoding(); use .encode only with unicode objects"

--
nosy: +josh.r

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26369] unicode.decode and str.encode are unnecessarily confusing for non-ascii

2016-05-19 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

> btw If anyone can find the place in the code (sorry I tried and failed!) 
> where str.encode('utf-8', error=X) is resulting in an implicit call to the 
> equivalent of decode(defaultencoding, errors=strict) (as suggested by the 
> exception message) I think it'll be easier to discuss the details of fixing.

There is no single place. Search lines "str = PyUnicode_FromObject(str);" in 
Modules/_codecsmodule.c.

> But that's not what happens - it *silently works* (is a no-op) as long as you 
> happen to be using ASCII characters so this so-called 'programming bug' will 
> go unnoticed by most programmers (and authors of third party library code you 
> might be relying on!)... but the moment a non-ascii character get introduced 
> suddenly you'll get an exception, maybe in some library code you rely on but 
> can't fix.

The problem is that encoding ASCII str to UTF-8 is legal operation in some 
circumstances and is a programming bug in other. There is no way to distinguish 
these two cases automatically.

As non-English speaker I am familiar with the problems you described. This is a 
bug in the design of Python 2, and the only solution is using Python 3.

You can experiment with your idea, but I'm afraid that the patch will be more 
difficult than you expect and break the tests. I want to warn that even if your 
experiment is quite successful, there is not much chance to take it in 2.7. 
This is more like a new feature than a bug fix. Programs that depend on this 
feature will be incompatible with previous bugfix releases. It is unlikely to 
help the migration on Python 3, but rather would encourage writing code that is 
incompatible with Python 3.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26369] unicode.decode and str.encode are unnecessarily confusing for non-ascii

2016-05-19 Thread Steven D'Aprano

Steven D'Aprano added the comment:

Ben, I'm sorry to see you have spent such a long time writing up reasons for 
changing this behaviour. I fear this is a total waste of your time, and ours to 
read it. Python 2.7 is under feature freeze, and changing the behaviour of 
str.encode and unicode.decode is a new feature. So it could only happen in 2.8, 
but there will never be a Python 2.8.

If you want more sensible behaviour, then upgrade to Python 3. If you want to 
improve the docs, then suggest some documentation improvements. But arguing for 
a change in behaviour of Python 2.7 str.encode and unicode.decode is, I fear, a 
waste of everyone's time.

If you still wish to champion this change, feel free to raise the issue on the 
Python-Dev mailing list where the senior developers, including Guido, hang out. 
I doubt it will do any good, but there is at least the theoretical possibility 
that if you convince them that this change will encourage people to migrate to 
Python 3 then you might get your wish.

Just don't hold your breath.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26369] unicode.decode and str.encode are unnecessarily confusing for non-ascii

2016-05-19 Thread Ben Spiller

Ben Spiller added the comment:

btw If anyone can find the place in the code (sorry I tried and failed!) where 
str.encode('utf-8', error=X) is resulting in an implicit call to the equivalent 
of decode(defaultencoding, errors=strict) (as suggested by the exception 
message) I think it'll be easier to discuss the details of fixing.

Thanks for your reply - yes I'm aware that theoretically you _could_ globally 
change python's default encoding from ascii, but the prevailing view I've heard 
from python developers seems to be that changing it is not a good idea and may 
cause lots of library code to break. Also it's probably not a good idea for 
individual libraries or modules to be changing global state that affects the 
entire python invocation, and it would be nice to find a less fragile and more 
out-of-the-box solution to this. You may well be using different encodings (not 
just utf-8) to be used in different parts of your program - so changing the 
globally-defined default encoding doesn't seem right, especially for a method 
like str.encode method that already takes an 'encoding' argument (used 
currently only for the encoding aspect, not the decoding aspect). 

I do think there's a strong case to be made for changing the str.encode (and 
also unicode.decode) behaviour so that str.encode('utf-8') behaves the same 
whether it's given ascii or non-ascii characters, and also similar to 
unicode.encode('utf-8'). Let me try to persuade you... :)

First, to address the point you made:

> If str.encode() raises a decoding exception, this is a programming bug. It 
> would be bad to hide it.

I totally agree with the general principal of not hiding programming bugs. 
However if calling str.encode for codecs like utf8 (let's ignore base64 for 
now, which is a very different beast) was *consistently* treated as a 
'programming bug' by python and always resulted in an exception that would be 
ok (suboptimal usability imho, but still ok), since programmers would quickly 
spot the problem and fix it. But that's not what happens - it *silently works* 
(is a no-op) as long as you happen to be using ASCII characters so this 
so-called 'programming bug' will go unnoticed by most programmers (and authors 
of third party library code you might be relying on!)... but the moment a 
non-ascii character get introduced suddenly you'll get an exception, maybe in 
some library code you rely on but can't fix. For this reason I don't think 
treating this as a programming bug is helping anyone write more robust python 
code - quite the reverse. Plus I think the behaviour of being a no-op is almost 
always
  'what you would have wanted it to do' anyway, whereas the behaviour of 
throwing an exception almost never is. 

I think we'd agree that changing str.encode(utf8) to throw an exception in 
*all* cases wouldn't be a realistic option since it would certainly break 
backwards compatability in painful ways for many existing apps and library 
code. 

So, if we want to make the behaviour of this important built-in type a bit more 
consistent and less error-prone/fragile for this case then I think the only 
option is making str.encode be a no-op for non-ascii characters (at least, 
non-ascii characters that are valid in the specified encoding), just as it is 
for ascii characters. 

Here's why I think ditching the current behaviour would be a good idea:
- calling str.encode() and getting a DecodeError is confusing ("I asked you to 
encode this string, what are you decoding for?")
- calling str.encode('utf-8') and getting an exception about "ascii" is 
confusing as the only encoding I mentioned in the method call was utf-8
- calling encode(..., errors=ignore) and getting an exception is confusing and 
feels like a bug; I've explicitly specified that I do NOT want exceptions from 
calling this method, yet (because neither 'errors' nor 'encoding' argument gets 
passed to the implicit - and undocumented - decode operation), I get unexpected 
behaviour that is far more likely to break my program than a no-op
- the somewhat surprising behaviour we're talking about is not explicitly 
documented anywhere
- having str.encode throw on non-ascii but not ascii makes it very likely that 
code will be written and shipped (including library code you may have no 
control over) that *appears* to work under normal testing but has *hidden* bugs 
that surface only once non-ascii characters are used. 
- in every situation I can think of, having str.encode(encoding, errors=ignore) 
honour the encoding and errors arguments even for the implicit-decode operation 
is more useful than having it ignore those arguments and throw an exception
- a quick google shows lots of people in the Python community (from newbies to 
experts) are seeing this exception and being confused by it, therefore a lot of 
people's lives might be improved if we can somehow make the situation better :)
- even with the best of intentions (and with code written by senior python 

[issue26369] unicode.decode and str.encode are unnecessarily confusing for non-ascii

2016-05-12 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

If str.encode() raises a decoding exception, this is a programming bug. It 
would be bad to hide it.

FYI, the default encoding is not hardcoded 'ascii'. Google "Changing default 
encoding in Python". Maybe this will help in your program.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26369] unicode.decode and str.encode are unnecessarily confusing for non-ascii

2016-05-12 Thread Ben Spiller

Ben Spiller added the comment:

I'm proposing that str.encode() should _not_ throw a 'decode' exception  for 
non-ascii characters and be effectively a no-op, to match what it already does 
for ascii characters - which therefore shouldn't break behavior anyone will be 
depending on. This could be achieved by passing the encoding parameter through 
to the implicit decode() call (which is where the exception is coming from it 
appears), rather than (arbitrarily and surprisingly) using "ascii" (which of 
course sometimes works and sometimes doesn't depending on the input string)

Does that make sense?

If someone can find the place in the code (sorry I tried and failed!) where 
str.encode('utf-8') is resulting in an implicit call to the equivalent of 
decode('ascii') (as suggested by the exception message) I think it'll be easier 
to discuss the details

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26369] unicode.decode and str.encode are unnecessarily confusing for non-ascii

2016-05-12 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

What do you propose? Note that str.encode() doesn't raise an exception. Ascii 
unicode and 8-bit strings are interchangeable. Ascii unicode strings can be 
packed in str for less memory consumption (see xmlrpclib or ElementTree), a lot 
of str constant are used in unicode context (like os.sep or empty string). 
Breaking str.encode() will break valid existing code.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26369] unicode.decode and str.encode are unnecessarily confusing for non-ascii

2016-05-12 Thread Ben Spiller

Ben Spiller added the comment:

yes the situation is loads better in python 3, this issue is specific to 2.x, 
but like many people sadly we're not able to move to 3 for the time being. 

Since making this mistake is quite common and there's some sensible behaviour 
that would make it disappear (resulting in ascii and non-ascii strings being 
treated the same way by these methods) I'd much prefer if we could actually fix 
it for python 2.7

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26369] unicode.decode and str.encode are unnecessarily confusing for non-ascii

2016-05-12 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Note that with the -3 option Python 2.7 already warns about incompatibilities. 

>>> 'abc'.encode('base64')
__main__:1: DeprecationWarning: 'base64' is not a text encoding; use 
codecs.encode() to handle arbitrary codecs
'YWJj\n'
>>> 'YWJj\n'.decode('base64')
__main__:1: DeprecationWarning: 'base64' is not a text encoding; use 
codecs.decode() to handle arbitrary codecs
'abc'
>>> u'abc'.decode('ascii')
__main__:1: DeprecationWarning: decoding Unicode is not supported in 3.x
u'abc'

--
nosy: +serhiy.storchaka

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26369] unicode.decode and str.encode are unnecessarily confusing for non-ascii

2016-05-12 Thread Ben Spiller

Ben Spiller added the comment:

Thanks that's really helpful

Having thought about it some more, I think if possible it'd be really so much 
better to actually 'fix' the behaviour for the unicode<->str standard codecs 
(i.e. not base64) rather than just documenting around it. The current behaviour 
is not only confusing but leads to bugs that are very easy to miss since the 
methods work correctly when given 7-bit ascii characters. 

I had a poke around in the python source but couldn't quite identify where it's 
happening - presumably there is somewhere in the str.encode('utf-8') 
implementation that first "decodes" the string and does so using the ascii 
codec. If it could be made to use the same encoding that was passed in (e.g. 
utf8) then this would end up being a no-op and there would be no unpleasant 
bugs that only appear when the input includes non-ascii characters. 

It would also allow X.encode('utf-8') to be called successfully whether X is 
already a str or is a unicode object, which would save callers having to 
explicitly check what kind of string they've been passed. 

Is anyone able to look into the code to see where this would need to be fixed 
and how difficult it would be to do? I have a feeling that once the line is 
located it might be quite a straightforward fix

Many thanks

--
components: +Interpreter Core -Documentation
title: doc for unicode.decode and str.encode is unnecessarily confusing -> 
unicode.decode and str.encode are unnecessarily confusing for non-ascii

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com