[issue40687] Windows py.exe launcher interacts badly with Windows store python.exe shim

2020-05-19 Thread Ben Spiller


Change by Ben Spiller :


--
type:  -> behavior

___
Python tracker 
<https://bugs.python.org/issue40687>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40687] Windows py.exe launcher interacts badly with Windows store python.exe shim

2020-05-19 Thread Ben Spiller


New submission from Ben Spiller :

The py.exe launcher doc states "If no relevant options are set, the commands 
python and python2 will use the latest Python 2.x version installed" ... which 
was indeed working reliably until Microsoft added their weird python.exe shim 
(which either terminates with no output or brings up the Microsoft Store page) 
as part of 
https://devblogs.microsoft.com/python/python-in-the-windows-10-may-2019-update/

Now, I find scripts that start with "#!/usr/bin/env python" cause py.exe to run 
the Windows python.exe shim which confusingly terminates with no output (unless 
run with no arguments). 

I think to stop lots of developers banging theirs heads against this brick 
wall, py.exe should include some logic to ignore the 
C:\Users\XXX\AppData\Local\Microsoft\WindowsApps\python.exe shim, since for 
someone with py.exe installed, running that is _never_ what you'd want. (or 
alternatively, work with Microsoft to get this decision reversed, but that may 
be harder!)

Lots of people are hitting this e.g. 
https://superuser.com/questions/1437590/typing-python-on-windows-10-version-1903-command-prompt-opens-microsoft-stor
 , 
https://stackoverflow.com/questions/57485491/python-python3-executes-in-command-prompt-but-does-not-run-correctly

Here's the output:

py myscript.py
launcher build: 32bit
launcher executable: Console
File 'C:\Users\XXX\AppData\Local\py.ini' non-existent
Using global configuration file 'C:\WINDOWS\py.ini'
Called with command line: apama-build\build.py   -h
maybe_handle_shebang: read 256 bytes
maybe_handle_shebang: BOM not found, using UTF-8
parse_shebang: found command: python
searching PATH for python executable
Python on path: C:\Users\XXX\AppData\Local\Microsoft\WindowsApps\python.exe
located python on PATH: 
C:\Users\XXX\AppData\Local\Microsoft\WindowsApps\python.exe
run_child: about to run 
'C:\Users\XXX\AppData\Local\Microsoft\WindowsApps\python.exe myscript.py'
child process exit code: 9009

>py -0
Installed Pythons found by py Launcher for Windows
 -3.8-64 *
 -3.7-64
 -3.6-64
 -2.7-64

(nb: was surprising that it didn't run any of those installed versions!)

--
components: Windows
messages: 369379
nosy: benspiller, paul.moore, steve.dower, tim.golden, zach.ware
priority: normal
severity: normal
status: open
title: Windows py.exe launcher interacts badly with Windows store python.exe 
shim
versions: Python 3.6, Python 3.7, Python 3.8

___
Python tracker 
<https://bugs.python.org/issue40687>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue38633] shutil.copystat fails with PermissionError in WSL

2020-03-31 Thread Ben Spiller


Ben Spiller  added the comment:

Looks like on WSL the errno is errno.EACCES rather than EPERM, so we just need 
to change the shutil._copyxattr error handler to also cope with that error code:

 except OSError as e:
- if e.errno not in (errno.EPERM, errno.ENOTSUP, errno.ENODATA):
+ if e.errno not in (errno.EPERM, errno.ENOTSUP, errno.ENODATA, 
errno.EACCES):
 raise

If anyone needs a workaround until this is fixed in shutil itself, you can do 
it by monkey-patching _copyxattr:

import errno, shutil
# have to monkey patch to work with WSL as workaround for 
https://bugs.python.org/issue38633
orig_copyxattr = shutil._copyxattr
def patched_copyxattr(src, dst, *, follow_symlinks=True):
try:
orig_copyxattr(src, dst, follow_symlinks=follow_symlinks)
except OSError as ex:
if ex.errno != errno.EACCES: raise
shutil._copyxattr = patched_copyxattr

--
nosy: +benspiller

___
Python tracker 
<https://bugs.python.org/issue38633>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue38607] Document that cprofile/profile only profile the main thread

2019-10-27 Thread Ben Spiller


New submission from Ben Spiller :

The built-in profiling modules only provide information about the main thread 
(at least when invoked as documented). 

To avoid user confusion we should state this in the documentation at 
https://docs.python.org/3/library/profile.html. 

Potentially we could also suggest mitigations such as manually creating a 
Profile instance in the user's thread code, but the most important thing is to 
make clear what the module does/does not do out of the box. 

(see also https://bugs.python.org/issue9609 which discusses a possible non-doc 
change to help with multi-threading, but looks like that's stalled, so best to 
push ahead with doc'ng this.

--
assignee: docs@python
components: Documentation
messages: 355492
nosy: benspiller, docs@python
priority: normal
severity: normal
status: open
title: Document that cprofile/profile only profile the main thread
type: enhancement
versions: Python 3.7, Python 3.8, Python 3.9

___
Python tracker 
<https://bugs.python.org/issue38607>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue38278] Need a more efficient way to perform dict.get(key, default)

2019-09-25 Thread Ben Spiller


Ben Spiller  added the comment:

Thanks... yep I realise method calls are slower than operators, am hoping we 
can still find a cunning way to speed up this use case nonetheless. :D For 
example by having a configuration option on dict (or making a new subclass) 
that gives the (speedy!) [] operator the same no-exception semantics you'd get 
from calling get(). As you can see from my timeit benchmarks none of the 
current workarounds are very appealing for this use case, and a 2.2x slowdown 
for this common operation is a shame.

--

___
Python tracker 
<https://bugs.python.org/issue38278>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue38278] Need a more efficient way to perform dict.get(key, default)

2019-09-25 Thread Ben Spiller


New submission from Ben Spiller :

In performance-critical python code, it's quite common to need to get an item 
from a dictionary, falling back on a default (e.g. None, 0 etc) if it doesn't 
yet exist. The obvious way to do this based on the documentation is to call the 
dict.get() method:

> python -m timeit -s "d={'abc':123}" "x=d.get('abc',None)"
500 loops, best of 5: 74.6 nsec per loop

... however the performance of natural approach is very poor (2.2 times 
slower!) compared to the time really needed to look up the value:
>python -m timeit -s "d={'abc':123}" "x=d['abc']"
500 loops, best of 5: 33 nsec per loop

There are various ways to do this more efficiently, but they all have 
significant performance or functional drawbacks:

custom dict subclass with __missing__() override: promising approach, but use 
of a custom class instead of dict seems to increase [] cost significantly:
>python -m timeit -s "class mydict(dict):" -s "  def 
>__missing__(self,key):return None" -s "d = mydict({'abc':123})" "x=d['abc']"
500 loops, best of 5: 60.4 nsec per loop

get() with caching of function lookup - somewhat better but not great:
>python -m timeit -s "d={'abc':123}; G=d.get" "x=G('abc',None)"
500 loops, best of 5: 59.8 nsec per loop

[] and "in" (inevitably a bit slow due to needing to do the lookup twice):
>python -m timeit -s "d={'abc':123}" "x=d['abc'] if 'abc' in d else None"
500 loops, best of 5: 53.9 nsec per loop

try/except approach: quickest solution if it exists, but clunky syntax, and 
VERY slow if it doesn't exist
>python -m timeit -s "d={'abc':123}" "try:" "   x=d['abc']" "except KeyError: 
>pass"
500 loops, best of 5: 41.8 nsec per loop
>python -m timeit -s "d={'abc':123}" "try:" "   x=d['XXX']" "except KeyError: 
>pass"
100 loops, best of 5: 174 nsec per loop

collections.defaultdict: reasonable performance if key exists, but unwanted 
behaviour of adding the key if missing (which if used with an unbounded 
universe of keys could produce a memory leak):
>python -m timeit -s "import collections; d=collections.defaultdict(lambda: 
>None); d['abc']=123; " "x=d['XXX']"
500 loops, best of 5: 34.3 nsec per loop

I bet we can do better! 

Lots of solutions are possible - maybe some kind of peephole optimization to 
make dict.get() itself perform similarly to the [] operator, or if that's 
challenging perhaps providing a class or option that behaves like defaultdict 
but without the auto-adding behaviour and with comparable [] performance to the 
"dict" type - for example dict.raiseExceptionOnMissing=False, or perhaps even 
some kind of new syntax (e.g. dict['key', default=None]). Which option would be 
easiest/nicest?

--
components: Interpreter Core
messages: 353206
nosy: benspiller
priority: normal
severity: normal
status: open
title: Need a more efficient way to perform dict.get(key, default)
type: enhancement
versions: Python 3.7

___
Python tracker 
<https://bugs.python.org/issue38278>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35185] Logger race condition - loses lines if removeHandler called from another thread while logging

2019-06-20 Thread Ben Spiller


Ben Spiller  added the comment:

Interesting conversation. :)

Yes I agree correctness is definitely top priority. :) I'd go further and say 
I'd prefer correctness to be always there automatically, rather than something 
the user must remember to enable by setting a flag such as lockCallHandling. 
(as an aside, adding a separate extra code path and option like that would 
require a bit more doc and testing changes than just fixing the bug by making 
the self.handlers list immutable, which is a small and simple change not 
needing extra doc). 

I'm also not convinced it's worth optimizing the performance of add/remove 
logger (which sounds like it's the goal of the callHandlers-locking approach 
you suggest, if I'm understanding correctly?) - since in a realistic 
application is always that's going to be vastly less frequent than invoking 
callhandlers. Especially if it reduces performance of the main logging, which 
is invoked much more often. Though admittedly the 1% regression you quoted 
isn't so bad (assuming that's true in CPython/IronPython/Jython/others). The 
test program I provided is a contrived way of quickly reproducing the race 
condition, but I certainly wouldn't use it for measuring or optimizing 
performance as it wasn't designed for that - the ratio of add/remove loggers to 
callhandlers calls is likely to be unrepresentative of a real application, and 
there's vastly more contention on calling add/remove loggers than you'd see in 
the wild. 

Do you see any downsides to the immutable self.handlers approach, other than 
performance of add/remove logger being a little lower?

Personally I think we're on safer ground if we permit add/remove logger be 
slightly slower (but at least correct! correctness trumps performance), but 
only if we avoid regressing the more important performance of logging itself. 

Does that seem reasonable?

--

___
Python tracker 
<https://bugs.python.org/issue35185>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35185] Logger race condition - loses lines if removeHandler called from another thread while logging

2019-06-20 Thread Ben Spiller


Ben Spiller  added the comment:

I'd definitely suggest we go for a solution that doesn't hit performance of 
normal logging when you're not adding/removing things, being as that's the more 
common case. I guess that's the reason why callHandlers was originally 
implemented without grabbing the mutex, and we should probably keep it that 
way. Logging can be a performance-critical part of some applications and I feel 
more comfortable about the fix (and more confident it won't get vetoed :)) if 
we can avoid changing callHandlers(). 

You make a good point about ensuring the solution works for non-GIL python 
versions. I thought about it some more... correct me if I'm wrong but as far as 
I can see the second idea I suggested should do that, i.e.
- self.handlers.remove(hdlr)
+ newhandlers = list(self.handlers)
+ newhandlers.remove(hdlr)
+ self.handlers = hdlr

... which effectively changes the model so that the _value_ of the 
self.handlers list is immutable (only which list the self.handlers reference 
points to changes), so without relying on any GIL locking callHandlers will 
still see the old list or the new list but never see an inconsistent value, 
since such a list never exists. That solves the read-write race condition; we'd 
still want to keep the existing locking in add/removeHandler which prevents 
write-write race conditions. 

What do you think?

--

___
Python tracker 
<https://bugs.python.org/issue35185>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35915] re.search extreme slowness (looks like hang/livelock), searching for patterns containing .* in a large string

2019-02-06 Thread Ben Spiller


Ben Spiller  added the comment:

Running this command:
time python -c "import re;  re.compile('y.*x').search('y'*(N))"

It's clearly quadratic:
N=100,000 time=7s
N=200,000 time=18s
N=400,000 time=110s
N=1,000,000 time=690s

This illustrates how a simple program that's working correctly can quickly 
degrade to a very long period of unresponsiveness after some fairly modest 
increases in size of input string.

--

___
Python tracker 
<https://bugs.python.org/issue35915>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35915] re.search extreme slowness (looks like hang/livelock), searching for patterns containing .* in a large string

2019-02-06 Thread Ben Spiller


Ben Spiller  added the comment:

Correction to original report - it doesn't hang indefinitely, it just takes a 
really long time. Specifically, looks like it's quadratic in the length of the 
input string. Increase the size of the input string to 1000*1000 and it's 
really really slow. 

I don't know for sure if it's possible to implement regexes in a way that 
avoids this pathological behaviour, but it's certainly quite risky that an 
otherwise working bit of code using a pattern containing .* can hang/livelock 
an application for an arbitrary  amount of time if passed a 
larger-than-expected (but actually not that big) input string.

--
title: re.search livelock/hang, searching for patterns starting .* in a large 
string -> re.search extreme slowness (looks like hang/livelock), searching for 
patterns containing .* in a large string
type: crash -> performance

___
Python tracker 
<https://bugs.python.org/issue35915>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35915] re.search livelock/hang, searching for patterns starting .* in a large string

2019-02-06 Thread Ben Spiller


New submission from Ben Spiller :

These work fine and return instantly:
python -c "import re;  re.compile('.*x').match('y'*(1000*100))"
python -c "import re;  re.compile('x').search('y'*(1000*100))"
python -c "import re;  re.compile('.*x').search('y'*(1000*10))"

This hangs / freezes / livelocks indefinitely, with lots of CPU usage:
python -c "import re;  re.compile('.*x').search('y'*(1000*100))"

Admittedly performing a search() with a pattern starting .* isn't useful, 
however it's worth fixing as:
- it's easily done by inexperienced developers, or users interacting with code 
that's far removed from the actual regex call
- the failure mode of hanging forever (with the GIL held, of course) is quite 
severe (took us a lot of debugging with gdb before we figured out where our 
complex multi-threaded python program was hanging!), and 
- the fact that the behaviour is different based on the length of the string 
being matched suggests there is some kind of underlying bug in how the buffer 
is handled whcih might also affect other, more reasonable regex use cases

--
components: Regular Expressions
messages: 334949
nosy: benspiller, ezio.melotti, mrabarnett
priority: normal
severity: normal
status: open
title: re.search livelock/hang, searching for patterns starting .* in a large 
string
type: crash
versions: Python 2.7, Python 3.6

___
Python tracker 
<https://bugs.python.org/issue35915>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35185] Logger race condition - loses lines if removeHandler called from another thread while logging

2018-11-07 Thread Ben Spiller


New submission from Ben Spiller :

I just came across a fairly serious thread-safety / race condition bug in the 
logging.Loggers class, which causes random log lines to be lost i.e. not get 
passed to some of the registered handlers, if (other, unrelated) handlers are 
being added/removed using add/removeHandler from another thread during logging. 
This potentially affects all log handler classes, though for timing reasons 
I've found it easiest to reproduce with the logging.FileHandler. 

See attached test program that reproduces this. 

I did some debugging and looks like although add/removeHandler are protected by 
_acquireLock(), they modify the self.handlers list in-place, and the 
callHandlers method iterates over self.handlers with no locking - so if you're 
unlucky you can end up with some of your handlers not being called. 

A trivial way to fix the bug is by editing callHandlers and copying the list 
before iterating over it:
- for hdlr in c.handlers:
+ for hdlr in list(c.handlers):

However since that could affect the performance of routine log statements a 
better fix is probably to change the implementation of add/removeHandler to 
avoid in-place modification of self.handlers so that (as a result of the GIL) 
it'll be safe to iterate over the list in callHandlers, e.g. change 
removeHandler like this:

- self.handlers.remove(hdlr)
+ newhandlers = list(self.handlers)
+ newhandlers.remove(hdlr)
+ self.handlers = hdlr

(and the equivalent in addHandler)

--
components: Library (Lib)
files: logger-race.py
messages: 329429
nosy: benspiller
priority: normal
severity: normal
status: open
title: Logger race condition - loses lines if removeHandler called from another 
thread while logging
type: behavior
versions: Python 2.7, Python 3.4, Python 3.5, Python 3.6, Python 3.7
Added file: https://bugs.python.org/file47914/logger-race.py

___
Python tracker 
<https://bugs.python.org/issue35185>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5166] ElementTree and minidom don't prevent creation of not well-formed XML

2018-11-07 Thread Ben Spiller


Change by Ben Spiller :


--
nosy: +Ben Spiller

___
Python tracker 
<https://bugs.python.org/issue5166>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5166] ElementTree and minidom don't prevent creation of not well-formed XML

2018-10-19 Thread Ben Spiller


Ben Spiller  added the comment:

To help anyone else struggling with this bug, based on 
https://lsimons.wordpress.com/2011/03/17/stripping-illegal-characters-out-of-xml-in-python/
 the best workaround I've currently found is to define this:

def escape_xml_illegal_chars(unicodeString, replaceWith=u'?'):
return re.sub(u'[\x00-\x08\x0b\x0c\x0e-\x1F\uD800-\uDFFF\uFFFE\u]', 
replaceWith, unicodeString)

and then copy+paste the following pattern into every bit of code that generates 
XML:

myfile.write(escape_xml_illegal_chars(document.toxml(encoding='utf-8').decode('utf-8')).encode('utf-8'))

It's obviously pretty grim (and unsafe) to expect every python developer to 
copy+paste this kind of thing into their own project to avoid buggy XML 
generation, so would be better to have the escape_xml_illegal_chars function in 
the python standard library (maybe alongside xml.sax.utils.escape - which 
notably does _not_ escape all the unicode characters that aren't valid XML), 
and built-in support for this as part of document.toxml. 

I guess we'd want it to be user-configurable for any users who are prepared to 
tolerate the possibility unparseable XML documents will be generated in return 
for improved performance for the common case where these characters are not 
present, not not having the capability at all just means most python 
applications that do XML generate with special-casing this have a bug. I 
suggest we definitely need some clear warnings about this in the doc.

--

___
Python tracker 
<https://bugs.python.org/issue5166>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue5166] ElementTree and minidom don't prevent creation of not well-formed XML

2018-09-10 Thread Ben Spiller


Ben Spiller  added the comment:

Hi it's been a few years now since this was reported and it's still a problem, 
any chance of a fix for this? The API gives the impression that if you pass 
python strings to the XML API then the library will generate valid XML. It 
takes care of the charset/encoding and entity escaping aspects of XML 
generation so would be logical for it to in some way take care of control 
characters too - especially as silently generating unparseable XML is a 
somewhat dangerous failure mode. 

I think there's a strong case for some built-in functionality to replace/ignore 
the control characters (perhaps as a configurable option, in case of 
performance worries) rather than just throwing an exception, since it's very 
common to have an arbitrary string generated by some other program or user 
input that needs to be written into an XML file (and a lot less common to be 
100% sure in all cases what characters your string might contain). For those 
common use cases, the current situation where every python developer needs to 
implement their own workaround to sanitize strings isn't ideal, especially as 
it's not trivial to get it right and likely a lot of the community who end up 
'rolling their own' are getting in wrong in some way. 

[On the other hand if you guys decide this really isn't going to be fixed, then 
at the very least I'd suggest that the API documentation should prominently 
state that it is up to the users of these libraries to implement their own 
sanitization of control characters, since I'm sure none of us want people using 
python to end up with buggy applications]

--
nosy: +benspiller
versions: +Python 3.5, Python 3.6, Python 3.7

___
Python tracker 
<https://bugs.python.org/issue5166>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26369] unicode.decode and str.encode are unnecessarily confusing for non-ascii

2016-05-20 Thread Ben Spiller

Ben Spiller added the comment:

Thanks for considering this, anyway. I'll admit I'm disappointed we couldn't 
fix this on the 2.7 train, as to me fixing a method that takes an 
errors='ignore' argument and then throws an exception anyway seems a little 
more like a bug than a feature (and changing it would likely not affect 
behaviour in any existing non-broken programs), but if that's the decision then 
fine. Of course I'm aware (as I mentioned earlier on the thread) that the 
radically different unicode handling in python 3 solves this entirely and only 
wish it was practical to move our existing (enormous) codebase and customers 
over to it, but we're stuck with Python 2.7 - I believe lots of people are in 
the same situation unfortunately. 

As Josh suggested, perhaps we can at least add something to the doc for the 
str/unicode encode and decode methods so users are aware of the behaviour 
without trial and error. I'll update the component of this bug to reflect it's 
now considered a doc issue. 

Based on the inputs from Terry, and what seem to be the key info that would 
have been helpful to me and those who are hitting the same issues for the first 
time, I'd propose the following text (feel free to adjust as you see fit):

For encode:
"For most encodings, the return type is a byte str regardless of whether it is 
called on a str or unicode object. For example, call encode on a unicode object 
with "utf-8" to return a byte str object, or call encode on a str object with 
"base64" to return a base64-encoded str object.

It is _not_ recommended to use call this method on "str" objects when using 
codecs such as utf-8 that convert betweens str and unicode objects, as any 
characters not supported by python's default encoding (usually 7-bit ascii) 
will result in a UnicodeDecodeError exception, even if errors='ignore' was 
specified. For such conversions the str.decode and unicode.encode methods 
should be used. If you need to produce an encoded version of a string that 
could be either a str or unicode object, only call the encode() method after 
checking it is a unicode object not a str object, using isinstance(s, unicode)."

and for decode:
"The return type may be either str or unicode, depending on which encoding is 
used and whether the method is called on a str or unicode object. For example, 
call decode on a str object with "utf-8" to return a unicode object, or call 
decode on a unicode or str object with "base64" to return a base64-decoded str 
object.

It is _not_ recommended to use call this method on "unicode" objects when using 
codecs such as utf-8 that convert betweens str and unicode objects, as any 
characters not supported by python's default encoding (usually 7-bit ascii) 
will result in a UnicodeEncodeError exception, even if errors='ignore' was 
specified. For such conversions the str.decode and unicode.encode methods 
should be used. If you need to produce a decoded version of a string that could 
be either a str or unicode object, only call the decode() method after checking 
it is a str object not a unicode object, using isinstance(s, str)."

--
components: +Documentation -Interpreter Core

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue26369>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26369] unicode.decode and str.encode are unnecessarily confusing for non-ascii

2016-05-19 Thread Ben Spiller

Ben Spiller added the comment:

btw If anyone can find the place in the code (sorry I tried and failed!) where 
str.encode('utf-8', error=X) is resulting in an implicit call to the equivalent 
of decode(defaultencoding, errors=strict) (as suggested by the exception 
message) I think it'll be easier to discuss the details of fixing.

Thanks for your reply - yes I'm aware that theoretically you _could_ globally 
change python's default encoding from ascii, but the prevailing view I've heard 
from python developers seems to be that changing it is not a good idea and may 
cause lots of library code to break. Also it's probably not a good idea for 
individual libraries or modules to be changing global state that affects the 
entire python invocation, and it would be nice to find a less fragile and more 
out-of-the-box solution to this. You may well be using different encodings (not 
just utf-8) to be used in different parts of your program - so changing the 
globally-defined default encoding doesn't seem right, especially for a method 
like str.encode method that already takes an 'encoding' argument (used 
currently only for the encoding aspect, not the decoding aspect). 

I do think there's a strong case to be made for changing the str.encode (and 
also unicode.decode) behaviour so that str.encode('utf-8') behaves the same 
whether it's given ascii or non-ascii characters, and also similar to 
unicode.encode('utf-8'). Let me try to persuade you... :)

First, to address the point you made:

> If str.encode() raises a decoding exception, this is a programming bug. It 
> would be bad to hide it.

I totally agree with the general principal of not hiding programming bugs. 
However if calling str.encode for codecs like utf8 (let's ignore base64 for 
now, which is a very different beast) was *consistently* treated as a 
'programming bug' by python and always resulted in an exception that would be 
ok (suboptimal usability imho, but still ok), since programmers would quickly 
spot the problem and fix it. But that's not what happens - it *silently works* 
(is a no-op) as long as you happen to be using ASCII characters so this 
so-called 'programming bug' will go unnoticed by most programmers (and authors 
of third party library code you might be relying on!)... but the moment a 
non-ascii character get introduced suddenly you'll get an exception, maybe in 
some library code you rely on but can't fix. For this reason I don't think 
treating this as a programming bug is helping anyone write more robust python 
code - quite the reverse. Plus I think the behaviour of being a no-op is almost 
always
  'what you would have wanted it to do' anyway, whereas the behaviour of 
throwing an exception almost never is. 

I think we'd agree that changing str.encode(utf8) to throw an exception in 
*all* cases wouldn't be a realistic option since it would certainly break 
backwards compatability in painful ways for many existing apps and library 
code. 

So, if we want to make the behaviour of this important built-in type a bit more 
consistent and less error-prone/fragile for this case then I think the only 
option is making str.encode be a no-op for non-ascii characters (at least, 
non-ascii characters that are valid in the specified encoding), just as it is 
for ascii characters. 

Here's why I think ditching the current behaviour would be a good idea:
- calling str.encode() and getting a DecodeError is confusing ("I asked you to 
encode this string, what are you decoding for?")
- calling str.encode('utf-8') and getting an exception about "ascii" is 
confusing as the only encoding I mentioned in the method call was utf-8
- calling encode(..., errors=ignore) and getting an exception is confusing and 
feels like a bug; I've explicitly specified that I do NOT want exceptions from 
calling this method, yet (because neither 'errors' nor 'encoding' argument gets 
passed to the implicit - and undocumented - decode operation), I get unexpected 
behaviour that is far more likely to break my program than a no-op
- the somewhat surprising behaviour we're talking about is not explicitly 
documented anywhere
- having str.encode throw on non-ascii but not ascii makes it very likely that 
code will be written and shipped (including library code you may have no 
control over) that *appears* to work under normal testing but has *hidden* bugs 
that surface only once non-ascii characters are used. 
- in every situation I can think of, having str.encode(encoding, errors=ignore) 
honour the encoding and errors arguments even for the implicit-decode operation 
is more useful than having it ignore those arguments and throw an exception
- a quick google shows lots of people in the Python community (from newbies to 
experts) are seeing this exception and being confused by it, therefore a lot of 
people's lives might be improved if we can somehow make the situation better :)
- even with the best of intentions (and with cod

[issue26369] unicode.decode and str.encode are unnecessarily confusing for non-ascii

2016-05-12 Thread Ben Spiller

Ben Spiller added the comment:

I'm proposing that str.encode() should _not_ throw a 'decode' exception  for 
non-ascii characters and be effectively a no-op, to match what it already does 
for ascii characters - which therefore shouldn't break behavior anyone will be 
depending on. This could be achieved by passing the encoding parameter through 
to the implicit decode() call (which is where the exception is coming from it 
appears), rather than (arbitrarily and surprisingly) using "ascii" (which of 
course sometimes works and sometimes doesn't depending on the input string)

Does that make sense?

If someone can find the place in the code (sorry I tried and failed!) where 
str.encode('utf-8') is resulting in an implicit call to the equivalent of 
decode('ascii') (as suggested by the exception message) I think it'll be easier 
to discuss the details

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue26369>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26369] unicode.decode and str.encode are unnecessarily confusing for non-ascii

2016-05-12 Thread Ben Spiller

Ben Spiller added the comment:

yes the situation is loads better in python 3, this issue is specific to 2.x, 
but like many people sadly we're not able to move to 3 for the time being. 

Since making this mistake is quite common and there's some sensible behaviour 
that would make it disappear (resulting in ascii and non-ascii strings being 
treated the same way by these methods) I'd much prefer if we could actually fix 
it for python 2.7

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue26369>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26369] unicode.decode and str.encode are unnecessarily confusing for non-ascii

2016-05-12 Thread Ben Spiller

Ben Spiller added the comment:

Thanks that's really helpful

Having thought about it some more, I think if possible it'd be really so much 
better to actually 'fix' the behaviour for the unicode<->str standard codecs 
(i.e. not base64) rather than just documenting around it. The current behaviour 
is not only confusing but leads to bugs that are very easy to miss since the 
methods work correctly when given 7-bit ascii characters. 

I had a poke around in the python source but couldn't quite identify where it's 
happening - presumably there is somewhere in the str.encode('utf-8') 
implementation that first "decodes" the string and does so using the ascii 
codec. If it could be made to use the same encoding that was passed in (e.g. 
utf8) then this would end up being a no-op and there would be no unpleasant 
bugs that only appear when the input includes non-ascii characters. 

It would also allow X.encode('utf-8') to be called successfully whether X is 
already a str or is a unicode object, which would save callers having to 
explicitly check what kind of string they've been passed. 

Is anyone able to look into the code to see where this would need to be fixed 
and how difficult it would be to do? I have a feeling that once the line is 
located it might be quite a straightforward fix

Many thanks

--
components: +Interpreter Core -Documentation
title: doc for unicode.decode and str.encode is unnecessarily confusing -> 
unicode.decode and str.encode are unnecessarily confusing for non-ascii

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue26369>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26369] doc for unicode.decode and str.encode is unnecessarily confusing

2016-02-16 Thread Ben Spiller

New submission from Ben Spiller:

It's well known that lots of people struggle writing correct programs using 
non-ascii strings in python 2.x, but I think one of the main reasons for this 
could be very easily fixed with a small addition to the documentation for 
str.encode and unicode.decode, which is currently quite vague. 

The decode/encode methods really make most sense when called on a unicode 
string i.e. unicode.encode() to produce a byte string, or on a byte string e.g. 
str.decode() to produce a unicode object from a byte string. 

However, the additional presence of the opposite methods str.encode() and 
unicode.decode() is quite confusing, and a frequent source of errors - e.g. 
calling str.encode('utf-8') first DECODES the str object (which might already 
be in utf8) to a unicode string **using the default encoding of "ascii"** (!) 
before ENCODING to a utf-8 byte str as requested, which of course will fail at 
the first stage with the classic error "UnicodeDecodeError: 'ascii' codec can't 
decode byte" if there are any non-ascii chars present. It's unfortunate that 
this initial decode/encode stage ignores both the "encoding" argument (used 
only for the subsequent encode/decode) and the "errors" argument (commonly used 
when the programmer is happy with a best-effort conversion e.g. for logging 
purposes).

Anyway, given this behaviour, a lot of time would be saved by a simple sentence 
on the doc for str.encode()/unicode.decode() essentially warning people that 
those methods aren't that useful and they probably really intended to use 
str.decode()/unicode.encode() - the current doc gives absolutely no clue about 
this extra stage which ignores the input arguments and sues 'ascii' and 
'strict'. It might also be worth stating in the documentation that the pattern 
(u.encode(encoding) if isinstance(u, unicode) else u) can be helpful for cases 
where you unavoidably have to deal with both kinds of input, string calling 
str.encode is such a bad idea. 

In an ideal world I'd love to see the implementation of 
str.encode/unicode.decode changed to be more useful (i.e. instead of using 
ascii, it would be more logical and useful to use the passed-in encoding to 
perform the initial decode/encode, and the apss-in 'errors' value). I wasn't 
sure if that change would be accepted so for now I'm proposing better 
documentation of the existing behaviour as a second-best.

--
assignee: docs@python
components: Documentation
messages: 260359
nosy: benspiller, docs@python
priority: normal
severity: normal
status: open
title: doc for unicode.decode and str.encode is unnecessarily confusing
type: behavior
versions: Python 2.7

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue26369>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com