[issue40687] Windows py.exe launcher interacts badly with Windows store python.exe shim
Change by Ben Spiller : -- type: -> behavior ___ Python tracker <https://bugs.python.org/issue40687> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue40687] Windows py.exe launcher interacts badly with Windows store python.exe shim
New submission from Ben Spiller : The py.exe launcher doc states "If no relevant options are set, the commands python and python2 will use the latest Python 2.x version installed" ... which was indeed working reliably until Microsoft added their weird python.exe shim (which either terminates with no output or brings up the Microsoft Store page) as part of https://devblogs.microsoft.com/python/python-in-the-windows-10-may-2019-update/ Now, I find scripts that start with "#!/usr/bin/env python" cause py.exe to run the Windows python.exe shim which confusingly terminates with no output (unless run with no arguments). I think to stop lots of developers banging theirs heads against this brick wall, py.exe should include some logic to ignore the C:\Users\XXX\AppData\Local\Microsoft\WindowsApps\python.exe shim, since for someone with py.exe installed, running that is _never_ what you'd want. (or alternatively, work with Microsoft to get this decision reversed, but that may be harder!) Lots of people are hitting this e.g. https://superuser.com/questions/1437590/typing-python-on-windows-10-version-1903-command-prompt-opens-microsoft-stor , https://stackoverflow.com/questions/57485491/python-python3-executes-in-command-prompt-but-does-not-run-correctly Here's the output: py myscript.py launcher build: 32bit launcher executable: Console File 'C:\Users\XXX\AppData\Local\py.ini' non-existent Using global configuration file 'C:\WINDOWS\py.ini' Called with command line: apama-build\build.py -h maybe_handle_shebang: read 256 bytes maybe_handle_shebang: BOM not found, using UTF-8 parse_shebang: found command: python searching PATH for python executable Python on path: C:\Users\XXX\AppData\Local\Microsoft\WindowsApps\python.exe located python on PATH: C:\Users\XXX\AppData\Local\Microsoft\WindowsApps\python.exe run_child: about to run 'C:\Users\XXX\AppData\Local\Microsoft\WindowsApps\python.exe myscript.py' child process exit code: 9009 >py -0 Installed Pythons found by py Launcher for Windows -3.8-64 * -3.7-64 -3.6-64 -2.7-64 (nb: was surprising that it didn't run any of those installed versions!) -- components: Windows messages: 369379 nosy: benspiller, paul.moore, steve.dower, tim.golden, zach.ware priority: normal severity: normal status: open title: Windows py.exe launcher interacts badly with Windows store python.exe shim versions: Python 3.6, Python 3.7, Python 3.8 ___ Python tracker <https://bugs.python.org/issue40687> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue38633] shutil.copystat fails with PermissionError in WSL
Ben Spiller added the comment: Looks like on WSL the errno is errno.EACCES rather than EPERM, so we just need to change the shutil._copyxattr error handler to also cope with that error code: except OSError as e: - if e.errno not in (errno.EPERM, errno.ENOTSUP, errno.ENODATA): + if e.errno not in (errno.EPERM, errno.ENOTSUP, errno.ENODATA, errno.EACCES): raise If anyone needs a workaround until this is fixed in shutil itself, you can do it by monkey-patching _copyxattr: import errno, shutil # have to monkey patch to work with WSL as workaround for https://bugs.python.org/issue38633 orig_copyxattr = shutil._copyxattr def patched_copyxattr(src, dst, *, follow_symlinks=True): try: orig_copyxattr(src, dst, follow_symlinks=follow_symlinks) except OSError as ex: if ex.errno != errno.EACCES: raise shutil._copyxattr = patched_copyxattr -- nosy: +benspiller ___ Python tracker <https://bugs.python.org/issue38633> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue38607] Document that cprofile/profile only profile the main thread
New submission from Ben Spiller : The built-in profiling modules only provide information about the main thread (at least when invoked as documented). To avoid user confusion we should state this in the documentation at https://docs.python.org/3/library/profile.html. Potentially we could also suggest mitigations such as manually creating a Profile instance in the user's thread code, but the most important thing is to make clear what the module does/does not do out of the box. (see also https://bugs.python.org/issue9609 which discusses a possible non-doc change to help with multi-threading, but looks like that's stalled, so best to push ahead with doc'ng this. -- assignee: docs@python components: Documentation messages: 355492 nosy: benspiller, docs@python priority: normal severity: normal status: open title: Document that cprofile/profile only profile the main thread type: enhancement versions: Python 3.7, Python 3.8, Python 3.9 ___ Python tracker <https://bugs.python.org/issue38607> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue38278] Need a more efficient way to perform dict.get(key, default)
Ben Spiller added the comment: Thanks... yep I realise method calls are slower than operators, am hoping we can still find a cunning way to speed up this use case nonetheless. :D For example by having a configuration option on dict (or making a new subclass) that gives the (speedy!) [] operator the same no-exception semantics you'd get from calling get(). As you can see from my timeit benchmarks none of the current workarounds are very appealing for this use case, and a 2.2x slowdown for this common operation is a shame. -- ___ Python tracker <https://bugs.python.org/issue38278> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue38278] Need a more efficient way to perform dict.get(key, default)
New submission from Ben Spiller : In performance-critical python code, it's quite common to need to get an item from a dictionary, falling back on a default (e.g. None, 0 etc) if it doesn't yet exist. The obvious way to do this based on the documentation is to call the dict.get() method: > python -m timeit -s "d={'abc':123}" "x=d.get('abc',None)" 500 loops, best of 5: 74.6 nsec per loop ... however the performance of natural approach is very poor (2.2 times slower!) compared to the time really needed to look up the value: >python -m timeit -s "d={'abc':123}" "x=d['abc']" 500 loops, best of 5: 33 nsec per loop There are various ways to do this more efficiently, but they all have significant performance or functional drawbacks: custom dict subclass with __missing__() override: promising approach, but use of a custom class instead of dict seems to increase [] cost significantly: >python -m timeit -s "class mydict(dict):" -s " def >__missing__(self,key):return None" -s "d = mydict({'abc':123})" "x=d['abc']" 500 loops, best of 5: 60.4 nsec per loop get() with caching of function lookup - somewhat better but not great: >python -m timeit -s "d={'abc':123}; G=d.get" "x=G('abc',None)" 500 loops, best of 5: 59.8 nsec per loop [] and "in" (inevitably a bit slow due to needing to do the lookup twice): >python -m timeit -s "d={'abc':123}" "x=d['abc'] if 'abc' in d else None" 500 loops, best of 5: 53.9 nsec per loop try/except approach: quickest solution if it exists, but clunky syntax, and VERY slow if it doesn't exist >python -m timeit -s "d={'abc':123}" "try:" " x=d['abc']" "except KeyError: >pass" 500 loops, best of 5: 41.8 nsec per loop >python -m timeit -s "d={'abc':123}" "try:" " x=d['XXX']" "except KeyError: >pass" 100 loops, best of 5: 174 nsec per loop collections.defaultdict: reasonable performance if key exists, but unwanted behaviour of adding the key if missing (which if used with an unbounded universe of keys could produce a memory leak): >python -m timeit -s "import collections; d=collections.defaultdict(lambda: >None); d['abc']=123; " "x=d['XXX']" 500 loops, best of 5: 34.3 nsec per loop I bet we can do better! Lots of solutions are possible - maybe some kind of peephole optimization to make dict.get() itself perform similarly to the [] operator, or if that's challenging perhaps providing a class or option that behaves like defaultdict but without the auto-adding behaviour and with comparable [] performance to the "dict" type - for example dict.raiseExceptionOnMissing=False, or perhaps even some kind of new syntax (e.g. dict['key', default=None]). Which option would be easiest/nicest? -- components: Interpreter Core messages: 353206 nosy: benspiller priority: normal severity: normal status: open title: Need a more efficient way to perform dict.get(key, default) type: enhancement versions: Python 3.7 ___ Python tracker <https://bugs.python.org/issue38278> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue35185] Logger race condition - loses lines if removeHandler called from another thread while logging
Ben Spiller added the comment: Interesting conversation. :) Yes I agree correctness is definitely top priority. :) I'd go further and say I'd prefer correctness to be always there automatically, rather than something the user must remember to enable by setting a flag such as lockCallHandling. (as an aside, adding a separate extra code path and option like that would require a bit more doc and testing changes than just fixing the bug by making the self.handlers list immutable, which is a small and simple change not needing extra doc). I'm also not convinced it's worth optimizing the performance of add/remove logger (which sounds like it's the goal of the callHandlers-locking approach you suggest, if I'm understanding correctly?) - since in a realistic application is always that's going to be vastly less frequent than invoking callhandlers. Especially if it reduces performance of the main logging, which is invoked much more often. Though admittedly the 1% regression you quoted isn't so bad (assuming that's true in CPython/IronPython/Jython/others). The test program I provided is a contrived way of quickly reproducing the race condition, but I certainly wouldn't use it for measuring or optimizing performance as it wasn't designed for that - the ratio of add/remove loggers to callhandlers calls is likely to be unrepresentative of a real application, and there's vastly more contention on calling add/remove loggers than you'd see in the wild. Do you see any downsides to the immutable self.handlers approach, other than performance of add/remove logger being a little lower? Personally I think we're on safer ground if we permit add/remove logger be slightly slower (but at least correct! correctness trumps performance), but only if we avoid regressing the more important performance of logging itself. Does that seem reasonable? -- ___ Python tracker <https://bugs.python.org/issue35185> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue35185] Logger race condition - loses lines if removeHandler called from another thread while logging
Ben Spiller added the comment: I'd definitely suggest we go for a solution that doesn't hit performance of normal logging when you're not adding/removing things, being as that's the more common case. I guess that's the reason why callHandlers was originally implemented without grabbing the mutex, and we should probably keep it that way. Logging can be a performance-critical part of some applications and I feel more comfortable about the fix (and more confident it won't get vetoed :)) if we can avoid changing callHandlers(). You make a good point about ensuring the solution works for non-GIL python versions. I thought about it some more... correct me if I'm wrong but as far as I can see the second idea I suggested should do that, i.e. - self.handlers.remove(hdlr) + newhandlers = list(self.handlers) + newhandlers.remove(hdlr) + self.handlers = hdlr ... which effectively changes the model so that the _value_ of the self.handlers list is immutable (only which list the self.handlers reference points to changes), so without relying on any GIL locking callHandlers will still see the old list or the new list but never see an inconsistent value, since such a list never exists. That solves the read-write race condition; we'd still want to keep the existing locking in add/removeHandler which prevents write-write race conditions. What do you think? -- ___ Python tracker <https://bugs.python.org/issue35185> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue35915] re.search extreme slowness (looks like hang/livelock), searching for patterns containing .* in a large string
Ben Spiller added the comment: Running this command: time python -c "import re; re.compile('y.*x').search('y'*(N))" It's clearly quadratic: N=100,000 time=7s N=200,000 time=18s N=400,000 time=110s N=1,000,000 time=690s This illustrates how a simple program that's working correctly can quickly degrade to a very long period of unresponsiveness after some fairly modest increases in size of input string. -- ___ Python tracker <https://bugs.python.org/issue35915> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue35915] re.search extreme slowness (looks like hang/livelock), searching for patterns containing .* in a large string
Ben Spiller added the comment: Correction to original report - it doesn't hang indefinitely, it just takes a really long time. Specifically, looks like it's quadratic in the length of the input string. Increase the size of the input string to 1000*1000 and it's really really slow. I don't know for sure if it's possible to implement regexes in a way that avoids this pathological behaviour, but it's certainly quite risky that an otherwise working bit of code using a pattern containing .* can hang/livelock an application for an arbitrary amount of time if passed a larger-than-expected (but actually not that big) input string. -- title: re.search livelock/hang, searching for patterns starting .* in a large string -> re.search extreme slowness (looks like hang/livelock), searching for patterns containing .* in a large string type: crash -> performance ___ Python tracker <https://bugs.python.org/issue35915> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue35915] re.search livelock/hang, searching for patterns starting .* in a large string
New submission from Ben Spiller : These work fine and return instantly: python -c "import re; re.compile('.*x').match('y'*(1000*100))" python -c "import re; re.compile('x').search('y'*(1000*100))" python -c "import re; re.compile('.*x').search('y'*(1000*10))" This hangs / freezes / livelocks indefinitely, with lots of CPU usage: python -c "import re; re.compile('.*x').search('y'*(1000*100))" Admittedly performing a search() with a pattern starting .* isn't useful, however it's worth fixing as: - it's easily done by inexperienced developers, or users interacting with code that's far removed from the actual regex call - the failure mode of hanging forever (with the GIL held, of course) is quite severe (took us a lot of debugging with gdb before we figured out where our complex multi-threaded python program was hanging!), and - the fact that the behaviour is different based on the length of the string being matched suggests there is some kind of underlying bug in how the buffer is handled whcih might also affect other, more reasonable regex use cases -- components: Regular Expressions messages: 334949 nosy: benspiller, ezio.melotti, mrabarnett priority: normal severity: normal status: open title: re.search livelock/hang, searching for patterns starting .* in a large string type: crash versions: Python 2.7, Python 3.6 ___ Python tracker <https://bugs.python.org/issue35915> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue35185] Logger race condition - loses lines if removeHandler called from another thread while logging
New submission from Ben Spiller : I just came across a fairly serious thread-safety / race condition bug in the logging.Loggers class, which causes random log lines to be lost i.e. not get passed to some of the registered handlers, if (other, unrelated) handlers are being added/removed using add/removeHandler from another thread during logging. This potentially affects all log handler classes, though for timing reasons I've found it easiest to reproduce with the logging.FileHandler. See attached test program that reproduces this. I did some debugging and looks like although add/removeHandler are protected by _acquireLock(), they modify the self.handlers list in-place, and the callHandlers method iterates over self.handlers with no locking - so if you're unlucky you can end up with some of your handlers not being called. A trivial way to fix the bug is by editing callHandlers and copying the list before iterating over it: - for hdlr in c.handlers: + for hdlr in list(c.handlers): However since that could affect the performance of routine log statements a better fix is probably to change the implementation of add/removeHandler to avoid in-place modification of self.handlers so that (as a result of the GIL) it'll be safe to iterate over the list in callHandlers, e.g. change removeHandler like this: - self.handlers.remove(hdlr) + newhandlers = list(self.handlers) + newhandlers.remove(hdlr) + self.handlers = hdlr (and the equivalent in addHandler) -- components: Library (Lib) files: logger-race.py messages: 329429 nosy: benspiller priority: normal severity: normal status: open title: Logger race condition - loses lines if removeHandler called from another thread while logging type: behavior versions: Python 2.7, Python 3.4, Python 3.5, Python 3.6, Python 3.7 Added file: https://bugs.python.org/file47914/logger-race.py ___ Python tracker <https://bugs.python.org/issue35185> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5166] ElementTree and minidom don't prevent creation of not well-formed XML
Change by Ben Spiller : -- nosy: +Ben Spiller ___ Python tracker <https://bugs.python.org/issue5166> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5166] ElementTree and minidom don't prevent creation of not well-formed XML
Ben Spiller added the comment: To help anyone else struggling with this bug, based on https://lsimons.wordpress.com/2011/03/17/stripping-illegal-characters-out-of-xml-in-python/ the best workaround I've currently found is to define this: def escape_xml_illegal_chars(unicodeString, replaceWith=u'?'): return re.sub(u'[\x00-\x08\x0b\x0c\x0e-\x1F\uD800-\uDFFF\uFFFE\u]', replaceWith, unicodeString) and then copy+paste the following pattern into every bit of code that generates XML: myfile.write(escape_xml_illegal_chars(document.toxml(encoding='utf-8').decode('utf-8')).encode('utf-8')) It's obviously pretty grim (and unsafe) to expect every python developer to copy+paste this kind of thing into their own project to avoid buggy XML generation, so would be better to have the escape_xml_illegal_chars function in the python standard library (maybe alongside xml.sax.utils.escape - which notably does _not_ escape all the unicode characters that aren't valid XML), and built-in support for this as part of document.toxml. I guess we'd want it to be user-configurable for any users who are prepared to tolerate the possibility unparseable XML documents will be generated in return for improved performance for the common case where these characters are not present, not not having the capability at all just means most python applications that do XML generate with special-casing this have a bug. I suggest we definitely need some clear warnings about this in the doc. -- ___ Python tracker <https://bugs.python.org/issue5166> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5166] ElementTree and minidom don't prevent creation of not well-formed XML
Ben Spiller added the comment: Hi it's been a few years now since this was reported and it's still a problem, any chance of a fix for this? The API gives the impression that if you pass python strings to the XML API then the library will generate valid XML. It takes care of the charset/encoding and entity escaping aspects of XML generation so would be logical for it to in some way take care of control characters too - especially as silently generating unparseable XML is a somewhat dangerous failure mode. I think there's a strong case for some built-in functionality to replace/ignore the control characters (perhaps as a configurable option, in case of performance worries) rather than just throwing an exception, since it's very common to have an arbitrary string generated by some other program or user input that needs to be written into an XML file (and a lot less common to be 100% sure in all cases what characters your string might contain). For those common use cases, the current situation where every python developer needs to implement their own workaround to sanitize strings isn't ideal, especially as it's not trivial to get it right and likely a lot of the community who end up 'rolling their own' are getting in wrong in some way. [On the other hand if you guys decide this really isn't going to be fixed, then at the very least I'd suggest that the API documentation should prominently state that it is up to the users of these libraries to implement their own sanitization of control characters, since I'm sure none of us want people using python to end up with buggy applications] -- nosy: +benspiller versions: +Python 3.5, Python 3.6, Python 3.7 ___ Python tracker <https://bugs.python.org/issue5166> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue26369] unicode.decode and str.encode are unnecessarily confusing for non-ascii
Ben Spiller added the comment: Thanks for considering this, anyway. I'll admit I'm disappointed we couldn't fix this on the 2.7 train, as to me fixing a method that takes an errors='ignore' argument and then throws an exception anyway seems a little more like a bug than a feature (and changing it would likely not affect behaviour in any existing non-broken programs), but if that's the decision then fine. Of course I'm aware (as I mentioned earlier on the thread) that the radically different unicode handling in python 3 solves this entirely and only wish it was practical to move our existing (enormous) codebase and customers over to it, but we're stuck with Python 2.7 - I believe lots of people are in the same situation unfortunately. As Josh suggested, perhaps we can at least add something to the doc for the str/unicode encode and decode methods so users are aware of the behaviour without trial and error. I'll update the component of this bug to reflect it's now considered a doc issue. Based on the inputs from Terry, and what seem to be the key info that would have been helpful to me and those who are hitting the same issues for the first time, I'd propose the following text (feel free to adjust as you see fit): For encode: "For most encodings, the return type is a byte str regardless of whether it is called on a str or unicode object. For example, call encode on a unicode object with "utf-8" to return a byte str object, or call encode on a str object with "base64" to return a base64-encoded str object. It is _not_ recommended to use call this method on "str" objects when using codecs such as utf-8 that convert betweens str and unicode objects, as any characters not supported by python's default encoding (usually 7-bit ascii) will result in a UnicodeDecodeError exception, even if errors='ignore' was specified. For such conversions the str.decode and unicode.encode methods should be used. If you need to produce an encoded version of a string that could be either a str or unicode object, only call the encode() method after checking it is a unicode object not a str object, using isinstance(s, unicode)." and for decode: "The return type may be either str or unicode, depending on which encoding is used and whether the method is called on a str or unicode object. For example, call decode on a str object with "utf-8" to return a unicode object, or call decode on a unicode or str object with "base64" to return a base64-decoded str object. It is _not_ recommended to use call this method on "unicode" objects when using codecs such as utf-8 that convert betweens str and unicode objects, as any characters not supported by python's default encoding (usually 7-bit ascii) will result in a UnicodeEncodeError exception, even if errors='ignore' was specified. For such conversions the str.decode and unicode.encode methods should be used. If you need to produce a decoded version of a string that could be either a str or unicode object, only call the decode() method after checking it is a str object not a unicode object, using isinstance(s, str)." -- components: +Documentation -Interpreter Core ___ Python tracker <http://bugs.python.org/issue26369> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue26369] unicode.decode and str.encode are unnecessarily confusing for non-ascii
Ben Spiller added the comment: btw If anyone can find the place in the code (sorry I tried and failed!) where str.encode('utf-8', error=X) is resulting in an implicit call to the equivalent of decode(defaultencoding, errors=strict) (as suggested by the exception message) I think it'll be easier to discuss the details of fixing. Thanks for your reply - yes I'm aware that theoretically you _could_ globally change python's default encoding from ascii, but the prevailing view I've heard from python developers seems to be that changing it is not a good idea and may cause lots of library code to break. Also it's probably not a good idea for individual libraries or modules to be changing global state that affects the entire python invocation, and it would be nice to find a less fragile and more out-of-the-box solution to this. You may well be using different encodings (not just utf-8) to be used in different parts of your program - so changing the globally-defined default encoding doesn't seem right, especially for a method like str.encode method that already takes an 'encoding' argument (used currently only for the encoding aspect, not the decoding aspect). I do think there's a strong case to be made for changing the str.encode (and also unicode.decode) behaviour so that str.encode('utf-8') behaves the same whether it's given ascii or non-ascii characters, and also similar to unicode.encode('utf-8'). Let me try to persuade you... :) First, to address the point you made: > If str.encode() raises a decoding exception, this is a programming bug. It > would be bad to hide it. I totally agree with the general principal of not hiding programming bugs. However if calling str.encode for codecs like utf8 (let's ignore base64 for now, which is a very different beast) was *consistently* treated as a 'programming bug' by python and always resulted in an exception that would be ok (suboptimal usability imho, but still ok), since programmers would quickly spot the problem and fix it. But that's not what happens - it *silently works* (is a no-op) as long as you happen to be using ASCII characters so this so-called 'programming bug' will go unnoticed by most programmers (and authors of third party library code you might be relying on!)... but the moment a non-ascii character get introduced suddenly you'll get an exception, maybe in some library code you rely on but can't fix. For this reason I don't think treating this as a programming bug is helping anyone write more robust python code - quite the reverse. Plus I think the behaviour of being a no-op is almost always 'what you would have wanted it to do' anyway, whereas the behaviour of throwing an exception almost never is. I think we'd agree that changing str.encode(utf8) to throw an exception in *all* cases wouldn't be a realistic option since it would certainly break backwards compatability in painful ways for many existing apps and library code. So, if we want to make the behaviour of this important built-in type a bit more consistent and less error-prone/fragile for this case then I think the only option is making str.encode be a no-op for non-ascii characters (at least, non-ascii characters that are valid in the specified encoding), just as it is for ascii characters. Here's why I think ditching the current behaviour would be a good idea: - calling str.encode() and getting a DecodeError is confusing ("I asked you to encode this string, what are you decoding for?") - calling str.encode('utf-8') and getting an exception about "ascii" is confusing as the only encoding I mentioned in the method call was utf-8 - calling encode(..., errors=ignore) and getting an exception is confusing and feels like a bug; I've explicitly specified that I do NOT want exceptions from calling this method, yet (because neither 'errors' nor 'encoding' argument gets passed to the implicit - and undocumented - decode operation), I get unexpected behaviour that is far more likely to break my program than a no-op - the somewhat surprising behaviour we're talking about is not explicitly documented anywhere - having str.encode throw on non-ascii but not ascii makes it very likely that code will be written and shipped (including library code you may have no control over) that *appears* to work under normal testing but has *hidden* bugs that surface only once non-ascii characters are used. - in every situation I can think of, having str.encode(encoding, errors=ignore) honour the encoding and errors arguments even for the implicit-decode operation is more useful than having it ignore those arguments and throw an exception - a quick google shows lots of people in the Python community (from newbies to experts) are seeing
[issue26369] unicode.decode and str.encode are unnecessarily confusing for non-ascii
Ben Spiller added the comment: I'm proposing that str.encode() should _not_ throw a 'decode' exception for non-ascii characters and be effectively a no-op, to match what it already does for ascii characters - which therefore shouldn't break behavior anyone will be depending on. This could be achieved by passing the encoding parameter through to the implicit decode() call (which is where the exception is coming from it appears), rather than (arbitrarily and surprisingly) using "ascii" (which of course sometimes works and sometimes doesn't depending on the input string) Does that make sense? If someone can find the place in the code (sorry I tried and failed!) where str.encode('utf-8') is resulting in an implicit call to the equivalent of decode('ascii') (as suggested by the exception message) I think it'll be easier to discuss the details -- ___ Python tracker <http://bugs.python.org/issue26369> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue26369] unicode.decode and str.encode are unnecessarily confusing for non-ascii
Ben Spiller added the comment: yes the situation is loads better in python 3, this issue is specific to 2.x, but like many people sadly we're not able to move to 3 for the time being. Since making this mistake is quite common and there's some sensible behaviour that would make it disappear (resulting in ascii and non-ascii strings being treated the same way by these methods) I'd much prefer if we could actually fix it for python 2.7 -- ___ Python tracker <http://bugs.python.org/issue26369> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue26369] unicode.decode and str.encode are unnecessarily confusing for non-ascii
Ben Spiller added the comment: Thanks that's really helpful Having thought about it some more, I think if possible it'd be really so much better to actually 'fix' the behaviour for the unicode<->str standard codecs (i.e. not base64) rather than just documenting around it. The current behaviour is not only confusing but leads to bugs that are very easy to miss since the methods work correctly when given 7-bit ascii characters. I had a poke around in the python source but couldn't quite identify where it's happening - presumably there is somewhere in the str.encode('utf-8') implementation that first "decodes" the string and does so using the ascii codec. If it could be made to use the same encoding that was passed in (e.g. utf8) then this would end up being a no-op and there would be no unpleasant bugs that only appear when the input includes non-ascii characters. It would also allow X.encode('utf-8') to be called successfully whether X is already a str or is a unicode object, which would save callers having to explicitly check what kind of string they've been passed. Is anyone able to look into the code to see where this would need to be fixed and how difficult it would be to do? I have a feeling that once the line is located it might be quite a straightforward fix Many thanks -- components: +Interpreter Core -Documentation title: doc for unicode.decode and str.encode is unnecessarily confusing -> unicode.decode and str.encode are unnecessarily confusing for non-ascii ___ Python tracker <http://bugs.python.org/issue26369> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue26369] doc for unicode.decode and str.encode is unnecessarily confusing
New submission from Ben Spiller: It's well known that lots of people struggle writing correct programs using non-ascii strings in python 2.x, but I think one of the main reasons for this could be very easily fixed with a small addition to the documentation for str.encode and unicode.decode, which is currently quite vague. The decode/encode methods really make most sense when called on a unicode string i.e. unicode.encode() to produce a byte string, or on a byte string e.g. str.decode() to produce a unicode object from a byte string. However, the additional presence of the opposite methods str.encode() and unicode.decode() is quite confusing, and a frequent source of errors - e.g. calling str.encode('utf-8') first DECODES the str object (which might already be in utf8) to a unicode string **using the default encoding of "ascii"** (!) before ENCODING to a utf-8 byte str as requested, which of course will fail at the first stage with the classic error "UnicodeDecodeError: 'ascii' codec can't decode byte" if there are any non-ascii chars present. It's unfortunate that this initial decode/encode stage ignores both the "encoding" argument (used only for the subsequent encode/decode) and the "errors" argument (commonly used when the programmer is happy with a best-effort conversion e.g. for logging purposes). Anyway, given this behaviour, a lot of time would be saved by a simple sentence on the doc for str.encode()/unicode.decode() essentially warning people that those methods aren't that useful and they probably really intended to use str.decode()/unicode.encode() - the current doc gives absolutely no clue about this extra stage which ignores the input arguments and sues 'ascii' and 'strict'. It might also be worth stating in the documentation that the pattern (u.encode(encoding) if isinstance(u, unicode) else u) can be helpful for cases where you unavoidably have to deal with both kinds of input, string calling str.encode is such a bad idea. In an ideal world I'd love to see the implementation of str.encode/unicode.decode changed to be more useful (i.e. instead of using ascii, it would be more logical and useful to use the passed-in encoding to perform the initial decode/encode, and the apss-in 'errors' value). I wasn't sure if that change would be accepted so for now I'm proposing better documentation of the existing behaviour as a second-best. -- assignee: docs@python components: Documentation messages: 260359 nosy: benspiller, docs@python priority: normal severity: normal status: open title: doc for unicode.decode and str.encode is unnecessarily confusing type: behavior versions: Python 2.7 ___ Python tracker <http://bugs.python.org/issue26369> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com