[issue35883] Change invalid unicode characters to replacement characters in argv

2021-03-13 Thread STINNER Victor
STINNER Victor added the comment: > https://bugs.python.org/issue25631 "Segmentation fault with invalid Unicode > command-line arguments in embedded Python" (actually 'fixed' since it now > abort()s) This issue is different: it is about the Py_Main() function called explicitly when Python

[issue35883] Change invalid unicode characters to replacement characters in argv

2021-03-13 Thread STINNER Victor
STINNER Victor added the comment: I wrote PR 24843 to fix this issue. With this fix, os.fsencode(sys.argv[1]) returns the original byte sequence as expected. -- I dislike the replace error handler since it loses information. The PEP 383 surrogateescape error handler exists to prevent

[issue35883] Change invalid unicode characters to replacement characters in argv

2021-03-13 Thread STINNER Victor
Change by STINNER Victor : -- keywords: +patch pull_requests: +23606 stage: -> patch review pull_request: https://github.com/python/cpython/pull/24843 ___ Python tracker ___

[issue35883] Change invalid unicode characters to replacement characters in argv

2021-03-12 Thread Eryk Sun
Change by Eryk Sun : -- components: +Unicode nosy: +ezio.melotti, vstinner versions: -Python 3.5, Python 3.6, Python 3.7 ___ Python tracker ___

[issue35883] Change invalid unicode characters to replacement characters in argv

2020-05-24 Thread Johannes Berg
Johannes Berg added the comment: I've also filed https://sourceware.org/bugzilla/show_bug.cgi?id=26034 for glibc, because that's where really the issues seems to be? But perhaps python should be forgiving of glibc errors here. -- ___ Python

[issue35883] Change invalid unicode characters to replacement characters in argv

2020-05-24 Thread Johannes Berg
Johannes Berg added the comment: Like I said above, it could be argued that the bug is in glibc, and then https://p.sipsolutions.net/6a4e9fce82dbbfa0.txt could be used as a simple LD_PRELOAD wrapper to work around this, just to illustrate the problem from that side. Arguably, that makes

[issue35883] Change invalid unicode characters to replacement characters in argv

2020-05-24 Thread Johannes Berg
Johannes Berg added the comment: And wrt. _Py_DecodeUTF8Ex() - it doesn't seem to help. But that's probably because I'm not __ANDROID__, nor __APPLE__, and then regardless of current_locale being non-zero or not, we end up in decode_current_locale() where the impedance mismatch happens.

[issue35883] Change invalid unicode characters to replacement characters in argv

2020-05-24 Thread Johannes Berg
Johannes Berg added the comment: In fact that python one-liner works with just about everything else that you can throw at it, just not something that "looks like utf-8 but isn't". And of course adding LC_CTYPE=ascii or something like that fixes it, as you'd expect. Then the

[issue35883] Change invalid unicode characters to replacement characters in argv

2020-05-24 Thread Johannes Berg
Johannes Berg added the comment: A simple test case is something like ./python -c 'import sys; print(sys.argv[1].encode(sys.getfilesystemencoding(), "surrogateescape"))' "$(echo -ne '\xfa\xbd\x83\x96\x80')" Which you'd probably expect to print b'\xfa\xbd\x83\x96\x80' i.e. the same

[issue35883] Change invalid unicode characters to replacement characters in argv

2020-05-24 Thread Johannes Berg
Johannes Berg added the comment: Pretty sure this is an issue still, I see it on current git master. This seems to work around it? https://p.sipsolutions.net/603927f1537226b3.txt Basically, it seems that mbstowcs() and mbrtowc() on glibc with utf-8 just blindly decode even invalid UTF-8 to

[issue35883] Change invalid unicode characters to replacement characters in argv

2019-02-01 Thread Eryk Sun
Eryk Sun added the comment: In Unix, Python 3.6 decodes the char * command line arguments via mbstowcs. In Linux, I see the following misbehavior of mbstowcs when decoding an overlong UTF-8 sequence: >>> mbstowcs = ctypes.CDLL(None, use_errno=True).mbstowcs >>> arg = bytes(x + 128

[issue35883] Change invalid unicode characters to replacement characters in argv

2019-02-01 Thread SilentGhost
Change by SilentGhost : -- nosy: +ncoghlan versions: +Python 3.7 ___ Python tracker ___ ___ Python-bugs-list mailing list

[issue35883] Change invalid unicode characters to replacement characters in argv

2019-02-01 Thread Neui
Neui added the comment: I'd say that the terminal is not really relevant here, but rather the locale settings because it uses wide string functions. Prefixing it with LC_ALL=C produces the same output as you had on my Ubuntu machine. I also get that output when running it in Cygwin (and

[issue35883] Change invalid unicode characters to replacement characters in argv

2019-02-01 Thread SilentGhost
SilentGhost added the comment: Hm, this seems to be due to how the terminal emulator handles those special characters, actually. I can reproduce in another terminal. -- ___ Python tracker

[issue35883] Change invalid unicode characters to replacement characters in argv

2019-02-01 Thread SilentGhost
SilentGhost added the comment: I'm on 4.15.0-44-generic and I cannot reproduce the crash. I get "python3: can't open file '��': [Errno 2] No such file or directory" Could you try this on a different machine / installation? -- nosy: +SilentGhost type: behavior -> crash

[issue35883] Change invalid unicode characters to replacement characters in argv

2019-02-01 Thread Neui
New submission from Neui : When an invalid unicode character is given to argv (cli arguments), then python abort()s with an fatal error about an character not in range (ValueError: character U+7fffbeba is not in range [U+; U+10]). I am wondering if this behaviour should change to