[issue35883] Change invalid unicode characters to replacement characters in argv

2019-02-01 Thread Neui


New submission from Neui :

When an invalid unicode character is given to argv (cli arguments), then python 
abort()s with an fatal error about an character not in range (ValueError: 
character U+7fffbeba is not in range [U+; U+10]).

I am wondering if this behaviour should change to replace those with U+FFFD 
REPLACEMENT CHARACTER (like .decode(..., 'replace')) or even with something 
similar/better (see 
https://docs.python.org/3/library/codecs.html#error-handlers )

The reason for this is that other applications can use the invalid character 
since it is just some data (like GDB for use as an argument to the program to 
be debugged), where in python this becomes an limitation, since the script (if 
specified) never runs.

The main motivation for me is that there is an command-not-found debian package 
that gets the wrongly-typed command as a command argument. If that then 
contains an invalid unicode character, it then just fails rather saying it 
couldn't find the/a similar command. If this doesn't get changed, it either 
then has to accept that this is a limitation, use an other way of passing the 
command or re-write it in not python.

# Requires bash 4.2+
# Specifying a script omits the first two lines
$ python3.6 $'\U7fffbeba'
Failed checking if argv[0] is an import path entry
ValueError: character U+7fffbeba is not in range [U+; U+10]
Fatal Python error: no mem for sys.argv
ValueError: character U+7fffbeba is not in range [U+; U+10]

Current thread 0x7fd212eaf740 (most recent call first):
Aborted (core dumped)

$ python3.6 --version
Python 3.6.7

$ uname -a
Linux nopea 4.15.0-39-generic #42-Ubuntu SMP Tue Oct 23 15:48:01 UTC 2018 
x86_64 x86_64 x86_64 GNU/Linux

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:Ubuntu 18.04.1 LTS
Release:18.04
Codename:   bionic

GDB backtrace just before throwing the error: (note that it's argc=2 since 
first argument is a script)
#0  find_maxchar_surrogates (begin=begin@entry=0xa847a0 L'\x7fffbeba' , end=end@entry=0xa847d0 L"", maxchar=maxchar@entry=0x7fffde94, 
num_surrogates=num_surrogates@entry=0x7fffde98) at 
../Objects/unicodeobject.c:1626
#1  0x004cee4b in PyUnicode_FromUnicode (u=u@entry=0xa847a0 
L'\x7fffbeba' , size=12) at ../Objects/unicodeobject.c:2017
#2  0x004db856 in PyUnicode_FromWideChar (w=0xa847a0 L'\x7fffbeba' 
, size=, size@entry=-1) at 
../Objects/unicodeobject.c:2502
#3  0x0043253d in makeargvobject (argc=argc@entry=2, 
argv=argv@entry=0xa82268) at ../Python/sysmodule.c:2145
#4  0x00433228 in PySys_SetArgvEx (argc=2, argv=0xa82268, updatepath=1) 
at ../Python/sysmodule.c:2264
#5  0x004332c1 in PySys_SetArgv (argc=, argv=) at ../Python/sysmodule.c:2277
#6  0x0043a5bd in Py_Main (argc=argc@entry=3, argv=argv@entry=0xa82260) 
at ../Modules/main.c:733
#7  0x00421149 in main (argc=3, argv=0x7fffe178) at 
../Programs/python.c:69

Similar issues:
https://bugs.python.org/issue25631 "Segmentation fault with invalid Unicode 
command-line arguments in embedded Python" (actually 'fixed' since it now 
abort()s)
https://bugs.python.org/issue2128 "sys.argv is wrong for unicode strings"

--
components: Interpreter Core
messages: 334703
nosy: Neui
priority: normal
severity: normal
status: open
title: Change invalid unicode characters to replacement characters in argv
type: behavior
versions: Python 3.6

___
Python tracker 
<https://bugs.python.org/issue35883>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35883] Change invalid unicode characters to replacement characters in argv

2019-02-01 Thread Neui


Neui  added the comment:

I'd say that the terminal is not really relevant here, but rather the locale 
settings because it uses wide string functions. Prefixing it with LC_ALL=C 
produces the same output as you had on my Ubuntu machine. I also get that 
output when running it in Cygwin (and MSYS2), although it seems setting LC_ALL 
has no effect.

--

___
Python tracker 
<https://bugs.python.org/issue35883>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com