[issue16455] sys.getfilesystemencoding() is not the locale encoding on FreeBSD and OpenSolaris when the locale is not set
Roundup Robot added the comment: New changeset c25635b137cc by Victor Stinner in branch 'default': Issue #16455: On FreeBSD and Solaris, if the locale is C, the http://hg.python.org/cpython/rev/c25635b137cc -- nosy: +python-dev ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16455 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16455] sys.getfilesystemencoding() is not the locale encoding on FreeBSD and OpenSolaris when the locale is not set
Jesús Cea Avión added the comment: Victor, any progress on this? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16455 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16455] sys.getfilesystemencoding() is not the locale encoding on FreeBSD and OpenSolaris when the locale is not set
STINNER Victor added the comment: Victor, any progress on this? We have two options, I don't know which one is the best (safer). Does the terminal handle non-ASCII characters with a C locale on FreeBSD or Solaris? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16455 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16455] sys.getfilesystemencoding() is not the locale encoding on FreeBSD and OpenSolaris when the locale is not set
STINNER Victor added the comment: Hijacking locale.getpreferredencoding() is maybe dangerous. I attached a new patch, force_ascii.patch, which uses a different approach: be more strict than mbstowcs(), force the ASCII encoding when: - the LC_CTYPE locale is C - nl_langinfo(CODESET) is ASCII or an alias of ASCII - mbstowcs() is able to decode non-ASCII characters 2012/11/12 STINNER Victor rep...@bugs.python.org STINNER Victor added the comment: Some tests are failing with the patch: == FAIL: test_undecodable_env (test.test_subprocess.POSIXProcessTestCase) -- Traceback (most recent call last): File /usr/home/haypo/prog/python/default/Lib/test/test_subprocess.py, line 1606, in test_undecodable_env self.assertEqual(stdout.decode('ascii'), ascii(value)) AssertionError: 'abc\\xff' != 'abc\\udcff' - 'abc\xff' ? ^ + 'abc\udcff' ? ^^^ == FAIL: test_strcoll_with_diacritic (test.test_locale.TestEnUSCollation) -- Traceback (most recent call last): File /usr/home/haypo/prog/python/default/Lib/test/test_locale.py, line 364, in test_strcoll_with_diacritic self.assertLess(locale.strcoll('\xe0', 'b'), 0) AssertionError: 126 not less than 0 == FAIL: test_strxfrm_with_diacritic (test.test_locale.TestEnUSCollation) -- Traceback (most recent call last): File /usr/home/haypo/prog/python/default/Lib/test/test_locale.py, line 367, in test_strxfrm_with_diacritic self.assertLess(locale.strxfrm('\xe0'), locale.strxfrm('b')) AssertionError: '\xe0' not less than 'b' -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16455 ___ -- Added file: http://bugs.python.org/file27970/force_ascii.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16455 ___diff -r 6a6ad09faad2 Python/fileutils.c --- a/Python/fileutils.cMon Nov 12 01:23:51 2012 +0100 +++ b/Python/fileutils.cMon Nov 12 15:33:24 2012 +0100 @@ -4,6 +4,7 @@ #endif #ifdef HAVE_LANGINFO_H +#include locale.h #include langinfo.h #endif @@ -39,6 +40,104 @@ PyObject * #ifdef HAVE_STAT +/* Workaround FreeBSD and OpenIndiana locale encoding issue. On these + operating systems, nl_langinfo(CODESET) announces an alias of the ASCII + encoding, whereas mbstowcs() and wcstombs() functions use the ISO-8859-1 + encoding. The problem is that os.fsencode() and os.fsdecode() use the + Python codec ASCII. For example, if command line arguments are decoded + by mbstowcs() and encoded by os.fsencode(), we get a UnicodeEncodeError + instead of retrieving the original byte string. + + The workaround is enabled if setlocale(LC_CTYPE, NULL) returns C and + nl_langinfo(CODESET) returns ascii. The workaround is not used if + setlocale(LC_CTYPE, NULL) failed, or if nl_langinfo() or CODESET is not + available. + + Values of locale_is_ascii: + + 1: the workaround is used, the ASCII codec is used instead of mbstowcs() + and wcstombs() functions + 0: the workaround is not used + -1: unknown, need to call check_locale_force_ascii() to known the value +*/ +static int locale_force_ascii = -1; + +extern char* _Py_GetLocaleEncoding(void); + +static int +check_locale_force_ascii(void) +{ +#ifdef MS_WINDOWS +return 0; +#else +char *encoding, *loc; +int i; +unsigned char ch; +wchar_t wch; +size_t res; + +return 1; + +loc = setlocale(LC_CTYPE, NULL); +if (loc == NULL || strcmp(loc, C) != 0) { +/* Failed to get the LC_CTYPE locale or it is different than C: + * don't use the workaround. */ +return 0; +} + +encoding = _Py_GetLocaleEncoding(); +if (encoding == NULL) { +/* unknown encoding: consider that the encoding is not ASCII */ +PyErr_Clear(); +return 0; +} + +if (strcmp(encoding, ascii) != 0) { +free(encoding); +return 0; +} +free(encoding); + +/* the locale is not set and nl_langinfo(CODESET) returns ASCII + (or an alias of the ASCII encoding). Check if the locale encoding + is really ASCII. */ +for (i=0x80; i0xff; i++) { +ch = (unsigned char)i; +res = mbstowcs(wch, (char*)ch, 1); +if (res == (size_t)-1) { +/* decoding a non-ASCII character from the locale encoding failed: + the encoding is really ASCII */ +return 0; +} +} +return 1; +#endif
[issue16455] sys.getfilesystemencoding() is not the locale encoding on FreeBSD and OpenSolaris when the locale is not set
Changes by Jesús Cea Avión j...@jcea.es: -- nosy: +jcea ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16455 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16455] sys.getfilesystemencoding() is not the locale encoding on FreeBSD and OpenSolaris when the locale is not set
New submission from STINNER Victor: On FreeBSD and OpenIndiana, sys.getfilesystemencoding() is 'ascii' when the locale is not set, whereas the locale encoding is ISO-8859-1. This inconsistency causes different issue. For example, os.fsencode(sys.argv[1]) fails if the argument is not ASCII because sys.argv are decoded from the locale encoding (by _Py_char2wchar()). sys.getfilesystemencoding() is 'ascii' because nl_langinfo(CODESET) is used to to get the locale encoding and nl_langinfo(CODESET) announces ASCII (or an alias of this encoding). Python should detect this case and set sys.getfilesystemencoding() to 'iso8859-1' if the locale encoding is 'iso8859-1' whereas nl_langinfo(CODESET) announces ASCII. We can for example decode b'\xe9' with mbstowcs() and check if it fails or if the result is U+00E9. -- components: Unicode messages: 175401 nosy: ezio.melotti, haypo priority: normal severity: normal status: open title: sys.getfilesystemencoding() is not the locale encoding on FreeBSD and OpenSolaris when the locale is not set versions: Python 3.4 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16455 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16455] sys.getfilesystemencoding() is not the locale encoding on FreeBSD and OpenSolaris when the locale is not set
STINNER Victor added the comment: Attached patch works around the CODESET issue on OpenIndiana and FreeBSD. If the LC_CTYPE locale is C and nl_langinfo(CODESET) returns ASCII (or an alias of this encoding), b\xE9 is decoded from the locale encoding: if the result is U+00E9, the patch Python uses ISO-8859-1. (If decoding fails, the locale encoding is really ASCII, the workaround is not used.) If the result is different (b'\xe9' is not decoded from the locale encoding to U+00E9), a ValueError is raised. I wrote this test to detect bugs. I hope that our buildbots will validate the code. We may choose a different behaviour (ex: keep ASCII). Example on FreeBSD 8.2, original Python 3.4: $ ./python import sys, locale sys.getfilesystemencoding() 'ascii' locale.getpreferredencoding() 'US-ASCII' Example on FreeBSD 8.2, patched Python 3.4: $ ./python import sys, locale sys.getfilesystemencoding() 'iso8859-1' locale.getpreferredencoding() 'iso8859-1' -- keywords: +patch Added file: http://bugs.python.org/file27965/workaround_codeset.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16455 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue16455] sys.getfilesystemencoding() is not the locale encoding on FreeBSD and OpenSolaris when the locale is not set
STINNER Victor added the comment: Some tests are failing with the patch: == FAIL: test_undecodable_env (test.test_subprocess.POSIXProcessTestCase) -- Traceback (most recent call last): File /usr/home/haypo/prog/python/default/Lib/test/test_subprocess.py, line 1606, in test_undecodable_env self.assertEqual(stdout.decode('ascii'), ascii(value)) AssertionError: 'abc\\xff' != 'abc\\udcff' - 'abc\xff' ? ^ + 'abc\udcff' ? ^^^ == FAIL: test_strcoll_with_diacritic (test.test_locale.TestEnUSCollation) -- Traceback (most recent call last): File /usr/home/haypo/prog/python/default/Lib/test/test_locale.py, line 364, in test_strcoll_with_diacritic self.assertLess(locale.strcoll('\xe0', 'b'), 0) AssertionError: 126 not less than 0 == FAIL: test_strxfrm_with_diacritic (test.test_locale.TestEnUSCollation) -- Traceback (most recent call last): File /usr/home/haypo/prog/python/default/Lib/test/test_locale.py, line 367, in test_strxfrm_with_diacritic self.assertLess(locale.strxfrm('\xe0'), locale.strxfrm('b')) AssertionError: '\xe0' not less than 'b' -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue16455 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com