Re: [Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API
I propose to raise Unicode errors if a filename cannot be decoded on Windows, instead of creating a bogus filenames with questions marks. Can you please elaborate what APIs you are talking about exactly? If it's the byte APIs (i.e. using bytes as file names), then I'm -1 on this proposal. People that explicitly use bytes for file names deserve to get whatever exact platform semantics the platform has to offer. This is true on Unix, and it is also true on Windows. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API
On Tue, 25 Oct 2011 00:57:42 +0200 Victor Stinner victor.stin...@haypocalc.com wrote: Hi, I propose to raise Unicode errors if a filename cannot be decoded on Windows, instead of creating a bogus filenames with questions marks. Because this change is incompatible with Python 3.2, even if such filenames are unusable and I consider the problem as a (Python?) bug, I would like your opinion on such change before working on a patch. +1 from me. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API
Le Mardi 25 Octobre 2011 13:20:12 vous avez écrit : Victor Stinner writes: I propose to raise Unicode errors if a filename cannot be decoded on Windows, instead of creating a bogus filenames with questions marks. By bogus you mean sometimes (?) invalid and the OS will refuse to use them, causing a later hard-to-diagnose exception, rather than not what the user thinks he wants, right? If the (Unicode) filename cannot be encoded to the ANSI code page, which is usually a small charset (e.g. cp1252 contains 256 code points), Windows replaces unencodable characters by question marks. Imagine that the code page is ASCII, the (Unicode) filename hého.txt will be encoded to bh?ho.txt. You can display this string in a dialog, but you cannot open the file to read its content... If you pass the filename to os.listdir(), it is even worse because ? is interpreted (? means any character, it's a pattern to match a filename). I would like to raise an error on such situation, because currently the user cannot be noticed otherwise. The user may search ? in the filename, but Windows replaces also unencodable characters by *similar glyph* (e.g. é replaced by e). In the hard errors case, a hearty +1 (I'm dealing with this in an experimental version of XEmacs and it's a right PITA if the codec doesn't complain). If you use MultiByteToWideChar and WideCharToMultiByte, you can be noticed on error using some flags, but functions of the ANSI API doesn't give access to these flags... Backward compatibility is important, but here the costs of fixing such bugs outweigh the value of bug-compatibility. I only want to change how unencodable filenames are handled, the bytes API will still be available. If you filesystem has the 8dot3name feature enable, it may work even for unencodable filenames (Windows generates names like HEHO~1.TXT). Victor ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API
Le Mardi 25 Octobre 2011 09:09:56 vous avez écrit : I propose to raise Unicode errors if a filename cannot be decoded on Windows, instead of creating a bogus filenames with questions marks. Can you please elaborate what APIs you are talking about exactly? Basically, all functions processing filenames, so most functions of posixmodule.c. Some examples: - os.listdir(): FindFirstFileA, FindNextFileA, FindCloseA - os.lstat(): CreateFileA - os.getcwdb(): getcwd() - os.mkdir(): CreateDirectoryA - os.chmod(): SetFileAttributesA - ... If it's the byte APIs (i.e. using bytes as file names), then I'm -1 on this proposal. People that explicitly use bytes for file names deserve to get whatever exact platform semantics the platform has to offer. This is true on Unix, and it is also true on Windows. My proposition is a fix to user reported by a user: http://bugs.python.org/issue13247 I want to keep the bytes API for backward compatibility, and it will still work for non-ASCII characters, but only for non-ASCII characters encodable to the ANSI code page. In practice, characters not encodable to the ANSI code page are very rare. For example: it's difficult to write such characters directly with the keyboard. I bet that very few people will notify the change. Victor ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] memcmp performance
Richard Saunders, 25.10.2011 01:17: -On [20111024 09:22], Stefan Behnel wrote: I agree. Given that the analysis shows that the libc memcmp() is particularly fast on many Linux systems, it should be up to the Python package maintainers for these systems to set that option externally through the optimisation CFLAGS. Indeed, this is how I constructed my Python 3.3 and Python 2.7 : setenv CFLAGS '-fno-builtin-memcmp' just before I configured. I would like to revisit changing unicode_compare: adding a special arm for using memcmp when the unicode kinds are the same will only work in two specific instances: (1) the strings are the same kind, the char size is 1 * We could add THIS to unicode_compare, but it seems extremely specialized by itself But also extremely likely to happen. This means that the strings are pure ASCII, which is highly likely and one of the main reasons why the unicode string layout was rewritten for CPython 3.3. It allows CPython to save a lot of memory (thus clearly proving how likely this case is!), and it would also allow it to do faster comparisons for these strings. Stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API
Le Mardi 25 Octobre 2011 09:09:56 vous avez écrit : If it's the byte APIs (i.e. using bytes as file names), then I'm -1 on this proposal. People that explicitly use bytes for file names deserve to get whatever exact platform semantics the platform has to offer. This is true on Unix, and it is also true on Windows. For your information, it took me something like 3 months (when I was working on the issue #12281) to understand exactly how Windows handles undecodable bytes and unencodable characters. I did a lot of tests on different Windows versions (XP, Vista and Seven, the behaviour changed in Windows Vista). I had to take notes because it is really complex. Well, I wanted to understand exactly *all* code pages, including CP_UTF7 and CP_UTF8, not only the most common ones like cp1252 or cp932. See the dedicated section in my book to learn more about these funtions: http://www.haypocalc.com/tmp/unicode-2011-07-20/html/operating_systems.html#encode- and-decode-functions Some information are available in MultiByteToWideChar and WideCharToMultiByte documentation, but they are not well explained :-p Victor ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] memcmp performance
Le Mardi 25 Octobre 2011 10:44:16 Stefan Behnel a écrit : Richard Saunders, 25.10.2011 01:17: -On [20111024 09:22], Stefan Behnel wrote: I agree. Given that the analysis shows that the libc memcmp() is particularly fast on many Linux systems, it should be up to the Python package maintainers for these systems to set that option externally through the optimisation CFLAGS. Indeed, this is how I constructed my Python 3.3 and Python 2.7 : setenv CFLAGS '-fno-builtin-memcmp' just before I configured. I would like to revisit changing unicode_compare: adding a special arm for using memcmp when the unicode kinds are the same will only work in two specific instances: (1) the strings are the same kind, the char size is 1 * We could add THIS to unicode_compare, but it seems extremely specialized by itself But also extremely likely to happen. This means that the strings are pure ASCII, which is highly likely and one of the main reasons why the unicode string layout was rewritten for CPython 3.3. It allows CPython to save a lot of memory (thus clearly proving how likely this case is!), and it would also allow it to do faster comparisons for these strings. Python 3.3 has already some optimizations for latin1: CPU and the C language are more efficient to process char* strings than Py_UCS2 and Py_UCS4 strings. For example, we are using memchr() to search a single character is a latin1 string. Victor ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [Python-checkins] cpython: #13251: update string description in datamodel.rst.
Hi, ezio.melotti wrote: http://hg.python.org/cpython/rev/11d18ebb2dd1 changeset: 73116:11d18ebb2dd1 user:Ezio Melotti ezio.melo...@gmail.com date:Tue Oct 25 09:23:42 2011 +0300 summary: #13251: update string description in datamodel.rst. files: Doc/reference/datamodel.rst | 20 ++-- 1 files changed, 10 insertions(+), 10 deletions(-) diff --git a/Doc/reference/datamodel.rst b/Doc/reference/datamodel.rst --- a/Doc/reference/datamodel.rst +++ b/Doc/reference/datamodel.rst @@ -276,16 +276,16 @@ single: integer single: Unicode - The items of a string object are Unicode code units. A Unicode code - unit is represented by a string object of one item and can hold either - a 16-bit or 32-bit value representing a Unicode ordinal (the maximum - value for the ordinal is given in ``sys.maxunicode``, and depends on - how Python is configured at compile time). Surrogate pairs may be - present in the Unicode object, and will be reported as two separate - items. The built-in functions :func:`chr` and :func:`ord` convert - between code units and nonnegative integers representing the Unicode - ordinals as defined in the Unicode Standard 3.0. Conversion from and to - other encodings are possible through the string method :meth:`encode`. + A string is a sequence of values that represent Unicode codepoints. + All the codepoints in range ``U+ - U+10`` can be represented + in a string. Python doesn't have a :c:type:`chr` type, and + every characters in the string is represented as a string object typo ^ Should be character, right? + with length ``1``. The built-in function :func:`chr` converts a + character to its codepoint (as an integer); :func:`ord` converts + an integer in range ``0 - 10`` to the corresponding character. Actually chr() converts an integer to a string and ord() converts a string to an integer. chr and ord are swapped in your text. + :meth:`str.encode` can be used to convert a :class:`str` to + :class:`bytes` using the given encoding, and :meth:`bytes.decode` can + be used to achieve the opposite. Petri ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [Python-checkins] cpython: Issue #13226: Add RTLD_xxx constants to the os module. These constants can by
Hi, victor.stinner wrote: http://hg.python.org/cpython/rev/c75427c0da06 changeset: 73127:c75427c0da06 user:Victor Stinner victor.stin...@haypocalc.com date:Tue Oct 25 13:34:04 2011 +0200 summary: Issue #13226: Add RTLD_xxx constants to the os module. These constants can by used with sys.setdlopenflags(). files: Doc/library/os.rst | 13 + Doc/library/sys.rst| 10 +- Lib/test/test_posix.py | 7 +++ Misc/NEWS | 3 +++ Modules/posixmodule.c | 26 ++ 5 files changed, 54 insertions(+), 5 deletions(-) [snip] diff --git a/Misc/NEWS b/Misc/NEWS --- a/Misc/NEWS +++ b/Misc/NEWS @@ -341,6 +341,9 @@ Library --- +- Issue #13226: Add RTLD_xxx constants to the os module. These constants can by Typo: s/by/be/ + used with sys.setdlopenflags(). + - Issue #10278: Add clock_getres(), clock_gettime() and CLOCK_xxx constants to the time module. time.clock_gettime(time.CLOCK_MONOTONIC) provides a monotonic clock Petri ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API
My proposition is a fix to user reported by a user: http://bugs.python.org/issue13247 So your proposal is that abspath(b.) shall raise a UnicodeError in this case? Are you serious??? In practice, characters not encodable to the ANSI code page are very rare. For example: it's difficult to write such characters directly with the keyboard. I bet that very few people will notify the change. Except people running into the very issues you are trying to resolve. I'm not sure these people are really helped by having their applications crash all of a sudden. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Modules of plat-* directories
Am 24.10.2011 14:06, schrieb Victor Stinner: There are open issues related to plat-XXX. Le Lundi 24 Octobre 2011 00:03:42 Martin v. Löwis a écrit : no, we make no changes to them unless a user actually requests a change Matthias Klose asked for socket SIO* constants in september 2006 (5 years ago). http://bugs.python.org/issue1565071 I would prefer to see such constants in the socket module. These are not mutually exclusive. You can regenerate IN.py and still add the constants to the socket module. Thiemo Seufer noticed that the linux2 platform definition is incorrect for several architectures, namely Alpha, PA-RISC(hppa), MIPS and SPARC. in september 2008 (3 years ago). He proposed to add a sublevel: Lib/plat- linux2/CDROM.py would become: - Lib/plat-linux2-alpha/CDROM.py - Lib/plat-linux2-hppa/CDROM.py - Lib/plat-linux2-mips/CDROM.py, - Lib/plat-linux2-sparc/CDROM.py - (and a default for other platforms like Intel x86?) = http://bugs.python.org/issue3990 I really don't like this idea (of adding the architecture in the directory name) :-p Neither do I. In the specific case, I'd generate four versions of CDROM.py (with differing names), and provide a CDROM.py that imports the right one. IMO plat-XXX is wrong by design. I disagree. It's limited, not wrong. It would be better if at least these files were regenerated at build, but Martin doesn't want to regenerate them. And there is still the problem of Mac OS X which embed 3 binarires for 3 architectures in the same FAT file. These are problems, but not necessarily issues. Even if some of the values are incorrect, the values that are correct may still be useful. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API
Le mardi 25 octobre 2011 00:57:42, Victor Stinner a écrit : I propose to raise Unicode errors if a filename cannot be decoded on Windows, instead of creating a bogus filenames with questions marks. Because this change is incompatible with Python 3.2, even if such filenames are unusable and I consider the problem as a (Python?) bug, I would like your opinion on such change before working on a patch. Most people like the idea, so I wrote a patch and attached it to: http://bugs.python.org/issue13247 The patch only changes os.getcwdb() and os.listdir(). We might use the PEP 383 to store undecoable bytes as surrogates (U+DC80- U+DCFF). But the situation is the opposite of the situtation on UNIX: on Windows, the problem is more on encoding (text-bytes) than on decoding (bytes-text). On UNIX, problems occur when the system is misconfigured (e.g. wrong locale encoding). On Windows, problems occur when your application uses the old (ANSI) API, whereas your filesystem is fully Unicode compliant and you created Unicode filenames with a program using the new (Windows) API. I only changed functions returning filenames, so os.mkdir() is unchanged for example. We may also patch the other functions to simplify the source code. Victor ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [Python-checkins] cpython: Issue #13226: Add RTLD_xxx constants to the os module. These constants can by
Le mardi 25 octobre 2011 14:50:44, Petri Lehtinen a écrit : Hi, victor.stinner wrote: http://hg.python.org/cpython/rev/c75427c0da06 changeset: 73127:c75427c0da06 user:Victor Stinner victor.stin...@haypocalc.com date:Tue Oct 25 13:34:04 2011 +0200 summary: Issue #13226: Add RTLD_xxx constants to the os module. These constants can by used with sys.setdlopenflags(). files: Doc/library/os.rst | 13 + Doc/library/sys.rst| 10 +- Lib/test/test_posix.py | 7 +++ Misc/NEWS | 3 +++ Modules/posixmodule.c | 26 ++ 5 files changed, 54 insertions(+), 5 deletions(-) [snip] diff --git a/Misc/NEWS b/Misc/NEWS --- a/Misc/NEWS +++ b/Misc/NEWS @@ -341,6 +341,9 @@ Library --- +- Issue #13226: Add RTLD_xxx constants to the os module. These constants can by Typo: s/by/be/ Fixed, thanks. Victor ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API
On 10/25/2011 4:31 AM, Victor Stinner wrote: Le Mardi 25 Octobre 2011 09:09:56 vous avez écrit : I propose to raise Unicode errors if a filename cannot be decoded on Windows, instead of creating a bogus filenames with questions marks. Can you please elaborate what APIs you are talking about exactly? Basically, all functions processing filenames, so most functions of posixmodule.c. Some examples: This seems way too broad. From you previous posts, I presumed that you only propose to change behavior when the user asks for the bytes versions of a unicode name that cannot be properly converted to a bytes version. - os.listdir(): os.listdir(unicode) works fine and should not be changed. os.listdir(bytes) is what OP of issue wants changed. FindFirstFileA, FindNextFileA, FindCloseA There are not Python names. Are they Windows API names? - os.lstat(): CreateFileA This does not create a path and should not be changed as far as I can see. - os.getcwdb(): This you might change. getcwd() This should not be, as no bytes are involved. - os.mkdir(): CreateDirectoryA - os.chmod(): SetFileAttributesA Like os.lstat, these accept only accept a path and should do what they are supposed to do. If it's the byte APIs (i.e. using bytes as file names), then I'm -1 on this proposal. People that explicitly use bytes for file names deserve to get whatever exact platform semantics the platform has to offer. This is true on Unix, and it is also true on Windows. My proposition is a fix to user reported by a user: http://bugs.python.org/issue13247 I want to keep the bytes API for backward compatibility, and it will still work for non-ASCII characters, but only for non-ASCII characters encodable to the ANSI code page. In practice, characters not encodable to the ANSI code page are very rare. For example: it's difficult to write such characters directly with the keyboard. I bet that very few people will notify the change. Actually, Windows makes switching keyboard setups rather easy once you enable the feature. It might be that people who routinely use non-'ansi' characters in file and directory names do not routinely ask for bytes versions thereof. The doc says All functions accepting path or file names accept both bytes and string objects, and result in an object of the same type, if a path or file name is returned. It does that now, though it says nothing about the encoding assumed for input bytes or used for output bytes. It does not mention raising exceptions, so doing so is a feature-change that would likely break code. Currently, exceptional situations are signalled with '?' in returned_path rather than with an exception object. It ('?') is a bad choice of signal though, given the other uses of '?' in paths. -- Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API
In general I agree with what you write, Terry. One clarification and one comment, though. Terry Reedy writes: The doc says All functions accepting path or file names accept both bytes and string objects, and result in an object of the same type, if a path or file name is returned. It does that now, though it says nothing about the encoding assumed for input bytes or used for output bytes. That's determined by the OS, and figuring that out is the end user's problem. It does not mention raising exceptions, so doing so is a feature-change that would likely break code. Currently, exceptional situations are signalled with '?' in returned_path rather than with an exception object. It ('?') is a bad choice of signal though, given the other uses of '?' in paths. True, but this isn't really Python's problem. And IIUC Martin's post, it is hardly exceptional: isn't Python doing this, it's just standard Windows behavior, which results in pathnames that are perfectly acceptable to Windows APIs, but unreliable in use because they have different semantics in different Windows APIs. If that is true, there are almost surely user programs that depend on this behavior, even though it sucks.[1] My original hearty +1 was dependent on my understanding from Victor's post that this substitution could cause later exceptions because filename is invalid (eg, contains illegal characters causing Windows to signal an error). If that's not true, I think the proper remedy is to add a strong warning to pylint that use of those APIs is supported (eg, for interaction with existing programs that use them) but that they require careful error-checking for robust use. As a card-carrying Unicode nazi I wouldn't mind tagging the bytes APIs with a DeprecationWarning but I know that proposal is going nowhere so I withdraw it in advance. wink Footnotes: [1] Note that the original rationale for this was surely since users will have a very hard time using file names with this character in them, using it as a substitution character internally will make the problem evident and Sufficiently Smart Programs can deal with it. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com