Re: [Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API

2011-10-25 Thread Martin v. Löwis
 I propose to raise Unicode errors if a filename cannot be decoded on Windows, 
 instead of creating a bogus filenames with questions marks.

Can you please elaborate what APIs you are talking about exactly?

If it's the byte APIs (i.e. using bytes as file names), then I'm -1 on
this proposal. People that explicitly use bytes for file names deserve
to get whatever exact platform semantics the platform has to offer. This
is true on Unix, and it is also true on Windows.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API

2011-10-25 Thread Antoine Pitrou
On Tue, 25 Oct 2011 00:57:42 +0200
Victor Stinner victor.stin...@haypocalc.com wrote:
 Hi,
 
 I propose to raise Unicode errors if a filename cannot be decoded on Windows, 
 instead of creating a bogus filenames with questions marks. Because this 
 change 
 is incompatible with Python 3.2, even if such filenames are unusable and I 
 consider the problem as a (Python?) bug, I would like your opinion on such 
 change before working on a patch.

+1 from me.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API

2011-10-25 Thread Victor Stinner
Le Mardi 25 Octobre 2011 13:20:12 vous avez écrit :
 Victor Stinner writes:
   I propose to raise Unicode errors if a filename cannot be decoded
   on Windows, instead of creating a bogus filenames with questions
   marks.
 
 By bogus you mean sometimes (?) invalid and the OS will refuse to
 use them, causing a later hard-to-diagnose exception, rather than
 not what the user thinks he wants, right?

If the (Unicode) filename cannot be encoded to the ANSI code page, which is 
usually a small charset (e.g. cp1252 contains 256 code points), Windows 
replaces unencodable characters by question marks.

Imagine that the code page is ASCII, the (Unicode) filename hého.txt will 
be encoded to bh?ho.txt. You can display this string in a dialog, but you 
cannot open the file to read its content... If you pass the filename to 
os.listdir(), it is even worse because ? is interpreted (? means any 
character, it's a pattern to match a filename).

I would like to raise an error on such situation, because currently the user 
cannot be noticed otherwise. The user may search ? in the filename, but 
Windows replaces also unencodable characters by *similar glyph* (e.g. é 
replaced by e).

 In the hard errors case, a hearty +1 (I'm dealing with this in an
 experimental version of XEmacs and it's a right PITA if the codec
 doesn't complain). 

If you use MultiByteToWideChar and WideCharToMultiByte, you can be noticed on 
error using some flags, but functions of the ANSI API doesn't give access to 
these flags...

 Backward compatibility is important, but here the
 costs of fixing such bugs outweigh the value of bug-compatibility.

I only want to change how unencodable filenames are handled, the bytes API will 
still be available. If you filesystem has the 8dot3name feature enable, it 
may work even for unencodable filenames (Windows generates names like 
HEHO~1.TXT).

Victor
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API

2011-10-25 Thread Victor Stinner
Le Mardi 25 Octobre 2011 09:09:56 vous avez écrit :
  I propose to raise Unicode errors if a filename cannot be decoded on
  Windows, instead of creating a bogus filenames with questions marks.
 
 Can you please elaborate what APIs you are talking about exactly?

Basically, all functions processing filenames, so most functions of 
posixmodule.c. Some examples:

- os.listdir(): FindFirstFileA, FindNextFileA, FindCloseA
- os.lstat(): CreateFileA
- os.getcwdb(): getcwd()
- os.mkdir(): CreateDirectoryA
- os.chmod(): SetFileAttributesA
- ...

 If it's the byte APIs (i.e. using bytes as file names), then I'm -1 on
 this proposal. People that explicitly use bytes for file names deserve
 to get whatever exact platform semantics the platform has to offer. This
 is true on Unix, and it is also true on Windows.

My proposition is a fix to user reported by a user:
http://bugs.python.org/issue13247

I want to keep the bytes API for backward compatibility, and it will still 
work for non-ASCII characters, but only for non-ASCII characters encodable to 
the ANSI code page.

In practice, characters not encodable to the ANSI code page are very rare. For 
example: it's difficult to write such characters directly with the keyboard. I 
bet that very few people will notify the change.

Victor
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] memcmp performance

2011-10-25 Thread Stefan Behnel

Richard Saunders, 25.10.2011 01:17:

-On [20111024 09:22], Stefan Behnel wrote:
  I agree. Given that the analysis shows that the libc memcmp() is
  particularly fast on many Linux systems, it should be up to the Python
  package maintainers for these systems to set that option externally through
  the optimisation CFLAGS.

Indeed, this is how I constructed my Python 3.3 and Python 2.7 :
setenv CFLAGS '-fno-builtin-memcmp'
just before I configured.

I would like to revisit changing unicode_compare: adding a
special arm for using memcmp when the unicode kinds are the
same will only work in two specific instances:

(1) the strings are the same kind, the char size is 1
* We could add THIS to unicode_compare, but it seems extremely
specialized by itself


But also extremely likely to happen. This means that the strings are pure 
ASCII, which is highly likely and one of the main reasons why the unicode 
string layout was rewritten for CPython 3.3. It allows CPython to save a 
lot of memory (thus clearly proving how likely this case is!), and it would 
also allow it to do faster comparisons for these strings.


Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API

2011-10-25 Thread Victor Stinner
Le Mardi 25 Octobre 2011 09:09:56 vous avez écrit :
 If it's the byte APIs (i.e. using bytes as file names), then I'm -1 on
 this proposal. People that explicitly use bytes for file names deserve
 to get whatever exact platform semantics the platform has to offer. This
 is true on Unix, and it is also true on Windows.

For your information, it took me something like 3 months (when I was working 
on the issue #12281) to understand exactly how Windows handles undecodable 
bytes and unencodable characters. I did a lot of tests on different Windows 
versions (XP, Vista and Seven, the behaviour changed in Windows Vista). I had 
to take notes because it is really complex. Well, I wanted to understand 
exactly *all* code pages, including CP_UTF7 and CP_UTF8, not only the most 
common ones like cp1252 or cp932.

See the dedicated section in my book to learn more about these funtions:

http://www.haypocalc.com/tmp/unicode-2011-07-20/html/operating_systems.html#encode-
and-decode-functions

Some information are available in MultiByteToWideChar and WideCharToMultiByte 
documentation, but they are not well explained :-p

Victor
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] memcmp performance

2011-10-25 Thread Victor Stinner
Le Mardi 25 Octobre 2011 10:44:16 Stefan Behnel a écrit :
 Richard Saunders, 25.10.2011 01:17:
  -On [20111024 09:22], Stefan Behnel wrote:
I agree. Given that the analysis shows that the libc memcmp() is
particularly fast on many Linux systems, it should be up to the
Python package maintainers for these systems to set that option
externally through the optimisation CFLAGS.
  
  Indeed, this is how I constructed my Python 3.3 and Python 2.7 :
  setenv CFLAGS '-fno-builtin-memcmp'
  just before I configured.
  
  I would like to revisit changing unicode_compare: adding a
  special arm for using memcmp when the unicode kinds are the
  same will only work in two specific instances:
  
  (1) the strings are the same kind, the char size is 1
  * We could add THIS to unicode_compare, but it seems extremely
  specialized by itself
 
 But also extremely likely to happen. This means that the strings are pure
 ASCII, which is highly likely and one of the main reasons why the unicode
 string layout was rewritten for CPython 3.3. It allows CPython to save a
 lot of memory (thus clearly proving how likely this case is!), and it would
 also allow it to do faster comparisons for these strings.

Python 3.3 has already some optimizations for latin1: CPU and the C language 
are more efficient to process char* strings than Py_UCS2 and Py_UCS4 strings. 
For example, we are using memchr() to search a single character is a latin1 
string.

Victor
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-checkins] cpython: #13251: update string description in datamodel.rst.

2011-10-25 Thread Petri Lehtinen
Hi,

ezio.melotti wrote:
 http://hg.python.org/cpython/rev/11d18ebb2dd1
 changeset:   73116:11d18ebb2dd1
 user:Ezio Melotti ezio.melo...@gmail.com
 date:Tue Oct 25 09:23:42 2011 +0300
 summary:
   #13251: update string description in datamodel.rst.
 
 files:
   Doc/reference/datamodel.rst |  20 ++--
   1 files changed, 10 insertions(+), 10 deletions(-)
 
 
 diff --git a/Doc/reference/datamodel.rst b/Doc/reference/datamodel.rst
 --- a/Doc/reference/datamodel.rst
 +++ b/Doc/reference/datamodel.rst
 @@ -276,16 +276,16 @@
  single: integer
  single: Unicode
  
 - The items of a string object are Unicode code units.  A Unicode code
 - unit is represented by a string object of one item and can hold 
 either
 - a 16-bit or 32-bit value representing a Unicode ordinal (the maximum
 - value for the ordinal is given in ``sys.maxunicode``, and depends on
 - how Python is configured at compile time).  Surrogate pairs may be
 - present in the Unicode object, and will be reported as two separate
 - items.  The built-in functions :func:`chr` and :func:`ord` convert
 - between code units and nonnegative integers representing the Unicode
 - ordinals as defined in the Unicode Standard 3.0. Conversion from 
 and to
 - other encodings are possible through the string method 
 :meth:`encode`.
 + A string is a sequence of values that represent Unicode codepoints.
 + All the codepoints in range ``U+ - U+10`` can be represented
 + in a string.  Python doesn't have a :c:type:`chr` type, and
 + every characters in the string is represented as a string object
  typo ^

Should be character, right?

 + with length ``1``.  The built-in function :func:`chr` converts a
 + character to its codepoint (as an integer); :func:`ord` converts
 + an integer in range ``0 - 10`` to the corresponding character.

Actually chr() converts an integer to a string and ord() converts a
string to an integer. chr and ord are swapped in your text.

 + :meth:`str.encode` can be used to convert a :class:`str` to
 + :class:`bytes` using the given encoding, and :meth:`bytes.decode` 
 can
 + be used to achieve the opposite.


Petri
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-checkins] cpython: Issue #13226: Add RTLD_xxx constants to the os module. These constants can by

2011-10-25 Thread Petri Lehtinen
Hi,

victor.stinner wrote:
 http://hg.python.org/cpython/rev/c75427c0da06
 changeset:   73127:c75427c0da06
 user:Victor Stinner victor.stin...@haypocalc.com
 date:Tue Oct 25 13:34:04 2011 +0200
 summary:
   Issue #13226: Add RTLD_xxx constants to the os module. These constants can 
 by
 used with sys.setdlopenflags().
 
 files:
   Doc/library/os.rst |  13 +
   Doc/library/sys.rst|  10 +-
   Lib/test/test_posix.py |   7 +++
   Misc/NEWS  |   3 +++
   Modules/posixmodule.c  |  26 ++
   5 files changed, 54 insertions(+), 5 deletions(-)

[snip]

 diff --git a/Misc/NEWS b/Misc/NEWS
 --- a/Misc/NEWS
 +++ b/Misc/NEWS
 @@ -341,6 +341,9 @@
  Library
  ---
  
 +- Issue #13226: Add RTLD_xxx constants to the os module. These constants can 
 by

Typo: s/by/be/

 +  used with sys.setdlopenflags().
 +
  - Issue #10278: Add clock_getres(), clock_gettime() and CLOCK_xxx constants 
 to
the time module. time.clock_gettime(time.CLOCK_MONOTONIC) provides a
monotonic clock


Petri
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API

2011-10-25 Thread Martin v. Löwis
 My proposition is a fix to user reported by a user:
 http://bugs.python.org/issue13247

So your proposal is that abspath(b.) shall raise a UnicodeError in
this case?

Are you serious???

 In practice, characters not encodable to the ANSI code page are very rare. 
 For 
 example: it's difficult to write such characters directly with the keyboard. 
 I 
 bet that very few people will notify the change.

Except people running into the very issues you are trying to resolve.
I'm not sure these people are really helped by having their applications
crash all of a sudden.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Modules of plat-* directories

2011-10-25 Thread Martin v. Löwis
Am 24.10.2011 14:06, schrieb Victor Stinner:
 There are open issues related to plat-XXX.
 
 Le Lundi 24 Octobre 2011 00:03:42 Martin v. Löwis a écrit :
 no, we make no changes to them unless a user actually requests a change
 
 Matthias Klose asked for socket SIO* constants in september 2006 (5 years 
 ago).
 http://bugs.python.org/issue1565071
 
 I would prefer to see such constants in the socket module.

These are not mutually exclusive. You can regenerate IN.py and still
add the constants to the socket module.

 Thiemo Seufer noticed that the linux2 platform definition is incorrect for 
 several architectures, namely Alpha, PA-RISC(hppa), MIPS and SPARC. in 
 september 2008 (3 years ago). He proposed to add a sublevel: Lib/plat-
 linux2/CDROM.py would become:
 
  - Lib/plat-linux2-alpha/CDROM.py
  - Lib/plat-linux2-hppa/CDROM.py
  - Lib/plat-linux2-mips/CDROM.py,
  - Lib/plat-linux2-sparc/CDROM.py
  - (and a default for other platforms like Intel x86?)
 
 = http://bugs.python.org/issue3990
 
 I really don't like this idea (of adding the architecture in the directory 
 name) :-p

Neither do I. In the specific case, I'd generate four versions of
CDROM.py (with differing names), and provide a CDROM.py that imports the
right one.

 IMO plat-XXX is wrong by design.

I disagree. It's limited, not wrong.

 It would be better if at least these files 
 were regenerated at build, but Martin doesn't want to regenerate them. And 
 there is still the problem of Mac OS X which embed 3 binarires for 3 
 architectures in the same FAT file.

These are problems, but not necessarily issues. Even if some of the
values are incorrect, the values that are correct may still be useful.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API

2011-10-25 Thread Victor Stinner
Le mardi 25 octobre 2011 00:57:42, Victor Stinner a écrit :
 I propose to raise Unicode errors if a filename cannot be decoded on
 Windows, instead of creating a bogus filenames with questions marks.
 Because this change is incompatible with Python 3.2, even if such
 filenames are unusable and I consider the problem as a (Python?) bug, I
 would like your opinion on such change before working on a patch.

Most people like the idea, so I wrote a patch and attached it to:

   http://bugs.python.org/issue13247

The patch only changes os.getcwdb() and os.listdir().

 We might use the PEP 383 to store undecoable bytes as surrogates (U+DC80-
 U+DCFF). But the situation is the opposite of the situtation on UNIX: on
 Windows, the problem is more on encoding (text-bytes) than on decoding
 (bytes-text). On UNIX, problems occur when the system is misconfigured
 (e.g. wrong locale encoding). On Windows, problems occur when your
 application uses the old (ANSI) API, whereas your filesystem is fully
 Unicode compliant and you created Unicode filenames with a program using
 the new (Windows) API.

I only changed functions returning filenames, so os.mkdir() is unchanged for 
example.

We may also patch the other functions to simplify the source code.

Victor
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-checkins] cpython: Issue #13226: Add RTLD_xxx constants to the os module. These constants can by

2011-10-25 Thread Victor Stinner
Le mardi 25 octobre 2011 14:50:44, Petri Lehtinen a écrit :
 Hi,
 
 victor.stinner wrote:
  http://hg.python.org/cpython/rev/c75427c0da06
  changeset:   73127:c75427c0da06
  user:Victor Stinner victor.stin...@haypocalc.com
  date:Tue Oct 25 13:34:04 2011 +0200
  
  summary:
Issue #13226: Add RTLD_xxx constants to the os module. These constants
can by
  
  used with sys.setdlopenflags().
  
  files:
Doc/library/os.rst |  13 +
Doc/library/sys.rst|  10 +-
Lib/test/test_posix.py |   7 +++
Misc/NEWS  |   3 +++
Modules/posixmodule.c  |  26 ++
5 files changed, 54 insertions(+), 5 deletions(-)
 
 [snip]
 
  diff --git a/Misc/NEWS b/Misc/NEWS
  --- a/Misc/NEWS
  +++ b/Misc/NEWS
  @@ -341,6 +341,9 @@
  
   Library
   ---
  
  +- Issue #13226: Add RTLD_xxx constants to the os module. These constants
  can by
 
 Typo: s/by/be/

Fixed, thanks.

Victor
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API

2011-10-25 Thread Terry Reedy

On 10/25/2011 4:31 AM, Victor Stinner wrote:

Le Mardi 25 Octobre 2011 09:09:56 vous avez écrit :

I propose to raise Unicode errors if a filename cannot be decoded on
Windows, instead of creating a bogus filenames with questions marks.


Can you please elaborate what APIs you are talking about exactly?


Basically, all functions processing filenames, so most functions of
posixmodule.c. Some examples:


This seems way too broad. From you previous posts, I presumed that you 
only propose to change behavior when the user asks for the bytes 
versions of a unicode name that cannot be properly converted to a bytes 
version.



- os.listdir():


os.listdir(unicode) works fine and should not be changed.
os.listdir(bytes) is what OP of issue wants changed.


FindFirstFileA, FindNextFileA, FindCloseA


There are not Python names. Are they Windows API names?


- os.lstat(): CreateFileA


This does not create a path and should not be changed as far as I can see.


- os.getcwdb():


This you might change.

 getcwd()

This should not be, as no bytes are involved.


- os.mkdir(): CreateDirectoryA
- os.chmod(): SetFileAttributesA


Like os.lstat, these accept only accept a path and should do what they 
are supposed to do.



If it's the byte APIs (i.e. using bytes as file names), then I'm -1 on
this proposal. People that explicitly use bytes for file names deserve
to get whatever exact platform semantics the platform has to offer. This
is true on Unix, and it is also true on Windows.


My proposition is a fix to user reported by a user:
http://bugs.python.org/issue13247

I want to keep the bytes API for backward compatibility, and it will still
work for non-ASCII characters, but only for non-ASCII characters encodable to
the ANSI code page.

In practice, characters not encodable to the ANSI code page are very rare. For
example: it's difficult to write such characters directly with the keyboard. I
bet that very few people will notify the change.


Actually, Windows makes switching keyboard setups rather easy once you 
enable the feature. It might be that people who routinely use non-'ansi' 
characters in file and directory names do not routinely ask for bytes 
versions thereof.


The doc says All functions accepting path or file names accept both 
bytes and string objects, and result in an object of the same type, if a 
path or file name is returned. It does that now, though it says nothing 
about the encoding assumed for input bytes or used for output bytes. It 
does not mention raising exceptions, so doing so is a feature-change 
that would likely break code. Currently, exceptional situations are 
signalled with '?' in returned_path rather than with an exception 
object. It ('?') is a bad choice of signal though, given the other uses 
of '?' in paths.


--
Terry Jan Reedy


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API

2011-10-25 Thread Stephen J. Turnbull
In general I agree with what you write, Terry.  One clarification and
one comment, though.

Terry Reedy writes:

  The doc says All functions accepting path or file names accept both 
  bytes and string objects, and result in an object of the same type, if a 
  path or file name is returned. It does that now, though it says nothing 
  about the encoding assumed for input bytes or used for output
  bytes.

That's determined by the OS, and figuring that out is the end user's
problem.

  It does not mention raising exceptions, so doing so is a
  feature-change that would likely break code. Currently, exceptional
  situations are signalled with '?' in returned_path rather than
  with an exception object. It ('?') is a bad choice of signal
  though, given the other uses of '?' in paths.

True, but this isn't really Python's problem.  And IIUC Martin's post,
it is hardly exceptional: isn't Python doing this, it's just
standard Windows behavior, which results in pathnames that are
perfectly acceptable to Windows APIs, but unreliable in use because
they have different semantics in different Windows APIs.  If that is
true, there are almost surely user programs that depend on this
behavior, even though it sucks.[1]

My original hearty +1 was dependent on my understanding from
Victor's post that this substitution could cause later exceptions
because filename is invalid (eg, contains illegal characters causing
Windows to signal an error).  If that's not true, I think the proper
remedy is to add a strong warning to pylint that use of those APIs is
supported (eg, for interaction with existing programs that use them)
but that they require careful error-checking for robust use.

As a card-carrying Unicode nazi I wouldn't mind tagging the bytes APIs
with a DeprecationWarning but I know that proposal is going nowhere so
I withdraw it in advance. wink


Footnotes: 
[1]  Note that the original rationale for this was surely since users
will have a very hard time using file names with this character in
them, using it as a substitution character internally will make the
problem evident and Sufficiently Smart Programs can deal with it.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com