Re: [Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API

2011-10-26 Thread Victor Stinner
Le Mardi 25 Octobre 2011 10:31:56 Victor Stinner a écrit :
 Basically, all functions processing filenames, so most functions of 
 posixmodule.c. Some examples:
 
 - os.listdir(): FindFirstFileA, FindNextFileA, FindCloseA
 - os.lstat(): CreateFileA
 - os.getcwdb(): getcwd()
 - os.mkdir(): CreateDirectoryA
 - os.chmod(): SetFileAttributesA
 - ...


 This seems way too broad. 

I changed my mind about this list: I only want to change how filenames are 
encoded, not how filenames are decoded. So only os.listdir()  os.getcwdb() 
should be changed, as I wrote in another email in this thread and in the issue 
#13247.

 - os.getcwdb():
 This you might change.

Issue #13247 combines os.getcwdb() and os.listdir(). Read the issue for more 
information.

 It ('?') is a bad choice of signal though, given the other uses
 of '?' in paths.

If I understood correctly, '?' is a pattern to match any character in 
FindFirstFile/FindNextFile. Python cannot configure the replacement character, 
it's hardcoded to ? (U+003F).

 it's just
 standard Windows behavior, which results in pathnames that are
 perfectly acceptable to Windows APIs, but unreliable in use because
 they have different semantics in different Windows APIs.

I think that such filenames cannot be used with any Windows function accessing 
to the filesystem. Extract of the issue:

Such filenames cannot be used, open() fails with OSError(22, invalid 
argument: '?') for example.

You can only be used if you want to display the content of a directory, but 
don't expect to be able to read file content.

--

Anyway, you must use Unicode on Windows! The bytes API was just kept for 
backward compatibility.

Victor
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API

2011-10-25 Thread Martin v. Löwis
 I propose to raise Unicode errors if a filename cannot be decoded on Windows, 
 instead of creating a bogus filenames with questions marks.

Can you please elaborate what APIs you are talking about exactly?

If it's the byte APIs (i.e. using bytes as file names), then I'm -1 on
this proposal. People that explicitly use bytes for file names deserve
to get whatever exact platform semantics the platform has to offer. This
is true on Unix, and it is also true on Windows.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API

2011-10-25 Thread Antoine Pitrou
On Tue, 25 Oct 2011 00:57:42 +0200
Victor Stinner victor.stin...@haypocalc.com wrote:
 Hi,
 
 I propose to raise Unicode errors if a filename cannot be decoded on Windows, 
 instead of creating a bogus filenames with questions marks. Because this 
 change 
 is incompatible with Python 3.2, even if such filenames are unusable and I 
 consider the problem as a (Python?) bug, I would like your opinion on such 
 change before working on a patch.

+1 from me.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API

2011-10-25 Thread Victor Stinner
Le Mardi 25 Octobre 2011 13:20:12 vous avez écrit :
 Victor Stinner writes:
   I propose to raise Unicode errors if a filename cannot be decoded
   on Windows, instead of creating a bogus filenames with questions
   marks.
 
 By bogus you mean sometimes (?) invalid and the OS will refuse to
 use them, causing a later hard-to-diagnose exception, rather than
 not what the user thinks he wants, right?

If the (Unicode) filename cannot be encoded to the ANSI code page, which is 
usually a small charset (e.g. cp1252 contains 256 code points), Windows 
replaces unencodable characters by question marks.

Imagine that the code page is ASCII, the (Unicode) filename hého.txt will 
be encoded to bh?ho.txt. You can display this string in a dialog, but you 
cannot open the file to read its content... If you pass the filename to 
os.listdir(), it is even worse because ? is interpreted (? means any 
character, it's a pattern to match a filename).

I would like to raise an error on such situation, because currently the user 
cannot be noticed otherwise. The user may search ? in the filename, but 
Windows replaces also unencodable characters by *similar glyph* (e.g. é 
replaced by e).

 In the hard errors case, a hearty +1 (I'm dealing with this in an
 experimental version of XEmacs and it's a right PITA if the codec
 doesn't complain). 

If you use MultiByteToWideChar and WideCharToMultiByte, you can be noticed on 
error using some flags, but functions of the ANSI API doesn't give access to 
these flags...

 Backward compatibility is important, but here the
 costs of fixing such bugs outweigh the value of bug-compatibility.

I only want to change how unencodable filenames are handled, the bytes API will 
still be available. If you filesystem has the 8dot3name feature enable, it 
may work even for unencodable filenames (Windows generates names like 
HEHO~1.TXT).

Victor
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API

2011-10-25 Thread Victor Stinner
Le Mardi 25 Octobre 2011 09:09:56 vous avez écrit :
  I propose to raise Unicode errors if a filename cannot be decoded on
  Windows, instead of creating a bogus filenames with questions marks.
 
 Can you please elaborate what APIs you are talking about exactly?

Basically, all functions processing filenames, so most functions of 
posixmodule.c. Some examples:

- os.listdir(): FindFirstFileA, FindNextFileA, FindCloseA
- os.lstat(): CreateFileA
- os.getcwdb(): getcwd()
- os.mkdir(): CreateDirectoryA
- os.chmod(): SetFileAttributesA
- ...

 If it's the byte APIs (i.e. using bytes as file names), then I'm -1 on
 this proposal. People that explicitly use bytes for file names deserve
 to get whatever exact platform semantics the platform has to offer. This
 is true on Unix, and it is also true on Windows.

My proposition is a fix to user reported by a user:
http://bugs.python.org/issue13247

I want to keep the bytes API for backward compatibility, and it will still 
work for non-ASCII characters, but only for non-ASCII characters encodable to 
the ANSI code page.

In practice, characters not encodable to the ANSI code page are very rare. For 
example: it's difficult to write such characters directly with the keyboard. I 
bet that very few people will notify the change.

Victor
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API

2011-10-25 Thread Victor Stinner
Le Mardi 25 Octobre 2011 09:09:56 vous avez écrit :
 If it's the byte APIs (i.e. using bytes as file names), then I'm -1 on
 this proposal. People that explicitly use bytes for file names deserve
 to get whatever exact platform semantics the platform has to offer. This
 is true on Unix, and it is also true on Windows.

For your information, it took me something like 3 months (when I was working 
on the issue #12281) to understand exactly how Windows handles undecodable 
bytes and unencodable characters. I did a lot of tests on different Windows 
versions (XP, Vista and Seven, the behaviour changed in Windows Vista). I had 
to take notes because it is really complex. Well, I wanted to understand 
exactly *all* code pages, including CP_UTF7 and CP_UTF8, not only the most 
common ones like cp1252 or cp932.

See the dedicated section in my book to learn more about these funtions:

http://www.haypocalc.com/tmp/unicode-2011-07-20/html/operating_systems.html#encode-
and-decode-functions

Some information are available in MultiByteToWideChar and WideCharToMultiByte 
documentation, but they are not well explained :-p

Victor
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API

2011-10-25 Thread Martin v. Löwis
 My proposition is a fix to user reported by a user:
 http://bugs.python.org/issue13247

So your proposal is that abspath(b.) shall raise a UnicodeError in
this case?

Are you serious???

 In practice, characters not encodable to the ANSI code page are very rare. 
 For 
 example: it's difficult to write such characters directly with the keyboard. 
 I 
 bet that very few people will notify the change.

Except people running into the very issues you are trying to resolve.
I'm not sure these people are really helped by having their applications
crash all of a sudden.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API

2011-10-25 Thread Victor Stinner
Le mardi 25 octobre 2011 00:57:42, Victor Stinner a écrit :
 I propose to raise Unicode errors if a filename cannot be decoded on
 Windows, instead of creating a bogus filenames with questions marks.
 Because this change is incompatible with Python 3.2, even if such
 filenames are unusable and I consider the problem as a (Python?) bug, I
 would like your opinion on such change before working on a patch.

Most people like the idea, so I wrote a patch and attached it to:

   http://bugs.python.org/issue13247

The patch only changes os.getcwdb() and os.listdir().

 We might use the PEP 383 to store undecoable bytes as surrogates (U+DC80-
 U+DCFF). But the situation is the opposite of the situtation on UNIX: on
 Windows, the problem is more on encoding (text-bytes) than on decoding
 (bytes-text). On UNIX, problems occur when the system is misconfigured
 (e.g. wrong locale encoding). On Windows, problems occur when your
 application uses the old (ANSI) API, whereas your filesystem is fully
 Unicode compliant and you created Unicode filenames with a program using
 the new (Windows) API.

I only changed functions returning filenames, so os.mkdir() is unchanged for 
example.

We may also patch the other functions to simplify the source code.

Victor
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API

2011-10-25 Thread Terry Reedy

On 10/25/2011 4:31 AM, Victor Stinner wrote:

Le Mardi 25 Octobre 2011 09:09:56 vous avez écrit :

I propose to raise Unicode errors if a filename cannot be decoded on
Windows, instead of creating a bogus filenames with questions marks.


Can you please elaborate what APIs you are talking about exactly?


Basically, all functions processing filenames, so most functions of
posixmodule.c. Some examples:


This seems way too broad. From you previous posts, I presumed that you 
only propose to change behavior when the user asks for the bytes 
versions of a unicode name that cannot be properly converted to a bytes 
version.



- os.listdir():


os.listdir(unicode) works fine and should not be changed.
os.listdir(bytes) is what OP of issue wants changed.


FindFirstFileA, FindNextFileA, FindCloseA


There are not Python names. Are they Windows API names?


- os.lstat(): CreateFileA


This does not create a path and should not be changed as far as I can see.


- os.getcwdb():


This you might change.

 getcwd()

This should not be, as no bytes are involved.


- os.mkdir(): CreateDirectoryA
- os.chmod(): SetFileAttributesA


Like os.lstat, these accept only accept a path and should do what they 
are supposed to do.



If it's the byte APIs (i.e. using bytes as file names), then I'm -1 on
this proposal. People that explicitly use bytes for file names deserve
to get whatever exact platform semantics the platform has to offer. This
is true on Unix, and it is also true on Windows.


My proposition is a fix to user reported by a user:
http://bugs.python.org/issue13247

I want to keep the bytes API for backward compatibility, and it will still
work for non-ASCII characters, but only for non-ASCII characters encodable to
the ANSI code page.

In practice, characters not encodable to the ANSI code page are very rare. For
example: it's difficult to write such characters directly with the keyboard. I
bet that very few people will notify the change.


Actually, Windows makes switching keyboard setups rather easy once you 
enable the feature. It might be that people who routinely use non-'ansi' 
characters in file and directory names do not routinely ask for bytes 
versions thereof.


The doc says All functions accepting path or file names accept both 
bytes and string objects, and result in an object of the same type, if a 
path or file name is returned. It does that now, though it says nothing 
about the encoding assumed for input bytes or used for output bytes. It 
does not mention raising exceptions, so doing so is a feature-change 
that would likely break code. Currently, exceptional situations are 
signalled with '?' in returned_path rather than with an exception 
object. It ('?') is a bad choice of signal though, given the other uses 
of '?' in paths.


--
Terry Jan Reedy


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API

2011-10-25 Thread Stephen J. Turnbull
In general I agree with what you write, Terry.  One clarification and
one comment, though.

Terry Reedy writes:

  The doc says All functions accepting path or file names accept both 
  bytes and string objects, and result in an object of the same type, if a 
  path or file name is returned. It does that now, though it says nothing 
  about the encoding assumed for input bytes or used for output
  bytes.

That's determined by the OS, and figuring that out is the end user's
problem.

  It does not mention raising exceptions, so doing so is a
  feature-change that would likely break code. Currently, exceptional
  situations are signalled with '?' in returned_path rather than
  with an exception object. It ('?') is a bad choice of signal
  though, given the other uses of '?' in paths.

True, but this isn't really Python's problem.  And IIUC Martin's post,
it is hardly exceptional: isn't Python doing this, it's just
standard Windows behavior, which results in pathnames that are
perfectly acceptable to Windows APIs, but unreliable in use because
they have different semantics in different Windows APIs.  If that is
true, there are almost surely user programs that depend on this
behavior, even though it sucks.[1]

My original hearty +1 was dependent on my understanding from
Victor's post that this substitution could cause later exceptions
because filename is invalid (eg, contains illegal characters causing
Windows to signal an error).  If that's not true, I think the proper
remedy is to add a strong warning to pylint that use of those APIs is
supported (eg, for interaction with existing programs that use them)
but that they require careful error-checking for robust use.

As a card-carrying Unicode nazi I wouldn't mind tagging the bytes APIs
with a DeprecationWarning but I know that proposal is going nowhere so
I withdraw it in advance. wink


Footnotes: 
[1]  Note that the original rationale for this was surely since users
will have a very hard time using file names with this character in
them, using it as a substitution character internally will make the
problem evident and Sufficiently Smart Programs can deal with it.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API

2011-10-24 Thread Nick Coghlan
On Tue, Oct 25, 2011 at 8:57 AM, Victor Stinner
victor.stin...@haypocalc.com wrote:
 The ANSI API uses MultiByteToWideChar (decode) and WideCharToMultiByte
 (encode) functions in the default mode (flags=0): MultiByteToWideChar()
 replaces undecodable bytes by '?' and WideCharToMultiByte() ignores
 unencodable characters (!!!). This behaviour produces invalid filenames (see
 for example the issue #13247) and *the user is unable to detect codec errors*.

 In Python 3.2, I changed the MBCS codec to make it strict: it raises a
 UnicodeEncodeError if a character cannot be encoded to the ANSI code page
 (e.g. encode Ł to cp1252) and a UnicodeDecodeError if a character cannot be
 decoded from the ANSI code page (e.g. b'\xff' from cp932).

 I propose to reuse our MBCS codec in strict mode (error handler=strict), to
 notice directly encode/decode errors, with the Windows native (wide character)
 API. It should simplify the source code: replace 2 versions of a function by 1
 version + optional code to decode arguments and/or encode the result.

So we'd be taking existing failures that appear at whatever point the
corrupted filename is used and replacing them with explicit failures
at the point where the offending string is converted to or from
encoded bytes? That sounds reasonable to me, and a lot closer to the
way Python behaves on POSIX based systems.

Regards,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Use our strict mbcs codec instead of the Windows ANSI API

2011-10-24 Thread Mark Hammond

+1 from me!

Mark

On 25/10/2011 9:57 AM, Victor Stinner wrote:

Hi,

I propose to raise Unicode errors if a filename cannot be decoded on Windows,
instead of creating a bogus filenames with questions marks. Because this change
is incompatible with Python 3.2, even if such filenames are unusable and I
consider the problem as a (Python?) bug, I would like your opinion on such
change before working on a patch.

--

Windows works internally on Unicode strings since Windows 95 (or something
like that), but provides also an ANSI API using the ANSI code page and byte
strings for backward compatibility. It was already proposed to drop completly
the bytes API in our nt (os) module, but it may break the Python backward
compatibility (and it is difficult to list Python programs using the bytes API
to access the file system).

The ANSI API uses MultiByteToWideChar (decode) and WideCharToMultiByte
(encode) functions in the default mode (flags=0): MultiByteToWideChar()
replaces undecodable bytes by '?' and WideCharToMultiByte() ignores
unencodable characters (!!!). This behaviour produces invalid filenames (see
for example the issue #13247) and *the user is unable to detect codec errors*.

In Python 3.2, I changed the MBCS codec to make it strict: it raises a
UnicodeEncodeError if a character cannot be encoded to the ANSI code page
(e.g. encode Ł to cp1252) and a UnicodeDecodeError if a character cannot be
decoded from the ANSI code page (e.g. b'\xff' from cp932).

I propose to reuse our MBCS codec in strict mode (error handler=strict), to
notice directly encode/decode errors, with the Windows native (wide character)
API. It should simplify the source code: replace 2 versions of a function by 1
version + optional code to decode arguments and/or encode the result.

--

Read also the previous thread:

[Python-Dev] Byte filenames in the posix module on Windows
Wed Jun 8 00:23:20 CEST 2011
http://mail.python.org/pipermail/python-dev/2011-June/111831.html

--

FYI I patched again Python MBCS codec: it now handles correclty ignore and
replace mode (to encode and decode), but now also supports any error handler.

--

We might use the PEP 383 to store undecoable bytes as surrogates (U+DC80-
U+DCFF). But the situation is the opposite of the situtation on UNIX: on
Windows, the problem is more on encoding (text-bytes) than on decoding
(bytes-text). On UNIX, problems occur when the system is misconfigured (e.g.
wrong locale encoding). On Windows, problems occur when your application uses
the old (ANSI) API, whereas your filesystem is fully Unicode compliant and you
created Unicode filenames with a program using the new (Windows) API.

Only few programs are fully Unicode compliant. A lot of programs fail if a
filename cannot be encoded to the ANSI code page (just 2 examples: Mercurial
and Visual Studio).

Victor
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/skippy.hammond%40gmail.com


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com