Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Paul Moore
2009/4/25 James Y Knight f...@fuhm.net:
 On Apr 24, 2009, at 6:05 PM, Paul Moore wrote:

 - Windows systems where broken Unicode (lone surrogates or whatever)
 isn't involved
 - Unix systems where the user's stated filesystem encoding is correct

 Can you honestly say that this isn't the vast majority of real-world
 environments? (IIRC, you are based in Japan, so it may well be true
 that the likelihood of problems is a lot higher where you are than
 where I am - the UK - but I suspect that averaging out, things are
 generally as above).

 In my experience, it is normal on most unix systems that some programs
 (mostly daemons) are running in default POSIX locale, others (most user
 programs) are running in the en_US.utf-8 locale, and some luddite users
 have set themselves to en_US.8859-1. All running on the same system.

OK, thanks for the data point.

Following on from that, would this (under Martin's proposal) result in
programs receiving encoded strings, or just semantically-incorrect
ones?

Specifically, the 8859-1 case cannot result in encoded strings, as
8859-1 can represent all byte strings (possibly garbled, but at least
validly). The utf8 case can hit unrepresentable bytes, but only if
there are characters greater than 0x7F in filenames. Is the POSIX
case ASCII? If so, then the same logic (=0x80 is unrepresentable).

So, the next question is - do people on such systems frequently use
high-bit characters in filenames?

Paul.

PS Unfortunately, I suspect that the biggest group of people likely to
be hit badly by this is people using non-latin scripts. And arguing
probabilities without real data is optimistic at best. But those
people are also the *least* likely people to contribute on an
English-speaking list, I guess :-( (Sincere apologies if everyone but
me on this list happens to actually be fluent English-speaking
Russians :-))
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Deprecating PyOS_ascii_formatd

2009-04-25 Thread Eric Smith

Benjamin Peterson wrote:

2009/4/24 Eric Smith e...@trueblade.com:

My proposal is to deprecate PyOS_ascii_formatd in 3.1 and remove it in
3.2.

Having heard no dissent, I'd like to go ahead and deprecate this API. What
are the mechanics of deprecating this? Just documentation, or is there
something I should do in the code to generate a warning? Any pointers to
examples would be great.


You can use PyErr_WarnEx().


Thanks. I created issue 5835 to track this. I marked it as a release 
blocker, but I should have no problem finishing it up this weekend.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Martin v. Löwis
Cameron Simpson wrote:
 On 22Apr2009 08:50, Martin v. Löwis mar...@v.loewis.de wrote:
 | File names, environment variables, and command line arguments are
 | defined as being character data in POSIX;
 
 Specific citation please? I'd like to check the specifics of this.

For example, on environment variables:

http://opengroup.org/onlinepubs/007908799/xbd/envvar.html

# For values to be portable across XSI-conformant systems, the value
# must be composed of characters from the portable character set (except
# NUL and as indicated below).

# Environment variable names used by the utilities in the XCU
# specification consist solely of upper-case letters, digits and the _
# (underscore) from the characters defined in Portable Character Set .
# Other characters may be permitted by an implementation;

Or, on command line arguments:

http://opengroup.org/onlinepubs/007908799/xsh/execve.html

# The arguments represented by arg0, ... are pointers to null-terminated
# character strings

where a character string is A contiguous sequence of characters
terminated by and including the first null byte., and a character
is

# A sequence of one or more bytes representing a single graphic symbol
# or control code. This term corresponds to the ISO C standard term
# multibyte character (multi-byte character), where a single-byte
# character is a special case of a multi-byte character. Unlike the
# usage in the ISO C standard, character here has no necessary
# relationship with storage space, and byte is used when storage space
# is discussed.

 So you're proposing that all POSIX OS interfaces (which use byte strings)
 interpret those byte strings into Python3 str objects, with a codec
 that will accept arbitrary byte sequences losslessly and is totally
 reversible, yes?

Correct.

 And, I hope, that the os.* interfaces silently use it by default.

Correct.

 | Applications that need to process the original byte
 | strings can obtain them by encoding the character strings with the
 | file system encoding, passing python-escape as the error handler
 | name.
 
 -1
 
 This last sentence kills the idea for me, unless I'm missing something.
 Which I may be, of course.
 
 POSIX filesystems _do_not_ have a file system encoding.

Why is that a problem for the PEP?

 If I'm writing a general purpose UNIX tool like chmod or find, I expect
 it to work reliably on _any_ UNIX pathname. It must be totally encoding
 blind. If I speak to the os.* interface to open a file, I expect to hand
 it bytes and have it behave.

See the other messages. If you want to do that, you can continue to.

 I'm very much in favour of being able to work in strings for most
 purposes, but if I use the os.* interfaces on a UNIX system it is
 necessary to be _able_ to work in bytes, because UNIX file pathnames
 are bytes.

Please re-read the PEP. It provides a way of being able to access any
POSIX file name correctly, and still pass strings.

 If there isn't a byte-safe os.* facility in Python3, it will simply be
 unsuitable for writing low level UNIX tools.

Why is that? The mechanism in the PEP is precisely defined to allow
writing low level UNIX tools.

 Finally, I have a small python program whose whole purpose in life
 is to transcode UNIX filenames before transfer to a MacOSX HFS
 directory, because of HFS's enforced particular encoding. What approach
 should a Python app take to transcode UNIX pathnames under your scheme?

Compute the corresponding character strings, and use them.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Martin v. Löwis
 | 2. Even if they were taken away (which the PEP does not propose to do),
 |it would be easy to emulate them for applications that want them.
 |For example, listdir could be wrapped as
 | 
 |def listdir_b(bytestring):
 |fse = sys.getfilesystemencoding()
 
 Alas, no

No, what? No, that algorithm would be incorrect?

 because there is no sys.getfilesystemencoding() at the POSIX
 level. It's only the user's current locale stuff on a UNIX system, and
 has _nothing_ to do with the filesystem because UNIX filesystems don't
 have encodings.

So can you produce a specific example where my proposed listdir_b
function would fail to work correctly?

For it to work, it is not necessary that POSIX has no notion of
character sets on the file system level (which is actually not true -
POSIX very well recognizes the notion of character sets for file
names, and recommends that you restrict yourself to the portable
character set).

 In particular, because the best (or to my mind misleading) you
 can do for this is report what the current user thinks:
   http://docs.python.org/library/sys.html#sys.getfilesystemencoding
 then there's no guarrentee that what is chosen has any releationship to
 what was in use when the files being consulted were made.

For this PEP, it's irrelevant. It will work even if the chosen encoding
is a bad choice.

 Now, if I were writing listdir_b() I'd want to be able to do something
 along these lines:
   - set LC_ALL=C (or some equivalent mechanism)
   - have os.listdir() read bytes as numeric values and transcode their values
 _directly_ into the corresponding Unicode code points.
   - yield bytes( ord(c) for c in os_listdir_string )
   - have os.open() et al transcode unicode code points back into bytes.
 i.e. a straight one-to-one mapping, using only codepoints in the range
 1..255.

That would be an alternative approach to the same problem (and one that
I think will fail more badly than the one I'm proposing).

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Martin v. Löwis
Simon Cross wrote:
 Unfortunately, for Windows, the situation would
 be exactly the opposite: the byte-oriented interface cannot represent
 all data; only the character-oriented API can.
 
 Is the second part of this actually true? My understanding may be
 flawed, but surely all Unicode data can be converted to and from bytes
 using UTF-8?

[I hope, by second part, you refer to the part that I left]

It's true that UTF-8 could represent all Windows file names. However,
the byte-oriented APIs of Windows do not use UTF-8, but instead, they
use the Windows ANSI code page (which varies with the installation).

 Given this, can't people who
 must have access to all files / environment data just use the bytes
 interface?

No, because the Windows API would interpret the bytes differently,
and not find the right file.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Martin v. Löwis
 The problem with this, and other preceding schemes that have been
 discussed here, is that there is no means of ascertaining whether a
 particular file name str was obtained from a str API, or was funny-
 decoded from a bytes API... and thus, there is no means of reliably
 ascertaining whether a particular filename str should be passed to a
 str API, or funny-encoded back to bytes.

Why is it necessary that you are able to make this distinction?

 Picking a character (I don't find U+F01xx in the
 Unicode standard, so I don't know what it is)

It's a private use area. It will never carry an official character
assignment.

 As I realized in the email-sig, in talking about decoding corrupted
 headers, there is only one way to guarantee this... to encode _all_
 character sequences, from _all_ interfaces.  Basically it requires
 reserving an escape character (I'll use ? in these examples -- yes, an
 ASCII question mark -- happens to be illegal in Windows filenames so
 all the better on that platform, but the specific character doesn't
 matter... avoiding / \ and . is probably good, though).

I think you'll have to write an alternative PEP if you want to see
something like this implemented throughout Python.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Martin v. Löwis
 Humour aside :), the expectation that filenames are Unicode data
 simply doesn't agree with the reality of POSIX file systems.  I think
 an approach similar to that adopted by glib [1] could work

Are you saying that the approach presented in the PEP will not work?
I believe it would work no matter whether that expectation agrees
with reality or not. The amount of moji-bake that you get is larger
when the disagreement is larger, but it will continue to *work*.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Martin v. Löwis
 The part that I haven't seen clearly addressed so far is what happens
 when disks get mounted across OSes (e.g. NFS).
 
 While I agree that there should be a layer on top that can handle most
 situations, it also seems clear that the raw layer needs to be readily
 accessible.

Indeed, with the PEP, the raw layer does remain readily available. If
you know that it was originally bytes, you can get the very same bytes
back if you want to.

However, for disks mounted across OSes, you won't have to, normally.
If you think there is a problem with these, can you please describe a
specific scenario? What application, what file names, what encodings,
what problems?

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Martin v. Löwis
 [1] Actually, all the PEP says is With this PEP, a uniform treatment
 of these data as characters becomes
 possible. An argument as to why this is a good thing would be a
 useful addition to the PEP. At the moment it's more or less treated as
 self-evident - which I agree with, but which clearly the Unix people
 here are not as certain of.

Ok, I have added another paragraph. Not sure whether it helps to clarify
though.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Martin v. Löwis
 Because the encoding is not reliably reversible.

Why do you say that? The encoding is completely reversible
(unless we disagree on what reversible means).

 I'm +1 on the concept, -1 on the PEP, due solely to the lack of a
 reversible encoding.

Then please provide an example for a setup where it is not reversible.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Martin v. Löwis
 Following on from that, would this (under Martin's proposal) result in
 programs receiving encoded strings, or just semantically-incorrect
 ones?

Not sure I understand the question - what is an encoded string?

As you analyse below, sometimes, the current (2.x) file system encoding
will do the right thing; sometimes, it will decode successfully, but
still not give the intended string, and sometimes, it will fail. With
the PEP, it won't fail, but give a string back that likely wasn't
intended by the user. This might be confusing if you try to render it to
a user interface; if the application merely passes it back to file
system APIs, it will work fine.

 So, the next question is - do people on such systems frequently use
 high-bit characters in filenames?

They typically do until they run into problems. For example, if they
set the locale to something, and then create files in their
homedirectory, it will work just fine, and nobody else will ever see
the files (except for the backup software).

When they find that the files they created are inaccessible to others,
they will often stop using funny characters.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Martin v. Löwis
 If the bytes are mapped to single half surrogate codes instead of the
 normal pairs (low+high), then I can see that decoding could never be
 ambiguous and encoding could produce the original bytes.

I was confused by Markus Kuhn's original UTF-8b specification. I have
now changed the PEP to avoid using PUA characters at all.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread MRAB

Martin v. Löwis wrote:

If the bytes are mapped to single half surrogate codes instead of the
normal pairs (low+high), then I can see that decoding could never be
ambiguous and encoding could produce the original bytes.


I was confused by Markus Kuhn's original UTF-8b specification. I have
now changed the PEP to avoid using PUA characters at all.


I find the PEP easier to understand now.

In detail I'd say that if a sequence of bytes =0x80 is found which is
not valid UTF-8, then the first byte is mapped to a half surrogate and
then decoding is continued from the next byte.

The only drawback I can see is if the UTF-8 bytes actually decode to a
half surrogate. However, half surrogates should really only occur in
UTF-16 (as I understand it), so they shouldn't be encoded in UTF-8
anyway!

As for handling this case, you could either:

1. Raise an exception (which is what you're trying to avoid)

or:

2. Treat it as invalid UTF-8 and map the bytes to half surrogates
(encoding would produce the original bytes).

I'd prefer option 2.

Anyway, +1 from me.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Paul Moore
2009/4/25 Martin v. Löwis mar...@v.loewis.de:
 Following on from that, would this (under Martin's proposal) result in
 programs receiving encoded strings, or just semantically-incorrect
 ones?

 Not sure I understand the question - what is an encoded string?

Sorry. I was struggling to come up with terminology for the various
concepts I was trying to express, as I went along.

I was meaning a string which has been created from a non-decodable
byte sequence using the encoding process you specify in the PEP (with
the current version of the PEP, this would be a string with lone half
surrogate codes).

I was distinguishing these because some people seemed to be implying
that such strings were the ones which would result in exceptions. (I
think that was Stephen, when he referred to a careful API).

 As you analyse below, sometimes, the current (2.x) file system encoding
 will do the right thing; sometimes, it will decode successfully, but
 still not give the intended string, and sometimes, it will fail. With
 the PEP, it won't fail, but give a string back that likely wasn't
 intended by the user. This might be confusing if you try to render it to
 a user interface; if the application merely passes it back to file
 system APIs, it will work fine.

OK, looks like my analysis matches yours, except that I wasn't sure if
the third case (a string that likely wasn't intended) could result
in exceptions. From what you're saying, it sounds like it would
actually be similar to the second case - I'm not clear on how
surrogates work, though.

 So, the next question is - do people on such systems frequently use
 high-bit characters in filenames?

 They typically do until they run into problems. For example, if they
 set the locale to something, and then create files in their
 homedirectory, it will work just fine, and nobody else will ever see
 the files (except for the backup software).

 When they find that the files they created are inaccessible to others,
 they will often stop using funny characters.

Which sounds fairly practical - and the irony of someone with a funny
character in his surname telling me this hasn't escaped me :-)

Paul.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Martin v. Löwis
 OK, looks like my analysis matches yours, except that I wasn't sure if
 the third case (a string that likely wasn't intended) could result
 in exceptions. From what you're saying, it sounds like it would
 actually be similar to the second case - I'm not clear on how
 surrogates work, though.

On decoding, there is a guarantee that it decodes successfully. There is
also a guarantee that the result will re-encode successfully, and yield
the same byte string.

If you pass a different string into encoding, you still may get
exceptions. For example, if the filesystem encoding is latin-1,
passing u\u20ac will continue to raise exceptions, even under the
python-escape error handler - that error handler will only handle
surrogates.

There isn't really that much trickery to surrogates. They *have*
to come in pairs to be meaningful, with the first one in the range
D800..DBFF (high surrogate), and the second in the range DC00..DCFF
(low surrogate). Having a lone low surrogate is not meaningful; this
is how the escaping works.

Proper surrogate pairs encode characters outside the BMP, for use with
UTF-16: each code contributes 10 bits (just count how many codes there
are in D800..DCFF), together, a pair encodes 20 bits, allowing for
2**20 characters, starting at U+1.

 When they find that the files they created are inaccessible to others,
 they will often stop using funny characters.
 
 Which sounds fairly practical - and the irony of someone with a funny
 character in his surname telling me this hasn't escaped me :-)

Sure: my Unix account name was always loewis, and even on Windows,
our admins didn't dare to put the umlaut into the account name - it
would be difficult to login with a US keyboard, for example. People
who use non-ASCII characters in filenames around here are primarily
non-IT people who aren't aware that these characters are different
from the rest.

I recognize that for other languages (without trivial transliterations)
the problem is more severe, and people are more likely to create
files with Cyrillic, or Japanese, names (say) if the systems accepts
them at all.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Martin v. Löwis
 The only drawback I can see is if the UTF-8 bytes actually decode to a
 half surrogate. However, half surrogates should really only occur in
 UTF-16 (as I understand it), so they shouldn't be encoded in UTF-8
 anyway!

Right: that's the rationale for UTF-8b. Encoding half surrogates
violates parts of the Unicode spec, so UTF-8b is safe.

 As for handling this case, you could either:
 
 1. Raise an exception (which is what you're trying to avoid)
 
 or:
 
 2. Treat it as invalid UTF-8 and map the bytes to half surrogates
 (encoding would produce the original bytes).
 
 I'd prefer option 2.

I hadn't thought of this case, but you are right - they *are*
illegal bytes, after all. Raising an exception would be useless
since the whole point of this codec is to never raise unicode
errors.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Zooko O'Whielacronx
Thanks for writing this PEP 383, MvL.  I recently ran into this  
problem in Python 2.x in the Tahoe project [1].  The Tahoe project  
should be considered a good use case showing what some people need.   
For example, the assumption that a file will later be written back  
into the same local filesystem (and thus luckily use the same  
encoding) from which it originally came doesn't hold for us, because  
Tahoe is used for file-sharing as well as for backup-and-restore.


One of my first conclusions in pursuing this issue is that we can  
never use the Python 2.x unicode APIs on Linux, just as we can never  
use the Python 2.x str APIs on Windows [2].  (You mentioned this  
ugliness in your PEP.)  My next conclusion was that the Linux way of  
doing encoding of filenames really sucks compared to, for example,  
the Mac OS X way.  I'm heartened to see what David Wheeler is trying  
to persuade the maintainers of Linux filesystems to improve some of  
this: [3].


My final conclusion was that we needed to have two kinds of  
workaround for the Linux suckage: first, if decoding using the  
suggested filesystem encoding fails, then we fall back to mojibake  
[4] by decoding with iso-8859-1 (or else with windows-1252 -- I'm not  
sure if it matters and I haven't yet understood if utf-8b offers  
another alternative for this case).  Second, if decoding succeeds  
using the suggested filesystem encoding on Linux, then write down the  
encoding that we used and include that with the filename.  This  
expands the size of our filenames significantly, but it is the only  
way to allow some future programmer to undo the damage of a falsely- 
successful decoding.  Here's our whole plan: [5].


Regards,

Zooko

[1] http://allmydata.org
[2] http://allmydata.org/pipermail/tahoe-dev/2009-March/001379.html #  
see the footnote of this message

[3] http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html
[4] http://en.wikipedia.org/wiki/Mojibake
[5] http://allmydata.org/trac/tahoe/ticket/534#comment:47
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Oleg Broytmann
On Sat, Apr 25, 2009 at 05:00:17PM +0200, Martin v. L?wis wrote:
 I recognize that for other languages (without trivial transliterations)
 the problem is more severe, and people are more likely to create
 files with Cyrillic, or Japanese, names (say) if the systems accepts
 them at all.

   In different encodings on the same filesystem...

Oleg.
-- 
 Oleg Broytmannhttp://phd.pp.ru/p...@phd.pp.ru
   Programmers don't die, they just GOSUB without RETURN.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Michael Urman
On Sat, Apr 25, 2009 at 10:00, Martin v. Löwis mar...@v.loewis.de wrote:
 On decoding, there is a guarantee that it decodes successfully. There is
 also a guarantee that the result will re-encode successfully, and yield
 the same byte string.

 If you pass a different string into encoding, you still may get
 exceptions. For example, if the filesystem encoding is latin-1,
 passing u\u20ac will continue to raise exceptions, even under the
 python-escape error handler - that error handler will only handle
 surrogates.

One angle I've not seen discussed yet is a set of use cases. While the
PEP addresses the need for the python developer to not have to write
insane conditional code that maps between bytes and str depending on
the platform, it doesn't talk about what this allows an application to
provide to a user, and at what risks.

I see two main user-oriented use cases for the resulting Unicode
strings this PEP will produce on all systems: displaying a list of
filenames for the user to select from (an open file dialog), and
allowing a user to edit or supply a filename (a save dialog or a
rename control).

It's clear what this PEP provides for the former. On well-behaved
systems where a simpler filesystemencoding approach would work, the
results are identical; the user can select filenames that are what he
expects to see on both Unix and Windows. On less well-behaved systems,
some characters may appear as junk in the middle of the name (or would
they be invisible?), but should be recognizable enough to choose, or
at least to open sequentially and remember what the last one was. On
particularly poorly behaved systems, the results will be extremely
difficult to read, but no approach is likely to fix this.

What I don't find clear is what the risks are for the latter. On the
less well behaved system, a user may well attempt to use this python
application to fix filenames. Can we estimate a likelihood that edits
to the names would result in a Unicode string that can no longer be
encoded with the python-escape? Will a new name fully provided by a
user on his keyboard (ignoring copy and paste) almost always safely
encode?

-- 
Michael Urman
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Martin v. Löwis
 I see two main user-oriented use cases for the resulting Unicode
 strings this PEP will produce on all systems: displaying a list of
 filenames for the user to select from (an open file dialog), and
 allowing a user to edit or supply a filename (a save dialog or a
 rename control).

There are more, in particular the case user passes a file name
on the command line, and web server passes URL in environment
variable.

 It's clear what this PEP provides for the former. On well-behaved
 systems where a simpler filesystemencoding approach would work, the
 results are identical; the user can select filenames that are what he
 expects to see on both Unix and Windows. On less well-behaved systems,
 some characters may appear as junk in the middle of the name (or would
 they be invisible?)

Depends on the rendering. Try print u'\udc00' in your terminal to see
what happens; for me, it renders the glyph for replacement character.
In GUI applications, you often see white boxes (rectangles).

 What I don't find clear is what the risks are for the latter. On the
 less well behaved system, a user may well attempt to use this python
 application to fix filenames. Can we estimate a likelihood that edits
 to the names would result in a Unicode string that can no longer be
 encoded with the python-escape? Will a new name fully provided by a
 user on his keyboard (ignoring copy and paste) almost always safely
 encode?

That very much depends on the system setup, and your impression is
right that the PEP doesn't address it - it only deals with cases
where you get random unsupported bytes; getting random unsupported
characters from the user is not considered.

If the user has the locale setup in way that matches his keyboard,
it should work all fine - and will already, even without the PEP.
If the user enters a character that doesn't directly map to a
good file name, you get an exception, and have to tell the user
to pick a different filename.

Notice that it may fail at several layers:
- it may be that characters entered are not supported in what
  Python choses as the file system encoding.
- it may be that the characters are not supported by the file
  system, e.g. leading spaces in Win32.
- it may be that the file cannot be renamed because the target
  name already exists.
In all these cases, the application has to ask the user to
reconsider; for at least the last case, it should be prepared
to do that, anyway (there is also the case where renaming fails
because of lack of permissions; in that case, picking a different
file name won't help).

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System C haracter Interfaces

2009-04-25 Thread Antoine Pitrou
Paul Moore p.f.moore at gmail.com writes:
 But those
 people are also the *least* likely people to contribute on an
 English-speaking list, I guess  (Sincere apologies if everyone but
 me on this list happens to actually be fluent English-speaking
 Russians )

Actually, we're all Finnish.

Regards,

Åntoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Michael Urman
On Sat, Apr 25, 2009 at 11:33, Martin v. Löwis mar...@v.loewis.de wrote:
 If the user has the locale setup in way that matches his keyboard,
 it should work all fine - and will already, even without the PEP.
 If the user enters a character that doesn't directly map to a
 good file name, you get an exception, and have to tell the user
 to pick a different filename.

This sound good so far - the 90% (or higher) case is still clean.

 Notice that it may fail at several layers:
 - it may be that characters entered are not supported in what
  Python choses as the file system encoding.
 - it may be that the characters are not supported by the file
  system, e.g. leading spaces in Win32.
 - it may be that the file cannot be renamed because the target
  name already exists.
 In all these cases, the application has to ask the user to
 reconsider; for at least the last case, it should be prepared
 to do that, anyway (there is also the case where renaming fails
 because of lack of permissions; in that case, picking a different
 file name won't help).

This argument sounds good to me too. How will we communicate to
developers what new exception might occur where? It would be a shame
to have a solid application developed under Windows start raising
encoding exceptions on linux. Would the encoding error get mapped to
an IOError for all file APIs that do this encoding?

-- 
Michael Urman
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread MRAB

Martin v. Löwis wrote:

I see two main user-oriented use cases for the resulting Unicode
strings this PEP will produce on all systems: displaying a list of
filenames for the user to select from (an open file dialog), and
allowing a user to edit or supply a filename (a save dialog or a
rename control).


There are more, in particular the case user passes a file name
on the command line, and web server passes URL in environment
variable.


It's clear what this PEP provides for the former. On well-behaved
systems where a simpler filesystemencoding approach would work, the
results are identical; the user can select filenames that are what he
expects to see on both Unix and Windows. On less well-behaved systems,
some characters may appear as junk in the middle of the name (or would
they be invisible?)


Depends on the rendering. Try print u'\udc00' in your terminal to see
what happens; for me, it renders the glyph for replacement character.
In GUI applications, you often see white boxes (rectangles).


What I don't find clear is what the risks are for the latter. On the
less well behaved system, a user may well attempt to use this python
application to fix filenames. Can we estimate a likelihood that edits
to the names would result in a Unicode string that can no longer be
encoded with the python-escape? Will a new name fully provided by a
user on his keyboard (ignoring copy and paste) almost always safely
encode?


That very much depends on the system setup, and your impression is
right that the PEP doesn't address it - it only deals with cases
where you get random unsupported bytes; getting random unsupported
characters from the user is not considered.

If the user has the locale setup in way that matches his keyboard,
it should work all fine - and will already, even without the PEP.
If the user enters a character that doesn't directly map to a
good file name, you get an exception, and have to tell the user
to pick a different filename.

Notice that it may fail at several layers:
- it may be that characters entered are not supported in what
  Python choses as the file system encoding.
- it may be that the characters are not supported by the file
  system, e.g. leading spaces in Win32.
- it may be that the file cannot be renamed because the target
  name already exists.
In all these cases, the application has to ask the user to
reconsider; for at least the last case, it should be prepared
to do that, anyway (there is also the case where renaming fails
because of lack of permissions; in that case, picking a different
file name won't help).

This has made me think about what happens going the other way, ie when a 
user-supplied Unicode string needs to be converted to UTF-8b. That 
should also be reversible.


Therefore:

When encoding using UTF-8b, codepoints in the range U+DC80..U+DCFF
should map to bytes 0x80..0xFF; all other codepoints, including the
remaining half surrogates, should be encoded normally.

When decoding using UTF-8b, undecodable bytes in the range 0x80..0xFF
should map to U+DC80..U+DCFF; all other bytes, including the encodings
for the remaining half surrogates, should be decoded normally.

This will ensure that even when the user has provided a string
containing half surrogates it can be encoded to bytes and then decoded
back to the original string.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Jeroen Ruigrok van der Werven
-On [20090425 11:01], Paul Moore (p.f.mo...@gmail.com) wrote:
PS Unfortunately, I suspect that the biggest group of people likely to
be hit badly by this is people using non-latin scripts. And arguing
probabilities without real data is optimistic at best. But those
people are also the *least* likely people to contribute on an
English-speaking list, I guess :-( (Sincere apologies if everyone but
me on this list happens to actually be fluent English-speaking
Russians :-))

Even though I am Dutch I have to deal with a variety of scripts for my i18n
and L10n efforts, which includes contributions to Unicode. Aside from that I
also have the fair share of audio files which have the names/descriptions in
the respective script (Thai, Korean, Chinese, Taiwanese, Japanese, and so
on).

-- 
Jeroen Ruigrok van der Werven asmodai(-at-)in-nomine.org / asmodai
イェルーン ラウフロック ヴァン デル ウェルヴェン
http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B
Necessity relieves us of the ordeal of choice...
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Python-checkins] r71946 - peps/trunk/pep-0315.txt

2009-04-25 Thread Eric Smith
You might want to note in the PEP that the problem that's being solved 
is known as the loop and a half problem.


http://www.cs.duke.edu/~ola/patterns/plopd/loops.html#loop-and-a-half

raymond.hettinger wrote:

Author: raymond.hettinger
Date: Sun Apr 26 02:34:36 2009
New Revision: 71946

Log:
Revive PEP 315.

Modified:
   peps/trunk/pep-0315.txt

Modified: peps/trunk/pep-0315.txt
==
--- peps/trunk/pep-0315.txt (original)
+++ peps/trunk/pep-0315.txt Sun Apr 26 02:34:36 2009
@@ -2,9 +2,9 @@
 Title: Enhanced While Loop
 Version: $Revision$
 Last-Modified: $Date$
-Author: W Isaac Carroll icarr...@pobox.com
-Raymond Hettinger pyt...@rcn.com
-Status: Deferred
+Author: Raymond Hettinger pyt...@rcn.com
+W Isaac Carroll icarr...@pobox.com
+Status: Draft
 Type: Standards Track
 Content-Type: text/plain
 Created: 25-Apr-2003
___
Python-checkins mailing list
python-check...@python.org
http://mail.python.org/mailman/listinfo/python-checkins



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Cameron Simpson
On 25Apr2009 14:07, Martin v. Löwis mar...@v.loewis.de wrote:
| Cameron Simpson wrote:
|  On 22Apr2009 08:50, Martin v. Löwis mar...@v.loewis.de wrote:
|  | File names, environment variables, and command line arguments are
|  | defined as being character data in POSIX;
|  
|  Specific citation please? I'd like to check the specifics of this.
| For example, on environment variables:
| http://opengroup.org/onlinepubs/007908799/xbd/envvar.html
[...]
| http://opengroup.org/onlinepubs/007908799/xsh/execve.html
[...]

Thanks.

|  So you're proposing that all POSIX OS interfaces (which use byte strings)
|  interpret those byte strings into Python3 str objects, with a codec
|  that will accept arbitrary byte sequences losslessly and is totally
|  reversible, yes?
| 
| Correct.
| 
|  And, I hope, that the os.* interfaces silently use it by default.
| 
| Correct.

Ok, then I'm probably good with the PEP. Though I have a quite strong
desire to be able to work in bytes at need without doing multiple
encode/decode steps.

|  | Applications that need to process the original byte
|  | strings can obtain them by encoding the character strings with the
|  | file system encoding, passing python-escape as the error handler
|  | name.
|  
|  -1
|  This last sentence kills the idea for me, unless I'm missing something.
|  Which I may be, of course.
|  POSIX filesystems _do_not_ have a file system encoding.
| 
| Why is that a problem for the PEP?

Because you said above by encoding the character strings with the file
system encoding, which is a fiction.

|  If I'm writing a general purpose UNIX tool like chmod or find, I expect
|  it to work reliably on _any_ UNIX pathname. It must be totally encoding
|  blind. If I speak to the os.* interface to open a file, I expect to hand
|  it bytes and have it behave.
| 
| See the other messages. If you want to do that, you can continue to.
| 
|  I'm very much in favour of being able to work in strings for most
|  purposes, but if I use the os.* interfaces on a UNIX system it is
|  necessary to be _able_ to work in bytes, because UNIX file pathnames
|  are bytes.
| 
| Please re-read the PEP. It provides a way of being able to access any
| POSIX file name correctly, and still pass strings.
| 
|  If there isn't a byte-safe os.* facility in Python3, it will simply be
|  unsuitable for writing low level UNIX tools.
| 
| Why is that? The mechanism in the PEP is precisely defined to allow
| writing low level UNIX tools.

Then implicitly it's byte safe. Clearly I'm being unclear; I mean
original OS-level byte strings must be obtainable undamaged, and it must
be possible to create/work on OS objects starting with a byte string as
the pathname.

|  Finally, I have a small python program whose whole purpose in life
|  is to transcode UNIX filenames before transfer to a MacOSX HFS
|  directory, because of HFS's enforced particular encoding. What approach
|  should a Python app take to transcode UNIX pathnames under your scheme?
| 
| Compute the corresponding character strings, and use them.

In Python2 I've been going (ignoring checks for unchanged names):

  - Obtain the old name and interpret it into a str() correctly.
I mean here that I go:
  unicode_name = unicode(name, srcencoding)
in old Python2 speak. name is a bytes string obtained from listdir()
and srcencoding is the encoding known to have been used when the old name
was constructed. Eg iso8859-1.
  - Compute the new name in the desired encoding. For MacOSX HFS,
that's:
  utf8_name = unicodedata.normalize('NFD',unicode_name).encode('utf8')
Still in Python2 speak, that's a byte string.
  - os.rename(name, utf8_name)

Under your scheme I imagine this is amended. I would change your
listdir_b() function as follows:

  def listdir_b(bytestring, fse=None):
   if fse is None:
   fse = sys.getfilesystemencoding()
   string = bytestring.decode(fse, python-escape)
   for fn in os.listdir(string):
   yield fn.encoded(fse, python-escape)

So, internally, os.listdir() takes a string and encodes it to an
_unspecified_ encoding in bytes, and opens the directory with that
byte string using POSIX opendir(3).

How does listdir() ensure that the byte string it passes to the underlying
opendir(3) is identical to 'bytestring' as passed to listdir_b()?

It seems from the PEP that On POSIX systems, Python currently applies the
locale's encoding to convert the byte data to Unicode. Your extension
is to augument that by expressing the non-decodable byte sequences in a
non-conflicting way for reversal later, yes?

That seems to double the complexity of my example application, since
it wants to interpret the original bytes in a caller-specified fashion,
not using the locale defaults.

So I must go:

  def macify(dirname, srcencoding):
# I need this to reverse your encoding scheme
fse = sys.getfilesystemencoding()
# I'll pretend dirname is ready for use
# it possibly has had to undergo the