Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Paul Moore
2009/4/25 James Y Knight : > On Apr 24, 2009, at 6:05 PM, Paul Moore wrote: >> >> - Windows systems where broken Unicode (lone surrogates or whatever) >> isn't involved >> - Unix systems where the user's stated filesystem encoding is correct >> >> Can you honestly say that this isn't the vast major

Re: [Python-Dev] Deprecating PyOS_ascii_formatd

2009-04-25 Thread Eric Smith
Benjamin Peterson wrote: 2009/4/24 Eric Smith : My proposal is to deprecate PyOS_ascii_formatd in 3.1 and remove it in 3.2. Having heard no dissent, I'd like to go ahead and deprecate this API. What are the mechanics of deprecating this? Just documentation, or is there something I should do in

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Martin v. Löwis
Cameron Simpson wrote: > On 22Apr2009 08:50, Martin v. Löwis wrote: > | File names, environment variables, and command line arguments are > | defined as being character data in POSIX; > > Specific citation please? I'd like to check the specifics of this. For example, on environment variables: h

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Martin v. Löwis
> | 2. Even if they were taken away (which the PEP does not propose to do), > |it would be easy to emulate them for applications that want them. > |For example, listdir could be wrapped as > | > |def listdir_b(bytestring): > |fse = sys.getfilesystemencoding() > > Alas, no No,

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Martin v. Löwis
Simon Cross wrote: >> Unfortunately, for Windows, the situation would >> be exactly the opposite: the byte-oriented interface cannot represent >> all data; only the character-oriented API can. > > Is the second part of this actually true? My understanding may be > flawed, but surely all Unicode da

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Martin v. Löwis
> The problem with this, and other preceding schemes that have been > discussed here, is that there is no means of ascertaining whether a > particular file name str was obtained from a str API, or was funny- > decoded from a bytes API... and thus, there is no means of reliably > ascertaining whethe

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Martin v. Löwis
> Humour aside :), the expectation that filenames are Unicode data > simply doesn't agree with the reality of POSIX file systems. I think > an approach similar to that adopted by glib [1] could work Are you saying that the approach presented in the PEP will not work? I believe it would work no ma

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Martin v. Löwis
> The part that I haven't seen clearly addressed so far is what happens > when disks get mounted across OSes (e.g. NFS). > > While I agree that there should be a layer on top that can handle "most" > situations, it also seems clear that the raw layer needs to be readily > accessible. Indeed, with

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Martin v. Löwis
> [1] Actually, all the PEP says is "With this PEP, a uniform treatment > of these data as characters becomes > possible." An argument as to why this is a good thing would be a > useful addition to the PEP. At the moment it's more or less treated as > self-evident - which I agree with, but which cl

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Martin v. Löwis
> Because the encoding is not reliably reversible. Why do you say that? The encoding is completely reversible (unless we disagree on what "reversible" means). > I'm +1 on the concept, -1 on the PEP, due solely to the lack of a > reversible encoding. Then please provide an example for a setup whe

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Martin v. Löwis
> Following on from that, would this (under Martin's proposal) result in > programs receiving encoded strings, or just semantically-incorrect > ones? Not sure I understand the question - what is an "encoded string"? As you analyse below, sometimes, the current (2.x) file system encoding will do t

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Martin v. Löwis
> If the bytes are mapped to single half surrogate codes instead of the > normal pairs (low+high), then I can see that decoding could never be > ambiguous and encoding could produce the original bytes. I was confused by Markus Kuhn's original UTF-8b specification. I have now changed the PEP to avo

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread MRAB
Martin v. Löwis wrote: If the bytes are mapped to single half surrogate codes instead of the normal pairs (low+high), then I can see that decoding could never be ambiguous and encoding could produce the original bytes. I was confused by Markus Kuhn's original UTF-8b specification. I have now ch

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Paul Moore
2009/4/25 "Martin v. Löwis" : >> Following on from that, would this (under Martin's proposal) result in >> programs receiving encoded strings, or just semantically-incorrect >> ones? > > Not sure I understand the question - what is an "encoded string"? Sorry. I was struggling to come up with termi

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Martin v. Löwis
> OK, looks like my analysis matches yours, except that I wasn't sure if > the third case (a string that "likely wasn't intended") could result > in exceptions. From what you're saying, it sounds like it would > actually be similar to the second case - I'm not clear on how > surrogates work, though

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Martin v. Löwis
> The only drawback I can see is if the UTF-8 bytes actually decode to a > half surrogate. However, half surrogates should really only occur in > UTF-16 (as I understand it), so they shouldn't be encoded in UTF-8 > anyway! Right: that's the rationale for UTF-8b. Encoding half surrogates violates p

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Zooko O'Whielacronx
Thanks for writing this PEP 383, MvL. I recently ran into this problem in Python 2.x in the Tahoe project [1]. The Tahoe project should be considered a good use case showing what some people need. For example, the assumption that a file will later be written back into the same local file

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Oleg Broytmann
On Sat, Apr 25, 2009 at 05:00:17PM +0200, "Martin v. L?wis" wrote: > I recognize that for other languages (without trivial transliterations) > the problem is more severe, and people are more likely to create > files with Cyrillic, or Japanese, names (say) if the systems accepts > them at all. I

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Michael Urman
On Sat, Apr 25, 2009 at 10:00, "Martin v. Löwis" wrote: > On decoding, there is a guarantee that it decodes successfully. There is > also a guarantee that the result will re-encode successfully, and yield > the same byte string. > > If you pass a different string into encoding, you still may get >

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Martin v. Löwis
> I see two main user-oriented use cases for the resulting Unicode > strings this PEP will produce on all systems: displaying a list of > filenames for the user to select from (an open file dialog), and > allowing a user to edit or supply a filename (a save dialog or a > rename control). There are

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System C haracter Interfaces

2009-04-25 Thread Antoine Pitrou
Paul Moore gmail.com> writes: > But those > people are also the *least* likely people to contribute on an > English-speaking list, I guess (Sincere apologies if everyone but > me on this list happens to actually be fluent English-speaking > Russians ) Actually, we're all Finnish. Regards, Ånto

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Michael Urman
On Sat, Apr 25, 2009 at 11:33, "Martin v. Löwis" wrote: > If the user has the locale setup in way that matches his keyboard, > it should work all fine - and will already, even without the PEP. > If the user enters a character that doesn't directly map to a > good file name, you get an exception, a

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread MRAB
Martin v. Löwis wrote: I see two main user-oriented use cases for the resulting Unicode strings this PEP will produce on all systems: displaying a list of filenames for the user to select from (an open file dialog), and allowing a user to edit or supply a filename (a save dialog or a rename contr

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Jeroen Ruigrok van der Werven
-On [20090425 11:01], Paul Moore (p.f.mo...@gmail.com) wrote: >PS Unfortunately, I suspect that the biggest group of people likely to >be hit badly by this is people using non-latin scripts. And arguing >probabilities without real data is optimistic at best. But those >people are als

Re: [Python-Dev] [Python-checkins] r71946 - peps/trunk/pep-0315.txt

2009-04-25 Thread Eric Smith
You might want to note in the PEP that the problem that's being solved is known as the "loop and a half" problem. http://www.cs.duke.edu/~ola/patterns/plopd/loops.html#loop-and-a-half raymond.hettinger wrote: Author: raymond.hettinger Date: Sun Apr 26 02:34:36 2009 New Revision: 71946 Log: Re

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Cameron Simpson
On 25Apr2009 14:07, "Martin v. Löwis" wrote: | Cameron Simpson wrote: | > On 22Apr2009 08:50, Martin v. Löwis wrote: | > | File names, environment variables, and command line arguments are | > | defined as being character data in POSIX; | > | > Specific citation please? I'd like to check the spe