Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-05-17 Thread Piet van Oostrum
> Ned Deily (ND) wrote: >ND> In article , Piet van Oostrum >ND> wrote: >>> > Ronald Oussoren (RO) wrote: >>> >RO> For what it's worth, the OSX API's seem to behave as follows: >>> >RO> * If you create a file with an non-UTF8 name on a HFS+ filesystem the >>> >RO> system automaticly enc

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-05-01 Thread Stephen J. Turnbull
James Y Knight writes: > in python. It seems like the most common reason why people want to use > SJIS is to make old pre-unicode apps work right in WINE -- in which > case it doesn't actually affect unix python at all. Mounting external drives, especially USB memory sticks which tend to b

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Ronald Oussoren
On 30 Apr, 2009, at 21:33, Piet van Oostrum wrote: Ronald Oussoren (RO) wrote: RO> For what it's worth, the OSX API's seem to behave as follows: RO> * If you create a file with an non-UTF8 name on a HFS+ filesystem the RO> system automaticly encodes the name. RO> That is, open(chr(255

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Steven D'Aprano
On Fri, 1 May 2009 06:55:48 am Thomas Breuel wrote: > You can get the same error on Linux: > > $ python > Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41) > [GCC 4.3.3] on linux2 > Type "help", "copyright", "credits" or "license" for more > information. > > >>> f=open(chr(255),'w') > > Traceb

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Toshio Kuratomi
Thomas Breuel wrote: > Not for me (I am using Python 2.6.2). > > >>> f = open(chr(255), 'w') > Traceback (most recent call last): > File "", line 1, in > IOError: [Errno 22] invalid mode ('w') or filename: '\xff' > >>> > > > You can get the same error on Linux: > > $ p

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Terry Reedy
James Y Knight wrote: On Apr 30, 2009, at 5:42 AM, Martin v. Löwis wrote: I think you are right. I have now excluded ASCII bytes from being mapped, effectively not supporting any encodings that are not ASCII compatible. Does that sound ok? Yes. The practical upshot of this is that users who br

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Barry Scott
On 30 Apr 2009, at 21:06, Martin v. Löwis wrote: How do get a printable unicode version of these path strings if they contain none unicode data? Define "printable". One way would be to use a regular expression, replacing all codes in a certain range with a question mark. What I mean by pr

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Thomas Breuel
> > Not for me (I am using Python 2.6.2). > > >>> f = open(chr(255), 'w') > Traceback (most recent call last): > File "", line 1, in > IOError: [Errno 22] invalid mode ('w') or filename: '\xff' > >>> You can get the same error on Linux: $ python Python 2.6.2 (release26-maint, Apr 19 2009, 01:5

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread James Y Knight
On Apr 30, 2009, at 5:42 AM, Martin v. Löwis wrote: I think you are right. I have now excluded ASCII bytes from being mapped, effectively not supporting any encodings that are not ASCII compatible. Does that sound ok? Yes. The practical upshot of this is that users who brokenly use "ja_JP.SJI

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread MRAB
Barry Scott wrote: On 30 Apr 2009, at 05:52, Martin v. Löwis wrote: How do get a printable unicode version of these path strings if they contain none unicode data? Define "printable". One way would be to use a regular expression, replacing all codes in a certain range with a question mark.

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Martin v. Löwis
>>> How do get a printable unicode version of these path strings if they >>> contain none unicode data? >> >> Define "printable". One way would be to use a regular expression, >> replacing all codes in a certain range with a question mark. > > What I mean by printable is that the string must be va

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Ned Deily
In article , Piet van Oostrum wrote: > > Ronald Oussoren (RO) wrote: > >RO> For what it's worth, the OSX API's seem to behave as follows: > >RO> * If you create a file with an non-UTF8 name on a HFS+ filesystem the > >RO> system automaticly encodes the name. > > >RO> That is, open(chr(255)

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Barry Scott
On 30 Apr 2009, at 05:52, Martin v. Löwis wrote: How do get a printable unicode version of these path strings if they contain none unicode data? Define "printable". One way would be to use a regular expression, replacing all codes in a certain range with a question mark. What I mean by prin

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Piet van Oostrum
> Ronald Oussoren (RO) wrote: >RO> For what it's worth, the OSX API's seem to behave as follows: >RO> * If you create a file with an non-UTF8 name on a HFS+ filesystem the >RO> system automaticly encodes the name. >RO> That is, open(chr(255), 'w') will silently create a file named '%FF' >RO

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Martin v. Löwis
MRAB wrote: > One further question: should the encoder accept a string like > u'\xDCC2\xDC80'? That would encode to b'\xC2\x80' Indeed so. > which, when decoded, would give u'\x80'. Assuming the encoding is UTF-8, yes. > Does the PEP only guarantee that strings decoded > from the filesystem are

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread MRAB
One further question: should the encoder accept a string like u'\xDCC2\xDC80'? That would encode to b'\xC2\x80', which, when decoded, would give u'\x80'. Does the PEP only guarantee that strings decoded from the filesystem are reversible, but not check what might be de novo strings? __

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Stephen J. Turnbull
Cameron Simpson writes: > On 29Apr2009 22:14, Stephen J. Turnbull wrote: > | Baptiste Carvello writes: > | > By contrast, if the new utf-8b codec would *supercede* the old one, > | > \udcxx would always mean raw bytes (at least on UCS-4 builds, where > | > surrogates are unused). Thus ambi

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Aahz
[top-posting for once to preserve full quoting] Glenn, Could you please reduce your suggestions into sample text for the PEP? We seem to be now at the stage where nobody is objecting to the PEP, so the focus should be on making the PEP clearer. If you still want to create an alternative PEP impl

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Martin v. Löwis
> I think it has to be excluded from mapping in order to not introduce > security issues. I think you are right. I have now excluded ASCII bytes from being mapped, effectively not supporting any encodings that are not ASCII compatible. Does that sound ok? Regards, Martin _

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Martin v. Löwis
> Assuming people agree that this is an accurate summary, it should be > incorporated into the PEP. Done! Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.pyth

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Glenn Linderman
On approximately 4/29/2009 7:50 PM, came the following characters from the keyboard of Aahz: On Thu, Apr 30, 2009, Cameron Simpson wrote: The lengthy discussion mostly revolves around: - Glenn points out that strings that came _not_ from listdir, and that are _not_ well-formed unicode (==

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Glenn Linderman
On approximately 4/29/2009 8:46 PM, came the following characters from the keyboard of Terry Reedy: Glenn Linderman wrote: On approximately 4/29/2009 1:28 PM, came the following characters from So where is the ambiguity here? None. But not everyone can read all the Python source code to tr

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Thomas Breuel
On Wed, Apr 29, 2009 at 23:03, Terry Reedy wrote: > Thomas Breuel wrote: > >> >>Sure. However, that requires you to provide meaningful, reproducible >>counter-examples, rather than a stenographic formulation that might >>hint some problem you apparently see (which I believe is just no

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Martin v. Löwis
> Thanks for clarifying the Windows behavior, here. A little more > clarification in the PEP could have avoided lots of discussion. It > would seem that a PEP, proposed to modify a poorly documented (and > therefore likely poorly understood) area, should be educational about > the status quo, as

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Martin v. Löwis
> How do get a printable unicode version of these path strings if they > contain none unicode data? Define "printable". One way would be to use a regular expression, replacing all codes in a certain range with a question mark. > I'm guessing that an app has to understand that filenames come in tw

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Terry Reedy
Glenn Linderman wrote: On approximately 4/29/2009 1:28 PM, came the following characters from So where is the ambiguity here? None. But not everyone can read all the Python source code to try to understand it; they expect the documentation to help them avoid that. Because the documentatio

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Aahz
On Thu, Apr 30, 2009, Cameron Simpson wrote: > > The lengthy discussion mostly revolves around: > > - Glenn points out that strings that came _not_ from listdir, and that are > _not_ well-formed unicode (== "have bare surrogates in them") but that > were intended for use as filenames wil

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Cameron Simpson
On 29Apr2009 23:41, Barry Scott wrote: > On 22 Apr 2009, at 07:50, Martin v. Löwis wrote: >> If the locale's encoding is UTF-8, the file system encoding is set to >> a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes >> (which must be >= 0x80) into half surrogate codes U+DC80..U

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Barry Scott
On 22 Apr 2009, at 07:50, Martin v. Löwis wrote: If the locale's encoding is UTF-8, the file system encoding is set to a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes (which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF. Forgive me if this has been covered.

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Cameron Simpson
On 29Apr2009 22:14, Stephen J. Turnbull wrote: | Baptiste Carvello writes: | > By contrast, if the new utf-8b codec would *supercede* the old one, | > \udcxx would always mean raw bytes (at least on UCS-4 builds, where | > surrogates are unused). Thus ambiguity could be avoided. | | Unfortunat

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Cameron Simpson
On 29Apr2009 17:03, Terry Reedy wrote: > Thomas Breuel wrote: >> Sure. However, that requires you to provide meaningful, reproducible >> counter-examples, rather than a stenographic formulation that might >> hint some problem you apparently see (which I believe is just not >> there

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Glenn Linderman
On approximately 4/29/2009 1:28 PM, came the following characters from the keyboard of Martin v. Löwis: C. File on disk with the invalid surrogate code, accessed via the str interface, no decoding happens, matches in memory the file on disk with the byte that translates to the same surrogate, acc

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Terry Reedy
Thomas Breuel wrote: Sure. However, that requires you to provide meaningful, reproducible counter-examples, rather than a stenographic formulation that might hint some problem you apparently see (which I believe is just not there). Well, here's another one: PEP 383 would disall

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Terry Reedy
Glenn Linderman wrote: On approximately 4/29/2009 4:36 AM, came the following characters from the keyboard of Cameron Simpson: On 29Apr2009 02:56, Glenn Linderman wrote: os.listdir(b"") I find that on my Windows system, with all ASCII path file names, that I get quite different results wh

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Martin v. Löwis
> So while out of scope of the PEP, I don't think it's at all > artificial. Sure - but I see this as the same case as "the file got renamed". If you have a LRU list in your app, and a file gets renamed, then the LRU list breaks (unless you also store the inode number in the LRU list, and lookup th

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Martin v. Löwis
>>> C. File on disk with the invalid surrogate code, accessed via the >>> str interface, no decoding happens, matches in memory the file on disk >>> with the byte that translates to the same surrogate, accessed via the >>> bytes interface. Ambiguity. >> What does that mean? What sp

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Martin v. Löwis
> Sure. However, that requires you to provide meaningful, reproducible > counter-examples, rather than a stenographic formulation that might > hint some problem you apparently see (which I believe is just not > there). > > > Well, here's another one: PEP 383 would disallow UTF-8 e

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Stephen J. Turnbull
"Martin v. Löwis" writes: > I find the case pretty artificial, though: if the locale encoding > changes, all file names will look incorrect to the user, so he'll > quickly switch back, or rename all the files. It's not necessarily the case that the locale encoding changes, but rather the name

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Stephen J. Turnbull
Baptiste Carvello writes: > By contrast, if the new utf-8b codec would *supercede* the old one, > \udcxx would always mean raw bytes (at least on UCS-4 builds, where > surrogates are unused). Thus ambiguity could be avoided. Unfortunately, that's false. It could have come from a literal strin

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Glenn Linderman
On approximately 4/29/2009 4:36 AM, came the following characters from the keyboard of Cameron Simpson: On 29Apr2009 02:56, Glenn Linderman wrote: os.listdir(b"") I find that on my Windows system, with all ASCII path file names, that I get quite different results when I pass os.listdir an

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Glenn Linderman
On approximately 4/29/2009 4:07 AM, came the following characters from the keyboard of R. David Murray: On Tue, 28 Apr 2009 at 20:29, Glenn Linderman wrote: On approximately 4/28/2009 7:40 PM, came the following characters from the keyboard of R. David Murray: On Tue, 28 Apr 2009 at 13:37, Gle

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Cameron Simpson
On 29Apr2009 02:56, Glenn Linderman wrote: > os.listdir(b"") > > I find that on my Windows system, with all ASCII path file names, that I > get quite different results when I pass os.listdir an empty str vs an > empty bytes. > > Rather than keep you guessing, I get the root directory contents

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread R. David Murray
On Tue, 28 Apr 2009 at 20:29, Glenn Linderman wrote: On approximately 4/28/2009 7:40 PM, came the following characters from the keyboard of R. David Murray: On Tue, 28 Apr 2009 at 13:37, Glenn Linderman wrote: > C. File on disk with the invalid surrogate code, accessed via the str > interfac

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Glenn Linderman
On approximately 4/29/2009 12:29 AM, came the following characters from the keyboard of Martin v. Löwis: C. File on disk with the invalid surrogate code, accessed via the str interface, no decoding happens, matches in memory the file on disk with the byte that translates to the same surrogate, ac

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Glenn Linderman
On approximately 4/29/2009 12:38 AM, came the following characters from the keyboard of Baptiste Carvello: Glenn Linderman a écrit : 3. When an undecodable byte 0xPQ is found, decode to the escape codepoint, followed by codepoint U+01PQ, where P and Q are hex digits. The problem with this

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Thomas Breuel
> Sure. However, that requires you to provide meaningful, reproducible > counter-examples, rather than a stenographic formulation that might > hint some problem you apparently see (which I believe is just not > there). Well, here's another one: PEP 383 would disallow UTF-8 encodings of half surro

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Baptiste Carvello
Glenn Linderman a écrit : If there is going to be a required transformation from de novo strings to funny-encoded strings, then why not make one that people can actually see and compare and decode from the displayable form, by using displayable characters instead of lone surrogates? The

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Baptiste Carvello
Lino Mastrodomenico a écrit : Only for the new utf-8b encoding (if Martin agrees), while the existing utf-8 is fine as is (or at least waaay outside the scope of this PEP). This is questionable. This would have the consequence that \udcxx in a python string would sometimes mean a surrogate,

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Hrvoje Niksic
Zooko O'Whielacronx wrote: If you switch to iso8859-15 only in the presence of undecodable UTF-8, then you have the same round-trip problem as the PEP: both b'\xff' and b'\xc3\xbf' will be converted to u'\u00ff' without a way to unambiguously recover the original file name. Why do you say

Re: [Python-Dev] Python-Dev PEP 383: Non-decodable Bytes in System Character?Interfaces

2009-04-29 Thread Cameron Simpson
On 29Apr2009 08:27, Martin v. L?wis wrote: | > I would like utility functions to perform: | > os-bytes->funny-encoded | > funny-encoded->os-bytes | > or explicit example code snippets for same in the PEP text. | | Done! Thanks! -- Cameron Simpson DoD#743 http://www.cskk.ezoshosting.com/cs/

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Baptiste Carvello
Glenn Linderman a écrit : 3. When an undecodable byte 0xPQ is found, decode to the escape codepoint, followed by codepoint U+01PQ, where P and Q are hex digits. The problem with this strategy is: paths are often sliced, so your 2 codepoints could get separated. The good thing with the PEP'

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Martin v. Löwis
> C. File on disk with the invalid surrogate code, accessed via the str > interface, no decoding happens, matches in memory the file on disk > with > the byte that translates to the same surrogate, accessed via the bytes > interface. Ambiguity. Is that an alternative to A

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman
On approximately 4/28/2009 10:52 PM, came the following characters from the keyboard of Martin v. Löwis: C. File on disk with the invalid surrogate code, accessed via the str interface, no decoding happens, matches in memory the file on disk with the byte that translates to the same surrogate, ac

Re: [Python-Dev] Python-Dev PEP 383: Non-decodable Bytes in System Character?Interfaces

2009-04-28 Thread Martin v. Löwis
> I would like utility functions to perform: > os-bytes->funny-encoded > funny-encoded->os-bytes > or explicit example code snippets for same in the PEP text. Done! Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mail

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis
> I'm more concerned with your (yours? someone else's?) mention of shift > characters. I'm unfamiliar with these encodings: to translate such a > thing into a Latin example, is it the case that there are schemes with > valid encodings that look like: > > [SHIFT] a b c > > which would produce "A

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis
>> The Python UTF-8 codec will happily encode half-surrogates; people argue >> that it is a bug that it does so, however, it would help in this >> specific case. > > Can we use this encoding scheme for writing into files as well? We've > turned the filename with undecodable bytes into a string wi

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis
>>> C. File on disk with the invalid surrogate code, accessed via the str >>> interface, no decoding happens, matches in memory the file on disk with >>> the byte that translates to the same surrogate, accessed via the bytes >>> interface. Ambiguity. >> >> Is that an alternative to A and B? > > I

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman
On approximately 4/28/2009 4:06 PM, came the following characters from the keyboard of Cameron Simpson: I think I may be able to resolve Glenn's issues with the scheme lower down (through careful use of definitions and hand waving). Close. You at least resolved what you thought my issue was

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman
On approximately 4/28/2009 7:40 PM, came the following characters from the keyboard of R. David Murray: On Tue, 28 Apr 2009 at 13:37, Glenn Linderman wrote: C. File on disk with the invalid surrogate code, accessed via the str interface, no decoding happens, matches in memory the file on disk w

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Cameron Simpson
On 28Apr2009 13:37, Glenn Linderman wrote: > On approximately 4/28/2009 1:25 PM, came the following characters from > the keyboard of Martin v. Löwis: >>> The UTF-8b representation suffers from the same potential ambiguities as >>> the PUA characters... >> >> Not at all the same ambiguities. He

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Toshio Kuratomi
Martin v. Löwis wrote: >> Since the serialization of the Unicode string is likely to use UTF-8, >> and the string for such a file will include half surrogates, the >> application may raise an exception when encoding the names for a >> configuration file. These encoding exceptions will be as rare a

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Cameron Simpson
On 28Apr2009 14:37, Thomas Breuel wrote: | But the biggest problem with the proposal is that it isn't needed: if you | want to be able to turn arbitrary byte sequences into unicode strings and | back, just set your encoding to iso8859-15. That already works and it | doesn't require any changes.

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread R. David Murray
On Tue, 28 Apr 2009 at 13:37, Glenn Linderman wrote: C. File on disk with the invalid surrogate code, accessed via the str interface, no decoding happens, matches in memory the file on disk with the byte that translates to the same surrogate, accessed via the bytes interface. Ambiguity. Unles

Re: [Python-Dev] Python-Dev PEP 383: Non-decodable Bytes in System Character?Interfaces

2009-04-28 Thread Cameron Simpson
On 28Apr2009 11:49, Antoine Pitrou wrote: | Paul Moore gmail.com> writes: | > | > I've yet to hear anyone claim that they would have an actual problem | > with a specific piece of code they have written. | | Yep, that's the problem. Lots of theoretical problems noone has ever encountered | bro

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Toshio Kuratomi
Zooko O'Whielacronx wrote: > On Apr 28, 2009, at 6:46 AM, Hrvoje Niksic wrote: >> If you switch to iso8859-15 only in the presence of undecodable UTF-8, >> then you have the same round-trip problem as the PEP: both b'\xff' and >> b'\xc3\xbf' will be converted to u'\u00ff' without a way to >> unambi

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman
On approximately 4/28/2009 2:01 PM, came the following characters from the keyboard of MRAB: Glenn Linderman wrote: On approximately 4/28/2009 11:55 AM, came the following characters from the keyboard of MRAB: I've been thinking of "python-escape" only in terms of UTF-8, the only encoding menti

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Cameron Simpson
I think I may be able to resolve Glenn's issues with the scheme lower down (through careful use of definitions and hand waving). On 27Apr2009 23:52, Glenn Linderman wrote: > On approximately 4/27/2009 7:11 PM, came the following characters from > the keyboard of Cameron Simpson: [...] >> There

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman
On approximately 4/28/2009 2:02 PM, came the following characters from the keyboard of Martin v. Löwis: Glenn Linderman wrote: On approximately 4/28/2009 1:25 PM, came the following characters from the keyboard of Martin v. Löwis: The UTF-8b representation suffers from the same potential ambigu

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis
Glenn Linderman wrote: > On approximately 4/28/2009 1:25 PM, came the following characters from > the keyboard of Martin v. Löwis: >>> The UTF-8b representation suffers from the same potential ambiguities as >>> the PUA characters... >> >> Not at all the same ambiguities. Here, again, the two choi

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread MRAB
Glenn Linderman wrote: On approximately 4/28/2009 11:55 AM, came the following characters from the keyboard of MRAB: I've been thinking of "python-escape" only in terms of UTF-8, the only encoding mentioned in the PEP. In UTF-8, bytes 0x00 to 0x7F are decodable. UTF-8 is only mentioned in the

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis
> Others have made this suggestion, and it is helpful to the PEP, but not > sufficient. As implemented as an error handler, I'm not sure that the > b'\xed\xb3\xbf' sequence would trigger the error handler, if the UTF-8 > decoder is happy with it. Which, in my testing, it is. Rest assured that th

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman
On approximately 4/28/2009 1:25 PM, came the following characters from the keyboard of Martin v. Löwis: The UTF-8b representation suffers from the same potential ambiguities as the PUA characters... Not at all the same ambiguities. Here, again, the two choices: A. use PUA characters to repres

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman
On approximately 4/28/2009 6:01 AM, came the following characters from the keyboard of Lino Mastrodomenico: 2009/4/28 Glenn Linderman : The switch from PUA to half-surrogates does not resolve the issues with the encoding not being a 1-to-1 mapping, though. The very fact that you think you can

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman
On approximately 4/28/2009 11:55 AM, came the following characters from the keyboard of MRAB: I've been thinking of "python-escape" only in terms of UTF-8, the only encoding mentioned in the PEP. In UTF-8, bytes 0x00 to 0x7F are decodable. UTF-8 is only mentioned in the sense of having special

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis
> The UTF-8b representation suffers from the same potential ambiguities as > the PUA characters... Not at all the same ambiguities. Here, again, the two choices: A. use PUA characters to represent undecodable bytes, in particular for UTF-8 (the PEP actually never proposed this to happen).

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman
On approximately 4/28/2009 10:53 AM, came the following characters from the keyboard of James Y Knight: On Apr 28, 2009, at 2:50 AM, Martin v. Löwis wrote: James Y Knight wrote: Hopefully it can be assumed that your locale encoding really is a non-overlapping superset of ASCII, as is required

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Zooko O'Whielacronx
On Apr 28, 2009, at 6:46 AM, Hrvoje Niksic wrote: Are you proposing to unconditionally encode file names as iso8859-15, or to do so only when undecodeable bytes are encountered? For what it is worth, what we have previously planned to do for the Tahoe project is the second of these -- decod

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread MRAB
James Y Knight wrote: On Apr 28, 2009, at 2:50 AM, Martin v. Löwis wrote: James Y Knight wrote: Hopefully it can be assumed that your locale encoding really is a non-overlapping superset of ASCII, as is required by POSIX... Can you please point to the part of the POSIX spec that says that s

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman
On approximately 4/28/2009 10:00 AM, came the following characters from the keyboard of Martin v. Löwis: An alternative that doesn't suffer from the risk of not being able to store decoded strings would have been the use of PUA characters, but people rejected it because of the potential ambigui

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread James Y Knight
On Apr 28, 2009, at 2:50 AM, Martin v. Löwis wrote: James Y Knight wrote: Hopefully it can be assumed that your locale encoding really is a non-overlapping superset of ASCII, as is required by POSIX... Can you please point to the part of the POSIX spec that says that such overlapping is forb

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis
> If the PEP depends on this being changed, it should be mentioned in the > PEP. The PEP says that the utf-8b codec decodes invalid bytes into low surrogates. I have now clarified that a strict definition of UTF-8 is assumed for utf-8b. Regards, Martin ___

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis
> Since the serialization of the Unicode string is likely to use UTF-8, > and the string for such a file will include half surrogates, the > application may raise an exception when encoding the names for a > configuration file. These encoding exceptions will be as rare as the > unusual names (whic

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis
> It does solve this issue, because (unlike e.g. U+F01FF) '\udcff' is > not a valid Unicode character (not a character at all, really) and the > only way you can put this in a POSIX filename is if you use a very > lenient UTF-8 encoder that gives you b'\xed\xb3\xbf'. > > Since this byte sequence

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Stephen J. Turnbull
Paul Moore writes: > But it seems to me that there is an assumption that problems will > arise when code gets a potentially funny-decoded string and doesn't > know where it came from. > > Is that a real concern? Yes, it's a real concern. I don't think it's possible to show a small piece of

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Michael Urman
On Mon, Apr 27, 2009 at 23:43, Stephen J. Turnbull wrote: > Nobody said we were at the stage of *saving* the [attachment]! But speaking of saving files, I think that's the biggest hole in this that has been nagging at the back of my mind. This PEP intends to allow easy access to filenames and oth

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Lino Mastrodomenico
2009/4/28 Hrvoje Niksic : > Lino Mastrodomenico wrote: >> >> Since this byte sequence [b'\xed\xb3\xbf'] doesn't represent a valid >> character when >> decoded with UTF-8, it should simply be considered an invalid UTF-8 >> sequence of three bytes and decoded to '\udced\udcb3\udcbf' (*not* >> '\udcff

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Hrvoje Niksic
Lino Mastrodomenico wrote: Since this byte sequence [b'\xed\xb3\xbf'] doesn't represent a valid character when decoded with UTF-8, it should simply be considered an invalid UTF-8 sequence of three bytes and decoded to '\udced\udcb3\udcbf' (*not* '\udcff'). "Should be considered" or "will be co

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Lino Mastrodomenico
2009/4/28 Glenn Linderman : > The switch from PUA to half-surrogates does not resolve the issues with the > encoding not being a 1-to-1 mapping, though.  The very fact that you  think > you can get away with use of lone surrogates means that other people might, > accidentally or intentionally, also

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Hrvoje Niksic
Thomas Breuel wrote: But the biggest problem with the proposal is that it isn't needed: if you want to be able to turn arbitrary byte sequences into unicode strings and back, just set your encoding to iso8859-15. That already works and it doesn't require any changes. Are you proposing to unc

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Thomas Breuel
> > Yep, that's the problem. Lots of theoretical problems noone has ever > encountered > brought up against a PEP which resolves some actual problems people > encounter on > a regular basis. How can you bring up practical problems against something that hasn't been implemented? The fact that no

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Ronald Oussoren
For what it's worth, the OSX API's seem to behave as follows: * If you create a file with an non-UTF8 name on a HFS+ filesystem the system automaticly encodes the name. That is, open(chr(255), 'w') will silently create a file named '%FF' instead of the name you'd expect on a unix system.

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Michael Foord
Paul Moore wrote: 2009/4/28 Antoine Pitrou : Paul Moore gmail.com> writes: I've yet to hear anyone claim that they would have an actual problem with a specific piece of code they have written. Yep, that's the problem. Lots of theoretical problems noone has ever encountered brou

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Paul Moore
2009/4/28 Antoine Pitrou : > Paul Moore gmail.com> writes: >> >> I've yet to hear anyone claim that they would have an actual problem >> with a specific piece of code they have written. > > Yep, that's the problem. Lots of theoretical problems noone has ever > encountered > brought up against a P

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Paul Moore
2009/4/28 Glenn Linderman : > So assume a non-decodable sequence in a name.  That puts us into Martin's > funny-decode scheme.  His funny-decode scheme produces a bare string, > indistinguishable from a bare string that would be produced by a str API > that happens to contain that same sequence.  D

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Martin v. Löwis
> Does the PEP take into consideration the normalising behaviour of Mac > OSX ? We've had some ongoing challenges in bzr related to this with bzr. No, that's completely out of scope, AFAICT. I don't even know what the issues are, so I'm not able to propose a solution, at the moment. Regards, Mart

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Glenn Linderman
On approximately 4/27/2009 7:11 PM, came the following characters from the keyboard of Cameron Simpson: On 27Apr2009 18:15, Glenn Linderman wrote: The problem with this, and other preceding schemes that have been discussed here, is that there is no means of ascertaining whether a particular

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Martin v. Löwis
James Y Knight wrote: > Hopefully it can be assumed that your locale encoding really is a > non-overlapping superset of ASCII, as is required by POSIX... Can you please point to the part of the POSIX spec that says that such overlapping is forbidden? > I'm a bit scared at the prospect that U+DCAF

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Glenn Linderman
On approximately 4/27/2009 8:39 PM, came the following characters from the keyboard of Martin v. Löwis: I'm not suggesting the PEP should solve the problem of mounting foreign file systems, although if it doesn't it should probably point that out. I'm just suggesting that if the people that writ

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Robert Collins
On Mon, 2009-04-27 at 22:25 -0700, Glenn Linderman wrote: > > Indeed, that was the missing piece. I'd forgotten about the > encodings > that use escape sequences, rather than UTF-8, and DBCS. I don't > think > those encodings are permitted by POSIX file systems, but I suppose > they > could s

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Glenn Linderman
On approximately 4/27/2009 8:35 PM, came the following characters from the keyboard of Martin v. Löwis: Glenn Linderman wrote: On approximately 4/27/2009 12:42 PM, came the following characters from the keyboard of Martin v. Löwis: It's a private use area. It will never carry an official charac

  1   2   3   >