Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-05-17 Thread Piet van Oostrum
Ned Deily n...@acm.org (ND) wrote: ND In article m2ocueq6mm@cs.uu.nl, Piet van Oostrum p...@cs.uu.nl ND wrote: Ronald Oussoren ronaldousso...@mac.com (RO) wrote: RO For what it's worth, the OSX API's seem to behave as follows: RO * If you create a file with an non-UTF8 name on a HFS+

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-05-01 Thread Stephen J. Turnbull
James Y Knight writes: in python. It seems like the most common reason why people want to use SJIS is to make old pre-unicode apps work right in WINE -- in which case it doesn't actually affect unix python at all. Mounting external drives, especially USB memory sticks which tend to be

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Thomas Breuel
On Wed, Apr 29, 2009 at 23:03, Terry Reedy tjre...@udel.edu wrote: Thomas Breuel wrote: Sure. However, that requires you to provide meaningful, reproducible counter-examples, rather than a stenographic formulation that might hint some problem you apparently see (which I believe is

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Glenn Linderman
On approximately 4/29/2009 8:46 PM, came the following characters from the keyboard of Terry Reedy: Glenn Linderman wrote: On approximately 4/29/2009 1:28 PM, came the following characters from So where is the ambiguity here? None. But not everyone can read all the Python source code to

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Glenn Linderman
On approximately 4/29/2009 7:50 PM, came the following characters from the keyboard of Aahz: On Thu, Apr 30, 2009, Cameron Simpson wrote: The lengthy discussion mostly revolves around: - Glenn points out that strings that came _not_ from listdir, and that are _not_ well-formed unicode

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Martin v. Löwis
Assuming people agree that this is an accurate summary, it should be incorporated into the PEP. Done! Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe:

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Martin v. Löwis
I think it has to be excluded from mapping in order to not introduce security issues. I think you are right. I have now excluded ASCII bytes from being mapped, effectively not supporting any encodings that are not ASCII compatible. Does that sound ok? Regards, Martin

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Aahz
[top-posting for once to preserve full quoting] Glenn, Could you please reduce your suggestions into sample text for the PEP? We seem to be now at the stage where nobody is objecting to the PEP, so the focus should be on making the PEP clearer. If you still want to create an alternative PEP

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Stephen J. Turnbull
Cameron Simpson writes: On 29Apr2009 22:14, Stephen J. Turnbull step...@xemacs.org wrote: | Baptiste Carvello writes: | By contrast, if the new utf-8b codec would *supercede* the old one, | \udcxx would always mean raw bytes (at least on UCS-4 builds, where | surrogates are

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread MRAB
One further question: should the encoder accept a string like u'\xDCC2\xDC80'? That would encode to b'\xC2\x80', which, when decoded, would give u'\x80'. Does the PEP only guarantee that strings decoded from the filesystem are reversible, but not check what might be de novo strings?

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Martin v. Löwis
MRAB wrote: One further question: should the encoder accept a string like u'\xDCC2\xDC80'? That would encode to b'\xC2\x80' Indeed so. which, when decoded, would give u'\x80'. Assuming the encoding is UTF-8, yes. Does the PEP only guarantee that strings decoded from the filesystem are

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Piet van Oostrum
Ronald Oussoren ronaldousso...@mac.com (RO) wrote: RO For what it's worth, the OSX API's seem to behave as follows: RO * If you create a file with an non-UTF8 name on a HFS+ filesystem the RO system automaticly encodes the name. RO That is, open(chr(255), 'w') will silently create a file named

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Barry Scott
On 30 Apr 2009, at 05:52, Martin v. Löwis wrote: How do get a printable unicode version of these path strings if they contain none unicode data? Define printable. One way would be to use a regular expression, replacing all codes in a certain range with a question mark. What I mean by

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Ned Deily
In article m2ocueq6mm@cs.uu.nl, Piet van Oostrum p...@cs.uu.nl wrote: Ronald Oussoren ronaldousso...@mac.com (RO) wrote: RO For what it's worth, the OSX API's seem to behave as follows: RO * If you create a file with an non-UTF8 name on a HFS+ filesystem the RO system automaticly encodes

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Martin v. Löwis
How do get a printable unicode version of these path strings if they contain none unicode data? Define printable. One way would be to use a regular expression, replacing all codes in a certain range with a question mark. What I mean by printable is that the string must be valid unicode

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread MRAB
Barry Scott wrote: On 30 Apr 2009, at 05:52, Martin v. Löwis wrote: How do get a printable unicode version of these path strings if they contain none unicode data? Define printable. One way would be to use a regular expression, replacing all codes in a certain range with a question mark.

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread James Y Knight
On Apr 30, 2009, at 5:42 AM, Martin v. Löwis wrote: I think you are right. I have now excluded ASCII bytes from being mapped, effectively not supporting any encodings that are not ASCII compatible. Does that sound ok? Yes. The practical upshot of this is that users who brokenly use

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Thomas Breuel
Not for me (I am using Python 2.6.2). f = open(chr(255), 'w') Traceback (most recent call last): File stdin, line 1, in module IOError: [Errno 22] invalid mode ('w') or filename: '\xff' You can get the same error on Linux: $ python Python 2.6.2 (release26-maint, Apr 19 2009,

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Barry Scott
On 30 Apr 2009, at 21:06, Martin v. Löwis wrote: How do get a printable unicode version of these path strings if they contain none unicode data? Define printable. One way would be to use a regular expression, replacing all codes in a certain range with a question mark. What I mean by

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Terry Reedy
James Y Knight wrote: On Apr 30, 2009, at 5:42 AM, Martin v. Löwis wrote: I think you are right. I have now excluded ASCII bytes from being mapped, effectively not supporting any encodings that are not ASCII compatible. Does that sound ok? Yes. The practical upshot of this is that users who

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Toshio Kuratomi
Thomas Breuel wrote: Not for me (I am using Python 2.6.2). f = open(chr(255), 'w') Traceback (most recent call last): File stdin, line 1, in module IOError: [Errno 22] invalid mode ('w') or filename: '\xff' You can get the same error on Linux: $ python

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Steven D'Aprano
On Fri, 1 May 2009 06:55:48 am Thomas Breuel wrote: You can get the same error on Linux: $ python Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41) [GCC 4.3.3] on linux2 Type help, copyright, credits or license for more information. f=open(chr(255),'w') Traceback (most recent call

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Ronald Oussoren
On 30 Apr, 2009, at 21:33, Piet van Oostrum wrote: Ronald Oussoren ronaldousso...@mac.com (RO) wrote: RO For what it's worth, the OSX API's seem to behave as follows: RO * If you create a file with an non-UTF8 name on a HFS+ filesystem the RO system automaticly encodes the name. RO

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Martin v. Löwis
The Python UTF-8 codec will happily encode half-surrogates; people argue that it is a bug that it does so, however, it would help in this specific case. Can we use this encoding scheme for writing into files as well? We've turned the filename with undecodable bytes into a string with half

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Martin v. Löwis
I'm more concerned with your (yours? someone else's?) mention of shift characters. I'm unfamiliar with these encodings: to translate such a thing into a Latin example, is it the case that there are schemes with valid encodings that look like: [SHIFT] a b c which would produce ABC in

Re: [Python-Dev] Python-Dev PEP 383: Non-decodable Bytes in System Character?Interfaces

2009-04-29 Thread Martin v. Löwis
I would like utility functions to perform: os-bytes-funny-encoded funny-encoded-os-bytes or explicit example code snippets for same in the PEP text. Done! Martin ___ Python-Dev mailing list Python-Dev@python.org

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Glenn Linderman
On approximately 4/28/2009 10:52 PM, came the following characters from the keyboard of Martin v. Löwis: C. File on disk with the invalid surrogate code, accessed via the str interface, no decoding happens, matches in memory the file on disk with the byte that translates to the same surrogate,

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Martin v. Löwis
C. File on disk with the invalid surrogate code, accessed via the str interface, no decoding happens, matches in memory the file on disk with the byte that translates to the same surrogate, accessed via the bytes interface. Ambiguity. Is that an alternative to A and B? I guess it is an

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Baptiste Carvello
Glenn Linderman a écrit : 3. When an undecodable byte 0xPQ is found, decode to the escape codepoint, followed by codepoint U+01PQ, where P and Q are hex digits. The problem with this strategy is: paths are often sliced, so your 2 codepoints could get separated. The good thing with the

Re: [Python-Dev] Python-Dev PEP 383: Non-decodable Bytes in System Character?Interfaces

2009-04-29 Thread Cameron Simpson
On 29Apr2009 08:27, Martin v. L?wis mar...@v.loewis.de wrote: | I would like utility functions to perform: |os-bytes-funny-encoded |funny-encoded-os-bytes | or explicit example code snippets for same in the PEP text. | | Done! Thanks! -- Cameron Simpson c...@zip.com.au DoD#743

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Hrvoje Niksic
Zooko O'Whielacronx wrote: If you switch to iso8859-15 only in the presence of undecodable UTF-8, then you have the same round-trip problem as the PEP: both b'\xff' and b'\xc3\xbf' will be converted to u'\u00ff' without a way to unambiguously recover the original file name. Why do you say

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Baptiste Carvello
Lino Mastrodomenico a écrit : Only for the new utf-8b encoding (if Martin agrees), while the existing utf-8 is fine as is (or at least waaay outside the scope of this PEP). This is questionable. This would have the consequence that \udcxx in a python string would sometimes mean a surrogate,

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Baptiste Carvello
Glenn Linderman a écrit : If there is going to be a required transformation from de novo strings to funny-encoded strings, then why not make one that people can actually see and compare and decode from the displayable form, by using displayable characters instead of lone surrogates? The

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Thomas Breuel
Sure. However, that requires you to provide meaningful, reproducible counter-examples, rather than a stenographic formulation that might hint some problem you apparently see (which I believe is just not there). Well, here's another one: PEP 383 would disallow UTF-8 encodings of half

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Glenn Linderman
On approximately 4/29/2009 12:38 AM, came the following characters from the keyboard of Baptiste Carvello: Glenn Linderman a écrit : 3. When an undecodable byte 0xPQ is found, decode to the escape codepoint, followed by codepoint U+01PQ, where P and Q are hex digits. The problem with this

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Glenn Linderman
On approximately 4/29/2009 12:29 AM, came the following characters from the keyboard of Martin v. Löwis: C. File on disk with the invalid surrogate code, accessed via the str interface, no decoding happens, matches in memory the file on disk with the byte that translates to the same surrogate,

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread R. David Murray
On Tue, 28 Apr 2009 at 20:29, Glenn Linderman wrote: On approximately 4/28/2009 7:40 PM, came the following characters from the keyboard of R. David Murray: On Tue, 28 Apr 2009 at 13:37, Glenn Linderman wrote: C. File on disk with the invalid surrogate code, accessed via the str

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Cameron Simpson
On 29Apr2009 02:56, Glenn Linderman v+pyt...@g.nevcal.com wrote: os.listdir(b) I find that on my Windows system, with all ASCII path file names, that I get quite different results when I pass os.listdir an empty str vs an empty bytes. Rather than keep you guessing, I get the root

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Glenn Linderman
On approximately 4/29/2009 4:07 AM, came the following characters from the keyboard of R. David Murray: On Tue, 28 Apr 2009 at 20:29, Glenn Linderman wrote: On approximately 4/28/2009 7:40 PM, came the following characters from the keyboard of R. David Murray: On Tue, 28 Apr 2009 at 13:37,

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Stephen J. Turnbull
Baptiste Carvello writes: By contrast, if the new utf-8b codec would *supercede* the old one, \udcxx would always mean raw bytes (at least on UCS-4 builds, where surrogates are unused). Thus ambiguity could be avoided. Unfortunately, that's false. It could have come from a literal string

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Stephen J. Turnbull
Martin v. Löwis writes: I find the case pretty artificial, though: if the locale encoding changes, all file names will look incorrect to the user, so he'll quickly switch back, or rename all the files. It's not necessarily the case that the locale encoding changes, but rather the name of

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Martin v. Löwis
Sure. However, that requires you to provide meaningful, reproducible counter-examples, rather than a stenographic formulation that might hint some problem you apparently see (which I believe is just not there). Well, here's another one: PEP 383 would disallow UTF-8

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Martin v. Löwis
C. File on disk with the invalid surrogate code, accessed via the str interface, no decoding happens, matches in memory the file on disk with the byte that translates to the same surrogate, accessed via the bytes interface. Ambiguity. What does that mean? What specific interface are you

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Martin v. Löwis
So while out of scope of the PEP, I don't think it's at all artificial. Sure - but I see this as the same case as the file got renamed. If you have a LRU list in your app, and a file gets renamed, then the LRU list breaks (unless you also store the inode number in the LRU list, and lookup the

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Terry Reedy
Glenn Linderman wrote: On approximately 4/29/2009 4:36 AM, came the following characters from the keyboard of Cameron Simpson: On 29Apr2009 02:56, Glenn Linderman v+pyt...@g.nevcal.com wrote: os.listdir(b) I find that on my Windows system, with all ASCII path file names, that I get quite

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Terry Reedy
Thomas Breuel wrote: Sure. However, that requires you to provide meaningful, reproducible counter-examples, rather than a stenographic formulation that might hint some problem you apparently see (which I believe is just not there). Well, here's another one: PEP 383 would

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Glenn Linderman
On approximately 4/29/2009 1:28 PM, came the following characters from the keyboard of Martin v. Löwis: C. File on disk with the invalid surrogate code, accessed via the str interface, no decoding happens, matches in memory the file on disk with the byte that translates to the same surrogate,

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Cameron Simpson
On 29Apr2009 17:03, Terry Reedy tjre...@udel.edu wrote: Thomas Breuel wrote: Sure. However, that requires you to provide meaningful, reproducible counter-examples, rather than a stenographic formulation that might hint some problem you apparently see (which I believe is just not

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Cameron Simpson
On 29Apr2009 22:14, Stephen J. Turnbull step...@xemacs.org wrote: | Baptiste Carvello writes: | By contrast, if the new utf-8b codec would *supercede* the old one, | \udcxx would always mean raw bytes (at least on UCS-4 builds, where | surrogates are unused). Thus ambiguity could be avoided.

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Barry Scott
On 22 Apr 2009, at 07:50, Martin v. Löwis wrote: If the locale's encoding is UTF-8, the file system encoding is set to a new encoding utf-8b. The UTF-8b codec decodes non-decodable bytes (which must be = 0x80) into half surrogate codes U+DC80..U+DCFF. Forgive me if this has been covered.

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Cameron Simpson
On 29Apr2009 23:41, Barry Scott ba...@barrys-emacs.org wrote: On 22 Apr 2009, at 07:50, Martin v. Löwis wrote: If the locale's encoding is UTF-8, the file system encoding is set to a new encoding utf-8b. The UTF-8b codec decodes non-decodable bytes (which must be = 0x80) into half surrogate

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Aahz
On Thu, Apr 30, 2009, Cameron Simpson wrote: The lengthy discussion mostly revolves around: - Glenn points out that strings that came _not_ from listdir, and that are _not_ well-formed unicode (== have bare surrogates in them) but that were intended for use as filenames will

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Terry Reedy
Glenn Linderman wrote: On approximately 4/29/2009 1:28 PM, came the following characters from So where is the ambiguity here? None. But not everyone can read all the Python source code to try to understand it; they expect the documentation to help them avoid that. Because the

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Martin v. Löwis
How do get a printable unicode version of these path strings if they contain none unicode data? Define printable. One way would be to use a regular expression, replacing all codes in a certain range with a question mark. I'm guessing that an app has to understand that filenames come in two

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Martin v. Löwis
Thanks for clarifying the Windows behavior, here. A little more clarification in the PEP could have avoided lots of discussion. It would seem that a PEP, proposed to modify a poorly documented (and therefore likely poorly understood) area, should be educational about the status quo, as well

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis
James Y Knight wrote: Hopefully it can be assumed that your locale encoding really is a non-overlapping superset of ASCII, as is required by POSIX... Can you please point to the part of the POSIX spec that says that such overlapping is forbidden? I'm a bit scared at the prospect that U+DCAF

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman
On approximately 4/27/2009 7:11 PM, came the following characters from the keyboard of Cameron Simpson: On 27Apr2009 18:15, Glenn Linderman v+pyt...@g.nevcal.com wrote: The problem with this, and other preceding schemes that have been discussed here, is that there is no means of ascertaining

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis
Does the PEP take into consideration the normalising behaviour of Mac OSX ? We've had some ongoing challenges in bzr related to this with bzr. No, that's completely out of scope, AFAICT. I don't even know what the issues are, so I'm not able to propose a solution, at the moment. Regards,

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Paul Moore
2009/4/28 Glenn Linderman v+pyt...@g.nevcal.com: So assume a non-decodable sequence in a name.  That puts us into Martin's funny-decode scheme.  His funny-decode scheme produces a bare string, indistinguishable from a bare string that would be produced by a str API that happens to contain that

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Paul Moore
2009/4/28 Antoine Pitrou solip...@pitrou.net: Paul Moore p.f.moore at gmail.com writes: I've yet to hear anyone claim that they would have an actual problem with a specific piece of code they have written. Yep, that's the problem. Lots of theoretical problems noone has ever encountered

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Michael Foord
Paul Moore wrote: 2009/4/28 Antoine Pitrou solip...@pitrou.net: Paul Moore p.f.moore at gmail.com writes: I've yet to hear anyone claim that they would have an actual problem with a specific piece of code they have written. Yep, that's the problem. Lots of theoretical problems

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Ronald Oussoren
For what it's worth, the OSX API's seem to behave as follows: * If you create a file with an non-UTF8 name on a HFS+ filesystem the system automaticly encodes the name. That is, open(chr(255), 'w') will silently create a file named '%FF' instead of the name you'd expect on a unix system.

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Thomas Breuel
Yep, that's the problem. Lots of theoretical problems noone has ever encountered brought up against a PEP which resolves some actual problems people encounter on a regular basis. How can you bring up practical problems against something that hasn't been implemented? The fact that no other

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Hrvoje Niksic
Thomas Breuel wrote: But the biggest problem with the proposal is that it isn't needed: if you want to be able to turn arbitrary byte sequences into unicode strings and back, just set your encoding to iso8859-15. That already works and it doesn't require any changes. Are you proposing to

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Lino Mastrodomenico
2009/4/28 Glenn Linderman v+pyt...@g.nevcal.com: The switch from PUA to half-surrogates does not resolve the issues with the encoding not being a 1-to-1 mapping, though.  The very fact that you  think you can get away with use of lone surrogates means that other people might, accidentally or

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Hrvoje Niksic
Lino Mastrodomenico wrote: Since this byte sequence [b'\xed\xb3\xbf'] doesn't represent a valid character when decoded with UTF-8, it should simply be considered an invalid UTF-8 sequence of three bytes and decoded to '\udced\udcb3\udcbf' (*not* '\udcff'). Should be considered or will be

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Lino Mastrodomenico
2009/4/28 Hrvoje Niksic hrvoje.nik...@avl.com: Lino Mastrodomenico wrote: Since this byte sequence [b'\xed\xb3\xbf'] doesn't represent a valid character when decoded with UTF-8, it should simply be considered an invalid UTF-8 sequence of three bytes and decoded to '\udced\udcb3\udcbf' (*not*

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Michael Urman
On Mon, Apr 27, 2009 at 23:43, Stephen J. Turnbull step...@xemacs.org wrote: Nobody said we were at the stage of *saving* the [attachment]! But speaking of saving files, I think that's the biggest hole in this that has been nagging at the back of my mind. This PEP intends to allow easy access to

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Stephen J. Turnbull
Paul Moore writes: But it seems to me that there is an assumption that problems will arise when code gets a potentially funny-decoded string and doesn't know where it came from. Is that a real concern? Yes, it's a real concern. I don't think it's possible to show a small piece of

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis
It does solve this issue, because (unlike e.g. U+F01FF) '\udcff' is not a valid Unicode character (not a character at all, really) and the only way you can put this in a POSIX filename is if you use a very lenient UTF-8 encoder that gives you b'\xed\xb3\xbf'. Since this byte sequence

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis
Since the serialization of the Unicode string is likely to use UTF-8, and the string for such a file will include half surrogates, the application may raise an exception when encoding the names for a configuration file. These encoding exceptions will be as rare as the unusual names (which

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis
If the PEP depends on this being changed, it should be mentioned in the PEP. The PEP says that the utf-8b codec decodes invalid bytes into low surrogates. I have now clarified that a strict definition of UTF-8 is assumed for utf-8b. Regards, Martin

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread James Y Knight
On Apr 28, 2009, at 2:50 AM, Martin v. Löwis wrote: James Y Knight wrote: Hopefully it can be assumed that your locale encoding really is a non-overlapping superset of ASCII, as is required by POSIX... Can you please point to the part of the POSIX spec that says that such overlapping is

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman
On approximately 4/28/2009 10:00 AM, came the following characters from the keyboard of Martin v. Löwis: An alternative that doesn't suffer from the risk of not being able to store decoded strings would have been the use of PUA characters, but people rejected it because of the potential

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread MRAB
James Y Knight wrote: On Apr 28, 2009, at 2:50 AM, Martin v. Löwis wrote: James Y Knight wrote: Hopefully it can be assumed that your locale encoding really is a non-overlapping superset of ASCII, as is required by POSIX... Can you please point to the part of the POSIX spec that says that

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Zooko O'Whielacronx
On Apr 28, 2009, at 6:46 AM, Hrvoje Niksic wrote: Are you proposing to unconditionally encode file names as iso8859-15, or to do so only when undecodeable bytes are encountered? For what it is worth, what we have previously planned to do for the Tahoe project is the second of these --

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman
On approximately 4/28/2009 10:53 AM, came the following characters from the keyboard of James Y Knight: On Apr 28, 2009, at 2:50 AM, Martin v. Löwis wrote: James Y Knight wrote: Hopefully it can be assumed that your locale encoding really is a non-overlapping superset of ASCII, as is

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis
The UTF-8b representation suffers from the same potential ambiguities as the PUA characters... Not at all the same ambiguities. Here, again, the two choices: A. use PUA characters to represent undecodable bytes, in particular for UTF-8 (the PEP actually never proposed this to happen).

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman
On approximately 4/28/2009 11:55 AM, came the following characters from the keyboard of MRAB: I've been thinking of python-escape only in terms of UTF-8, the only encoding mentioned in the PEP. In UTF-8, bytes 0x00 to 0x7F are decodable. UTF-8 is only mentioned in the sense of having special

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman
On approximately 4/28/2009 6:01 AM, came the following characters from the keyboard of Lino Mastrodomenico: 2009/4/28 Glenn Linderman v+pyt...@g.nevcal.com: The switch from PUA to half-surrogates does not resolve the issues with the encoding not being a 1-to-1 mapping, though. The very fact

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman
On approximately 4/28/2009 1:25 PM, came the following characters from the keyboard of Martin v. Löwis: The UTF-8b representation suffers from the same potential ambiguities as the PUA characters... Not at all the same ambiguities. Here, again, the two choices: A. use PUA characters to

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis
Others have made this suggestion, and it is helpful to the PEP, but not sufficient. As implemented as an error handler, I'm not sure that the b'\xed\xb3\xbf' sequence would trigger the error handler, if the UTF-8 decoder is happy with it. Which, in my testing, it is. Rest assured that the

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread MRAB
Glenn Linderman wrote: On approximately 4/28/2009 11:55 AM, came the following characters from the keyboard of MRAB: I've been thinking of python-escape only in terms of UTF-8, the only encoding mentioned in the PEP. In UTF-8, bytes 0x00 to 0x7F are decodable. UTF-8 is only mentioned in the

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis
Glenn Linderman wrote: On approximately 4/28/2009 1:25 PM, came the following characters from the keyboard of Martin v. Löwis: The UTF-8b representation suffers from the same potential ambiguities as the PUA characters... Not at all the same ambiguities. Here, again, the two choices: A.

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman
On approximately 4/28/2009 2:02 PM, came the following characters from the keyboard of Martin v. Löwis: Glenn Linderman wrote: On approximately 4/28/2009 1:25 PM, came the following characters from the keyboard of Martin v. Löwis: The UTF-8b representation suffers from the same potential

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Cameron Simpson
I think I may be able to resolve Glenn's issues with the scheme lower down (through careful use of definitions and hand waving). On 27Apr2009 23:52, Glenn Linderman v+pyt...@g.nevcal.com wrote: On approximately 4/27/2009 7:11 PM, came the following characters from the keyboard of Cameron

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman
On approximately 4/28/2009 2:01 PM, came the following characters from the keyboard of MRAB: Glenn Linderman wrote: On approximately 4/28/2009 11:55 AM, came the following characters from the keyboard of MRAB: I've been thinking of python-escape only in terms of UTF-8, the only encoding

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Toshio Kuratomi
Zooko O'Whielacronx wrote: On Apr 28, 2009, at 6:46 AM, Hrvoje Niksic wrote: If you switch to iso8859-15 only in the presence of undecodable UTF-8, then you have the same round-trip problem as the PEP: both b'\xff' and b'\xc3\xbf' will be converted to u'\u00ff' without a way to unambiguously

Re: [Python-Dev] Python-Dev PEP 383: Non-decodable Bytes in System Character?Interfaces

2009-04-28 Thread Cameron Simpson
On 28Apr2009 11:49, Antoine Pitrou solip...@pitrou.net wrote: | Paul Moore p.f.moore at gmail.com writes: | | I've yet to hear anyone claim that they would have an actual problem | with a specific piece of code they have written. | | Yep, that's the problem. Lots of theoretical problems noone

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread R. David Murray
On Tue, 28 Apr 2009 at 13:37, Glenn Linderman wrote: C. File on disk with the invalid surrogate code, accessed via the str interface, no decoding happens, matches in memory the file on disk with the byte that translates to the same surrogate, accessed via the bytes interface. Ambiguity.

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Cameron Simpson
On 28Apr2009 14:37, Thomas Breuel tmb...@gmail.com wrote: | But the biggest problem with the proposal is that it isn't needed: if you | want to be able to turn arbitrary byte sequences into unicode strings and | back, just set your encoding to iso8859-15. That already works and it | doesn't

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Toshio Kuratomi
Martin v. Löwis wrote: Since the serialization of the Unicode string is likely to use UTF-8, and the string for such a file will include half surrogates, the application may raise an exception when encoding the names for a configuration file. These encoding exceptions will be as rare as the

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Cameron Simpson
On 28Apr2009 13:37, Glenn Linderman v+pyt...@g.nevcal.com wrote: On approximately 4/28/2009 1:25 PM, came the following characters from the keyboard of Martin v. Löwis: The UTF-8b representation suffers from the same potential ambiguities as the PUA characters... Not at all the same

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman
On approximately 4/28/2009 7:40 PM, came the following characters from the keyboard of R. David Murray: On Tue, 28 Apr 2009 at 13:37, Glenn Linderman wrote: C. File on disk with the invalid surrogate code, accessed via the str interface, no decoding happens, matches in memory the file on disk

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Glenn Linderman
On approximately 4/28/2009 4:06 PM, came the following characters from the keyboard of Cameron Simpson: I think I may be able to resolve Glenn's issues with the scheme lower down (through careful use of definitions and hand waving). Close. You at least resolved what you thought my issue

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Martin v. Löwis
C. File on disk with the invalid surrogate code, accessed via the str interface, no decoding happens, matches in memory the file on disk with the byte that translates to the same surrogate, accessed via the bytes interface. Ambiguity. Is that an alternative to A and B? I guess it is an

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Glenn Linderman
On approximately 4/25/2009 5:35 AM, came the following characters from the keyboard of Martin v. Löwis: Because the encoding is not reliably reversible. Why do you say that? The encoding is completely reversible (unless we disagree on what reversible means). I'm +1 on the concept, -1 on the

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Glenn Linderman
On approximately 4/25/2009 5:22 AM, came the following characters from the keyboard of Martin v. Löwis: The problem with this, and other preceding schemes that have been discussed here, is that there is no means of ascertaining whether a particular file name str was obtained from a str API, or

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Cameron Simpson
On 26Apr2009 23:39, Glenn Linderman v+pyt...@g.nevcal.com wrote: [...snip...] There are still issues regarding how Windows and POSIX programs that are sharing cross-mounted file systems might communicate file names between each other, which is not at all clear from the PEP. If this is an

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-27 Thread Glenn Linderman
On approximately 4/27/2009 12:55 AM, came the following characters from the keyboard of Cameron Simpson: On 26Apr2009 23:39, Glenn Linderman v+pyt...@g.nevcal.com wrote: [...snip...] There are still issues regarding how Windows and POSIX programs that are sharing cross-mounted file systems

  1   2   3   >