Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Barry Scott
On 30 Apr 2009, at 05:52, Martin v. Löwis wrote: How do get a printable unicode version of these path strings if they contain none unicode data? Define printable. One way would be to use a regular expression, replacing all codes in a certain range with a question mark. What I mean by

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Martin v. Löwis
How do get a printable unicode version of these path strings if they contain none unicode data? Define printable. One way would be to use a regular expression, replacing all codes in a certain range with a question mark. What I mean by printable is that the string must be valid unicode

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Barry Scott
On 30 Apr 2009, at 21:06, Martin v. Löwis wrote: How do get a printable unicode version of these path strings if they contain none unicode data? Define printable. One way would be to use a regular expression, replacing all codes in a certain range with a question mark. What I mean by

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread norseman
Martin v. Löwis wrote: How do get a printable unicode version of these path strings if they contain none unicode data? Define printable. One way would be to use a regular expression, replacing all codes in a certain range with a question mark. What I mean by printable is that the string must

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Barry Scott
On 22 Apr 2009, at 07:50, Martin v. Löwis wrote: If the locale's encoding is UTF-8, the file system encoding is set to a new encoding utf-8b. The UTF-8b codec decodes non-decodable bytes (which must be = 0x80) into half surrogate codes U+DC80..U+DCFF. Forgive me if this has been covered.

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Martin v. Löwis
How do get a printable unicode version of these path strings if they contain none unicode data? Define printable. One way would be to use a regular expression, replacing all codes in a certain range with a question mark. I'm guessing that an app has to understand that filenames come in two

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-26 Thread Adrian
How about another str-like type, a sequence of char-or-bytes? Could be called strbytes or stringwithinvalidcharacters. It would support whatever subset of str functionality makes sense / is easy to implement plus a to_escaped_str() method (that does the escaping the PEP talks about) for people who

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-26 Thread Martin v. Löwis
How about another str-like type, a sequence of char-or-bytes? That would be a different PEP. I personally like my own proposal more, but feel free to propose something different. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Martin v. Löwis
Cameron Simpson wrote: On 22Apr2009 08:50, Martin v. Löwis mar...@v.loewis.de wrote: | File names, environment variables, and command line arguments are | defined as being character data in POSIX; Specific citation please? I'd like to check the specifics of this. For example, on environment

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-25 Thread Martin v. Löwis
If the bytes are mapped to single half surrogate codes instead of the normal pairs (low+high), then I can see that decoding could never be ambiguous and encoding could produce the original bytes. I was confused by Markus Kuhn's original UTF-8b specification. I have now changed the PEP to avoid

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-24 Thread Lino Mastrodomenico
2009/4/22 Martin v. Löwis mar...@v.loewis.de: To convert non-decodable bytes, a new error handler python-escape is introduced, which decodes non-decodable bytes using into a private-use character U+F01xx, which is believed to not conflict with private-use characters that currently exist in

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-24 Thread Martin v. Löwis
Why not use U+DCxx for non-UTF-8 encodings too? I thought of that, and was tricked into believing that only U+DC8x is a half surrogate. Now I see that you are right, and have fixed the PEP accordingly. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-23 Thread James Y Knight
On Apr 22, 2009, at 2:50 AM, Martin v. Löwis wrote: I'm proposing the following PEP for inclusion into Python 3.1. Please comment. +1. Even if some people still want a low-level bytes API, it's important that the easy case be easy. That is: the majority of Python applications should

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-23 Thread MRAB
Martin v. Löwis wrote: MRAB wrote: Martin v. Löwis wrote: [snip] To convert non-decodable bytes, a new error handler python-escape is introduced, which decodes non-decodable bytes using into a private-use character U+F01xx, which is believed to not conflict with private-use characters that

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread Nick Coghlan
Martin v. Löwis wrote: I'm proposing the following PEP for inclusion into Python 3.1. Please comment. That seems like a much nicer solution than having parallel bytes/Unicode APIs everywhere. When the locale encoding is UTF-8, would UTF-8b also be used for the command line decoding and

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread Walter Dörwald
Martin v. Löwis wrote: I'm proposing the following PEP for inclusion into Python 3.1. Please comment. Regards, Martin PEP: 383 Title: Non-decodable Bytes in System Character Interfaces Version: $Revision: 71793 $ Last-Modified: $Date: 2009-04-22 08:42:06 +0200 (Mi, 22. Apr 2009) $

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread Martin v. Löwis
correct - corrected Thanks, fixed. To convert non-decodable bytes, a new error handler python-escape is introduced, which decodes non-decodable bytes using into a private-use character U+F01xx, which is believed to not conflict with private-use characters that currently exist in Python

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread Martin v. Löwis
MRAB wrote: Martin v. Löwis wrote: [snip] To convert non-decodable bytes, a new error handler python-escape is introduced, which decodes non-decodable bytes using into a private-use character U+F01xx, which is believed to not conflict with private-use characters that currently exist in

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread Walter Dörwald
Martin v. Löwis wrote: correct - corrected Thanks, fixed. To convert non-decodable bytes, a new error handler python-escape is introduced, which decodes non-decodable bytes using into a private-use character U+F01xx, which is believed to not conflict with private-use characters that

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread M.-A. Lemburg
On 2009-04-22 22:06, Walter Dörwald wrote: Martin v. Löwis wrote: correct - corrected Thanks, fixed. To convert non-decodable bytes, a new error handler python-escape is introduced, which decodes non-decodable bytes using into a private-use character U+F01xx, which is believed to not

Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-22 Thread Martin v. Löwis
The python-escape codec is only used/meaningful if the env encoding is not UTF-8. For any other encoding, it is assumed that no character actually maps to the private-use characters. Which should be true for any encoding from the pre-unicode era, but not for UTF-16/32 and variants. Right.