Re: [Python-3000] Unicode and OS strings

2007-09-28 Thread Martin v. Löwis
> msvcrt ships with the operating system - I'd call that a conforming > implementation. Yes, but it's not part of the operating system interface; Microsoft documents it as "for future use only by system-level components". > I still regard handling argv as anything other the raw bytes that come >

Re: [Python-3000] Unicode and OS strings

2007-09-27 Thread Nicholas Bastin
On 9/28/07, "Martin v. Löwis" <[EMAIL PROTECTED]> wrote: > Nicholas Bastin schrieb: > > On 9/22/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > >> argc/argv does not exist on Windows (that you seem to see it > >> anyway is an illusion), and if it did exist, it would be characters, > >> not bytes

Re: [Python-3000] Unicode and OS strings

2007-09-27 Thread Stephen Hansen
On 9/27/07, Nicholas Bastin <[EMAIL PROTECTED]> wrote: > > On 9/22/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > > argc/argv does not exist on Windows (that you seem to see it > > anyway is an illusion), and if it did exist, it would be characters, > > not bytes. > > Of course it exists on Win

Re: [Python-3000] Unicode and OS strings

2007-09-27 Thread Martin v. Löwis
Nicholas Bastin schrieb: > On 9/22/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: >> argc/argv does not exist on Windows (that you seem to see it >> anyway is an illusion), and if it did exist, it would be characters, >> not bytes. > > Of course it exists on Windows. argc/argv are defined by th

Re: [Python-3000] Unicode and OS strings

2007-09-27 Thread Nicholas Bastin
On 9/22/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > argc/argv does not exist on Windows (that you seem to see it > anyway is an illusion), and if it did exist, it would be characters, > not bytes. Of course it exists on Windows. argc/argv are defined by the C standard, and say what you wil

Re: [Python-3000] Unicode and OS strings

2007-09-22 Thread Martin v. Löwis
> The filesystem is unrelated to sys.argv, except for the need to pass > filenames through argv. If the filesystem is using bytes rather than > characters, then sys.argv must offer the same option, or else certain > scripts will (under some rare circumstances) fail. The same holds for file names

Re: [Python-3000] Unicode and OS strings

2007-09-22 Thread Jim Jewett
On 9/22/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > Zitat von Jim Jewett <[EMAIL PROTECTED]>: > > > On 9/21/07, Paul Moore <[EMAIL PROTECTED]> wrote: > >> On 21/09/2007, Jim Jewett <[EMAIL PROTECTED]> wrote: [The original context, expressed with some detail by Michael Urman in http://mail.p

Re: [Python-3000] Unicode and OS strings

2007-09-22 Thread Marcin 'Qrczak' Kowalczyk
Dnia 21-09-2007, Pt o godzinie 10:00 -0400, Jim Jewett napisał(a): > Is it reasonable to expose sys.argv.buffer? > (Since this would be bytes rather than text, I assume this would be a > single array, rather than a list of already separated arguments.) On Unix the arguments are already separated

Re: [Python-3000] Unicode and OS strings

2007-09-21 Thread martin
Zitat von Jim Jewett <[EMAIL PROTECTED]>: > On 9/21/07, Paul Moore <[EMAIL PROTECTED]> wrote: >> On 21/09/2007, Jim Jewett <[EMAIL PROTECTED]> wrote: >> > (Outside ASCII), if you treat sys.argv as text, that is probably >> > impossible without filesystem support. Before python even sees the >> >

Re: [Python-3000] Unicode and OS strings

2007-09-21 Thread Terry Reedy
"Michael Urman" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] | If there's not something straightforward to put in the ... below that | would allow simple iteration and processing of all files passed on the | command line, preferably interchangeably on both unix (where filenames |

Re: [Python-3000] Unicode and OS strings

2007-09-21 Thread Paul Moore
On 21/09/2007, Jim Jewett <[EMAIL PROTECTED]> wrote: > If you are using text (as opposed to bytes), then À can be either > U+00C0 or . If the file system makes a distinction, > then it is using bytes, and any program interacting with it needs* to > use bytes too. OK. I don't know enough about Uni

Re: [Python-3000] Unicode and OS strings

2007-09-21 Thread Michael Urman
On 9/21/07, Jim Jewett <[EMAIL PROTECTED]> wrote: > (Outside ASCII), if you treat sys.argv as text, that is probably > impossible without filesystem support. Before python even sees the > data, the terminal itself is allowed to change between canonical > equivalents, which have different binary re

Re: [Python-3000] Unicode and OS strings

2007-09-21 Thread Thomas Heller
Jean-Paul Calderone schrieb: > On Fri, 21 Sep 2007 10:00:38 -0400, Jim Jewett <[EMAIL PROTECTED]> wrote: >> [snip] >> >>It does sound like we need a way to get to the original bytes, similar >>to sys.stdin.buffer. Is it reasonable to expose sys.argv.buffer? >>(Since this would be bytes rather than

Re: [Python-3000] Unicode and OS strings

2007-09-21 Thread Jim Jewett
On 9/21/07, Paul Moore <[EMAIL PROTECTED]> wrote: > On 21/09/2007, Jim Jewett <[EMAIL PROTECTED]> wrote: > > (Outside ASCII), if you treat sys.argv as text, that is probably > > impossible without filesystem support. Before python even sees the > > data, the terminal itself is allowed to change be

Re: [Python-3000] Unicode and OS strings

2007-09-21 Thread Jean-Paul Calderone
On Fri, 21 Sep 2007 10:00:38 -0400, Jim Jewett <[EMAIL PROTECTED]> wrote: > [snip] > >It does sound like we need a way to get to the original bytes, similar >to sys.stdin.buffer. Is it reasonable to expose sys.argv.buffer? >(Since this would be bytes rather than text, I assume this would be a >sin

Re: [Python-3000] Unicode and OS strings

2007-09-21 Thread Jim Jewett
On 9/18/07, Guido van Rossum <[EMAIL PROTECTED]> wrote: > On 9/18/07, Jim Jewett <[EMAIL PROTECTED]> wrote: > > ... given that defenc is now always UTF-8, won't exposing > > it in the public typedef then just be an attractive nuisance? > *ALL* fields of the struct def are strictly internal. Is t

Re: [Python-3000] Unicode and OS strings

2007-09-21 Thread Paul Moore
On 21/09/2007, Jim Jewett <[EMAIL PROTECTED]> wrote: > (Outside ASCII), if you treat sys.argv as text, that is probably > impossible without filesystem support. Before python even sees the > data, the terminal itself is allowed to change between canonical > equivalents, which have different binary

Re: [Python-3000] Unicode and OS strings

2007-09-21 Thread Jim Jewett
On 9/18/07, James Y Knight <[EMAIL PROTECTED]> wrote: > On Sep 18, 2007, at 11:11 AM, Guido van Rossum wrote: > One of the more common things to do with command line arguments is > open them. So, it'd really be nice if: > python -c 'import sys; open(sys.argv[1])' [some filename] > would always w

Re: [Python-3000] Unicode and OS strings

2007-09-20 Thread martin
> On Linux, filenames are *byte* string and not *character* string. That's not true, although this is a wide-spread misunderstanding. The POSIX standard defines that the file names must be a superset of the portable character set, which includes things such as '/', which is the path separator. >

Re: [Python-3000] Unicode and OS strings

2007-09-19 Thread Stephen J. Turnbull
Victor Stinner writes: > On Thursday 13 September 2007 18:22:12 Marcin 'Qrczak' Kowalczyk wrote: > > What should happen when a command line argument or an environment > > variable is not decodable using the system encoding (on Unix where > > from the OS point of view it is an array of bytes)?

Re: [Python-3000] Unicode and OS strings

2007-09-19 Thread Victor Stinner
Hi, On Thursday 13 September 2007 18:22:12 Marcin 'Qrczak' Kowalczyk wrote: > What should happen when a command line argument or an environment > variable is not decodable using the system encoding (on Unix where > from the OS point of view it is an array of bytes)? On Linux, filenames are *byte*

Re: [Python-3000] Unicode and OS strings

2007-09-18 Thread Stephen J. Turnbull
James Y Knight writes: > iso-2022 or some other abomination. This has upsides (simple, doesn't > trample on PUA codepoints, only needs one new codec, never throws > exception in the above example, and really is correct much of the > time), and downsides (if the system locale is iso-2022,

Re: [Python-3000] Unicode and OS strings

2007-09-18 Thread Guido van Rossum
On 9/18/07, James Y Knight <[EMAIL PROTECTED]> wrote: > > On Sep 18, 2007, at 11:11 AM, Guido van Rossum wrote: > > If they contain > > non-ASCII bytes I am currently in favor os doing a best-effort > > decoding using the default locale encoding, replacing errors with '?' > > rather than throwing e

Re: [Python-3000] Unicode and OS strings

2007-09-18 Thread James Y Knight
On Sep 18, 2007, at 11:11 AM, Guido van Rossum wrote: > If they contain > non-ASCII bytes I am currently in favor os doing a best-effort > decoding using the default locale encoding, replacing errors with '?' > rather than throwing exception. One of the more common things to do with command line

Re: [Python-3000] Unicode and OS strings

2007-09-18 Thread Guido van Rossum
On 9/18/07, Jim Jewett <[EMAIL PROTECTED]> wrote: > On 9/18/07, Guido van Rossum <[EMAIL PROTECTED]> wrote: > > On 9/18/07, Jim Jewett <[EMAIL PROTECTED]> wrote: > > > On 9/18/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote: > > > > > There's no UTF-8 in Python's internal string encoding. > > > >

Re: [Python-3000] Unicode and OS strings

2007-09-18 Thread Jim Jewett
On 9/18/07, Guido van Rossum <[EMAIL PROTECTED]> wrote: > On 9/18/07, Jim Jewett <[EMAIL PROTECTED]> wrote: > > On 9/18/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote: > > > There's no UTF-8 in Python's internal string encoding. > > (At least as of a few days ago) > > In Python 3 there is; st

Re: [Python-3000] Unicode and OS strings

2007-09-18 Thread Guido van Rossum
On 9/18/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote: > Guido has stated that the > internal representation used by Python strings is a sequence of > Unicode code units, not characters. I don't think that's reached the > status of "pronouncement" yet, but you will probably need a PEP to get >

Re: [Python-3000] Unicode and OS strings

2007-09-18 Thread Guido van Rossum
On 9/18/07, Jim Jewett <[EMAIL PROTECTED]> wrote: > On 9/18/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote: > > > There's no UTF-8 in Python's internal string encoding. What are you > > talking about? > > (At least as of a few days ago) > > In Python 3 there is; strings are unicode. A PyUnicod

Re: [Python-3000] Unicode and OS strings

2007-09-18 Thread Jim Jewett
On 9/18/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote: > There's no UTF-8 in Python's internal string encoding. What are you > talking about? (At least as of a few days ago) In Python 3 there is; strings are unicode. A PyUnicodeObject object has two encodings that you can grab from a point

Re: [Python-3000] Unicode and OS strings

2007-09-18 Thread Stephen J. Turnbull
> "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> writes: >> > This is wrong: UTF-8 is specified for PUA. PUA is no special from the >> > point of view of UTF-8. > >> It is from the point of view of the Unicode standard, specifically v5. >> Please see section 16.5, especially about the

Re: [Python-3000] Unicode and OS strings

2007-09-18 Thread Guido van Rossum
On 9/17/07, Stephen J. Turnbull <[EMAIL PROTECTED]> wrote: > Note that some people are currently arguing that sys.argv should be an > array of bytes objects, and Guido has not yet said "no". Then let me say "no" now. I'd be happy to support a lower-level API for getting at the actual bytes in the

Re: [Python-3000] Unicode and OS strings

2007-09-18 Thread Marcin 'Qrczak' Kowalczyk
Dnia 18-09-2007, Wt o godzinie 13:08 +0900, Stephen J. Turnbull napisał(a): > > This is wrong: UTF-8 is specified for PUA. PUA is no special from the > > point of view of UTF-8. > > It is from the point of view of the Unicode standard, specifically v5. > Please see section 16.5, especially abou

Re: [Python-3000] Unicode and OS strings

2007-09-17 Thread Stephen J. Turnbull
> "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> writes: >> When a codec encounters something it can't handle, whether it's a >> valid character in a legacy encoding, a private use character in a >> UTF, or an invalid sequence of code units, it throws an exception >> specifying the charac

Re: [Python-3000] Unicode and OS strings

2007-09-17 Thread Stephen J. Turnbull
> "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> writes: >> > Well, for any scheme which attempts to modify UTF-8 by accepting >> > arbitrary byte strings is used, *something* must be interpreted >> > differently than in real UTF-8. >> Wrong. In my scheme everything ends up in the PUA

Re: [Python-3000] Unicode and OS strings

2007-09-17 Thread Mike Klaas
On 16-Sep-07, at 4:03 PM, Greg Ewing wrote: > Paul Moore wrote: >> On 15/09/2007, Gregory P. Smith <[EMAIL PROTECTED]> wrote: >> >>> similarly for the environment. os.environ dict >>> should be bytes object keys and values >> >> You can't have bytes as keys - the type isn't hashable... > > Has th

Re: [Python-3000] Unicode and OS strings

2007-09-17 Thread Marcin 'Qrczak' Kowalczyk
Dnia 16-09-2007, N o godzinie 16:13 +0900, Stephen J. Turnbull napisał(a): > When a codec encounters something it can't handle, whether it's a > valid character in a legacy encoding, a private use character in a > UTF, or an invalid sequence of code units, it throws an exception > specifying the c

Re: [Python-3000] Unicode and OS strings

2007-09-17 Thread Martin v. Löwis
> Yes. I'm recovering from moving from Japan to California, and will be > busy until the beginning of October, I'll get started on it then. For > this kind of thing, what is the deadline for submission of a patch? > Before the alpha, early beta? Either would work fine, unless somebody else does

Re: [Python-3000] Unicode and OS strings

2007-09-17 Thread Marcin 'Qrczak' Kowalczyk
Dnia 15-09-2007, So o godzinie 09:13 +0900, Stephen J. Turnbull napisał(a): > > Well, for any scheme which attempts to modify UTF-8 by accepting > > arbitrary byte strings is used, *something* must be interpreted > > differently than in real UTF-8. > > Wrong. In my scheme everything ends up i

Re: [Python-3000] Unicode and OS strings

2007-09-16 Thread Stephen J. Turnbull
"Martin v. Löwis" writes: > > The basic idea is to allocate code points in private space as-needed. > > Ok, thanks. Would you be interested in implementing that scheme? Yes. I'm recovering from moving from Japan to California, and will be busy until the beginning of October, I'll get started

Re: [Python-3000] Unicode and OS strings

2007-09-16 Thread Greg Ewing
Paul Moore wrote: > On 15/09/2007, Gregory P. Smith <[EMAIL PROTECTED]> wrote: > >>similarly for the environment. os.environ dict >>should be bytes object keys and values > > You can't have bytes as keys - the type isn't hashable... Has there been any consensus reached yet on whether there will

Re: [Python-3000] Unicode and OS strings

2007-09-16 Thread Greg Ewing
Gregory P. Smith wrote: > argv is the C/C++ name for bytes, lets not > confuse people. C has never made a clear distinction between characters and bytes, using the type 'char' for both. It got away with it for the same reason that Python did until unicode came along. I'm pretty sure most people us

Re: [Python-3000] Unicode and OS strings

2007-09-16 Thread Gregory P. Smith
On 9/16/07, Paul Moore <[EMAIL PROTECTED]> wrote: > On 16/09/2007, Fred Drake <[EMAIL PROTECTED]> wrote: > > On Sep 15, 2007, at 10:00 PM, Nicholas Bastin wrote: > > > Then lets stop beating around the bush and implement an immutable > > > bytes type. Why put ourselves through contortions trying t

Re: [Python-3000] Unicode and OS strings

2007-09-16 Thread Paul Moore
On 16/09/2007, Fred Drake <[EMAIL PROTECTED]> wrote: > On Sep 15, 2007, at 10:00 PM, Nicholas Bastin wrote: > > Then lets stop beating around the bush and implement an immutable > > bytes type. Why put ourselves through contortions trying to jam a > > square peg into a round hole and not just deci

Re: [Python-3000] Unicode and OS strings

2007-09-16 Thread Martin v. Löwis
> The basic idea is to allocate code points in private space as-needed. Ok, thanks. Would you be interested in implementing that scheme? Regards, Martin ___ Python-3000 mailing list Python-3000@python.org http://mail.python.org/mailman/listinfo/python-3

Re: [Python-3000] Unicode and OS strings

2007-09-16 Thread Stephen J. Turnbull
"Martin v. Löwis" writes: > > What I'm suggesting is to provide a way for processes to record and > > communicate that information without needing to provide a "source > > encoding" slot for strings, and which is able to handle strings > > containing unrecognized (including corrupt) characters

Re: [Python-3000] Unicode and OS strings

2007-09-15 Thread Fred Drake
On Sep 15, 2007, at 10:00 PM, Nicholas Bastin wrote: > Then lets stop beating around the bush and implement an immutable > bytes type. Why put ourselves through contortions trying to jam a > square peg into a round hole and not just decide to make a round peg? +42 -Fred -- Fred Drake

Re: [Python-3000] Unicode and OS strings

2007-09-15 Thread Bill Janssen
> > You can't have bytes as keys - the type isn't hashable... > > That's why people keep arguing for an immutable bytes types. I keep > seeing long discussions that end up with a tortured mechanism for making > the keys unicode. Why don't we just bite the bullet and make things > easier and have

Re: [Python-3000] Unicode and OS strings

2007-09-15 Thread Nicholas Bastin
On 9/15/07, Paul Moore <[EMAIL PROTECTED]> wrote: > On 15/09/2007, Gregory P. Smith <[EMAIL PROTECTED]> wrote: > > similarly for the environment. os.environ dict > > should be bytes object keys and values > > You can't have bytes as keys - the type isn't hashable... Then lets stop beating around

Re: [Python-3000] Unicode and OS strings

2007-09-15 Thread Aahz
On Sat, Sep 15, 2007, Paul Moore wrote: > On 15/09/2007, Gregory P. Smith <[EMAIL PROTECTED]> wrote: >> >> similarly for the environment. os.environ dict >> should be bytes object keys and values > > You can't have bytes as keys - the type isn't hashable... That's why people keep arguing for an

Re: [Python-3000] Unicode and OS strings

2007-09-15 Thread Aahz
On Fri, Sep 14, 2007, "Martin v. L??wis" wrote: >Hagen: >> >> And what if we skillfully conserve unknown bytes in a private use or >> surrogate area and the application author actually knows the encoding >> and wants correctly decoded strings? > > They can easily roundtrip that then to the encodi

Re: [Python-3000] Unicode and OS strings

2007-09-15 Thread Gregory P. Smith
On 9/15/07, Paul Moore <[EMAIL PROTECTED]> wrote: > On 15/09/2007, Gregory P. Smith <[EMAIL PROTECTED]> wrote: > > similarly for the environment. os.environ dict > > should be bytes object keys and values > > You can't have bytes as keys - the type isn't hashable... ugh, yeah. as much as i hate

Re: [Python-3000] Unicode and OS strings

2007-09-15 Thread Paul Moore
On 15/09/2007, Gregory P. Smith <[EMAIL PROTECTED]> wrote: > similarly for the environment. os.environ dict > should be bytes object keys and values You can't have bytes as keys - the type isn't hashable... Paul ___ Python-3000 mailing list Python-3000

Re: [Python-3000] Unicode and OS strings

2007-09-15 Thread Gregory P. Smith
On 9/14/07, Greg Ewing <[EMAIL PROTECTED]> wrote: > Hagen Fürstenau wrote: > > sys.argv could be of type bytes and sys.arguments (or whatever) could be > > a function taking an encoding parameter (which defaults to UTF-8) and > > returning strings. > > > > Of course that's backwards incompatible an

Re: [Python-3000] Unicode and OS strings

2007-09-15 Thread Hagen Fürstenau
>> sys.argv could be of type bytes and sys.arguments (or whatever) could be >> a function taking an encoding parameter (which defaults to UTF-8) and >> returning strings. >> > It would be pretty disruptive to ask everyone to change > their habit of thinking of sys.argv as a list of strings. The

Re: [Python-3000] Unicode and OS strings

2007-09-14 Thread Martin v. Löwis
> What I'm suggesting is to provide a way for processes to record and > communicate that information without needing to provide a "source > encoding" slot for strings, and which is able to handle strings > containing unrecognized (including corrupt) characters from multiple > source encodings. Can

Re: [Python-3000] Unicode and OS strings

2007-09-14 Thread Stephen J. Turnbull
Greg Ewing writes: > Stephen J. Turnbull wrote: > > You chose the context of round-tripping *across > > encodings*, not me. Please stick with your context. > > Maybe we have different ideas of what the problem is. I thought > the problem is to take arbitrary byte sequences coming in as >

Re: [Python-3000] Unicode and OS strings

2007-09-14 Thread Stephen J. Turnbull
Hagen Fürstenau writes: > And what if we skillfully conserve unknown bytes in a private use or > surrogate area and the application author actually knows the encoding > and wants correctly decoded strings? This is what my proposal would do, but my proposal would would return a string, not by

Re: [Python-3000] Unicode and OS strings

2007-09-14 Thread Stephen J. Turnbull
"Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> writes: >> And it *is* needed, because these characters by assumption >> are not present in Unicode at all. (More precisely, they may be >> present, but the tables we happen to have don't have mappings for >> them.) > They are present! For UTF

Re: [Python-3000] Unicode and OS strings

2007-09-14 Thread Guido van Rossum
On 9/14/07, Greg Ewing <[EMAIL PROTECTED]> wrote: > Guido van Rossum wrote: > > Great idea, but sys.argv doesn't need to be magic for this approach to work. > > Are you sure? I thought part of the problem was that > if an argv entry couldn't be decoded, you got an error > too soon to do anything ab

Re: [Python-3000] Unicode and OS strings

2007-09-14 Thread Greg Ewing
Guido van Rossum wrote: > Great idea, but sys.argv doesn't need to be magic for this approach to work. Are you sure? I thought part of the problem was that if an argv entry couldn't be decoded, you got an error too soon to do anything about it. Making sys.argv lazy would avoid that. -- Greg _

Re: [Python-3000] Unicode and OS strings

2007-09-14 Thread Guido van Rossum
On 9/14/07, Greg Ewing <[EMAIL PROTECTED]> wrote: > It would be pretty disruptive to ask everyone to change > their habit of thinking of sys.argv as a list of strings. Indeed. > I would suggest doing it the other way around -- have > sys.argv be an object that automatically converts to > unicode

Re: [Python-3000] Unicode and OS strings

2007-09-14 Thread Greg Ewing
Hagen Fürstenau wrote: > sys.argv could be of type bytes and sys.arguments (or whatever) could be > a function taking an encoding parameter (which defaults to UTF-8) and > returning strings. > > Of course that's backwards incompatible and I'm not sure if it's too > late for something like this

Re: [Python-3000] Unicode and OS strings

2007-09-14 Thread Greg Ewing
Stephen J. Turnbull wrote: > You chose the context of round-tripping *across > encodings*, not me. Please stick with your context. Maybe we have different ideas of what the problem is. I thought the problem is to take arbitrary byte sequences coming in as command-line args and represent them as u

Re: [Python-3000] Unicode and OS strings

2007-09-14 Thread Jim Jewett
On 9/14/07, Hagen Fürstenau <[EMAIL PROTECTED]> wrote: > Is it too unreasonable to keep the byte strings we get from the OS as > byte strings in Python (since we're not sure about their encoding) and > offer functions for getting strings? > sys.argv could be of type bytes and sys.arguments (or wha

Re: [Python-3000] Unicode and OS strings

2007-09-14 Thread Hagen Fürstenau
> They can easily roundtrip that then to the encoding that it should have: > > good_string = sys.argv[bad_string_index].\ >encode(sys.argv_encoding, "pua-replace").decode(real_encoding) To me this doesn't look easier than sys.arguments() in the standard case and sys.arguments(encoding="whate

Re: [Python-3000] Unicode and OS strings

2007-09-14 Thread Martin v. Löwis
> Are you sure that "strings in an unknown encoding" are conceptually > strings and not rather bytes? For file names, most definitely. For command line arguments, I am fairly sure: the argc/argv calling convention does not allow for arbitrary bytes. > And what if we skillfully conserve unknown by

Re: [Python-3000] Unicode and OS strings

2007-09-14 Thread Hagen Fürstenau
> That is not a concern. However, it is fundamentally the wrong thing to > do. Most people rightfully view command line arguments and file names > as strings, as they use the keyboard to enter them, and the computer > uses letters from a font to display them. They are not bytes > conceptually - the

Re: [Python-3000] Unicode and OS strings

2007-09-14 Thread Martin v. Löwis
> Is it too unreasonable to keep the byte strings we get from the OS as > byte strings in Python (since we're not sure about their encoding) and > offer functions for getting strings? I think people will complain if command line arguments aren't strings, and they will complain even more so if fi

Re: [Python-3000] Unicode and OS strings

2007-09-14 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Sep 14, 2007, at 5:15 AM, Hagen Fürstenau wrote: > Is it too unreasonable to keep the byte strings we get from the OS as > byte strings in Python (since we're not sure about their encoding) and > offer functions for getting strings? > > sys.argv co

Re: [Python-3000] Unicode and OS strings

2007-09-14 Thread Barry Warsaw
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Sep 14, 2007, at 1:08 AM, Greg Ewing wrote: > Stephen J. Turnbull wrote: >> You can't win that, because Unicode is the only encoding that >> attempts >> to guarantee even the possibility of round-tripping. > > Rubbish -- I can do print [ord(c) fo

Re: [Python-3000] Unicode and OS strings

2007-09-14 Thread Hagen Fürstenau
Is it too unreasonable to keep the byte strings we get from the OS as byte strings in Python (since we're not sure about their encoding) and offer functions for getting strings? sys.argv could be of type bytes and sys.arguments (or whatever) could be a function taking an encoding parameter (whi

Re: [Python-3000] Unicode and OS strings

2007-09-14 Thread Stephen J. Turnbull
Greg Ewing writes: > Stephen J. Turnbull wrote: > > You can't win that, because Unicode is the only encoding that attempts > > to guarantee even the possibility of round-tripping. > > Rubbish -- I can do print [ord(c) for c in my_unicode_string] > and get perfect round-trippability if I wan

Re: [Python-3000] Unicode and OS strings

2007-09-14 Thread Marcin 'Qrczak' Kowalczyk
Dnia 13-09-2007, Cz o godzinie 23:41 -0400, James Y Knight napisał(a): > Here's a suggestion I made on the SBCL dev list a while back, in > response to the same issues. After a second thought, this (escaping undecodable UTF-8 bytes by unpaired low surrogates) might be a good idea. (I don't rem

Re: [Python-3000] Unicode and OS strings

2007-09-14 Thread Marcin 'Qrczak' Kowalczyk
Dnia 14-09-2007, Pt o godzinie 15:02 +0900, Stephen J. Turnbull napisał(a): > > PUA already has a representation in UTF-8, so this is more incompatible > > with UTF-8 than needed, > > Hm? It's not incompatible at all, and we're not interested in a > representation in UTF-8, but rather in UTF-1

Re: [Python-3000] Unicode and OS strings

2007-09-13 Thread Stephen J. Turnbull
"Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> writes: >> This means that a way of handling such points is very useful, and >> as long as there's enough PUA space, the approach I suggested can >> handle all of these various issues. > PUA already has a representation in UTF-8, so this is more

Re: [Python-3000] Unicode and OS strings

2007-09-13 Thread Greg Ewing
Stephen J. Turnbull wrote: > You can't win that, because Unicode is the only encoding that attempts > to guarantee even the possibility of round-tripping. Rubbish -- I can do print [ord(c) for c in my_unicode_string] and get perfect round-trippability if I want. You can ask people to use pre-exis

Re: [Python-3000] Unicode and OS strings

2007-09-13 Thread Stephen J. Turnbull
Greg Ewing writes: > Stephen J. Turnbull wrote: > > What should happen internally is that all undecodable characters > > (which PUA characters are by definition for standard codecs) are > > mapped to unused codepoints in the PUA, chosen by Python. > > You mean chosen dynamically? Yes. >

Re: [Python-3000] Unicode and OS strings

2007-09-13 Thread Greg Ewing
Stephen J. Turnbull wrote: > What should > happen internally is that all undecodable characters (which PUA > characters are by definition for standard codecs) are mapped to unused > codepoints in the PUA, chosen by Python. You mean chosen dynamically? What happens if these PUA characters get encod

Re: [Python-3000] Unicode and OS strings

2007-09-13 Thread James Y Knight
On Sep 13, 2007, at 12:22 PM, Marcin 'Qrczak' Kowalczyk wrote: > What should happen when a command line argument or an environment > variable is not decodable using the system encoding (on Unix where > from the OS point of view it is an array of bytes)? Here's a suggestion I made on the SBCL dev l

Re: [Python-3000] Unicode and OS strings

2007-09-13 Thread Marcin 'Qrczak' Kowalczyk
Dnia 14-09-2007, Pt o godzinie 06:12 +0900, Stephen J. Turnbull napisał(a): > This means that a way of handling such points > is very useful, and as long as there's enough PUA space, the approach > I suggested can handle all of these various issues. PUA already has a representation in UTF-8, so t

Re: [Python-3000] Unicode and OS strings

2007-09-13 Thread Stephen J. Turnbull
"Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> writes: >> Of course, if the input data already contains PUA characters, >> there would be an ambiguity. We can rule this out for most codecs, >> as they don't support PUA characters. The major exception would >> be UTF-8, > Most codecs other t

Re: [Python-3000] Unicode and OS strings

2007-09-13 Thread Marcin 'Qrczak' Kowalczyk
Dnia 13-09-2007, Cz o godzinie 19:08 +0200, "Martin v. Löwis" napisał(a): > Of course, if the input data already contains PUA characters, > there would be an ambiguity. We can rule this out for most codecs, > as they don't support PUA characters. The major exception would > be UTF-8, Most codecs

Re: [Python-3000] Unicode and OS strings

2007-09-13 Thread Martin v. Löwis
> > We would make a list of all interfaces that use the PUA error > > handler: file names, environment variables, command line > > arguments. > > In general, I don't consider this an error. I don't, either. However, given the current codec design, this is the least intrusive way to enhance "al

Re: [Python-3000] Unicode and OS strings

2007-09-13 Thread Stephen J. Turnbull
"Martin v. Löwis" writes: > One "universal" solution is to use Unicode private-use-area > characters. +1 > Of course, if the input data already contains PUA characters, > there would be an ambiguity. That may be true in the implementation, but it shouldn't. What should happen internally i

Re: [Python-3000] Unicode and OS strings

2007-09-13 Thread Martin v. Löwis
> Yes, I have noticed this too. Environment variables, command line > arguments, locale properties, TZ names, and so on, are often given as > 8-bit strings in who knows what encoding. I'm not sure what the > solution is, but we need one. One "universal" solution is to use Unicode private-use-area

Re: [Python-3000] Unicode and OS strings

2007-09-13 Thread Guido van Rossum
Yes, I have noticed this too. Environment variables, command line arguments, locale properties, TZ names, and so on, are often given as 8-bit strings in who knows what encoding. I'm not sure what the solution is, but we need one. I'm guessing one thing we need to do is research how various systems

[Python-3000] Unicode and OS strings

2007-09-13 Thread Marcin 'Qrczak' Kowalczyk
What should happen when a command line argument or an environment variable is not decodable using the system encoding (on Unix where from the OS point of view it is an array of bytes)? This is an unfortunate side effect of switching to Unicode. It's unfortunate because often the data is only passe