Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Thomas Breuel
On Wed, Apr 29, 2009 at 23:03, Terry Reedy tjre...@udel.edu wrote:

 Thomas Breuel wrote:


Sure. However, that requires you to provide meaningful, reproducible
counter-examples, rather than a stenographic formulation that might
hint some problem you apparently see (which I believe is just not
there).


 Well, here's another one: PEP 383 would disallow UTF-8 encodings of half
 surrogates.


 By my reading, the current Unicode 5.1 definition of 'UTF-8' disallows
 that.


If we use conformance to Unicode 5.1 as the basis for our discussion, then
PEP 383 is off the table anyway.  I'm all for strict Unicode compliance.
But apparently, the Python community doesn't care.

CESU-8 is described in Unicode Technical Report #26, so it at least has some
official recognition.  More importantly, it's also widely used.  So, my
question: what are the implications of PEP 383 for CESU-8 encodings on
Python?

My meta-point is: there are probably many more such issues hidden away and
it is a really bad idea to rush something like PEP 383 out.  Unicode is hard
anyway, and tinkering with its semantics requires a lot of thought.

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-30 Thread Thomas Breuel
On Thu, Apr 30, 2009 at 05:40, Curt Hagenlocher c...@hagenlocher.orgwrote:

  IronPython will inherit whatever behavior Mono has implemented. The
 Microsoft CLR defines the native string type as UTF-16 and all of the
 managed APIs for things like file names and environmental variables
 operate on UTF-16 strings -- there simply are no byte string APIs.


Yes.  Now think about the implications.  This means that adopting PEP 383
will make IronPython and Jython running on UNIX intrinsically incompatible
with CPython running on UNIX, and there's no way to fix that.

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] what Windows and Linux really do Re: PEP 383 (again)

2009-04-30 Thread Thomas Breuel
Given the stated rationale of PEP 383, I was wondering what Windows actually
does.  So, I created some ISO8859-15 and ISO8859-8 encoded file names on a
device, plugged them into my Windows Vista machine, and fired up Python 3.0.

First, os.listdir(f:) returns a list of strings for those file names...
but those unicode strings are illegal.

You can't even print them without getting an error from Python.  In fact,
you also can't print strings containing the proposed half-surrogate
encodings either: in both cases, the output encoder rejects them with a
UnicodeEncodeError.   (If not even Python, with its generally lenient
attitude, can print those things, some other libraries probably will fail,
too.)

What about round tripping? So, if you take a malformed file name from an
external device (say, because it was actually encoded iso8859-15 or East
Asian) and write it to an NTFS directory, it seems to write malformed UTF-16
file names.  In essence, Windows doesn't really use unicode, it just
implements 16bit raw character strings, just like UNIX historically
implements raw 8bit character strings.

Then I tried the same thing on my Ubuntu 9.04 machine.It turns out that,
unlike Windows, Linux is seems to be moving to consistent use of valid
UTF-8.  If you plug in an external device and nothing else is known about
it, it gets mounted with the utf8 option and the kernel actually seems to
enforce UTF-8 encoding.   I think this calls into question the rationale
behind PEP 383, and we should first look into what the roadmap for
UNIX/Linux and UTF-8 actually is.  UNIX may have consistent unicode support
(via UTF-8) before Windows.

As I was saying, I think PEP 383 needs a lot more thought and research...

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-30 Thread Thomas Breuel
  Yes.  Now think about the implications.  This means that adopting PEP
  383 will make IronPython and Jython running on UNIX intrinsically
  incompatible with CPython running on UNIX, and there's no way to fix
 that.

 *Not* adapting the PEP will also make CPython and IronPython
 incompatible, and there's no way to fix that.


CPython and IronPython are incompatible.  And they will stay incompatible if
the PEP is adopted.

They would become compatible if CPython adopted Mono and/or Java semantics.


Since both have had to deal with this, have you looked at what they actually
do before proposing PEP 383?  What did you find?  Why did you choose an
incompatible approach for PEP 383?

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-30 Thread Thomas Breuel

  Since both have had to deal with this, have you looked at what they
  actually do before proposing PEP 383?  What did you find?

 See

 http://mail.python.org/pipermail/python-3000/2007-September/010450.html


Thanks, that's very useful.


  Why did you choose an incompatible approach for PEP 383?

 Because in Python, we want to be able to access all files on disk.
 Neither Java nor Mono are capable of doing that.


OK, so what's wrong with os.listdir() and similar functions returning a
unicode string for strings that correctly encode/decode, and with byte
strings for strings that are not valid unicode?

The file I/O functions already seem to deal with byte strings correctly, you
never get byte strings on platforms that are fully unicode, and they are
well supported.

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-30 Thread Thomas Breuel
On Thu, Apr 30, 2009 at 12:32, Martin v. Löwis mar...@v.loewis.de wrote:

  OK, so what's wrong with os.listdir() and similar functions returning a
  unicode string for strings that correctly encode/decode, and with byte
  strings for strings that are not valid unicode?

 See http://bugs.python.org/issue3187
 in particular msg71655


Why didn't you point to that discussion from the PEP 383?  And why didn't
you point to Kowalczyk's message on encodings in Mono, Java, etc. from the
PEP?  You could have saved us all a lot of time.

Under the set of constraints that Guido imposes, plus the requirement that
round-trip works for illegal encodings, there is no other solution than PEP
383.  That doesn't make PEP 383 right--I still think it's a bad
decision--but it makes it pointless to discuss it any further.

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-30 Thread Thomas Breuel

 Java is not capable of doing that.  Mono, as I keep pointing out, is. It
 uses NULLs to escape invalid UNIX filenames.  Please see:

 http://go-mono.com/docs/index.aspx?link=T%3AMono.Unix.UnixEncoding

 The upshot to all this is that Mono.Unix and Mono.Unix.Native can list,
 access, and open all files on your filesystem, regardless of encoding.


OK, so why not adopt the Mono solution in CPython?  It seems to produce
valid unicode strings, removing at least one issue with PEP 383.  It also
means that IronPython and CPython actually would be compatible.

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] what Windows and Linux really do Re: PEP 383 (again)

2009-04-30 Thread Thomas Breuel
On Thu, Apr 30, 2009 at 10:21, Martin v. Löwis mar...@v.loewis.de wrote:

 Thomas Breuel wrote:
  Given the stated rationale of PEP 383, I was wondering what Windows
  actually does.  So, I created some ISO8859-15 and ISO8859-8 encoded file
  names on a device, plugged them into my Windows Vista machine, and fired
  up Python 3.0.

 How did you do that, and what were the specific names that you
 had chosen?


There are several different ways I tried it.  The easiest was to mount a
vfat file system with various encodings on Linux and use the Python byte
interface to write file names, then plug that flash drive into Windows.


 I think you misinterpreted what you saw. To find out what way you
 misinterpreted it, we would have to know what it is that you saw.


I didn't interpret it much at all.  I'm just saying that the PEP 383
assumption that these problems can't occur on Windows isn't true.

I can plug in a flash drive with malformed strings, and somewhere between
the disk and Python, something maps those strings onto unicode in some way,
and it's done in a way that's different from PEP 383.  Mono and Java must
have their own solutions that are different from PEP 383.

My point remains that I think PEP 383 shouldn't be rushed through, and one
should look more carefully first at what the Windows kernel does in these
situations, and what Mono and Java do.

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-30 Thread Thomas Breuel

 And then it goes on to say: You won't be able to pass non-Unicode
 filenames as command-line arguments.(*)  Not only that, but you can't
 reliably use such files with System.IO (whatever that is, but it
 sounds pretty basic).  This support is only available within the
 Mono.Unix and Mono.Unix.Native namespaces.  Now, I don't know what
 that means (never having touched Mono), but it doesn't sound like
 it simplifies cross-platform support, which is what PEP 383 is aiming for.


The problem there isn't how the characters are quoted, but that they are
quoted at all, and that the ECMA and Microsoft libraries don't understand
this quoting convention.  Since command line parsing is handled through
ECMA, you happen not to be able to get at those files (that's fixable, but
why bother).

The analogous problem exists with Martin's proposal on Python: if you pass a
unicode string from Python to some library through a unicode API and that
library attempts to open the file, it will fail because it doesn't use the
proposed Python utf-8b decoder.  There just is no way to fix that, no matter
which quoting convention you use.

In contrast to PEP 383, quoting with u at least results in valid unicode
strings in Python.  And command line arguments (and environment variables
etc.) would work in Python because in Python, those should also use the new
encoding for invalid UTF-8 inputs.

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-30 Thread Thomas Breuel


  The upshot to all this is that Mono.Unix and Mono.Unix.Native can list,
  access, and open all files on your filesystem, regardless of encoding.

 I think this is misleading. With Mono 2.0.1, I get


This has nothing to do with how Mono quotes.  The reason for this is that
Mono quotes at all and that the Mono developers decided not to change
System.IO to understand UNIX quoting.

If Mono used PEP 383 quoting, this would fail the same way.

And analogous failures will exist with PEP 383 in Python, because there will
be more and more libraries with unicode interfaces that then use their own
internal decoder (which doesn't understand utf8b) to get a UNIX file name.

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-30 Thread Thomas Breuel

 What's an analogous failure? Or, rather, why would a failure analogous
 to the one I got when using System.IO.DirectoryInfo ever exist in
 Python?


Mono.Unix uses an encoder and a decoder that knows about special quoting
rules.  System.IO uses a different encoder and decoder because it's a
reimplementation of a Microsoft library and the Mono developers chose not to
implement Mono.Unix quoting rules in it.  There is nothing technical
preventing System.IO from using the Mono.Unix codec, it's just that the
developers didn't want to change the behavior of an ECMA and Microsoft
library.

The analogous phenomenon will exist in Python with PEP 383.  Let's say I
have a C library with wide character interfaces and I pass it a unicode
string from Python.(*)  That C library now turns that unicode string into
UTF-8 for writing to disk using its internal UTF-8 converter.   The result
is that the file can be opened using Python's open, but it can't be opened
using the other library.  There simply is no way you can guarantee that all
libraries turn unicode strings into pathnames using utf-8b.   I'm not
arguing about whether that's good or bad anymore, since it's obvious that
the only proposal acceptable to Guido uses some form of non-standard
encoding / quoting.

I'm simply pointing out that the failure you observed with System.IO has
nothing to do with which quoting convention you choose, but results from the
fact that the developers of System.IO are not using the same encoder/decoder
as Mono.Unix (in that case, by choice).

So, I don't see any reason to prefer your half surrogate quoting to the Mono
U+-based quoting.  Both seem to achieve the same goal with respect to
round tripping file names, displaying them, etc., but Mono quoting actually
results in valid unicode strings.  It works because null is the one
character that's not legal in a UNIX path name.

So, why do you prefer half surrogate coding to U+ quoting?

Tom

(*) There's actually a second, sutble issue.  PEP 383 intends utf-8b only to
be used for file names.  But that means that I might have to bind the first
argument to TIFFOpen with utf-8b conversion, while I might have to bind
other arguments with utf-8 conversion.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-30 Thread Thomas Breuel

 Not for me (I am using Python 2.6.2).

  f = open(chr(255), 'w')
 Traceback (most recent call last):
  File stdin, line 1, in module
 IOError: [Errno 22] invalid mode ('w') or filename: '\xff'
 


You can get the same error on Linux:

$ python
Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41)
[GCC 4.3.3] on linux2
Type help, copyright, credits or license for more information.
 f=open(chr(255),'w')
Traceback (most recent call last):
  File stdin, line 1, in module
IOError: [Errno 22] invalid mode ('w') or filename: '\xff'


(Some file system drivers do not enforce valid utf8 yet, but I suspect they
will in the future.)

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 (again)

2009-04-29 Thread Thomas Breuel
On Wed, Apr 29, 2009 at 07:45, Martin v. Löwis mar...@v.loewis.de wrote:

 Your claim was
 that PEP 383 may have unfortunate effects on Windows,


No, I simply think that PEP 383 is not sufficiently specified to be able to
tell.


 and I'm telling
 you that it won't, because the behavior of Python on Windows won't
 change at all.


A justification for your proposal is that there are differences between
Python on UNIX and Windows that you would like to reduce.  But depending on
where you introduce utf-8b coding on UNIX, you may also have to introduce it
on Windows in order to keep the platforms consistent.

So whatever the problem - it's there already, and the
 PEP is not going to change it.


OK, so you are saying that under PEP 383, utf-8b wouldn't be used anywhere
on Windows by default.  That's not clear from your proposal.

It's also not clear from your proposal where utf-8b will get used on UNIX
systems.  Some of the places that have been suggested are: open, os.listdir,
sys.argv, os.getenv. There are other potential ones, like print, write, and
os.system.  And what about text file and string conversions: will utf-8b
become the default, or optional, or unavailable?

Each of those choices potentially has significant implications.  I'm just
asking what those choices are so that one can then talk about the
implications and see whether this proposal is a good one or whether other
alternatives are better.

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-29 Thread Thomas Breuel
 Sure. However, that requires you to provide meaningful, reproducible
 counter-examples, rather than a stenographic formulation that might
 hint some problem you apparently see (which I believe is just not
 there).


Well, here's another one: PEP 383 would disallow UTF-8 encodings of half
surrogates.  But such encodings are currently supported by Python, and they
are used as part of CESU-8 coding.  That's, in fact, a common way of
converting UTF-16 to UTF-8.  How are you going to deal with existing code
that relies on being able to code half surrogates as UTF-8?

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-29 Thread Thomas Breuel

 The whole purpose of PEP 383 is to send the exact same bytes that were
 read from the OS back to the OS = violating (2) (for whatever the
 apparent system file-encoding is, not limited to UTF-8),


It's fine to read a file name from a file system and write the same file
back as the same raw byte sequence.  That I don't have a problem with; it's
not quite right, but it's harmless.

The problem with this PEP is that the malformed unicode it produces can end
up in so many other places: as file names on another file system, in string
processing libraries, in text files, in databases, in user interfaces,
etc.   Some of those destinations will use the utf-8b decoder, so they will
get byte sequences that never could occur before and that are illegal under
unicode.

Nobody knows what will happen.  And, yes, Martin is proposing that this is
the default behavior.

There are several other issues that are unresolved: utf-8b makes some
current practices illegal; for example, it might break CESU-8 encodings.
Also, what are Jython and IronPython supposed to do on UNIX?  Can they
implement these semantics at all?


 and that has overwhelmingly popular support.


I think people don't fully understand the tradeoffs.  I certainly don't.
Although there is a slight benefit, there are unknown and potentially large
costs. We'd be changing Python's entire unicode string behavior for the sake
of one use cases.  Since our uses of Python actually involve a lot of
unicode, I am wary of having malformed unicode crop up legally in Python
code.

And that's why I think this proposal should be shelved for a while until
people have had more time to try to understand the issues and also come up
with alternative proposals.  Once this is adopted and implemented in
C-Python, Python is stuck with it forever.

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] PEP 383 (again)

2009-04-28 Thread Thomas Breuel
I thought PEP-383 was a fairly neat approach, but after thinking about it, I
now think that it is wrong.

PEP-383 attempts to represent non-UTF-8 byte sequences in Unicode strings in
a reversible way.  But how do those non-UTF-8 byte sequences get into those
path names in the first place?  Most likely because an encoding other than
UTF-8 was used to write the file system, but you're now trying to interpret
its path names as UTF-8.

Quietly escaping a bad UTF-8 encoding with private Unicode characters is
unlikely to be the right thing, since using the wrong encoding likely means
that other characters are decoded incorrectly as well.   As a result, the
path name may fail in string comparisons and pattern matching, and will look
wrong to the user in print statements and dialog boxes. Therefore, when
Python encounters path names on a file system that are not consistent with
the (assumed) encoding for that file system, Python should raise an error.

If you really don't care what the string looks like and you just want an
encoding that round-trips without loss, you can probably just set your
encoding to one of the 8 bit encodings, like ISO 8859-15.   Decoding
arbitrary byte sequences to unicode strings as ISO 8859-15 is no less
correct than decoding them as the proposed utf-8b.  In fact, the most
likely source of non-UTF-8 sequences is ISO 8859 encodings.

As for what the byte-oriented interfaces should do, they are simply platform
dependent.  On UNIX, they should do the obvious thing.  On Windows, they can
either hook up to the low-level byte-oriented system calls that the systems
supply, or Windows could fake it and have the byte-oriented interfaces use
UTF-8 encodings always and reject non-UTF-8 sequences as illegal (there are
already many illegal byte sequences anyway).

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 (again)

2009-04-28 Thread Thomas Breuel
  Therefore, when Python encounters path names on a file system
  that are not consistent with the (assumed) encoding for that file
  system, Python should raise an error.

 This is what happens currently, and users are quite unhappy about it.


We need to keep users and programmers distinct here.

Programmers may find it inconvenient that they have to spend time figuring
out and deal with platform-dependent file system encoding issues and
errors.  But internationalization and unicode are hard, that's just a fact
of life.

End users, however, are going to be quite unhappy if they get a string of
gibberish for a file name because you decided to interpret some non-Unicode
string as UTF-8-with-extra-bytes.

Or some Python program might copy files from an ISO8859-15 encoded file
system to a UTF-8 encoded file system, and instead of getting an error when
the encodings are set incorrectly, Python would quietly create ISO8859-15
encoded file names, making the target file system inconsistent.

There is a lot of potential for major problems for end users with your
proposals.  In both cases, what should happen is that the end user gets an
error, submits a bug, and the programmer figures out how to deal with the
encoding issues correctly.


 Yes, users can do that (to a degree), but they are still unhappy about
 it. The approach actually fails for command line arguments


As it should: if I give an ISO8859-15 encoded command line argument to a
Python program that expects a UTF-8 encoding, the Python program should tell
me that there is something wrong when it notices that.  Quietly continuing
is the wrong thing to do.

If we follow your approach, that ISO8859-15 string will get turned into an
escaped unicode string inside Python.  If I understand your proposal
correctly, if it's a output file name and gets passed to Python's open
function, Python will then decode that string and end up with an ISO8859-15
byte sequence, which it will write to disk literally, even if the encoding
for the system is UTF-8.   That's the wrong thing to do.

As is, these interfaces are incomplete - they don't support command
 line arguments, or environment variables. If you want to complete them,
 you should write a PEP.


There's no point in scratching when there's no itch.

Tom

PS:

 Quietly escaping a bad UTF-8 encoding with private Unicode characters is
  unlikely to be the right thing

 And indeed, the PEP stopped using PUA characters.


Let me rephrase this: quietly escaping a bad UTF-8 encoding is unlikely to
be the right thing; it doesn't matter how you do it.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 (again)

2009-04-28 Thread Thomas Breuel


Until it's hard there will be no internationalization. A fact of life,
 damn it. Programmers are lazy, and have many problems to solve.


PEP 383 doesn't make it any easier; it just turns one set of problems into
another.  Actually, it makes it worse, since any problems that show up now
show up far from the source of the problem, and since it can lead to
security problems and/or data loss.


And the programmer answers The program is expected a correct
 environment, good filenames, etc. and closes the issue with the resolution
 User error, will not fix.


The problem may well be with the program using the wrong encodings or
incorrectly ignoring encoding information.  Furthermore, even if it is user
error, the program needs to validate its inputs and put up a meaningful
error message, not mangle the disk.  To detect such program bugs, it's
important that when Python detects an incorrect encoding that it doesn't
quietly continue with an incorrect string.

Furthermore, if you don't provide clear error messages, it often takes a
significant amount of time for each issue to determine that it is user
error.


   I am not arguing for or against the PEP in question. Python certainly
 has to have a way to make portable i18n less hard or else the number of
 portable internationalized program will be about zero. What the way should
 be - I don't know.


Returning an error for an incorrect encoding doesn't make
internationalization harder, it makes it easier because it makes debugging
easier.

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 (again)

2009-04-28 Thread Thomas Breuel
On Tue, Apr 28, 2009 at 11:00, Oleg Broytmann p...@phd.pp.ru wrote:

 On Tue, Apr 28, 2009 at 10:37:45AM +0200, Thomas Breuel wrote:
  Returning an error for an incorrect encoding doesn't make
  internationalization harder, it makes it easier because it makes
 debugging
  easier.

What is a correct encoding?

   I have an FTP server to which clients with different local encodings
 are connecting. FTP protocol doesn't have a notion of encoding so filenames
 on the filesystem are in koi8-r, cp1251 and utf-8 encodings - all in one
 directory! What should os.listdir() return for that directory? What is a
 correct encoding for that directory?!


I don't know what it should do (ftplib needs to worry about that). I do know
what it shouldn't do, however: it sould not return a utf-8b string which,
when used to create a file, will create a file reproducing the byte sequence
of the remote machine; that's wrong.

  If any program starts to raise errors Python becomes completely unusable
 for me! But is there anything I can debug here?


If we follow PEP 383, you will get lots of errors anyway because those
strings, when encoded in utf-8b, will result in an error when you try to
write them on a Windows file system or any other system that doesn't allow
the byte sequences that the utf-8b encodes.

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

2009-04-28 Thread Thomas Breuel

 Yep, that's the problem. Lots of theoretical problems noone has ever
 encountered
 brought up against a PEP which resolves some actual problems people
 encounter on
 a regular basis.


How can you bring up practical problems against something that hasn't been
implemented?

The fact that no other language or library does this is perhaps an
indication that it isn't the right thing to do.

But the biggest problem with the proposal is that it isn't needed: if you
want to be able to turn arbitrary byte sequences into unicode strings and
back, just set your encoding to iso8859-15.  That already works and it
doesn't require any changes.

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 (again)

2009-04-28 Thread Thomas Breuel

 However, it is mission creep: Martin didn't volunteer to
 write a PEP for it, he volunteered to write a PEP to solve the
 roundtrip the value of os.listdir() problem.  And he succeeded, up
 to some minor details.


Yes, it solves that problem.  But that doesn't come without cost.

Most importantly, now Python writes illegal UTF-8 strings even if the user
chose a UTF-8 encoding.   That means that illegal UTF-8 encodings can
propagate anywhere, without warning.

Furthermore, I don't believe that PEP 383 works consistently on Windows, and
it causes programs to behave differently in unintuitive ways on Windows and
Linux.

I'll suggest an alternative in a separate message.

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] a suggestion ... Re: PEP 383 (again)

2009-04-28 Thread Thomas Breuel
I think we should break up this problem into several parts:

(1) Should the default UTF-8 decoder fail if it gets an illegal byte
sequence.

It's probably OK for the default decoder to be lenient in some way (see
below).

(2) Should the default UTF-8 encoder for file system operations be allowed
to generate illegal byte sequences?

I think that's a definite no; if I set the encoding for a device to UTF-8, I
never want Python to try to write illegal UTF-8 strings to my device.

(3) What kind of representation should the UTF-8 decoder return for illegal
inputs?

There are actually several choices: (a) it could guess what the actual
encoding is and use that, (b) it could return a valid unicode string that
indicates the illegal characters but does not re-encode to the original byte
sequence, or (c) it could return some kind of non-standard representation
that encodes back into the original byte sequence.

PEP 383 violated (2), and I think that's a bad thing.

I think the best solution would be to use (3a) and fall back to (3b) if that
doesn't work.  If people try to write those strings, they will always get
written as correctly encoded UTF-8 strings.

If people really want the option of (3c), then I think encoders related to
the file system should by default reject those strings as illegal because
the potential problems from writing them are just too serious.  Printing
routines and UI routines could display them without error (but some clear
indication), of course.

There is yet another option, which is arguably the right one: make the
results of os.listdir() subclasses of string that keep track of where they
came from.  If you write back to the same device, it just writes the same
byte sequence.  But if you write to other devices and the byte sequence is
illegal according to its encoding, you get an error.

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 (again)

2009-04-28 Thread Thomas Breuel
On Tue, Apr 28, 2009 at 20:45, Martin v. Löwis mar...@v.loewis.de wrote:

  Furthermore, I don't believe that PEP 383 works consistently on Windows,

 What makes you say that? PEP 383 will have no effect on Windows,
 compared to the status quo, whatsoever.


That's what you believe, but it's not clear to me that that follows from
your proposal.

Your proposal says that utf-8b would be used for file systems, but then you
also say that it might be used for command line arguments and environment
variables.  So, which specific APIs will it be used with on Windows and on
POSIX systems?   Or will utf-8b simply not be available on Windows at all?
What happens if I create a Python version of tar, utf-8b strings slip in
there, and I try to use them on Windows?

You also assume that all Windows file system functions strictly conform to
UTF-16 in practice (not just on paper).  Have you verified that?  It
certainly isn't true across all versions of Windows (since NT originally
used UCS-2).   What's the situation on Windows CE?

Another question on Linux: what happens when I decode a file system path
with utf-8b and then pass the resulting unicode string to Gnome?  To Qt?  To
windows.forms?  To Java?  To a unicode regular expression library?  To
wprintf?  AFAIK, the behavior of most libraries is undefined for the kinds
of unicode strings you construct, and it may be undefined in a bad way
(crash, buffer overflow, whatever).

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 (again)

2009-04-28 Thread Thomas Breuel

 On Windows, the Wide APIs are already used throughout the code base,
  e.g. SetEnvironmentVariableW/_wenviron. If you need to find out the
 specific API for a specific functionality, please read the source code.
 [...]

No, I don't assume that. I assume that all functions are strictly
 available in a Wide character version, and have verified that they are.


The wide APIs use UTF-16.  UTF-16 suffers from the same problem as UTF-8:
not all sequences of words are valid UTF-16 sequences.  In particular,
sequences containing isolated surrogate pairs are not well-formed according
to the Unicode standard.  Therefore, the existence of a wide character API
function does not guarantee that the wide character strings it returns can
be converted into valid unicode strings.  And, in fact, Windows Vista
happily creates files with malformed UTF-16 encodings, and os.listdir()
happily returns them.


 If you can crash Python that way,
 nothing gets worse by this PEP - you can then *already* crash Python
 in that way.


Yes, but AFAIK, Python does not currently have functions that, as part of
correct usage and normal operation, are intended to generate malformed
unicode strings.

Under your proposal, passing the output from a correctly implemented file
system or other OS function to a correctly written library using unicode
strings may crash Python.  In order to avoid that, every library that's
built into Python would have to be checked and updated to deal with both the
Unicode standard and your extension to it.

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 383 (again)

2009-04-28 Thread Thomas Breuel

 It cannot crash Python; it can only crash
 hypothetical third-party programs or libraries with deficient error
 checking and
 unreasonable assumptions about input data.


The error checking isn't necessarily deficient.  For example, a safe and
legitimate thing to do is for third party libraries to throw a C++
exception, raise a Python exception, or delete the half surrogate.  Any of
those would break one of the use cases people have been talking about,
namely being able to present the output from os.listdir() to the user, say
in a file selector, and then access that file.

(and, of course, you haven't even proven those programs or libraries exist)


PEP 383 is a proposal that suggests changing Python such that malformed
unicode strings become a required part of Python and such that Pyhon writes
illegal UTF-8 encodings to UTF-8 encoded file systems.  Those are big
changes, and it's legitimate to ask that PEP 383 address the implications of
that choice before it's made.

Tom
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com