Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On Wed, Apr 29, 2009 at 23:03, Terry Reedy tjre...@udel.edu wrote: Thomas Breuel wrote: Sure. However, that requires you to provide meaningful, reproducible counter-examples, rather than a stenographic formulation that might hint some problem you apparently see (which I believe is just not there). Well, here's another one: PEP 383 would disallow UTF-8 encodings of half surrogates. By my reading, the current Unicode 5.1 definition of 'UTF-8' disallows that. If we use conformance to Unicode 5.1 as the basis for our discussion, then PEP 383 is off the table anyway. I'm all for strict Unicode compliance. But apparently, the Python community doesn't care. CESU-8 is described in Unicode Technical Report #26, so it at least has some official recognition. More importantly, it's also widely used. So, my question: what are the implications of PEP 383 for CESU-8 encodings on Python? My meta-point is: there are probably many more such issues hidden away and it is a really bad idea to rush something like PEP 383 out. Unicode is hard anyway, and tinkering with its semantics requires a lot of thought. Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)
On Thu, Apr 30, 2009 at 05:40, Curt Hagenlocher c...@hagenlocher.orgwrote: IronPython will inherit whatever behavior Mono has implemented. The Microsoft CLR defines the native string type as UTF-16 and all of the managed APIs for things like file names and environmental variables operate on UTF-16 strings -- there simply are no byte string APIs. Yes. Now think about the implications. This means that adopting PEP 383 will make IronPython and Jython running on UNIX intrinsically incompatible with CPython running on UNIX, and there's no way to fix that. Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] what Windows and Linux really do Re: PEP 383 (again)
Given the stated rationale of PEP 383, I was wondering what Windows actually does. So, I created some ISO8859-15 and ISO8859-8 encoded file names on a device, plugged them into my Windows Vista machine, and fired up Python 3.0. First, os.listdir(f:) returns a list of strings for those file names... but those unicode strings are illegal. You can't even print them without getting an error from Python. In fact, you also can't print strings containing the proposed half-surrogate encodings either: in both cases, the output encoder rejects them with a UnicodeEncodeError. (If not even Python, with its generally lenient attitude, can print those things, some other libraries probably will fail, too.) What about round tripping? So, if you take a malformed file name from an external device (say, because it was actually encoded iso8859-15 or East Asian) and write it to an NTFS directory, it seems to write malformed UTF-16 file names. In essence, Windows doesn't really use unicode, it just implements 16bit raw character strings, just like UNIX historically implements raw 8bit character strings. Then I tried the same thing on my Ubuntu 9.04 machine.It turns out that, unlike Windows, Linux is seems to be moving to consistent use of valid UTF-8. If you plug in an external device and nothing else is known about it, it gets mounted with the utf8 option and the kernel actually seems to enforce UTF-8 encoding. I think this calls into question the rationale behind PEP 383, and we should first look into what the roadmap for UNIX/Linux and UTF-8 actually is. UNIX may have consistent unicode support (via UTF-8) before Windows. As I was saying, I think PEP 383 needs a lot more thought and research... Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)
Yes. Now think about the implications. This means that adopting PEP 383 will make IronPython and Jython running on UNIX intrinsically incompatible with CPython running on UNIX, and there's no way to fix that. *Not* adapting the PEP will also make CPython and IronPython incompatible, and there's no way to fix that. CPython and IronPython are incompatible. And they will stay incompatible if the PEP is adopted. They would become compatible if CPython adopted Mono and/or Java semantics. Since both have had to deal with this, have you looked at what they actually do before proposing PEP 383? What did you find? Why did you choose an incompatible approach for PEP 383? Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)
Since both have had to deal with this, have you looked at what they actually do before proposing PEP 383? What did you find? See http://mail.python.org/pipermail/python-3000/2007-September/010450.html Thanks, that's very useful. Why did you choose an incompatible approach for PEP 383? Because in Python, we want to be able to access all files on disk. Neither Java nor Mono are capable of doing that. OK, so what's wrong with os.listdir() and similar functions returning a unicode string for strings that correctly encode/decode, and with byte strings for strings that are not valid unicode? The file I/O functions already seem to deal with byte strings correctly, you never get byte strings on platforms that are fully unicode, and they are well supported. Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)
On Thu, Apr 30, 2009 at 12:32, Martin v. Löwis mar...@v.loewis.de wrote: OK, so what's wrong with os.listdir() and similar functions returning a unicode string for strings that correctly encode/decode, and with byte strings for strings that are not valid unicode? See http://bugs.python.org/issue3187 in particular msg71655 Why didn't you point to that discussion from the PEP 383? And why didn't you point to Kowalczyk's message on encodings in Mono, Java, etc. from the PEP? You could have saved us all a lot of time. Under the set of constraints that Guido imposes, plus the requirement that round-trip works for illegal encodings, there is no other solution than PEP 383. That doesn't make PEP 383 right--I still think it's a bad decision--but it makes it pointless to discuss it any further. Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)
Java is not capable of doing that. Mono, as I keep pointing out, is. It uses NULLs to escape invalid UNIX filenames. Please see: http://go-mono.com/docs/index.aspx?link=T%3AMono.Unix.UnixEncoding The upshot to all this is that Mono.Unix and Mono.Unix.Native can list, access, and open all files on your filesystem, regardless of encoding. OK, so why not adopt the Mono solution in CPython? It seems to produce valid unicode strings, removing at least one issue with PEP 383. It also means that IronPython and CPython actually would be compatible. Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] what Windows and Linux really do Re: PEP 383 (again)
On Thu, Apr 30, 2009 at 10:21, Martin v. Löwis mar...@v.loewis.de wrote: Thomas Breuel wrote: Given the stated rationale of PEP 383, I was wondering what Windows actually does. So, I created some ISO8859-15 and ISO8859-8 encoded file names on a device, plugged them into my Windows Vista machine, and fired up Python 3.0. How did you do that, and what were the specific names that you had chosen? There are several different ways I tried it. The easiest was to mount a vfat file system with various encodings on Linux and use the Python byte interface to write file names, then plug that flash drive into Windows. I think you misinterpreted what you saw. To find out what way you misinterpreted it, we would have to know what it is that you saw. I didn't interpret it much at all. I'm just saying that the PEP 383 assumption that these problems can't occur on Windows isn't true. I can plug in a flash drive with malformed strings, and somewhere between the disk and Python, something maps those strings onto unicode in some way, and it's done in a way that's different from PEP 383. Mono and Java must have their own solutions that are different from PEP 383. My point remains that I think PEP 383 shouldn't be rushed through, and one should look more carefully first at what the Windows kernel does in these situations, and what Mono and Java do. Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)
And then it goes on to say: You won't be able to pass non-Unicode filenames as command-line arguments.(*) Not only that, but you can't reliably use such files with System.IO (whatever that is, but it sounds pretty basic). This support is only available within the Mono.Unix and Mono.Unix.Native namespaces. Now, I don't know what that means (never having touched Mono), but it doesn't sound like it simplifies cross-platform support, which is what PEP 383 is aiming for. The problem there isn't how the characters are quoted, but that they are quoted at all, and that the ECMA and Microsoft libraries don't understand this quoting convention. Since command line parsing is handled through ECMA, you happen not to be able to get at those files (that's fixable, but why bother). The analogous problem exists with Martin's proposal on Python: if you pass a unicode string from Python to some library through a unicode API and that library attempts to open the file, it will fail because it doesn't use the proposed Python utf-8b decoder. There just is no way to fix that, no matter which quoting convention you use. In contrast to PEP 383, quoting with u at least results in valid unicode strings in Python. And command line arguments (and environment variables etc.) would work in Python because in Python, those should also use the new encoding for invalid UTF-8 inputs. Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)
The upshot to all this is that Mono.Unix and Mono.Unix.Native can list, access, and open all files on your filesystem, regardless of encoding. I think this is misleading. With Mono 2.0.1, I get This has nothing to do with how Mono quotes. The reason for this is that Mono quotes at all and that the Mono developers decided not to change System.IO to understand UNIX quoting. If Mono used PEP 383 quoting, this would fail the same way. And analogous failures will exist with PEP 383 in Python, because there will be more and more libraries with unicode interfaces that then use their own internal decoder (which doesn't understand utf8b) to get a UNIX file name. Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)
What's an analogous failure? Or, rather, why would a failure analogous to the one I got when using System.IO.DirectoryInfo ever exist in Python? Mono.Unix uses an encoder and a decoder that knows about special quoting rules. System.IO uses a different encoder and decoder because it's a reimplementation of a Microsoft library and the Mono developers chose not to implement Mono.Unix quoting rules in it. There is nothing technical preventing System.IO from using the Mono.Unix codec, it's just that the developers didn't want to change the behavior of an ECMA and Microsoft library. The analogous phenomenon will exist in Python with PEP 383. Let's say I have a C library with wide character interfaces and I pass it a unicode string from Python.(*) That C library now turns that unicode string into UTF-8 for writing to disk using its internal UTF-8 converter. The result is that the file can be opened using Python's open, but it can't be opened using the other library. There simply is no way you can guarantee that all libraries turn unicode strings into pathnames using utf-8b. I'm not arguing about whether that's good or bad anymore, since it's obvious that the only proposal acceptable to Guido uses some form of non-standard encoding / quoting. I'm simply pointing out that the failure you observed with System.IO has nothing to do with which quoting convention you choose, but results from the fact that the developers of System.IO are not using the same encoder/decoder as Mono.Unix (in that case, by choice). So, I don't see any reason to prefer your half surrogate quoting to the Mono U+-based quoting. Both seem to achieve the same goal with respect to round tripping file names, displaying them, etc., but Mono quoting actually results in valid unicode strings. It works because null is the one character that's not legal in a UNIX path name. So, why do you prefer half surrogate coding to U+ quoting? Tom (*) There's actually a second, sutble issue. PEP 383 intends utf-8b only to be used for file names. But that means that I might have to bind the first argument to TIFFOpen with utf-8b conversion, while I might have to bind other arguments with utf-8 conversion. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Not for me (I am using Python 2.6.2). f = open(chr(255), 'w') Traceback (most recent call last): File stdin, line 1, in module IOError: [Errno 22] invalid mode ('w') or filename: '\xff' You can get the same error on Linux: $ python Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41) [GCC 4.3.3] on linux2 Type help, copyright, credits or license for more information. f=open(chr(255),'w') Traceback (most recent call last): File stdin, line 1, in module IOError: [Errno 22] invalid mode ('w') or filename: '\xff' (Some file system drivers do not enforce valid utf8 yet, but I suspect they will in the future.) Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 (again)
On Wed, Apr 29, 2009 at 07:45, Martin v. Löwis mar...@v.loewis.de wrote: Your claim was that PEP 383 may have unfortunate effects on Windows, No, I simply think that PEP 383 is not sufficiently specified to be able to tell. and I'm telling you that it won't, because the behavior of Python on Windows won't change at all. A justification for your proposal is that there are differences between Python on UNIX and Windows that you would like to reduce. But depending on where you introduce utf-8b coding on UNIX, you may also have to introduce it on Windows in order to keep the platforms consistent. So whatever the problem - it's there already, and the PEP is not going to change it. OK, so you are saying that under PEP 383, utf-8b wouldn't be used anywhere on Windows by default. That's not clear from your proposal. It's also not clear from your proposal where utf-8b will get used on UNIX systems. Some of the places that have been suggested are: open, os.listdir, sys.argv, os.getenv. There are other potential ones, like print, write, and os.system. And what about text file and string conversions: will utf-8b become the default, or optional, or unavailable? Each of those choices potentially has significant implications. I'm just asking what those choices are so that one can then talk about the implications and see whether this proposal is a good one or whether other alternatives are better. Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Sure. However, that requires you to provide meaningful, reproducible counter-examples, rather than a stenographic formulation that might hint some problem you apparently see (which I believe is just not there). Well, here's another one: PEP 383 would disallow UTF-8 encodings of half surrogates. But such encodings are currently supported by Python, and they are used as part of CESU-8 coding. That's, in fact, a common way of converting UTF-16 to UTF-8. How are you going to deal with existing code that relies on being able to code half surrogates as UTF-8? Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)
The whole purpose of PEP 383 is to send the exact same bytes that were read from the OS back to the OS = violating (2) (for whatever the apparent system file-encoding is, not limited to UTF-8), It's fine to read a file name from a file system and write the same file back as the same raw byte sequence. That I don't have a problem with; it's not quite right, but it's harmless. The problem with this PEP is that the malformed unicode it produces can end up in so many other places: as file names on another file system, in string processing libraries, in text files, in databases, in user interfaces, etc. Some of those destinations will use the utf-8b decoder, so they will get byte sequences that never could occur before and that are illegal under unicode. Nobody knows what will happen. And, yes, Martin is proposing that this is the default behavior. There are several other issues that are unresolved: utf-8b makes some current practices illegal; for example, it might break CESU-8 encodings. Also, what are Jython and IronPython supposed to do on UNIX? Can they implement these semantics at all? and that has overwhelmingly popular support. I think people don't fully understand the tradeoffs. I certainly don't. Although there is a slight benefit, there are unknown and potentially large costs. We'd be changing Python's entire unicode string behavior for the sake of one use cases. Since our uses of Python actually involve a lot of unicode, I am wary of having malformed unicode crop up legally in Python code. And that's why I think this proposal should be shelved for a while until people have had more time to try to understand the issues and also come up with alternative proposals. Once this is adopted and implemented in C-Python, Python is stuck with it forever. Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] PEP 383 (again)
I thought PEP-383 was a fairly neat approach, but after thinking about it, I now think that it is wrong. PEP-383 attempts to represent non-UTF-8 byte sequences in Unicode strings in a reversible way. But how do those non-UTF-8 byte sequences get into those path names in the first place? Most likely because an encoding other than UTF-8 was used to write the file system, but you're now trying to interpret its path names as UTF-8. Quietly escaping a bad UTF-8 encoding with private Unicode characters is unlikely to be the right thing, since using the wrong encoding likely means that other characters are decoded incorrectly as well. As a result, the path name may fail in string comparisons and pattern matching, and will look wrong to the user in print statements and dialog boxes. Therefore, when Python encounters path names on a file system that are not consistent with the (assumed) encoding for that file system, Python should raise an error. If you really don't care what the string looks like and you just want an encoding that round-trips without loss, you can probably just set your encoding to one of the 8 bit encodings, like ISO 8859-15. Decoding arbitrary byte sequences to unicode strings as ISO 8859-15 is no less correct than decoding them as the proposed utf-8b. In fact, the most likely source of non-UTF-8 sequences is ISO 8859 encodings. As for what the byte-oriented interfaces should do, they are simply platform dependent. On UNIX, they should do the obvious thing. On Windows, they can either hook up to the low-level byte-oriented system calls that the systems supply, or Windows could fake it and have the byte-oriented interfaces use UTF-8 encodings always and reject non-UTF-8 sequences as illegal (there are already many illegal byte sequences anyway). Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 (again)
Therefore, when Python encounters path names on a file system that are not consistent with the (assumed) encoding for that file system, Python should raise an error. This is what happens currently, and users are quite unhappy about it. We need to keep users and programmers distinct here. Programmers may find it inconvenient that they have to spend time figuring out and deal with platform-dependent file system encoding issues and errors. But internationalization and unicode are hard, that's just a fact of life. End users, however, are going to be quite unhappy if they get a string of gibberish for a file name because you decided to interpret some non-Unicode string as UTF-8-with-extra-bytes. Or some Python program might copy files from an ISO8859-15 encoded file system to a UTF-8 encoded file system, and instead of getting an error when the encodings are set incorrectly, Python would quietly create ISO8859-15 encoded file names, making the target file system inconsistent. There is a lot of potential for major problems for end users with your proposals. In both cases, what should happen is that the end user gets an error, submits a bug, and the programmer figures out how to deal with the encoding issues correctly. Yes, users can do that (to a degree), but they are still unhappy about it. The approach actually fails for command line arguments As it should: if I give an ISO8859-15 encoded command line argument to a Python program that expects a UTF-8 encoding, the Python program should tell me that there is something wrong when it notices that. Quietly continuing is the wrong thing to do. If we follow your approach, that ISO8859-15 string will get turned into an escaped unicode string inside Python. If I understand your proposal correctly, if it's a output file name and gets passed to Python's open function, Python will then decode that string and end up with an ISO8859-15 byte sequence, which it will write to disk literally, even if the encoding for the system is UTF-8. That's the wrong thing to do. As is, these interfaces are incomplete - they don't support command line arguments, or environment variables. If you want to complete them, you should write a PEP. There's no point in scratching when there's no itch. Tom PS: Quietly escaping a bad UTF-8 encoding with private Unicode characters is unlikely to be the right thing And indeed, the PEP stopped using PUA characters. Let me rephrase this: quietly escaping a bad UTF-8 encoding is unlikely to be the right thing; it doesn't matter how you do it. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 (again)
Until it's hard there will be no internationalization. A fact of life, damn it. Programmers are lazy, and have many problems to solve. PEP 383 doesn't make it any easier; it just turns one set of problems into another. Actually, it makes it worse, since any problems that show up now show up far from the source of the problem, and since it can lead to security problems and/or data loss. And the programmer answers The program is expected a correct environment, good filenames, etc. and closes the issue with the resolution User error, will not fix. The problem may well be with the program using the wrong encodings or incorrectly ignoring encoding information. Furthermore, even if it is user error, the program needs to validate its inputs and put up a meaningful error message, not mangle the disk. To detect such program bugs, it's important that when Python detects an incorrect encoding that it doesn't quietly continue with an incorrect string. Furthermore, if you don't provide clear error messages, it often takes a significant amount of time for each issue to determine that it is user error. I am not arguing for or against the PEP in question. Python certainly has to have a way to make portable i18n less hard or else the number of portable internationalized program will be about zero. What the way should be - I don't know. Returning an error for an incorrect encoding doesn't make internationalization harder, it makes it easier because it makes debugging easier. Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 (again)
On Tue, Apr 28, 2009 at 11:00, Oleg Broytmann p...@phd.pp.ru wrote: On Tue, Apr 28, 2009 at 10:37:45AM +0200, Thomas Breuel wrote: Returning an error for an incorrect encoding doesn't make internationalization harder, it makes it easier because it makes debugging easier. What is a correct encoding? I have an FTP server to which clients with different local encodings are connecting. FTP protocol doesn't have a notion of encoding so filenames on the filesystem are in koi8-r, cp1251 and utf-8 encodings - all in one directory! What should os.listdir() return for that directory? What is a correct encoding for that directory?! I don't know what it should do (ftplib needs to worry about that). I do know what it shouldn't do, however: it sould not return a utf-8b string which, when used to create a file, will create a file reproducing the byte sequence of the remote machine; that's wrong. If any program starts to raise errors Python becomes completely unusable for me! But is there anything I can debug here? If we follow PEP 383, you will get lots of errors anyway because those strings, when encoded in utf-8b, will result in an error when you try to write them on a Windows file system or any other system that doesn't allow the byte sequences that the utf-8b encodes. Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Yep, that's the problem. Lots of theoretical problems noone has ever encountered brought up against a PEP which resolves some actual problems people encounter on a regular basis. How can you bring up practical problems against something that hasn't been implemented? The fact that no other language or library does this is perhaps an indication that it isn't the right thing to do. But the biggest problem with the proposal is that it isn't needed: if you want to be able to turn arbitrary byte sequences into unicode strings and back, just set your encoding to iso8859-15. That already works and it doesn't require any changes. Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 (again)
However, it is mission creep: Martin didn't volunteer to write a PEP for it, he volunteered to write a PEP to solve the roundtrip the value of os.listdir() problem. And he succeeded, up to some minor details. Yes, it solves that problem. But that doesn't come without cost. Most importantly, now Python writes illegal UTF-8 strings even if the user chose a UTF-8 encoding. That means that illegal UTF-8 encodings can propagate anywhere, without warning. Furthermore, I don't believe that PEP 383 works consistently on Windows, and it causes programs to behave differently in unintuitive ways on Windows and Linux. I'll suggest an alternative in a separate message. Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] a suggestion ... Re: PEP 383 (again)
I think we should break up this problem into several parts: (1) Should the default UTF-8 decoder fail if it gets an illegal byte sequence. It's probably OK for the default decoder to be lenient in some way (see below). (2) Should the default UTF-8 encoder for file system operations be allowed to generate illegal byte sequences? I think that's a definite no; if I set the encoding for a device to UTF-8, I never want Python to try to write illegal UTF-8 strings to my device. (3) What kind of representation should the UTF-8 decoder return for illegal inputs? There are actually several choices: (a) it could guess what the actual encoding is and use that, (b) it could return a valid unicode string that indicates the illegal characters but does not re-encode to the original byte sequence, or (c) it could return some kind of non-standard representation that encodes back into the original byte sequence. PEP 383 violated (2), and I think that's a bad thing. I think the best solution would be to use (3a) and fall back to (3b) if that doesn't work. If people try to write those strings, they will always get written as correctly encoded UTF-8 strings. If people really want the option of (3c), then I think encoders related to the file system should by default reject those strings as illegal because the potential problems from writing them are just too serious. Printing routines and UI routines could display them without error (but some clear indication), of course. There is yet another option, which is arguably the right one: make the results of os.listdir() subclasses of string that keep track of where they came from. If you write back to the same device, it just writes the same byte sequence. But if you write to other devices and the byte sequence is illegal according to its encoding, you get an error. Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 (again)
On Tue, Apr 28, 2009 at 20:45, Martin v. Löwis mar...@v.loewis.de wrote: Furthermore, I don't believe that PEP 383 works consistently on Windows, What makes you say that? PEP 383 will have no effect on Windows, compared to the status quo, whatsoever. That's what you believe, but it's not clear to me that that follows from your proposal. Your proposal says that utf-8b would be used for file systems, but then you also say that it might be used for command line arguments and environment variables. So, which specific APIs will it be used with on Windows and on POSIX systems? Or will utf-8b simply not be available on Windows at all? What happens if I create a Python version of tar, utf-8b strings slip in there, and I try to use them on Windows? You also assume that all Windows file system functions strictly conform to UTF-16 in practice (not just on paper). Have you verified that? It certainly isn't true across all versions of Windows (since NT originally used UCS-2). What's the situation on Windows CE? Another question on Linux: what happens when I decode a file system path with utf-8b and then pass the resulting unicode string to Gnome? To Qt? To windows.forms? To Java? To a unicode regular expression library? To wprintf? AFAIK, the behavior of most libraries is undefined for the kinds of unicode strings you construct, and it may be undefined in a bad way (crash, buffer overflow, whatever). Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 (again)
On Windows, the Wide APIs are already used throughout the code base, e.g. SetEnvironmentVariableW/_wenviron. If you need to find out the specific API for a specific functionality, please read the source code. [...] No, I don't assume that. I assume that all functions are strictly available in a Wide character version, and have verified that they are. The wide APIs use UTF-16. UTF-16 suffers from the same problem as UTF-8: not all sequences of words are valid UTF-16 sequences. In particular, sequences containing isolated surrogate pairs are not well-formed according to the Unicode standard. Therefore, the existence of a wide character API function does not guarantee that the wide character strings it returns can be converted into valid unicode strings. And, in fact, Windows Vista happily creates files with malformed UTF-16 encodings, and os.listdir() happily returns them. If you can crash Python that way, nothing gets worse by this PEP - you can then *already* crash Python in that way. Yes, but AFAIK, Python does not currently have functions that, as part of correct usage and normal operation, are intended to generate malformed unicode strings. Under your proposal, passing the output from a correctly implemented file system or other OS function to a correctly written library using unicode strings may crash Python. In order to avoid that, every library that's built into Python would have to be checked and updated to deal with both the Unicode standard and your extension to it. Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 (again)
It cannot crash Python; it can only crash hypothetical third-party programs or libraries with deficient error checking and unreasonable assumptions about input data. The error checking isn't necessarily deficient. For example, a safe and legitimate thing to do is for third party libraries to throw a C++ exception, raise a Python exception, or delete the half surrogate. Any of those would break one of the use cases people have been talking about, namely being able to present the output from os.listdir() to the user, say in a file selector, and then access that file. (and, of course, you haven't even proven those programs or libraries exist) PEP 383 is a proposal that suggests changing Python such that malformed unicode strings become a required part of Python and such that Pyhon writes illegal UTF-8 encodings to UTF-8 encoded file systems. Those are big changes, and it's legitimate to ask that PEP 383 address the implications of that choice before it's made. Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com