Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)
-On [20090430 07:18], Martin v. Löwis (mar...@v.loewis.de) wrote: Suppose I create a new directory, and run the following script in 3.x: py open(x,w).close() py open(b\xff,w).close() py os.listdir(.) ['x'] That is actually a regression in 3.x: Python 2.6.1 (r261:67515, Mar 8 2009, 11:36:21) import os open(x,w).close() open(b\xff,w).close() os.listdir(.) ['x', '\xff'] [Apologies if that was completely clear through the entire discussion, but I've lost track at a given point.] -- Jeroen Ruigrok van der Werven asmodai(-at-)in-nomine.org / asmodai イェルーン ラウフロック ヴァン デル ウェルヴェン http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B Heart is the engine of your body, but Mind is the engine of Life... ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On Wed, Apr 29, 2009 at 23:03, Terry Reedy tjre...@udel.edu wrote: Thomas Breuel wrote: Sure. However, that requires you to provide meaningful, reproducible counter-examples, rather than a stenographic formulation that might hint some problem you apparently see (which I believe is just not there). Well, here's another one: PEP 383 would disallow UTF-8 encodings of half surrogates. By my reading, the current Unicode 5.1 definition of 'UTF-8' disallows that. If we use conformance to Unicode 5.1 as the basis for our discussion, then PEP 383 is off the table anyway. I'm all for strict Unicode compliance. But apparently, the Python community doesn't care. CESU-8 is described in Unicode Technical Report #26, so it at least has some official recognition. More importantly, it's also widely used. So, my question: what are the implications of PEP 383 for CESU-8 encodings on Python? My meta-point is: there are probably many more such issues hidden away and it is a really bad idea to rush something like PEP 383 out. Unicode is hard anyway, and tinkering with its semantics requires a lot of thought. Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)
On Thu, Apr 30, 2009 at 05:40, Curt Hagenlocher c...@hagenlocher.orgwrote: IronPython will inherit whatever behavior Mono has implemented. The Microsoft CLR defines the native string type as UTF-16 and all of the managed APIs for things like file names and environmental variables operate on UTF-16 strings -- there simply are no byte string APIs. Yes. Now think about the implications. This means that adopting PEP 383 will make IronPython and Jython running on UNIX intrinsically incompatible with CPython running on UNIX, and there's no way to fix that. Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)
Jeroen Ruigrok van der Werven wrote: -On [20090430 07:18], Martin v. Löwis (mar...@v.loewis.de) wrote: Suppose I create a new directory, and run the following script in 3.x: py open(x,w).close() py open(b\xff,w).close() py os.listdir(.) ['x'] That is actually a regression in 3.x: Correct - and precisely the issue that this PEP wants to address. For comparison, do os.listdir(u.), though: py os.listdir(u.) [u'x', '\xff'] Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)
Thomas Breuel wrote: On Thu, Apr 30, 2009 at 05:40, Curt Hagenlocher c...@hagenlocher.org mailto:c...@hagenlocher.org wrote: IronPython will inherit whatever behavior Mono has implemented. The Microsoft CLR defines the native string type as UTF-16 and all of the managed APIs for things like file names and environmental variables operate on UTF-16 strings -- there simply are no byte string APIs. Yes. Now think about the implications. This means that adopting PEP 383 will make IronPython and Jython running on UNIX intrinsically incompatible with CPython running on UNIX, and there's no way to fix that. *Not* adapting the PEP will also make CPython and IronPython incompatible, and there's no way to fix that. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] what Windows and Linux really do Re: PEP 383 (again)
Given the stated rationale of PEP 383, I was wondering what Windows actually does. So, I created some ISO8859-15 and ISO8859-8 encoded file names on a device, plugged them into my Windows Vista machine, and fired up Python 3.0. First, os.listdir(f:) returns a list of strings for those file names... but those unicode strings are illegal. You can't even print them without getting an error from Python. In fact, you also can't print strings containing the proposed half-surrogate encodings either: in both cases, the output encoder rejects them with a UnicodeEncodeError. (If not even Python, with its generally lenient attitude, can print those things, some other libraries probably will fail, too.) What about round tripping? So, if you take a malformed file name from an external device (say, because it was actually encoded iso8859-15 or East Asian) and write it to an NTFS directory, it seems to write malformed UTF-16 file names. In essence, Windows doesn't really use unicode, it just implements 16bit raw character strings, just like UNIX historically implements raw 8bit character strings. Then I tried the same thing on my Ubuntu 9.04 machine.It turns out that, unlike Windows, Linux is seems to be moving to consistent use of valid UTF-8. If you plug in an external device and nothing else is known about it, it gets mounted with the utf8 option and the kernel actually seems to enforce UTF-8 encoding. I think this calls into question the rationale behind PEP 383, and we should first look into what the roadmap for UNIX/Linux and UTF-8 actually is. UNIX may have consistent unicode support (via UTF-8) before Windows. As I was saying, I think PEP 383 needs a lot more thought and research... Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)
Yes. Now think about the implications. This means that adopting PEP 383 will make IronPython and Jython running on UNIX intrinsically incompatible with CPython running on UNIX, and there's no way to fix that. *Not* adapting the PEP will also make CPython and IronPython incompatible, and there's no way to fix that. CPython and IronPython are incompatible. And they will stay incompatible if the PEP is adopted. They would become compatible if CPython adopted Mono and/or Java semantics. Since both have had to deal with this, have you looked at what they actually do before proposing PEP 383? What did you find? Why did you choose an incompatible approach for PEP 383? Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On approximately 4/29/2009 8:46 PM, came the following characters from the keyboard of Terry Reedy: Glenn Linderman wrote: On approximately 4/29/2009 1:28 PM, came the following characters from So where is the ambiguity here? None. But not everyone can read all the Python source code to try to understand it; they expect the documentation to help them avoid that. Because the documentation is lacking in this area, it makes your concisely stated PEP rather hard to understand. If you think a section of the doc is grossly inadequate, and there is no existing issue on the tracker, feel free to add one. Thanks for clarifying the Windows behavior, here. A little more clarification in the PEP could have avoided lots of discussion. It would seem that a PEP, proposed to modify a poorly documented (and therefore likely poorly understood) area, should be educational about the status quo, as well as presenting the suggested change. Where the PEP proposes to change, it should start with the status quo. But Martin's somewhat reasonable position is that since he is not proposing to change behavior on Windows, it is not his responsibility to document what he is not proposing to change more adequately. This means, of course, that any observed change on Windows would then be a bug, or at least a break of the promise. On the other hand, I can see that this is enough related to what he is proposing to change that better doc would help. Yes; the very fact that the PEP discusses Windows, speaks about cross-platform code, and doesn't explicitly state that no Windows functionality will change, is confusing. An example of how to initialize things within a sample cross-platform application might help, especially if that initialization only happens if the platform is POSIX, or is commented to the effect that it has no effect on Windows, but makes POSIX happy. Or maybe it is all buried within the initialization of Python itself, and is not exposed to the application at all. I still haven't figured that out, but was not (and am still not) as concerned about that as ensuring that the overall algorithms are functional and useful and user-friendly. Showing it might have been helpful in making it clear that no Windows functionality would change, however. A statement that additional features are being added to allow cross-platform programs deal with non-decodable bytes obtained from POSIX APIs using the same code that already works on Windows, would have made things much clearer. The present Abstract does, in fact, talk only about POSIX, but later statements about Windows muddy the water. Rationale paragraph 3, explicitly talks about cross-platform programs needing to work one way on Windows and another way on POSIX to deal with all the cases. It calls that a proposal, which I guess it is for command line and environment, but it is already implemented in both bytes and str forms for file names... so that further muddies the water. It is, of course, easier to point out deficiencies in a document than to write a better document; however, it is incumbent upon the PEP author to write a PEP that is good enough to get approved, and that means making it understandable enough that people are in favor... or to respond to the plethora of comments until people are in favor. I'm not sure which one is more time-consuming. I've reached the point, based on PEP and comment responses, where I now believe that the PEP is a solution to the problem it is trying to solve, and doesn't create ambiguities in the naming. I don't believe it is the best solution. The basic problem is the overuse of fake characters... normalizing them for display results is large data loss -- many characters would be translated to the same replacement characters. Solutions exist that would allow the use of fewer different fake characters in the strings, while still having a fake character as the escape character, to preserve the invariant that all the strings manipulated by python-escape from the PEP were, and become, strings containing fake characters (from a strict Unicode perspective), which is a nice invariant*. There even exist solutions that would use only one fake character (repeatedly if necessary), and all other characters generated would be displayable characters. This would ease the burden on the program in displaying the strings, and also on the user that might view the resulting mojibake in trying to differentiate one such string from another. Those are outlined in various emails in this thread, although some include my misconception that strings obtained via Unicode-enabled OS APIs would also need to be encoded and altered. If there is any interest in using a more readable encoding, I'd be glad to rework them to remove those misconceptions. * It would be nice to point out that invariant in the PEP, also. -- Glenn -- http://nevcal.com/ ===
Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)
On approximately 4/29/2009 10:17 PM, came the following characters from the keyboard of Martin v. Löwis: I don't understand the proposal and issues. I see a lot of people claiming that they do, and then spending all their time either talking past each other, or disagreeing. If everyone who claims they understand the issues actually does, why is it so hard to reach a consensus? Because the problem is difficult, and any solution has trade-offs. People disagree on which trade-offs are worse than others. I'd like to see some real examples of how things can break in the current system Suppose I create a new directory, and run the following script in 3.x: py open(x,w).close() py open(b\xff,w).close() py os.listdir(.) ['x'] but... py os.listdir(b.) ['x', '\xff'] If I quit Python, I can now do mar...@mira:~/work/3k/t$ ls ? x mar...@mira:~/work/3k/t$ ls -b \377 x As you can see, there are two files in the current directory, but only one of them is reported by os.listdir. The same happens to command line arguments and environment variables: Python might swallow some of them. There is presently no solution for command line and environment variables, I guess... which adds some amount of urgency to the implementation of _something_, even if not this PEP. and I'd like any potential solution to be made available as a third-party package before it goes into the standard library (if possible). Unfortunately, at least for my solution, this isn't possible. I need to change the implementation of the existing file IO APIs. Other than initializing them to use UTF-8b instead of UTF-8, and to use the new python-escape handler? I'm sure if I read the code for that, I'd be able to figure out the answer... I don't find any documented way of adding an encoding/decoding handler to the file IO encoding technique, though which lends credence to your statement, but then that could also be an oversight on my part. One could envision a staged implementation: the addition of the ability to add encoding/decoding handlers to the file IO encoding/decoding process, and the external selection of your new python-escape handler during application startup. That way, the hooks would be in the file system to allow your solution to be used, but not require that it be used; competing solutions using similar technology could be implemented and evaluated. -- Glenn -- http://nevcal.com/ === A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On approximately 4/29/2009 7:50 PM, came the following characters from the keyboard of Aahz: On Thu, Apr 30, 2009, Cameron Simpson wrote: The lengthy discussion mostly revolves around: - Glenn points out that strings that came _not_ from listdir, and that are _not_ well-formed unicode (== have bare surrogates in them) but that were intended for use as filenames will conflict with the PEP's scheme - programs must know that these strings came from outside and must be translated into the PEP's funny-encoding before use in the os.* functions. Previous to the PEP they would get used directly and encode differently after the PEP, thus producing different POSIX filenames. Breakage. - Glenn would like the encoding to use Unicode scalar values only, using a rare-in-filenames character. That would avoid the issue with outside' strings that contain surrogates. To my mind it just moves the punning from rare illegal strings to merely uncommon but legal characters. - Some parties think it would be better to not return strings from os.listdir but a subclass of string (or at least a duck-type of string) that knows where it came from and is also handily recognisable as not-really-a-string for purposes of deciding whether is it PEP-funny-encoded by direct inspection. Assuming people agree that this is an accurate summary, it should be incorporated into the PEP. I'll agree that once other misconceptions were explained away, that the remaining issues are those Cameron summarized. Thanks for the summary! Point two could be modified because I've changed my opinion; I like the invariant Cameron first (I think) explicitly stated about the PEP as it stands, and that I just reworded in another message, that the strings that are altered by the PEP in either direction are in the subset of strings that contain fake (from a strict Unicode viewpoint) characters. I still think an encoding that uses mostly real characters that have assigned glyphs would be better than the encoding in the PEP; but would now suggest that an escape character be a fake character. I'll note here that while the PEP encoding causes illegal bytes to be translated to one fake character, the 3-byte sequence that looks like the range of fake characters would also be translated to a sequence of 3 fake characters. This is 512 combinations that must be translated, and understood by the user (or at least by the programmer). The escape sequence approach requires changing only 257 combinations, and each altered combination would result in exactly 2 characters. Hence, this seems simpler to understand, and to manually encode and decode for debugging purposes. -- Glenn -- http://nevcal.com/ === A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] what Windows and Linux really do Re: PEP 383 (again)
Thomas Breuel wrote: Given the stated rationale of PEP 383, I was wondering what Windows actually does. So, I created some ISO8859-15 and ISO8859-8 encoded file names on a device, plugged them into my Windows Vista machine, and fired up Python 3.0. How did you do that, and what were the specific names that you had chosen? How does explorer display the file names? First, os.listdir(f:) returns a list of strings for those file names... but those unicode strings are illegal. What was the exact result that you got? You can't even print them without getting an error from Python. This is unrelated to the PEP. Try to run the same code in IDLE, or use the ascii() function. What about round tripping? So, if you take a malformed file name from an external device (say, because it was actually encoded iso8859-15 or East Asian) and write it to an NTFS directory, it seems to write malformed UTF-16 file names. In essence, Windows doesn't really use unicode, it just implements 16bit raw character strings, just like UNIX historically implements raw 8bit character strings. I think you misinterpreted what you saw. To find out what way you misinterpreted it, we would have to know what it is that you saw. I think this calls into question the rationale behind PEP 383, and we should first look into what the roadmap for UNIX/Linux and UTF-8 actually is. UNIX may have consistent unicode support (via UTF-8) before Windows. If so, PEP 383 won't hurt. If you never get decode errors for file names, you can just ignore PEP 383. It's only for those of us who do get decode errors. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)
CPython and IronPython are incompatible. And they will stay incompatible if the PEP is adopted. They would become compatible if CPython adopted Mono and/or Java semantics. Which one should it adopt? Mono semantics, or Java semantics? Since both have had to deal with this, have you looked at what they actually do before proposing PEP 383? What did you find? See http://mail.python.org/pipermail/python-3000/2007-September/010450.html Why did you choose an incompatible approach for PEP 383? Because in Python, we want to be able to access all files on disk. Neither Java nor Mono are capable of doing that. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] PEP 383 and GUI libraries
I checked how GUI libraries deal with half surrogates. In pygtk, a warning gets issued to the console /tmp/helloworld.py:71: PangoWarning: Invalid UTF-8 string passed to pango_layout_set_text() self.window.show() and then the widget contains three crossed boxes. wxpython (in its wxgtk version) behaves the same way. PyQt displays a single square box. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 and GUI libraries
On approximately 4/30/2009 1:48 AM, came the following characters from the keyboard of Martin v. Löwis: I checked how GUI libraries deal with half surrogates. In pygtk, a warning gets issued to the console /tmp/helloworld.py:71: PangoWarning: Invalid UTF-8 string passed to pango_layout_set_text() self.window.show() and then the widget contains three crossed boxes. wxpython (in its wxgtk version) behaves the same way. PyQt displays a single square box. Interesting. Did you use a name with other characters? Were they displayed? Both before and after the surrogates? Did you use one or three half surrogates, to produce the three crossed boxes? Did you use one or three half surrogates, to produce the single square box? -- Glenn -- http://nevcal.com/ === A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Assuming people agree that this is an accurate summary, it should be incorporated into the PEP. Done! Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
I think it has to be excluded from mapping in order to not introduce security issues. I think you are right. I have now excluded ASCII bytes from being mapped, effectively not supporting any encodings that are not ASCII compatible. Does that sound ok? Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)
Since both have had to deal with this, have you looked at what they actually do before proposing PEP 383? What did you find? See http://mail.python.org/pipermail/python-3000/2007-September/010450.html Thanks, that's very useful. Why did you choose an incompatible approach for PEP 383? Because in Python, we want to be able to access all files on disk. Neither Java nor Mono are capable of doing that. OK, so what's wrong with os.listdir() and similar functions returning a unicode string for strings that correctly encode/decode, and with byte strings for strings that are not valid unicode? The file I/O functions already seem to deal with byte strings correctly, you never get byte strings on platforms that are fully unicode, and they are well supported. Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)
OK, so what's wrong with os.listdir() and similar functions returning a unicode string for strings that correctly encode/decode, and with byte strings for strings that are not valid unicode? See http://bugs.python.org/issue3187 in particular msg71655 Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] what Windows and Linux really do Re: PEP 383 (again)
Thomas Breuel tmbdev at gmail.com writes: So, I created some ISO8859-15 and ISO8859-8 encoded file names on a device, plugged them into my Windows Vista machine, and fired up Python 3.0.First, os.listdir(f:) returns a list of strings for those file names... but those unicode strings are illegal. Sorry, when you report such experiments, is it too much to ask for a cut and paste of your Python session? You are being unhelpful with such unsubstantiated statements, and your mails are taking a lot of valuable bandwidth. Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)
On Thu, Apr 30, 2009 at 12:32, Martin v. Löwis mar...@v.loewis.de wrote: OK, so what's wrong with os.listdir() and similar functions returning a unicode string for strings that correctly encode/decode, and with byte strings for strings that are not valid unicode? See http://bugs.python.org/issue3187 in particular msg71655 Why didn't you point to that discussion from the PEP 383? And why didn't you point to Kowalczyk's message on encodings in Mono, Java, etc. from the PEP? You could have saved us all a lot of time. Under the set of constraints that Guido imposes, plus the requirement that round-trip works for illegal encodings, there is no other solution than PEP 383. That doesn't make PEP 383 right--I still think it's a bad decision--but it makes it pointless to discuss it any further. Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)
2009/4/30 Martin v. Löwis mar...@v.loewis.de: OK, so what's wrong with os.listdir() and similar functions returning a unicode string for strings that correctly encode/decode, and with byte strings for strings that are not valid unicode? See http://bugs.python.org/issue3187 in particular msg71655 Can I suggest that a pointer to this issue be added to the PEP? It certainly seems like a lot of the discussion of options available is captured there. And the fact that Guido's views are noted there is also useful (as he hasn't been contributing to this thread). 2009/4/30 Thomas Breuel tmb...@gmail.com: Since both have had to deal with this, have you looked at what they actually do before proposing PEP 383? What did you find? See http://mail.python.org/pipermail/python-3000/2007-September/010450.html Thanks, that's very useful. This reference could probably be usefully added to the PEP as well. Paul. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)
On 08:25 am, mar...@v.loewis.de wrote: Why did you choose an incompatible approach for PEP 383? Because in Python, we want to be able to access all files on disk. Neither Java nor Mono are capable of doing that. Java is not capable of doing that. Mono, as I keep pointing out, is. It uses NULLs to escape invalid UNIX filenames. Please see: http://go-mono.com/docs/index.aspx?link=T%3AMono.Unix.UnixEncoding The upshot to all this is that Mono.Unix and Mono.Unix.Native can list, access, and open all files on your filesystem, regardless of encoding. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] what Windows and Linux really do Re: PEP 383 (again)
There are several different ways I tried it. The easiest was to mount a vfat file system with various encodings on Linux and use the Python byte interface to write file names, then plug that flash drive into Windows. So can you share precisely what you have done, to allow others to reproduce it? I think you misinterpreted what you saw. To find out what way you misinterpreted it, we would have to know what it is that you saw. I didn't interpret it much at all. I'm just saying that the PEP 383 assumption that these problems can't occur on Windows isn't true. What are these problems, and where does PEP 383 say they can't occur on Windows? What could Python do differently on Windows? I can plug in a flash drive with malformed strings, and somewhere between the disk and Python, something maps those strings onto unicode in some way, and it's done in a way that's different from PEP 383. Of course it is. The Windows FAT driver has chosen some mapping for the file names to Unicode, and most likely not the encoding that you meant it to use. There is now no way for a Win32 application to find out how the file name is actually represented on disk, short of implementing the FAT file system itself. So what Python does is the best possible solution already - report the file names as-is, with no interpretation. My point remains that I think PEP 383 shouldn't be rushed through, and one should look more carefully first at what the Windows kernel does in these situations, and what Mono and Java do. These questions really have been studied on this list for the last eight years, over and over again. It's not being rushed. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)
Java is not capable of doing that. Mono, as I keep pointing out, is. It uses NULLs to escape invalid UNIX filenames. Please see: http://go-mono.com/docs/index.aspx?link=T%3AMono.Unix.UnixEncoding The upshot to all this is that Mono.Unix and Mono.Unix.Native can list, access, and open all files on your filesystem, regardless of encoding. OK, so why not adopt the Mono solution in CPython? It seems to produce valid unicode strings, removing at least one issue with PEP 383. It also means that IronPython and CPython actually would be compatible. Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)
On Thu, 30 Apr 2009 at 11:26, gl...@divmod.com wrote: On 08:25 am, mar...@v.loewis.de wrote: Why did you choose an incompatible approach for PEP 383? Because in Python, we want to be able to access all files on disk. Neither Java nor Mono are capable of doing that. Java is not capable of doing that. Mono, as I keep pointing out, is. It uses NULLs to escape invalid UNIX filenames. Please see: http://go-mono.com/docs/index.aspx?link=T%3AMono.Unix.UnixEncoding The upshot to all this is that Mono.Unix and Mono.Unix.Native can list, access, and open all files on your filesystem, regardless of encoding. And then it goes on to say: You won't be able to pass non-Unicode filenames as command-line arguments.(*) Not only that, but you can't reliably use such files with System.IO (whatever that is, but it sounds pretty basic). This support is only available within the Mono.Unix and Mono.Unix.Native namespaces. Now, I don't know what that means (never having touched Mono), but it doesn't sound like it simplifies cross-platform support, which is what PEP 383 is aiming for. So it doesn't sound like Mono has solved the problem that Martin is trying to solve, even if it is possible to put Unix specific code into your Mono ap to deal with byte filenames on disk from within your GUI. FWIW I'm +1 on seeing PEP 383 in 3.1, if Martin can manage the patch in time. --David (*) I'd argue that in an important sense that makes Martin's statement about Mono being unable to access all files on disk a true statement; but, then, I freely admit that I have a bias against GUI programs in general :) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] what Windows and Linux really do Re: PEP 383 (again)
On Thu, Apr 30, 2009 at 10:21, Martin v. Löwis mar...@v.loewis.de wrote: Thomas Breuel wrote: Given the stated rationale of PEP 383, I was wondering what Windows actually does. So, I created some ISO8859-15 and ISO8859-8 encoded file names on a device, plugged them into my Windows Vista machine, and fired up Python 3.0. How did you do that, and what were the specific names that you had chosen? There are several different ways I tried it. The easiest was to mount a vfat file system with various encodings on Linux and use the Python byte interface to write file names, then plug that flash drive into Windows. I think you misinterpreted what you saw. To find out what way you misinterpreted it, we would have to know what it is that you saw. I didn't interpret it much at all. I'm just saying that the PEP 383 assumption that these problems can't occur on Windows isn't true. I can plug in a flash drive with malformed strings, and somewhere between the disk and Python, something maps those strings onto unicode in some way, and it's done in a way that's different from PEP 383. Mono and Java must have their own solutions that are different from PEP 383. My point remains that I think PEP 383 shouldn't be rushed through, and one should look more carefully first at what the Windows kernel does in these situations, and what Mono and Java do. Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)
Why didn't you point to that discussion from the PEP 383? And why didn't you point to Kowalczyk's message on encodings in Mono, Java, etc. from the PEP? Because I assumed that readers of the PEP would know (and I'm sure many of them do - this has been *really* discussed over and over again). Under the set of constraints that Guido imposes, plus the requirement that round-trip works for illegal encodings, there is no other solution than PEP 383. Well, there actually is an alternative: expose byte-oriented interfaces in parallel with the string-oriented ones. In the rationale, the PEP explains why I consider this the worse choice. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 and GUI libraries
Did you use a name with other characters? Were they displayed? Both before and after the surrogates? Yes, yes, and yes (IOW, I put the surrogate in the middle). Did you use one or three half surrogates, to produce the three crossed boxes? Only one, and it produced three boxes - probably one for each UTF-8 byte that pango considered invalid. Did you use one or three half surrogates, to produce the single square box? Again, only one. Apparently, PyQt passes the Python Unicode string to Qt in a character-by-character representation, rather than going through UTF-8. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] PEP 382 update
Guido found out that I had misunderstood the existing pkg mechanism: If a zope package is imported, and it uses pkgutil.extend_path, then it won't glob for files ending in .pkg, but instead searches the path for files named zope.pkg. IOW, this is unsuitable as a foundation of PEP 382. I have now changed the PEP to call the files .pth, more in line with how top-level .pth files work, and added a statement that the import feature of .pth files is not provided for package .pth files (use __init__.py instead). Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)
And then it goes on to say: You won't be able to pass non-Unicode filenames as command-line arguments.(*) Not only that, but you can't reliably use such files with System.IO (whatever that is, but it sounds pretty basic). This support is only available within the Mono.Unix and Mono.Unix.Native namespaces. Now, I don't know what that means (never having touched Mono), but it doesn't sound like it simplifies cross-platform support, which is what PEP 383 is aiming for. The problem there isn't how the characters are quoted, but that they are quoted at all, and that the ECMA and Microsoft libraries don't understand this quoting convention. Since command line parsing is handled through ECMA, you happen not to be able to get at those files (that's fixable, but why bother). The analogous problem exists with Martin's proposal on Python: if you pass a unicode string from Python to some library through a unicode API and that library attempts to open the file, it will fail because it doesn't use the proposed Python utf-8b decoder. There just is no way to fix that, no matter which quoting convention you use. In contrast to PEP 383, quoting with u at least results in valid unicode strings in Python. And command line arguments (and environment variables etc.) would work in Python because in Python, those should also use the new encoding for invalid UTF-8 inputs. Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)
Because in Python, we want to be able to access all files on disk. Neither Java nor Mono are capable of doing that. Java is not capable of doing that. Mono, as I keep pointing out, is. It uses NULLs to escape invalid UNIX filenames. Please see: http://go-mono.com/docs/index.aspx?link=T%3AMono.Unix.UnixEncoding The upshot to all this is that Mono.Unix and Mono.Unix.Native can list, access, and open all files on your filesystem, regardless of encoding. I think this is misleading. With Mono 2.0.1, I get ** (/tmp/a.exe:30553): WARNING **: FindNextFile: Bad encoding for '/home/martin/work/3k/t/\xff' Consider using MONO_EXTERNAL_ENCODINGS when running the program using System.IO; class X{ public static void Main(string[] args){ DirectoryInfo di = new DirectoryInfo(.); foreach(FileInfo fi in di.GetFiles()) System.Console.WriteLine(Next:+fi.Name); } } On the other hand, when I write using Mono.Unix; class X{ public static void Main(string[] args){ UnixDirectoryInfo di = new UnixDirectoryInfo(.); foreach(UnixFileSystemInfo fi in di.GetFileSystemEntries()) System.Console.WriteLine(Next:+fi.Name); } } I get indeed all files listed (and can also find out the other stat results). Of course, the resulting application will be mono-specific (it links with Mono.Posix), and not work on Microsoft .NET anymore. IOW, IronPython likely won't use this API. Python, of course, already has the equivalent of that: os.listdir, with a byte parameter, will give you access to all files. If you wanted to closely emulate the Mono API, you could set the file system encoding to the mono-lookalike codec. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)
OK, so why not adopt the Mono solution in CPython? It seems to produce valid unicode strings, removing at least one issue with PEP 383. It also means that IronPython and CPython actually would be compatible. See my other message. The Mono solution may not be what you expect it to be. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)
The upshot to all this is that Mono.Unix and Mono.Unix.Native can list, access, and open all files on your filesystem, regardless of encoding. I think this is misleading. With Mono 2.0.1, I get This has nothing to do with how Mono quotes. The reason for this is that Mono quotes at all and that the Mono developers decided not to change System.IO to understand UNIX quoting. If Mono used PEP 383 quoting, this would fail the same way. And analogous failures will exist with PEP 383 in Python, because there will be more and more libraries with unicode interfaces that then use their own internal decoder (which doesn't understand utf8b) to get a UNIX file name. Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)
This has nothing to do with how Mono quotes. The reason for this is that Mono quotes at all and that the Mono developers decided not to change System.IO to understand UNIX quoting. If Mono used PEP 383 quoting, this would fail the same way. And analogous failures will exist with PEP 383 in Python, because there will be more and more libraries with unicode interfaces that then use their own internal decoder (which doesn't understand utf8b) to get a UNIX file name. What's an analogous failure? Or, rather, why would a failure analogous to the one I got when using System.IO.DirectoryInfo ever exist in Python? Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
[top-posting for once to preserve full quoting] Glenn, Could you please reduce your suggestions into sample text for the PEP? We seem to be now at the stage where nobody is objecting to the PEP, so the focus should be on making the PEP clearer. If you still want to create an alternative PEP implementation, please provide step-by-step walkthroughs, preferably in a new thread -- if you did previously provide that, it's gotten lost in the flood of messages. On Thu, Apr 30, 2009, Glenn Linderman wrote: On approximately 4/29/2009 8:46 PM, came the following characters from the keyboard of Terry Reedy: Glenn Linderman wrote: On approximately 4/29/2009 1:28 PM, came the following characters from So where is the ambiguity here? None. But not everyone can read all the Python source code to try to understand it; they expect the documentation to help them avoid that. Because the documentation is lacking in this area, it makes your concisely stated PEP rather hard to understand. If you think a section of the doc is grossly inadequate, and there is no existing issue on the tracker, feel free to add one. Thanks for clarifying the Windows behavior, here. A little more clarification in the PEP could have avoided lots of discussion. It would seem that a PEP, proposed to modify a poorly documented (and therefore likely poorly understood) area, should be educational about the status quo, as well as presenting the suggested change. Where the PEP proposes to change, it should start with the status quo. But Martin's somewhat reasonable position is that since he is not proposing to change behavior on Windows, it is not his responsibility to document what he is not proposing to change more adequately. This means, of course, that any observed change on Windows would then be a bug, or at least a break of the promise. On the other hand, I can see that this is enough related to what he is proposing to change that better doc would help. Yes; the very fact that the PEP discusses Windows, speaks about cross-platform code, and doesn't explicitly state that no Windows functionality will change, is confusing. An example of how to initialize things within a sample cross-platform application might help, especially if that initialization only happens if the platform is POSIX, or is commented to the effect that it has no effect on Windows, but makes POSIX happy. Or maybe it is all buried within the initialization of Python itself, and is not exposed to the application at all. I still haven't figured that out, but was not (and am still not) as concerned about that as ensuring that the overall algorithms are functional and useful and user-friendly. Showing it might have been helpful in making it clear that no Windows functionality would change, however. A statement that additional features are being added to allow cross-platform programs deal with non-decodable bytes obtained from POSIX APIs using the same code that already works on Windows, would have made things much clearer. The present Abstract does, in fact, talk only about POSIX, but later statements about Windows muddy the water. Rationale paragraph 3, explicitly talks about cross-platform programs needing to work one way on Windows and another way on POSIX to deal with all the cases. It calls that a proposal, which I guess it is for command line and environment, but it is already implemented in both bytes and str forms for file names... so that further muddies the water. It is, of course, easier to point out deficiencies in a document than to write a better document; however, it is incumbent upon the PEP author to write a PEP that is good enough to get approved, and that means making it understandable enough that people are in favor... or to respond to the plethora of comments until people are in favor. I'm not sure which one is more time-consuming. I've reached the point, based on PEP and comment responses, where I now believe that the PEP is a solution to the problem it is trying to solve, and doesn't create ambiguities in the naming. I don't believe it is the best solution. The basic problem is the overuse of fake characters... normalizing them for display results is large data loss -- many characters would be translated to the same replacement characters. Solutions exist that would allow the use of fewer different fake characters in the strings, while still having a fake character as the escape character, to preserve the invariant that all the strings manipulated by python-escape from the PEP were, and become, strings containing fake characters (from a strict Unicode perspective), which is a nice invariant*. There even exist solutions that would use only one fake character (repeatedly if necessary), and all other characters generated would be displayable characters. This would ease
Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)
Martin v. Löwis wrote: OK, so why not adopt the Mono solution in CPython? It seems to produce valid unicode strings, removing at least one issue with PEP 383. It also means that IronPython and CPython actually would be compatible. See my other message. The Mono solution may not be what you expect it to be. Have we considered discussing the problem with the developers and users of the other languages to reach a common solution? ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] what Windows and Linux really do Re: PEP 383 (again)
You can't even print them without getting an error from Python. In fact, you also can't print strings containing the proposed half-surrogate encodings either: in both cases, the output encoder rejects them with a UnicodeEncodeError. (If not even Python, with its generally lenient attitude, can print those things, some other libraries probably will fail, too.) I think you may be confusing two completely separate things; its a long-known issue that the windows console is simply not a Unicode-aware display device naturally. You have to manually set the codepage (by typing 'chcp 65001' -- that's utf8) *and* manually make sure you have a unicode-enabled font chosen for it (which for console fonts is extremely limited to none, and last I looked the default font didn't support unicode) before you can even try to successfully print valid unicode. The default codepage is 437 (for me at least; I think it depends on which language of Windows you're using) which is ASCII-/ish/. You have to do your test in an environment which actually supports displaying unicode at all, or its meaningless. Personally and for all the use cases I have to deal with at work, I would /love/ to see this PEP succeed. Being able to query a list of files in a directory and get them -all-, display them all to a user (which necessitates it being converted to unicode one way or the other. I don't care if certain characters don't display: as long as any arbitrary file will always end up looking like a distinct series of readable and unreadable glyphs so the user can select it clearly), and then perform operations on any selected file regardless of whatever nonsense may be going on underneath with confused users and encodings... in a cross-platform way, would be a tremendous boon to future py3k porting efforts. I ramble. If there's inconsistent encodings used by users on a posix system so that they can only make sense of half of what the names really are... that's for other programs to deal with. I just want to be able to access the files they tell me they want. For anyone who is doing something low-level, they can use the bytes API. --Stephen ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 and GUI libraries
FWIW, I'm in agreement with this PEP (i.e. its status is now Accepted). Martin, you can update the PEP and start the implementation. On Thu, Apr 30, 2009 at 2:12 AM, Martin v. Löwis mar...@v.loewis.de wrote: Did you use a name with other characters? Were they displayed? Both before and after the surrogates? Yes, yes, and yes (IOW, I put the surrogate in the middle). Did you use one or three half surrogates, to produce the three crossed boxes? Only one, and it produced three boxes - probably one for each UTF-8 byte that pango considered invalid. Did you use one or three half surrogates, to produce the single square box? Again, only one. Apparently, PyQt passes the Python Unicode string to Qt in a character-by-character representation, rather than going through UTF-8. -- --Guido van Rossum (home page: http://www.python.org/~guido/) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)
What's an analogous failure? Or, rather, why would a failure analogous to the one I got when using System.IO.DirectoryInfo ever exist in Python? Mono.Unix uses an encoder and a decoder that knows about special quoting rules. System.IO uses a different encoder and decoder because it's a reimplementation of a Microsoft library and the Mono developers chose not to implement Mono.Unix quoting rules in it. There is nothing technical preventing System.IO from using the Mono.Unix codec, it's just that the developers didn't want to change the behavior of an ECMA and Microsoft library. The analogous phenomenon will exist in Python with PEP 383. Let's say I have a C library with wide character interfaces and I pass it a unicode string from Python.(*) That C library now turns that unicode string into UTF-8 for writing to disk using its internal UTF-8 converter. The result is that the file can be opened using Python's open, but it can't be opened using the other library. There simply is no way you can guarantee that all libraries turn unicode strings into pathnames using utf-8b. I'm not arguing about whether that's good or bad anymore, since it's obvious that the only proposal acceptable to Guido uses some form of non-standard encoding / quoting. I'm simply pointing out that the failure you observed with System.IO has nothing to do with which quoting convention you choose, but results from the fact that the developers of System.IO are not using the same encoder/decoder as Mono.Unix (in that case, by choice). So, I don't see any reason to prefer your half surrogate quoting to the Mono U+-based quoting. Both seem to achieve the same goal with respect to round tripping file names, displaying them, etc., but Mono quoting actually results in valid unicode strings. It works because null is the one character that's not legal in a UNIX path name. So, why do you prefer half surrogate coding to U+ quoting? Tom (*) There's actually a second, sutble issue. PEP 383 intends utf-8b only to be used for file names. But that means that I might have to bind the first argument to TIFFOpen with utf-8b conversion, while I might have to bind other arguments with utf-8 conversion. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)
On 02:42 pm, tmb...@gmail.com wrote: So, why do you prefer half surrogate coding to U+ quoting? I have also been eagerly waiting for an answer to this question. I am afraid I have lost it somewhere in the storm of this thread :). Martin, if you're going to stick with the half-surrogate trick, would you mind adding a section to the PEP on alternate encoding strategies, explaining why the NULL method was not selected? ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)
2009/4/30 Thomas Breuel tmb...@gmail.com: The analogous phenomenon will exist in Python with PEP 383. Let's say I have a C library with wide character interfaces and I pass it a unicode string from Python.(*) [...] (*) There's actually a second, sutble issue. PEP 383 intends utf-8b only to be used for file names. But that means that I might have to bind the first argument to TIFFOpen with utf-8b conversion, while I might have to bind other arguments with utf-8 conversion. The footnote seems to imply that you have a concrete case rather than a hypothetical one. The discussion would be much easier if you would supply the concrete details. Then other participants in the discussion could offer concrete suggestions on how your issue could be addressed. Of course, there are 2 provisos here: 1. Maybe you don't care any more, having accepted that the PEP is going to be implemented. That's fine, but there's also no point continuing to argue your case in that event. 2. Maybe you aren't going to accept suggestions that don't conform to your idea of how things should be done. In which case, your reasoning is circular, and you're wasting people's time. Sorry, that sounds grumpy. But I get a headache at the best of times trying to understand Unicode issues, and theoretical, vague, descriptions of problems just make my headache worse... I suggest the discussion should be dropped now, as the PEP has been accepted. Paul. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)
What's an analogous failure? Or, rather, why would a failure analogous to the one I got when using System.IO.DirectoryInfo ever exist in Python? Mono.Unix uses an encoder and a decoder that knows about special quoting rules. System.IO uses a different encoder and decoder because it's a reimplementation of a Microsoft library and the Mono developers chose not to implement Mono.Unix quoting rules in it. There is nothing technical preventing System.IO from using the Mono.Unix codec, it's just that the developers didn't want to change the behavior of an ECMA and Microsoft library. The analogous phenomenon will exist in Python with PEP 383. Let's say I have a C library with wide character interfaces and I pass it a unicode string from Python.(*) That C library now turns that unicode string into UTF-8 for writing to disk using its internal UTF-8 converter. What specific library do you have in mind? Would it always use UTF-8? If so, it will fail in many other ways, as well - if the locale charset is different from UTF-8. I fail to see the analogy. In Python, the standard library works, and the extension fails; in Mono, it's actually vice versa, and not at all analogous. So, I don't see any reason to prefer your half surrogate quoting to the Mono U+-based quoting. Both seem to achieve the same goal with respect to round tripping file names, displaying them, etc., but Mono quoting actually results in valid unicode strings. It works because null is the one character that's not legal in a UNIX path name. So, why do you prefer half surrogate coding to U+ quoting? If I pass a string with an embedded U+ to gtk, gtk will truncate the string, and stop rendering it at this character. This is worse than what it does for invalid UTF-8 sequences. Chances are fairly high that other C libraries will fail in the same way, in particular if they expect char* (which is very common in C). So I prefer the half surrogate because its failure mode is better th (*) There's actually a second, sutble issue. PEP 383 intends utf-8b only to be used for file names. But that means that I might have to bind the first argument to TIFFOpen with utf-8b conversion, while I might have to bind other arguments with utf-8 conversion. I couldn't find a Python wrapper for libtiff. If a wrapper was written, it would indeed have to use the file system encoding for the file name parameters. However, it would have to do that even without PEP 383, since the file name should be encoded in the locale's encoding, not in UTF-8, anyway. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)
On Thu, Apr 30, 2009 at 09:42, Thomas Breuel tmb...@gmail.com wrote: So, I don't see any reason to prefer your half surrogate quoting to the Mono U+-based quoting. Both seem to achieve the same goal with respect to round tripping file names, displaying them, etc., but Mono quoting actually results in valid unicode strings. It works because null is the one character that's not legal in a UNIX path name. This seems to summarize only half of the problem. Mono's U+ quoting creates a string which is an invalid filename; PEP 383's creates one which is an unsanctioned collection of code units. Neither can be passed directly to the posix filesystem in question. I favor PEP 383 because its Unicode strings can be usefully passed to most APIs that would display it usefully. Mono's U+ probably truncates most strings. And since such non-valid Unicode strings can occur on the Windows filesystem, I don't find their use in PEP 383 to be a flaw. -- Michael Urman ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)
Martin, if you're going to stick with the half-surrogate trick, would you mind adding a section to the PEP on alternate encoding strategies, explaining why the NULL method was not selected? In the PEP process, it isn't my job to criticize competing proposals. Instead, proponents of competing proposals should write alternative PEPs, which then get criticized on their own. As the PEP author, I would have to collect the objections to the PEP in the PEP, which I did; I'm not convinced that I would have to also collect all alternative proposals that people come up with in the PEP (except when they are in fact amendments that I accept). I hope I had made it clear that I don't try to shoot down alternative proposals, but have rather asked people making alternative proposals to write their own PEPs. At some point (when the amount of alternative proposals grew unreasonably), I stopped responding to each and every alternative proposal that this should be proposed in a separate PEP. Wrt. escaping with U+: I personally disliked it because I considered it difficult to implement. In particular, on encoding: how do you arrange the encoder not to encode the NUL character in the encoding, as it would surely be a valid character? The surrogate approach works much better here, as it will automatically invoke the error handler. With further testing, I found that in practice, the proposal also suffers from the problem that the character would be taken as a terminating character by APIs - I found that to be a real problem in gtk, and have added that to the PEP. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)
On 03:35 pm, mar...@v.loewis.de wrote: So, why do you prefer half surrogate coding to U+ quoting? If I pass a string with an embedded U+ to gtk, gtk will truncate the string, and stop rendering it at this character. This is worse than what it does for invalid UTF-8 sequences. Chances are fairly high that other C libraries will fail in the same way, in particular if they expect char* (which is very common in C). Hmm. I believe the intended failure mode here, for PyGTK at least, is actually this: TypeError: GtkLabel.set_text() argument 1 must be string without null bytes, not unicode APIs in PyGTK which accept NULLs and silently trucate are probably broken. Although perhaps I've just made your point even more strongly; one because the behavior is inconsistent, and two because it sometimes raises an exception if a NULL is present, and apparently the goal here is to prevent exceptions from being raised anywhere in the process. For this idiom to be of any use to GTK programs, gtk.FileChooser.get_filename() will probably need to be changed, since (in py2) it currently returns a str, not unicode. The PEP should say something about how GUI libraries should handle file choosers, so that they'll be consistent and compatible with the standard library. Perhaps only that file choosers need to take this PEP into account, and the rest is obvious. Or maybe the right thing for GTK to do would be to continue to use bytes on POSIX and convert to text on Windows, since open(), listdir() et. al. will continue to accept bytes for filenames? So I prefer the half surrogate because its failure mode is better th Heh heh heh. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)
On 04:07 pm, mar...@v.loewis.de wrote: Martin, if you're going to stick with the half-surrogate trick, would you mind adding a section to the PEP on alternate encoding strategies, explaining why the NULL method was not selected? In the PEP process, it isn't my job to criticize competing proposals. Instead, proponents of competing proposals should write alternative PEPs, which then get criticized on their own. As the PEP author, I would have to collect the objections to the PEP in the PEP, which I did; I'm not convinced that I would have to also collect all alternative proposals that people come up with in the PEP (except when they are in fact amendments that I accept). Fair enough. I have probably misunderstood the process. I dimly recalled reading some PEPs which addressed alternate approaches in this way and I thought it was part of the process. Anyway, congratulations on getting the PEP accepted, good luck with the implementation. Thanks for addressing my question. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)
If I pass a string with an embedded U+ to gtk, gtk will truncate the string, and stop rendering it at this character. This is worse than what it does for invalid UTF-8 sequences. Chances are fairly high that other C libraries will fail in the same way, in particular if they expect char* (which is very common in C). Hmm. I believe the intended failure mode here, for PyGTK at least, is actually this: TypeError: GtkLabel.set_text() argument 1 must be string without null bytes, not unicode It may depend on the widget also, I tried it with wxMessageDialog (I only had the wx example available, and am using wxgtk). APIs in PyGTK which accept NULLs and silently trucate are probably broken. Although perhaps I've just made your point even more strongly; one because the behavior is inconsistent, and two because it sometimes raises an exception if a NULL is present, and apparently the goal here is to prevent exceptions from being raised anywhere in the process. Indeed so. For this idiom to be of any use to GTK programs, gtk.FileChooser.get_filename() will probably need to be changed, since (in py2) it currently returns a str, not unicode. Perhaps - the entire PEP is about Python 3 only. I don't know whether PyGTK already works with 3.x. The PEP should say something about how GUI libraries should handle file choosers, so that they'll be consistent and compatible with the standard library. Perhaps only that file choosers need to take this PEP into account, and the rest is obvious. Or maybe the right thing for GTK to do would be to continue to use bytes on POSIX and convert to text on Windows, since open(), listdir() et. al. will continue to accept bytes for filenames? In Python 3, the file chooser should definitely return strings, and it would be good if they were PEP 383 compliant. So I prefer the half surrogate because its failure mode is better th Heh heh heh. And it wasn't even intentional :-) Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Cameron Simpson writes: On 29Apr2009 22:14, Stephen J. Turnbull step...@xemacs.org wrote: | Baptiste Carvello writes: | By contrast, if the new utf-8b codec would *supercede* the old one, | \udcxx would always mean raw bytes (at least on UCS-4 builds, where | surrogates are unused). Thus ambiguity could be avoided. | | Unfortunately, that's false. [Because Python strings are | intended to be used as containers for widechars which are to be | interpreted as Unicode when that makes sense, but there's no | restriction against nonsense code points, including in UCS-4 | Python.] [...] Wouldn't you then be bypassing the implicit encoding anyway, at least to some extent, and thus not trip over the PEP? Sure. I'm not really arguing the PEP here; the point is that under the current definition of Python strings, ambiguity is unavoidable. The best we can ask for is fewer exceptions, and an attempt to reduce ambiguity to a bare minimum in the code paths that we open up when we make definition that allows a formerly erroneous computation to succeed. Martin is well aware of this, the PEP is clear enough about that (to me, but I'm a mail and multilingual editor internals kinda guywink). I'd rather have more validation of strings, but *shrug* Martin's doing the work. OTOH, the Unicode fans need to understand that past policy of Python is not to validate; Python is intended to provide all the tools needed to write validating apps, but it isn't one itself. Martin's PEP is quite narrow in that sense. All it is about is an invertible encoding of broken encodings. It does have the downside that it guarantees that Python itself can produce non-conforming strings, but that's not the end of the world, and an app can keep track of them or even refuse them by setting the error handler, if it wants to. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] a suggestion ... Re: PEP 383 (again)
On 2009.04.30 18:21:03 +0200, Martin v. Löwis wrote: Perhaps - the entire PEP is about Python 3 only. I don't know whether PyGTK already works with 3.x. It does not. There is a bug in the Gnome tracker for it, and I believe some work has been done to start porting PyGObject, but it appears that a full PyGTK on Python 3 is a ways off. -- David Riptondrip...@ripton.net ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
One further question: should the encoder accept a string like u'\xDCC2\xDC80'? That would encode to b'\xC2\x80', which, when decoded, would give u'\x80'. Does the PEP only guarantee that strings decoded from the filesystem are reversible, but not check what might be de novo strings? ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] #!/usr/bin/env python -- python3 where applicable
Jared Grubb wrote: Ok, so if I understand, the situation is: * python points to 2.x version * python3 points to 3.x version * need to be able to run certain 3k scripts from cmdline (since we're talking about shebangs) using Python3k even though python points to 2.x So, if I got the situation right, then do these same scripts understand that PYTHONPATH and PYTHONHOME and all the others are also probably pointing to 2.x code? Would it make sense to introduce PYTHON2PATH and PYTHON3PATH (or even PYTHON27PATH and PYTHON 32PATH) et al? Or is this an area where we just figure that whoever moved the file locations around for distribution can hardcode things properly? -jJ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
MRAB wrote: One further question: should the encoder accept a string like u'\xDCC2\xDC80'? That would encode to b'\xC2\x80' Indeed so. which, when decoded, would give u'\x80'. Assuming the encoding is UTF-8, yes. Does the PEP only guarantee that strings decoded from the filesystem are reversible, but not check what might be de novo strings? Exactly so. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 and GUI libraries
On 30-Apr-09, at 7:39 AM, Guido van Rossum wrote: FWIW, I'm in agreement with this PEP (i.e. its status is now Accepted). Martin, you can update the PEP and start the implementation. +1 Kudos to Martin for seeing this through with (imo) considerable patience and dignity. -Mike ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Proposed: add support for UNC paths to all functions in ntpath
Counting the votes for http://bugs.python.org/issue5799 : +1 from Mark Hammond (via private mail) +1 from Paul Moore (via the tracker) +1 from Tim Golden (in Python-ideas, though what he literally said was I'm up for it) +1 from Michael Foord +1 from Eric Smith There have been no other votes. Is that enough consensus for it to go in? If so, are there any core developers who could help me get it in before the 3.1 feature freeze? The patch should be in good shape; it has unit tests and updated documentation. /larry/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Ronald Oussoren ronaldousso...@mac.com (RO) wrote: RO For what it's worth, the OSX API's seem to behave as follows: RO * If you create a file with an non-UTF8 name on a HFS+ filesystem the RO system automaticly encodes the name. RO That is, open(chr(255), 'w') will silently create a file named '%FF' RO instead of the name you'd expect on a unix system. Not for me (I am using Python 2.6.2). f = open(chr(255), 'w') Traceback (most recent call last): File stdin, line 1, in module IOError: [Errno 22] invalid mode ('w') or filename: '\xff' I once got a tar file from a Linux system which contained a file with a non-ASCII, ISO-8859-1 encoded filename. The tar file refused to be unpacked on a HFS+ filesystem. -- Piet van Oostrum p...@cs.uu.nl URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4] Private email: p...@vanoostrum.org ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On 30 Apr 2009, at 05:52, Martin v. Löwis wrote: How do get a printable unicode version of these path strings if they contain none unicode data? Define printable. One way would be to use a regular expression, replacing all codes in a certain range with a question mark. What I mean by printable is that the string must be valid unicode that I can print to a UTF-8 console or place as text in a UTF-8 web page. I think your PEP gives me a string that will not encode to valid UTF-8 that the outside of python world likes. Did I get this point wrong? I'm guessing that an app has to understand that filenames come in two forms unicode and bytes if its not utf-8 data. Why not simply return string if its valid utf-8 otherwise return bytes? That would have been an alternative solution, and the one that 2.x uses for listdir. People didn't like it. In our application we are running fedora with the assumption that the filenames are UTF-8. When Windows systems FTP files to our system the files are in CP-1251(?) and not valid UTF-8. What we have to do is detect these non UTF-8 filename and get the users to rename them. Having an algorithm that says if its a string no problem, if its a byte deal with the exceptions seems simple. How do I do this detection with the PEP proposal? Do I end up using the byte interface and doing the utf-8 decode myself? Barry ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Proposed: add support for UNC paths to all functions in ntpath
Larry Hastings wrote: Counting the votes for http://bugs.python.org/issue5799 : +1 from Mark Hammond (via private mail) +1 from Paul Moore (via the tracker) +1 from Tim Golden (in Python-ideas, though what he literally said was I'm up for it) +1 from Michael Foord +1 from Eric Smith There have been no other votes. Is that enough consensus for it to go in? If so, are there any core developers who could help me get it in before the 3.1 feature freeze? The patch should be in good shape; it has unit tests and updated documentation. +1 from me. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
In article m2ocueq6mm@cs.uu.nl, Piet van Oostrum p...@cs.uu.nl wrote: Ronald Oussoren ronaldousso...@mac.com (RO) wrote: RO For what it's worth, the OSX API's seem to behave as follows: RO * If you create a file with an non-UTF8 name on a HFS+ filesystem the RO system automaticly encodes the name. RO That is, open(chr(255), 'w') will silently create a file named '%FF' RO instead of the name you'd expect on a unix system. Not for me (I am using Python 2.6.2). f = open(chr(255), 'w') Traceback (most recent call last): File stdin, line 1, in module IOError: [Errno 22] invalid mode ('w') or filename: '\xff' What version of OSX are you using? On Tiger 10.4.11 I see the failure you see but on Leopard 10.5.6 the behavior Ronald reports. -- Ned Deily, n...@acm.org ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
How do get a printable unicode version of these path strings if they contain none unicode data? Define printable. One way would be to use a regular expression, replacing all codes in a certain range with a question mark. What I mean by printable is that the string must be valid unicode that I can print to a UTF-8 console or place as text in a UTF-8 web page. I think your PEP gives me a string that will not encode to valid UTF-8 that the outside of python world likes. Did I get this point wrong? You are right. However, if your *only* requirement is that it should be printable, then this is fairly underspecified. One way to get a printable string would be this function def printable_string(unprintable): return This will always return a printable version of the input string... In our application we are running fedora with the assumption that the filenames are UTF-8. When Windows systems FTP files to our system the files are in CP-1251(?) and not valid UTF-8. That would be a bug in your FTP server, no? If you want all file names to be UTF-8, then your FTP server should arrange for that. Having an algorithm that says if its a string no problem, if its a byte deal with the exceptions seems simple. How do I do this detection with the PEP proposal? Do I end up using the byte interface and doing the utf-8 decode myself? No, you should encode using the strict error handler, with the locale encoding. If the file name encodes successfully, it's correct, otherwise, it's broken. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Barry Scott wrote: On 30 Apr 2009, at 05:52, Martin v. Löwis wrote: How do get a printable unicode version of these path strings if they contain none unicode data? Define printable. One way would be to use a regular expression, replacing all codes in a certain range with a question mark. What I mean by printable is that the string must be valid unicode that I can print to a UTF-8 console or place as text in a UTF-8 web page. I think your PEP gives me a string that will not encode to valid UTF-8 that the outside of python world likes. Did I get this point wrong? I'm guessing that an app has to understand that filenames come in two forms unicode and bytes if its not utf-8 data. Why not simply return string if its valid utf-8 otherwise return bytes? That would have been an alternative solution, and the one that 2.x uses for listdir. People didn't like it. In our application we are running fedora with the assumption that the filenames are UTF-8. When Windows systems FTP files to our system the files are in CP-1251(?) and not valid UTF-8. What we have to do is detect these non UTF-8 filename and get the users to rename them. Having an algorithm that says if its a string no problem, if its a byte deal with the exceptions seems simple. How do I do this detection with the PEP proposal? Do I end up using the byte interface and doing the utf-8 decode myself? What do you do currently? The PEP just offers a way of reading all filenames as Unicode, if that's what you want. So what if the strings can't be encoded to normal UTF-8! The filenames aren't valid UTF-8 anyway! :-) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On Apr 30, 2009, at 5:42 AM, Martin v. Löwis wrote: I think you are right. I have now excluded ASCII bytes from being mapped, effectively not supporting any encodings that are not ASCII compatible. Does that sound ok? Yes. The practical upshot of this is that users who brokenly use ja_JP.SJIS as their locale (which, note, first requires editing some files in /var/lib/locales manually to enable its use..) may still have python not work with invalid-in-shift-jis filenames. Since that locale is widely recognized as a bad idea to use, and is not supported by any distros, it certainly doesn't bother me that it isn't 100% supported in python. It seems like the most common reason why people want to use SJIS is to make old pre-unicode apps work right in WINE -- in which case it doesn't actually affect unix python at all. I'd personally be fine with python just declaring that the filesystem- encoding will *always* be utf-8b and ignore the locale...but I expect some other people might complain about that. Of course, application authors can decide to do that themselves by calling sys.setfilesystemencoding('utf-8b') at the start of their program. James ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Not for me (I am using Python 2.6.2). f = open(chr(255), 'w') Traceback (most recent call last): File stdin, line 1, in module IOError: [Errno 22] invalid mode ('w') or filename: '\xff' You can get the same error on Linux: $ python Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41) [GCC 4.3.3] on linux2 Type help, copyright, credits or license for more information. f=open(chr(255),'w') Traceback (most recent call last): File stdin, line 1, in module IOError: [Errno 22] invalid mode ('w') or filename: '\xff' (Some file system drivers do not enforce valid utf8 yet, but I suspect they will in the future.) Tom ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On 30 Apr 2009, at 21:06, Martin v. Löwis wrote: How do get a printable unicode version of these path strings if they contain none unicode data? Define printable. One way would be to use a regular expression, replacing all codes in a certain range with a question mark. What I mean by printable is that the string must be valid unicode that I can print to a UTF-8 console or place as text in a UTF-8 web page. I think your PEP gives me a string that will not encode to valid UTF-8 that the outside of python world likes. Did I get this point wrong? You are right. However, if your *only* requirement is that it should be printable, then this is fairly underspecified. One way to get a printable string would be this function def printable_string(unprintable): return Ha ha! Indeed this works, but I would have to try to turn enough of the string into a reasonable hint at the name of the file so the user can some chance of know what is being reported. This will always return a printable version of the input string... In our application we are running fedora with the assumption that the filenames are UTF-8. When Windows systems FTP files to our system the files are in CP-1251(?) and not valid UTF-8. That would be a bug in your FTP server, no? If you want all file names to be UTF-8, then your FTP server should arrange for that. Not a bug its the lack of a feature. We use ProFTPd that has just implemented what is required. I forget the exact details - they are at work - when the ftp client asks for the FEAT of the ftp server the server can say use UTF-8. Supporting that in the server was apparently none-trivia. Having an algorithm that says if its a string no problem, if its a byte deal with the exceptions seems simple. How do I do this detection with the PEP proposal? Do I end up using the byte interface and doing the utf-8 decode myself? No, you should encode using the strict error handler, with the locale encoding. If the file name encodes successfully, it's correct, otherwise, it's broken. O.k. I understand. Barry ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] 3.1 beta deferred
Hi everyone! In the interest of letting Martin implement PEP 383 for 3.1, I am deferring the release of the 3.1 beta until next Wednesday, May 6th. Thank you, Benjamin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
James Y Knight wrote: On Apr 30, 2009, at 5:42 AM, Martin v. Löwis wrote: I think you are right. I have now excluded ASCII bytes from being mapped, effectively not supporting any encodings that are not ASCII compatible. Does that sound ok? Yes. The practical upshot of this is that users who brokenly use ja_JP.SJIS as their locale (which, note, first requires editing some files in /var/lib/locales manually to enable its use..) may still have python not work with invalid-in-shift-jis filenames. Since that locale is widely recognized as a bad idea to use, and is not supported by any distros, it certainly doesn't bother me that it isn't 100% supported in python. It seems like the most common reason why people want to use SJIS is to make old pre-unicode apps work right in WINE -- in which case it doesn't actually affect unix python at all. I'd personally be fine with python just declaring that the filesystem-encoding will *always* be utf-8b and ignore the locale...but I expect some other people might complain about that. Of course, application authors can decide to do that themselves by calling sys.setfilesystemencoding('utf-8b') at the start of their program. It seems to me that the 3.1+ doc set (or wiki) could be usefully extended with a How-to on working with filenames. I am not sure that everything useful fits anywhere in particular the ref manuals. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Thomas Breuel wrote: Not for me (I am using Python 2.6.2). f = open(chr(255), 'w') Traceback (most recent call last): File stdin, line 1, in module IOError: [Errno 22] invalid mode ('w') or filename: '\xff' You can get the same error on Linux: $ python Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41) [GCC 4.3.3] on linux2 Type help, copyright, credits or license for more information. f=open(chr(255),'w') Traceback (most recent call last): File stdin, line 1, in module IOError: [Errno 22] invalid mode ('w') or filename: '\xff' (Some file system drivers do not enforce valid utf8 yet, but I suspect they will in the future.) Do you suspect that from discussing the issue with kernel developers or reading a thread on lkml? If not, then you're suspicion seems to be pretty groundless The fact that VFAT enforces an encoding does not lend itself to your argument for two reasons: 1) VFAT is not a Unix filesystem. It's a filesystem that's compatible with Windows/DOS. If Windows and DOS have filesystem encodings, then it makes sense for that driver to enforce that as well. Filesystems intended to be used natively on Linux/Unix do not necessarily make this design decision. 2) The encoding is specified when mounting the filesystem. This means that you can still mix encodings in a number of ways. If you mount with an encoding that has full byte coverage, for instance, each user can put filenames from different encodings on there. If you mount with utf8 on a system which uses euc-jp as the default encoding, you can have full paths that contain a mix of utf-8 and euc-jp. Etc. -Toshio signature.asc Description: OpenPGP digital signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] 3.1 beta deferred
Benjamin Peterson wrote: Hi everyone! In the interest of letting Martin implement PEP 383 for 3.1, I am deferring the release of the 3.1 beta until next Wednesday, May 6th. That might also give time for Larry Hastngs' UNC path patch. (and anything else essentially ready ;-) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Proposed: a new function-based C API for declaring Python types
On Tue, Apr 28, 2009 at 8:03 PM, Larry Hastings la...@hastings.org wrote: EXECUTIVE SUMMARY I've written a patch against py3k trunk creating a new function-based API for creating extension types in C. This allows PyTypeObject to become a (mostly) private structure. THE PROBLEM Here's how you create an extension type using the current API. * First, find some code that already has a working type declaration. Copy and paste their fifty-line PyTypeObject declaration, then hack it up until it looks like what you need. * Next--hey! There *is* no next, you're done. You can immediately create an object using your type and pass it into the Python interpreter and it would work fine. You are encouraged to call PyType_Ready(), but this isn't required and it's often skipped. This approach causes two problems. 1) The Python interpreter *must support* and *cannot change* the PyTypeObject structure, forever. Any meaningful change to the structure will break every extension. This has many consequences: a) Fields that are no longer used must be left in place, forever, as ignored placeholders if need be. Py3k cleaned up a lot of these, but it's already picked up a new one (tp_compare is now tp_reserved). b) Internal implementation details of the type system must be public. c) The interpreter can't even use a different structure internally, because extensions are free to pass in objects using PyTypeObjects the interpreter has never seen before. 2) As a programming interface this lacks a certain gentility. It clearly *works*, but it requires programmers to copy and paste with a large structure mostly containing NULLs, which they must pick carefully through to change just a few fields. THE SOLUTION My patch creates a new function-based extension type definition API. You create a type by calling PyType_New(), then call various accessor functions on the type (PyType_SetString and the like), and when your type has been completely populated you must call PyType_Activate() to enable it for use. With this API available, extension authors no longer need to directly see the innards of the PyTypeObject structure. Well, most of the fields anyway. There are a few shortcut macros in CPython that need to continue working for performance reasons, so the tp_flags and tp_dealloc fields need to remain publically visible. One feature worth mentioning is that the API is type-safe. Many such APIs would have had one generic PyType_SetPointer, taking an identifier for the field and a void * for its value, but this would have lost type safety. Another approach would have been to have one accessor per field (PyType_SetAddFunction), but this would have exploded the number of functions in the API. My API splits the difference: each distinct *type* has its own set of accessors (PyType_GetSSizeT) which takes an identifier specifying which field you wish to get or set. SIDE-EFFECTS OF THE API The major change resulting from this API: all PyTypeObjects must now be *pointers* rather than static instances. For example, the external declaration of PyType_Type itself changes from this: PyAPI_DATA(PyTypeObject) PyType_Type; to this: PyAPI_DATA(PyTypeObject *) PyType_Type; This gives rise to the first headache caused by the API: type casts on type objects. It took me a day and a half to realize that this, from Modules/_weakref.c: PyModule_AddObject(m, ref, (PyObject *) _PyWeakref_RefType); really needed to be this: PyModule_AddObject(m, ref, (PyObject *) _PyWeakref_RefType); Hopefully I've already found most of these in CPython itself, but this sort of code surely lurks in extensions yet to be touched. (Pro-tip: if you're working with this patch, and you see a crash, and gdb shows you something like this at the top of the stack: #0 0x081056d8 in visit_decref (op=0x8247aa0, data=0x0) at Modules/gcmodule.c:323 323 if (PyObject_IS_GC(op)) { your problem is an errant , likely on a type object you're passing in to the interpreter. Think--what did you touch recently? Or debug it by salting your code with calls to collect(NUM_GENERATIONS-1).) Another irksome side-effect of the API: because of tp_flags and tp_dealloc, I now have two declarations of PyTypeObject. There's the externally-visible one in Include/object.h, which lets external parties see tp_dealloc and tp_flags. Then there's the internal one in Objects/typeprivate.h which is the real structure. Since declaring a type twice is a no-no, the external one is gated on #ifndef PY_TYPEPRIVATE If you're a normal Python extension programmer, you'd include Python.h as normal: #include Python.h Python implementation files that need to see the real PyTypeObject structure now look like this: #define
Re: [Python-Dev] Proposed: add support for UNC paths to all functions in ntpath
Larry Hastings wrote: Counting the votes for http://bugs.python.org/issue5799 : +1 from Mark Hammond (via private mail) +1 from Paul Moore (via the tracker) +1 from Tim Golden (in Python-ideas, though what he literally said was I'm up for it) +1 from Michael Foord +1 from Eric Smith There have been no other votes. Is that enough consensus for it to go in? If so, are there any core developers who could help me get it in before the 3.1 feature freeze? The patch should be in good shape; it has unit tests and updated documentation. I've taken the liberty of explicitly CCing Martin just incase he missed the thread with all the noise regarding PEP383. If there are no objections from Martin or anyone else here, please feel free to assign it to me (and mail if I haven't taken action by the day before the beta freeze...) Cheers, Mark ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On Fri, 1 May 2009 06:55:48 am Thomas Breuel wrote: You can get the same error on Linux: $ python Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41) [GCC 4.3.3] on linux2 Type help, copyright, credits or license for more information. f=open(chr(255),'w') Traceback (most recent call last): File stdin, line 1, in module IOError: [Errno 22] invalid mode ('w') or filename: '\xff' Works for me under Fedora using ext3 as the file system. $ python2.6 Python 2.6.1 (r261:67515, Dec 24 2008, 00:33:13) [GCC 4.1.2 20070502 (Red Hat 4.1.2-12)] on linux2 Type help, copyright, credits or license for more information. f=open(chr(255),'w') f.close() import os os.remove(chr(255)) Given that chr(255) is a valid filename on my file system, I would consider it a bug if Python couldn't deal with a file with that name. -- Steven D'Aprano ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On 30 Apr, 2009, at 21:33, Piet van Oostrum wrote: Ronald Oussoren ronaldousso...@mac.com (RO) wrote: RO For what it's worth, the OSX API's seem to behave as follows: RO * If you create a file with an non-UTF8 name on a HFS+ filesystem the RO system automaticly encodes the name. RO That is, open(chr(255), 'w') will silently create a file named '%FF' RO instead of the name you'd expect on a unix system. Not for me (I am using Python 2.6.2). f = open(chr(255), 'w') Traceback (most recent call last): File stdin, line 1, in module IOError: [Errno 22] invalid mode ('w') or filename: '\xff' That's odd. Which version of OSX do you use? ron...@rivendell-2[0]$ sw_vers ProductName:Mac OS X ProductVersion: 10.5.6 BuildVersion: 9G55 [~/testdir] ron...@rivendell-2[0]$ /usr/bin/python Python 2.5.1 (r251:54863, Jan 13 2009, 10:26:13) [GCC 4.0.1 (Apple Inc. build 5465)] on darwin Type help, copyright, credits or license for more information. import os os.listdir('.') [] open(chr(255), 'w').write('x') os.listdir('.') ['%FF'] And likewise with python 2.6.1+ (after cleaning the directory): [~/testdir] ron...@rivendell-2[0]$ python2.6 Python 2.6.1+ (release26-maint:70603, Mar 26 2009, 08:38:03) [GCC 4.0.1 (Apple Inc. build 5493)] on darwin Type help, copyright, credits or license for more information. import os os.listdir('.') [] open(chr(255), 'w').write('x') os.listdir('.') ['%FF'] I once got a tar file from a Linux system which contained a file with a non-ASCII, ISO-8859-1 encoded filename. The tar file refused to be unpacked on a HFS+ filesystem. -- Piet van Oostrum p...@cs.uu.nl URL: http://pietvanoostrum.com [PGP 8DAE142BE17999C4] Private email: p...@vanoostrum.org smime.p7s Description: S/MIME cryptographic signature ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383 and GUI libraries
Folks: My use case (Tahoe-LAFS [1]) requires that I am *able* to read arbitrary binary names from the filesystem and store them so that I can regenerate the same byte string later, but it also requires that I *know* whether what I got was a valid string in the expected encoding (which might be utf-8) or whether it was not and I need to fall back to storing the bytes. So far, it looks like PEP 383 doesn't provide both of these requirements, so I am going to have to continue working-around the Python API even after PEP 383. In fact, it might actually increase the amount of working-around that I have to do. If I understand correctly, .decode(encoding, 'strict') will not be changed by PEP 383. A new error handler is added, so .decode('utf-8', 'python-escape') performs the utf-8b decoding. Am I right so far? Therefore if I have a string of bytes, I can attempt to decode it with 'strict', and if that fails I can set the flag showing that it was not a valid byte string in the expected encoding, and then I can invoke .decode('utf-8', 'python-escape') on it. So far, so good. (Note that I never want to do .decode(expected_encoding, 'python-escape') -- if it wasn't a valid bytestring in the expected_encoding, then I want to decode it with utf-8b, regardless of what the expected encoding was.) Anyway, I can use it like this: class FName: def __init__(self, name, failed_decode=False): self.name = name self.failed_decode = failed_decode def fs_to_unicode(bytes): try: return FName(bytes.decode(sys.getfilesystemencoding(), 'strict')) except UnicodeDecodeError: return FName(fn.decode('utf-8', 'python-escape'), failed_decode=True) And what about unicode-oriented APIs such as os.listdir()? Uh-oh, the PEP says that on systems with locale 'utf-8', it will automatically be changed to 'utf-8b'. This means I can't reliably find out whether the entries in the directory *were* named with valid encodings in utf-8? That's not acceptable for my use case. I would have to refrain from using the unicode-oriented os.listdir() on POSIX, and instead do something like this: if platform.system() in ('Windows', 'Darwin'): def listdir(d): return [FName(n) for n in os.listdir(d)] elif platform.system() in ('Linux', 'SunOs'): def listdir(d): bytesd = d.encode(sys.getfilesystemencoding()) return [fs_to_unicode(n) for n in os.listdir(bytesd)] else: raise NotImplementedError(Please classify platform.system() == %s \ as either unicode-safe or unicode-unsafe. % platform.system()) In fact, if 'utf-8' gets automatically converted to 'utf-8b' when *decoding* as well as encoding, then I would have to change my fs_to_unicode() function to check for that and make sure to use strict utf-8 in the first attempt: def fs_to_unicode(bytes): fse = sys.getfilesystemencoding() if fse == 'utf-8b': fse = 'utf-8' try: return FName(bytes.decode(fse, 'strict')) except UnicodeDecodeError: return FName(fn.decode('utf-8', 'python-escape'), failed_decode=True) Would it be possible for Python unicode objects to have a flag indicating whether the 'python-escape' error handler was present? That would serve the same purpose as my failed_decode flag above, and would basically allow me to use the Python APIs directory and make all this work-around code disappear. Failing that, I can't see any way to use the os.listdir() in its unicode-oriented mode to satisfy Tahoe's requirements. If you take the above code and then add the fact that you want to use the failed_decode flag when *encoding* the d argument to os.listdir(), then you get this code: [2]. Oh, I just realized that I *could* use the PEP 383 os.listdir(), like this: def listdir(d): fse = sys.getfilesystemencoding() if fse == 'utf-8b': fse = 'utf-8' ns = [] for fn in os.listdir(d): bytes = fn.encode(fse, 'python-escape') try: ns.append(FName(bytes.decode(fse, 'strict'))) except UnicodeDecodeError: ns.append(FName(fn.decode('utf-8', 'python-escape'), failed_decode=True)) return ns (And I guess I could define listdir() like this only on the non-unicode-safe platforms, as above.) However, that strikes me as even more horrible than the previous listdir() work-around, in part because it means decoding, re-encoding, and re-decoding every name, so I think I would stick with the previous version. Oh, one more note: for Tahoe's purposes you can, in all of the code above, replace .decode('utf-8', 'python-replace') with .decode('windows-1252') and it works just as well. While UTF-8b seems like a really cool hack, and it would produce more legible results if utf-8-encoded strings were partially corrupted, I guess I should just use 'windows-1252' which is already implemented in Python 2 (as well as in all other software in the world). I guess this means that PEP 383, which I