Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
2009/4/25 James Y Knight f...@fuhm.net: On Apr 24, 2009, at 6:05 PM, Paul Moore wrote: - Windows systems where broken Unicode (lone surrogates or whatever) isn't involved - Unix systems where the user's stated filesystem encoding is correct Can you honestly say that this isn't the vast majority of real-world environments? (IIRC, you are based in Japan, so it may well be true that the likelihood of problems is a lot higher where you are than where I am - the UK - but I suspect that averaging out, things are generally as above). In my experience, it is normal on most unix systems that some programs (mostly daemons) are running in default POSIX locale, others (most user programs) are running in the en_US.utf-8 locale, and some luddite users have set themselves to en_US.8859-1. All running on the same system. OK, thanks for the data point. Following on from that, would this (under Martin's proposal) result in programs receiving encoded strings, or just semantically-incorrect ones? Specifically, the 8859-1 case cannot result in encoded strings, as 8859-1 can represent all byte strings (possibly garbled, but at least validly). The utf8 case can hit unrepresentable bytes, but only if there are characters greater than 0x7F in filenames. Is the POSIX case ASCII? If so, then the same logic (=0x80 is unrepresentable). So, the next question is - do people on such systems frequently use high-bit characters in filenames? Paul. PS Unfortunately, I suspect that the biggest group of people likely to be hit badly by this is people using non-latin scripts. And arguing probabilities without real data is optimistic at best. But those people are also the *least* likely people to contribute on an English-speaking list, I guess :-( (Sincere apologies if everyone but me on this list happens to actually be fluent English-speaking Russians :-)) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Deprecating PyOS_ascii_formatd
Benjamin Peterson wrote: 2009/4/24 Eric Smith e...@trueblade.com: My proposal is to deprecate PyOS_ascii_formatd in 3.1 and remove it in 3.2. Having heard no dissent, I'd like to go ahead and deprecate this API. What are the mechanics of deprecating this? Just documentation, or is there something I should do in the code to generate a warning? Any pointers to examples would be great. You can use PyErr_WarnEx(). Thanks. I created issue 5835 to track this. I marked it as a release blocker, but I should have no problem finishing it up this weekend. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Cameron Simpson wrote: On 22Apr2009 08:50, Martin v. Löwis mar...@v.loewis.de wrote: | File names, environment variables, and command line arguments are | defined as being character data in POSIX; Specific citation please? I'd like to check the specifics of this. For example, on environment variables: http://opengroup.org/onlinepubs/007908799/xbd/envvar.html # For values to be portable across XSI-conformant systems, the value # must be composed of characters from the portable character set (except # NUL and as indicated below). # Environment variable names used by the utilities in the XCU # specification consist solely of upper-case letters, digits and the _ # (underscore) from the characters defined in Portable Character Set . # Other characters may be permitted by an implementation; Or, on command line arguments: http://opengroup.org/onlinepubs/007908799/xsh/execve.html # The arguments represented by arg0, ... are pointers to null-terminated # character strings where a character string is A contiguous sequence of characters terminated by and including the first null byte., and a character is # A sequence of one or more bytes representing a single graphic symbol # or control code. This term corresponds to the ISO C standard term # multibyte character (multi-byte character), where a single-byte # character is a special case of a multi-byte character. Unlike the # usage in the ISO C standard, character here has no necessary # relationship with storage space, and byte is used when storage space # is discussed. So you're proposing that all POSIX OS interfaces (which use byte strings) interpret those byte strings into Python3 str objects, with a codec that will accept arbitrary byte sequences losslessly and is totally reversible, yes? Correct. And, I hope, that the os.* interfaces silently use it by default. Correct. | Applications that need to process the original byte | strings can obtain them by encoding the character strings with the | file system encoding, passing python-escape as the error handler | name. -1 This last sentence kills the idea for me, unless I'm missing something. Which I may be, of course. POSIX filesystems _do_not_ have a file system encoding. Why is that a problem for the PEP? If I'm writing a general purpose UNIX tool like chmod or find, I expect it to work reliably on _any_ UNIX pathname. It must be totally encoding blind. If I speak to the os.* interface to open a file, I expect to hand it bytes and have it behave. See the other messages. If you want to do that, you can continue to. I'm very much in favour of being able to work in strings for most purposes, but if I use the os.* interfaces on a UNIX system it is necessary to be _able_ to work in bytes, because UNIX file pathnames are bytes. Please re-read the PEP. It provides a way of being able to access any POSIX file name correctly, and still pass strings. If there isn't a byte-safe os.* facility in Python3, it will simply be unsuitable for writing low level UNIX tools. Why is that? The mechanism in the PEP is precisely defined to allow writing low level UNIX tools. Finally, I have a small python program whose whole purpose in life is to transcode UNIX filenames before transfer to a MacOSX HFS directory, because of HFS's enforced particular encoding. What approach should a Python app take to transcode UNIX pathnames under your scheme? Compute the corresponding character strings, and use them. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
| 2. Even if they were taken away (which the PEP does not propose to do), |it would be easy to emulate them for applications that want them. |For example, listdir could be wrapped as | |def listdir_b(bytestring): |fse = sys.getfilesystemencoding() Alas, no No, what? No, that algorithm would be incorrect? because there is no sys.getfilesystemencoding() at the POSIX level. It's only the user's current locale stuff on a UNIX system, and has _nothing_ to do with the filesystem because UNIX filesystems don't have encodings. So can you produce a specific example where my proposed listdir_b function would fail to work correctly? For it to work, it is not necessary that POSIX has no notion of character sets on the file system level (which is actually not true - POSIX very well recognizes the notion of character sets for file names, and recommends that you restrict yourself to the portable character set). In particular, because the best (or to my mind misleading) you can do for this is report what the current user thinks: http://docs.python.org/library/sys.html#sys.getfilesystemencoding then there's no guarrentee that what is chosen has any releationship to what was in use when the files being consulted were made. For this PEP, it's irrelevant. It will work even if the chosen encoding is a bad choice. Now, if I were writing listdir_b() I'd want to be able to do something along these lines: - set LC_ALL=C (or some equivalent mechanism) - have os.listdir() read bytes as numeric values and transcode their values _directly_ into the corresponding Unicode code points. - yield bytes( ord(c) for c in os_listdir_string ) - have os.open() et al transcode unicode code points back into bytes. i.e. a straight one-to-one mapping, using only codepoints in the range 1..255. That would be an alternative approach to the same problem (and one that I think will fail more badly than the one I'm proposing). Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Simon Cross wrote: Unfortunately, for Windows, the situation would be exactly the opposite: the byte-oriented interface cannot represent all data; only the character-oriented API can. Is the second part of this actually true? My understanding may be flawed, but surely all Unicode data can be converted to and from bytes using UTF-8? [I hope, by second part, you refer to the part that I left] It's true that UTF-8 could represent all Windows file names. However, the byte-oriented APIs of Windows do not use UTF-8, but instead, they use the Windows ANSI code page (which varies with the installation). Given this, can't people who must have access to all files / environment data just use the bytes interface? No, because the Windows API would interpret the bytes differently, and not find the right file. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
The problem with this, and other preceding schemes that have been discussed here, is that there is no means of ascertaining whether a particular file name str was obtained from a str API, or was funny- decoded from a bytes API... and thus, there is no means of reliably ascertaining whether a particular filename str should be passed to a str API, or funny-encoded back to bytes. Why is it necessary that you are able to make this distinction? Picking a character (I don't find U+F01xx in the Unicode standard, so I don't know what it is) It's a private use area. It will never carry an official character assignment. As I realized in the email-sig, in talking about decoding corrupted headers, there is only one way to guarantee this... to encode _all_ character sequences, from _all_ interfaces. Basically it requires reserving an escape character (I'll use ? in these examples -- yes, an ASCII question mark -- happens to be illegal in Windows filenames so all the better on that platform, but the specific character doesn't matter... avoiding / \ and . is probably good, though). I think you'll have to write an alternative PEP if you want to see something like this implemented throughout Python. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Humour aside :), the expectation that filenames are Unicode data simply doesn't agree with the reality of POSIX file systems. I think an approach similar to that adopted by glib [1] could work Are you saying that the approach presented in the PEP will not work? I believe it would work no matter whether that expectation agrees with reality or not. The amount of moji-bake that you get is larger when the disagreement is larger, but it will continue to *work*. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
The part that I haven't seen clearly addressed so far is what happens when disks get mounted across OSes (e.g. NFS). While I agree that there should be a layer on top that can handle most situations, it also seems clear that the raw layer needs to be readily accessible. Indeed, with the PEP, the raw layer does remain readily available. If you know that it was originally bytes, you can get the very same bytes back if you want to. However, for disks mounted across OSes, you won't have to, normally. If you think there is a problem with these, can you please describe a specific scenario? What application, what file names, what encodings, what problems? Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
[1] Actually, all the PEP says is With this PEP, a uniform treatment of these data as characters becomes possible. An argument as to why this is a good thing would be a useful addition to the PEP. At the moment it's more or less treated as self-evident - which I agree with, but which clearly the Unix people here are not as certain of. Ok, I have added another paragraph. Not sure whether it helps to clarify though. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Because the encoding is not reliably reversible. Why do you say that? The encoding is completely reversible (unless we disagree on what reversible means). I'm +1 on the concept, -1 on the PEP, due solely to the lack of a reversible encoding. Then please provide an example for a setup where it is not reversible. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Following on from that, would this (under Martin's proposal) result in programs receiving encoded strings, or just semantically-incorrect ones? Not sure I understand the question - what is an encoded string? As you analyse below, sometimes, the current (2.x) file system encoding will do the right thing; sometimes, it will decode successfully, but still not give the intended string, and sometimes, it will fail. With the PEP, it won't fail, but give a string back that likely wasn't intended by the user. This might be confusing if you try to render it to a user interface; if the application merely passes it back to file system APIs, it will work fine. So, the next question is - do people on such systems frequently use high-bit characters in filenames? They typically do until they run into problems. For example, if they set the locale to something, and then create files in their homedirectory, it will work just fine, and nobody else will ever see the files (except for the backup software). When they find that the files they created are inaccessible to others, they will often stop using funny characters. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
If the bytes are mapped to single half surrogate codes instead of the normal pairs (low+high), then I can see that decoding could never be ambiguous and encoding could produce the original bytes. I was confused by Markus Kuhn's original UTF-8b specification. I have now changed the PEP to avoid using PUA characters at all. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Martin v. Löwis wrote: If the bytes are mapped to single half surrogate codes instead of the normal pairs (low+high), then I can see that decoding could never be ambiguous and encoding could produce the original bytes. I was confused by Markus Kuhn's original UTF-8b specification. I have now changed the PEP to avoid using PUA characters at all. I find the PEP easier to understand now. In detail I'd say that if a sequence of bytes =0x80 is found which is not valid UTF-8, then the first byte is mapped to a half surrogate and then decoding is continued from the next byte. The only drawback I can see is if the UTF-8 bytes actually decode to a half surrogate. However, half surrogates should really only occur in UTF-16 (as I understand it), so they shouldn't be encoded in UTF-8 anyway! As for handling this case, you could either: 1. Raise an exception (which is what you're trying to avoid) or: 2. Treat it as invalid UTF-8 and map the bytes to half surrogates (encoding would produce the original bytes). I'd prefer option 2. Anyway, +1 from me. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
2009/4/25 Martin v. Löwis mar...@v.loewis.de: Following on from that, would this (under Martin's proposal) result in programs receiving encoded strings, or just semantically-incorrect ones? Not sure I understand the question - what is an encoded string? Sorry. I was struggling to come up with terminology for the various concepts I was trying to express, as I went along. I was meaning a string which has been created from a non-decodable byte sequence using the encoding process you specify in the PEP (with the current version of the PEP, this would be a string with lone half surrogate codes). I was distinguishing these because some people seemed to be implying that such strings were the ones which would result in exceptions. (I think that was Stephen, when he referred to a careful API). As you analyse below, sometimes, the current (2.x) file system encoding will do the right thing; sometimes, it will decode successfully, but still not give the intended string, and sometimes, it will fail. With the PEP, it won't fail, but give a string back that likely wasn't intended by the user. This might be confusing if you try to render it to a user interface; if the application merely passes it back to file system APIs, it will work fine. OK, looks like my analysis matches yours, except that I wasn't sure if the third case (a string that likely wasn't intended) could result in exceptions. From what you're saying, it sounds like it would actually be similar to the second case - I'm not clear on how surrogates work, though. So, the next question is - do people on such systems frequently use high-bit characters in filenames? They typically do until they run into problems. For example, if they set the locale to something, and then create files in their homedirectory, it will work just fine, and nobody else will ever see the files (except for the backup software). When they find that the files they created are inaccessible to others, they will often stop using funny characters. Which sounds fairly practical - and the irony of someone with a funny character in his surname telling me this hasn't escaped me :-) Paul. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
OK, looks like my analysis matches yours, except that I wasn't sure if the third case (a string that likely wasn't intended) could result in exceptions. From what you're saying, it sounds like it would actually be similar to the second case - I'm not clear on how surrogates work, though. On decoding, there is a guarantee that it decodes successfully. There is also a guarantee that the result will re-encode successfully, and yield the same byte string. If you pass a different string into encoding, you still may get exceptions. For example, if the filesystem encoding is latin-1, passing u\u20ac will continue to raise exceptions, even under the python-escape error handler - that error handler will only handle surrogates. There isn't really that much trickery to surrogates. They *have* to come in pairs to be meaningful, with the first one in the range D800..DBFF (high surrogate), and the second in the range DC00..DCFF (low surrogate). Having a lone low surrogate is not meaningful; this is how the escaping works. Proper surrogate pairs encode characters outside the BMP, for use with UTF-16: each code contributes 10 bits (just count how many codes there are in D800..DCFF), together, a pair encodes 20 bits, allowing for 2**20 characters, starting at U+1. When they find that the files they created are inaccessible to others, they will often stop using funny characters. Which sounds fairly practical - and the irony of someone with a funny character in his surname telling me this hasn't escaped me :-) Sure: my Unix account name was always loewis, and even on Windows, our admins didn't dare to put the umlaut into the account name - it would be difficult to login with a US keyboard, for example. People who use non-ASCII characters in filenames around here are primarily non-IT people who aren't aware that these characters are different from the rest. I recognize that for other languages (without trivial transliterations) the problem is more severe, and people are more likely to create files with Cyrillic, or Japanese, names (say) if the systems accepts them at all. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
The only drawback I can see is if the UTF-8 bytes actually decode to a half surrogate. However, half surrogates should really only occur in UTF-16 (as I understand it), so they shouldn't be encoded in UTF-8 anyway! Right: that's the rationale for UTF-8b. Encoding half surrogates violates parts of the Unicode spec, so UTF-8b is safe. As for handling this case, you could either: 1. Raise an exception (which is what you're trying to avoid) or: 2. Treat it as invalid UTF-8 and map the bytes to half surrogates (encoding would produce the original bytes). I'd prefer option 2. I hadn't thought of this case, but you are right - they *are* illegal bytes, after all. Raising an exception would be useless since the whole point of this codec is to never raise unicode errors. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Thanks for writing this PEP 383, MvL. I recently ran into this problem in Python 2.x in the Tahoe project [1]. The Tahoe project should be considered a good use case showing what some people need. For example, the assumption that a file will later be written back into the same local filesystem (and thus luckily use the same encoding) from which it originally came doesn't hold for us, because Tahoe is used for file-sharing as well as for backup-and-restore. One of my first conclusions in pursuing this issue is that we can never use the Python 2.x unicode APIs on Linux, just as we can never use the Python 2.x str APIs on Windows [2]. (You mentioned this ugliness in your PEP.) My next conclusion was that the Linux way of doing encoding of filenames really sucks compared to, for example, the Mac OS X way. I'm heartened to see what David Wheeler is trying to persuade the maintainers of Linux filesystems to improve some of this: [3]. My final conclusion was that we needed to have two kinds of workaround for the Linux suckage: first, if decoding using the suggested filesystem encoding fails, then we fall back to mojibake [4] by decoding with iso-8859-1 (or else with windows-1252 -- I'm not sure if it matters and I haven't yet understood if utf-8b offers another alternative for this case). Second, if decoding succeeds using the suggested filesystem encoding on Linux, then write down the encoding that we used and include that with the filename. This expands the size of our filenames significantly, but it is the only way to allow some future programmer to undo the damage of a falsely- successful decoding. Here's our whole plan: [5]. Regards, Zooko [1] http://allmydata.org [2] http://allmydata.org/pipermail/tahoe-dev/2009-March/001379.html # see the footnote of this message [3] http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html [4] http://en.wikipedia.org/wiki/Mojibake [5] http://allmydata.org/trac/tahoe/ticket/534#comment:47 ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On Sat, Apr 25, 2009 at 05:00:17PM +0200, Martin v. L?wis wrote: I recognize that for other languages (without trivial transliterations) the problem is more severe, and people are more likely to create files with Cyrillic, or Japanese, names (say) if the systems accepts them at all. In different encodings on the same filesystem... Oleg. -- Oleg Broytmannhttp://phd.pp.ru/p...@phd.pp.ru Programmers don't die, they just GOSUB without RETURN. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On Sat, Apr 25, 2009 at 10:00, Martin v. Löwis mar...@v.loewis.de wrote: On decoding, there is a guarantee that it decodes successfully. There is also a guarantee that the result will re-encode successfully, and yield the same byte string. If you pass a different string into encoding, you still may get exceptions. For example, if the filesystem encoding is latin-1, passing u\u20ac will continue to raise exceptions, even under the python-escape error handler - that error handler will only handle surrogates. One angle I've not seen discussed yet is a set of use cases. While the PEP addresses the need for the python developer to not have to write insane conditional code that maps between bytes and str depending on the platform, it doesn't talk about what this allows an application to provide to a user, and at what risks. I see two main user-oriented use cases for the resulting Unicode strings this PEP will produce on all systems: displaying a list of filenames for the user to select from (an open file dialog), and allowing a user to edit or supply a filename (a save dialog or a rename control). It's clear what this PEP provides for the former. On well-behaved systems where a simpler filesystemencoding approach would work, the results are identical; the user can select filenames that are what he expects to see on both Unix and Windows. On less well-behaved systems, some characters may appear as junk in the middle of the name (or would they be invisible?), but should be recognizable enough to choose, or at least to open sequentially and remember what the last one was. On particularly poorly behaved systems, the results will be extremely difficult to read, but no approach is likely to fix this. What I don't find clear is what the risks are for the latter. On the less well behaved system, a user may well attempt to use this python application to fix filenames. Can we estimate a likelihood that edits to the names would result in a Unicode string that can no longer be encoded with the python-escape? Will a new name fully provided by a user on his keyboard (ignoring copy and paste) almost always safely encode? -- Michael Urman ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
I see two main user-oriented use cases for the resulting Unicode strings this PEP will produce on all systems: displaying a list of filenames for the user to select from (an open file dialog), and allowing a user to edit or supply a filename (a save dialog or a rename control). There are more, in particular the case user passes a file name on the command line, and web server passes URL in environment variable. It's clear what this PEP provides for the former. On well-behaved systems where a simpler filesystemencoding approach would work, the results are identical; the user can select filenames that are what he expects to see on both Unix and Windows. On less well-behaved systems, some characters may appear as junk in the middle of the name (or would they be invisible?) Depends on the rendering. Try print u'\udc00' in your terminal to see what happens; for me, it renders the glyph for replacement character. In GUI applications, you often see white boxes (rectangles). What I don't find clear is what the risks are for the latter. On the less well behaved system, a user may well attempt to use this python application to fix filenames. Can we estimate a likelihood that edits to the names would result in a Unicode string that can no longer be encoded with the python-escape? Will a new name fully provided by a user on his keyboard (ignoring copy and paste) almost always safely encode? That very much depends on the system setup, and your impression is right that the PEP doesn't address it - it only deals with cases where you get random unsupported bytes; getting random unsupported characters from the user is not considered. If the user has the locale setup in way that matches his keyboard, it should work all fine - and will already, even without the PEP. If the user enters a character that doesn't directly map to a good file name, you get an exception, and have to tell the user to pick a different filename. Notice that it may fail at several layers: - it may be that characters entered are not supported in what Python choses as the file system encoding. - it may be that the characters are not supported by the file system, e.g. leading spaces in Win32. - it may be that the file cannot be renamed because the target name already exists. In all these cases, the application has to ask the user to reconsider; for at least the last case, it should be prepared to do that, anyway (there is also the case where renaming fails because of lack of permissions; in that case, picking a different file name won't help). Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System C haracter Interfaces
Paul Moore p.f.moore at gmail.com writes: But those people are also the *least* likely people to contribute on an English-speaking list, I guess (Sincere apologies if everyone but me on this list happens to actually be fluent English-speaking Russians ) Actually, we're all Finnish. Regards, Åntoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On Sat, Apr 25, 2009 at 11:33, Martin v. Löwis mar...@v.loewis.de wrote: If the user has the locale setup in way that matches his keyboard, it should work all fine - and will already, even without the PEP. If the user enters a character that doesn't directly map to a good file name, you get an exception, and have to tell the user to pick a different filename. This sound good so far - the 90% (or higher) case is still clean. Notice that it may fail at several layers: - it may be that characters entered are not supported in what Python choses as the file system encoding. - it may be that the characters are not supported by the file system, e.g. leading spaces in Win32. - it may be that the file cannot be renamed because the target name already exists. In all these cases, the application has to ask the user to reconsider; for at least the last case, it should be prepared to do that, anyway (there is also the case where renaming fails because of lack of permissions; in that case, picking a different file name won't help). This argument sounds good to me too. How will we communicate to developers what new exception might occur where? It would be a shame to have a solid application developed under Windows start raising encoding exceptions on linux. Would the encoding error get mapped to an IOError for all file APIs that do this encoding? -- Michael Urman ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
Martin v. Löwis wrote: I see two main user-oriented use cases for the resulting Unicode strings this PEP will produce on all systems: displaying a list of filenames for the user to select from (an open file dialog), and allowing a user to edit or supply a filename (a save dialog or a rename control). There are more, in particular the case user passes a file name on the command line, and web server passes URL in environment variable. It's clear what this PEP provides for the former. On well-behaved systems where a simpler filesystemencoding approach would work, the results are identical; the user can select filenames that are what he expects to see on both Unix and Windows. On less well-behaved systems, some characters may appear as junk in the middle of the name (or would they be invisible?) Depends on the rendering. Try print u'\udc00' in your terminal to see what happens; for me, it renders the glyph for replacement character. In GUI applications, you often see white boxes (rectangles). What I don't find clear is what the risks are for the latter. On the less well behaved system, a user may well attempt to use this python application to fix filenames. Can we estimate a likelihood that edits to the names would result in a Unicode string that can no longer be encoded with the python-escape? Will a new name fully provided by a user on his keyboard (ignoring copy and paste) almost always safely encode? That very much depends on the system setup, and your impression is right that the PEP doesn't address it - it only deals with cases where you get random unsupported bytes; getting random unsupported characters from the user is not considered. If the user has the locale setup in way that matches his keyboard, it should work all fine - and will already, even without the PEP. If the user enters a character that doesn't directly map to a good file name, you get an exception, and have to tell the user to pick a different filename. Notice that it may fail at several layers: - it may be that characters entered are not supported in what Python choses as the file system encoding. - it may be that the characters are not supported by the file system, e.g. leading spaces in Win32. - it may be that the file cannot be renamed because the target name already exists. In all these cases, the application has to ask the user to reconsider; for at least the last case, it should be prepared to do that, anyway (there is also the case where renaming fails because of lack of permissions; in that case, picking a different file name won't help). This has made me think about what happens going the other way, ie when a user-supplied Unicode string needs to be converted to UTF-8b. That should also be reversible. Therefore: When encoding using UTF-8b, codepoints in the range U+DC80..U+DCFF should map to bytes 0x80..0xFF; all other codepoints, including the remaining half surrogates, should be encoded normally. When decoding using UTF-8b, undecodable bytes in the range 0x80..0xFF should map to U+DC80..U+DCFF; all other bytes, including the encodings for the remaining half surrogates, should be decoded normally. This will ensure that even when the user has provided a string containing half surrogates it can be encoded to bytes and then decoded back to the original string. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
-On [20090425 11:01], Paul Moore (p.f.mo...@gmail.com) wrote: PS Unfortunately, I suspect that the biggest group of people likely to be hit badly by this is people using non-latin scripts. And arguing probabilities without real data is optimistic at best. But those people are also the *least* likely people to contribute on an English-speaking list, I guess :-( (Sincere apologies if everyone but me on this list happens to actually be fluent English-speaking Russians :-)) Even though I am Dutch I have to deal with a variety of scripts for my i18n and L10n efforts, which includes contributions to Unicode. Aside from that I also have the fair share of audio files which have the names/descriptions in the respective script (Thai, Korean, Chinese, Taiwanese, Japanese, and so on). -- Jeroen Ruigrok van der Werven asmodai(-at-)in-nomine.org / asmodai イェルーン ラウフロック ヴァン デル ウェルヴェン http://www.in-nomine.org/ | http://www.rangaku.org/ | GPG: 2EAC625B Necessity relieves us of the ordeal of choice... ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] [Python-checkins] r71946 - peps/trunk/pep-0315.txt
You might want to note in the PEP that the problem that's being solved is known as the loop and a half problem. http://www.cs.duke.edu/~ola/patterns/plopd/loops.html#loop-and-a-half raymond.hettinger wrote: Author: raymond.hettinger Date: Sun Apr 26 02:34:36 2009 New Revision: 71946 Log: Revive PEP 315. Modified: peps/trunk/pep-0315.txt Modified: peps/trunk/pep-0315.txt == --- peps/trunk/pep-0315.txt (original) +++ peps/trunk/pep-0315.txt Sun Apr 26 02:34:36 2009 @@ -2,9 +2,9 @@ Title: Enhanced While Loop Version: $Revision$ Last-Modified: $Date$ -Author: W Isaac Carroll icarr...@pobox.com -Raymond Hettinger pyt...@rcn.com -Status: Deferred +Author: Raymond Hettinger pyt...@rcn.com +W Isaac Carroll icarr...@pobox.com +Status: Draft Type: Standards Track Content-Type: text/plain Created: 25-Apr-2003 ___ Python-checkins mailing list python-check...@python.org http://mail.python.org/mailman/listinfo/python-checkins ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
On 25Apr2009 14:07, Martin v. Löwis mar...@v.loewis.de wrote: | Cameron Simpson wrote: | On 22Apr2009 08:50, Martin v. Löwis mar...@v.loewis.de wrote: | | File names, environment variables, and command line arguments are | | defined as being character data in POSIX; | | Specific citation please? I'd like to check the specifics of this. | For example, on environment variables: | http://opengroup.org/onlinepubs/007908799/xbd/envvar.html [...] | http://opengroup.org/onlinepubs/007908799/xsh/execve.html [...] Thanks. | So you're proposing that all POSIX OS interfaces (which use byte strings) | interpret those byte strings into Python3 str objects, with a codec | that will accept arbitrary byte sequences losslessly and is totally | reversible, yes? | | Correct. | | And, I hope, that the os.* interfaces silently use it by default. | | Correct. Ok, then I'm probably good with the PEP. Though I have a quite strong desire to be able to work in bytes at need without doing multiple encode/decode steps. | | Applications that need to process the original byte | | strings can obtain them by encoding the character strings with the | | file system encoding, passing python-escape as the error handler | | name. | | -1 | This last sentence kills the idea for me, unless I'm missing something. | Which I may be, of course. | POSIX filesystems _do_not_ have a file system encoding. | | Why is that a problem for the PEP? Because you said above by encoding the character strings with the file system encoding, which is a fiction. | If I'm writing a general purpose UNIX tool like chmod or find, I expect | it to work reliably on _any_ UNIX pathname. It must be totally encoding | blind. If I speak to the os.* interface to open a file, I expect to hand | it bytes and have it behave. | | See the other messages. If you want to do that, you can continue to. | | I'm very much in favour of being able to work in strings for most | purposes, but if I use the os.* interfaces on a UNIX system it is | necessary to be _able_ to work in bytes, because UNIX file pathnames | are bytes. | | Please re-read the PEP. It provides a way of being able to access any | POSIX file name correctly, and still pass strings. | | If there isn't a byte-safe os.* facility in Python3, it will simply be | unsuitable for writing low level UNIX tools. | | Why is that? The mechanism in the PEP is precisely defined to allow | writing low level UNIX tools. Then implicitly it's byte safe. Clearly I'm being unclear; I mean original OS-level byte strings must be obtainable undamaged, and it must be possible to create/work on OS objects starting with a byte string as the pathname. | Finally, I have a small python program whose whole purpose in life | is to transcode UNIX filenames before transfer to a MacOSX HFS | directory, because of HFS's enforced particular encoding. What approach | should a Python app take to transcode UNIX pathnames under your scheme? | | Compute the corresponding character strings, and use them. In Python2 I've been going (ignoring checks for unchanged names): - Obtain the old name and interpret it into a str() correctly. I mean here that I go: unicode_name = unicode(name, srcencoding) in old Python2 speak. name is a bytes string obtained from listdir() and srcencoding is the encoding known to have been used when the old name was constructed. Eg iso8859-1. - Compute the new name in the desired encoding. For MacOSX HFS, that's: utf8_name = unicodedata.normalize('NFD',unicode_name).encode('utf8') Still in Python2 speak, that's a byte string. - os.rename(name, utf8_name) Under your scheme I imagine this is amended. I would change your listdir_b() function as follows: def listdir_b(bytestring, fse=None): if fse is None: fse = sys.getfilesystemencoding() string = bytestring.decode(fse, python-escape) for fn in os.listdir(string): yield fn.encoded(fse, python-escape) So, internally, os.listdir() takes a string and encodes it to an _unspecified_ encoding in bytes, and opens the directory with that byte string using POSIX opendir(3). How does listdir() ensure that the byte string it passes to the underlying opendir(3) is identical to 'bytestring' as passed to listdir_b()? It seems from the PEP that On POSIX systems, Python currently applies the locale's encoding to convert the byte data to Unicode. Your extension is to augument that by expressing the non-decodable byte sequences in a non-conflicting way for reversal later, yes? That seems to double the complexity of my example application, since it wants to interpret the original bytes in a caller-specified fashion, not using the locale defaults. So I must go: def macify(dirname, srcencoding): # I need this to reverse your encoding scheme fse = sys.getfilesystemencoding() # I'll pretend dirname is ready for use # it possibly has had to undergo the