On approximately 5/6/2009 1:58 PM, came the following characters from the keyboard of Zooko Wilcox-O'Hearn: > Thank you for your interesting message about this perplexing issue. I > haven't yet read all of your message, but just so you know I went and > added [email protected] to the list of senders whose posts are > automatically approved to go to tahoe-dev. Please feel free to Cc: > [email protected] in the future, and if you would be willing to > resend your message (quoted below) to tahoe-dev I would appreciate it. > > Regards, > > Zooko
I'll go ahead and resend to your list. I did read the other message about requirements that you mentioned in your other response: <http://allmydata.org/pipermail/tahoe-dev/2009-May/001714.html> Per that message, I would say that my "Uncertainty 2)" applies. I'll make a few more comments at the bottom, based on reading the message at the above link. Because it sounds like an interesting project, I'm willing to read and comment on any emails on this topic that are Cc:'d to me, if and when my interest causes me to spend time that should probably be spent doing something else, but I don't have time to join the group, and go looking for them. > On May 6, 2009, at 14:17 PM, Glenn Linderman wrote: > >> On approximately 5/6/2009 12:18 PM, came the following characters from >> the keyboard of Zooko Wilcox-O'Hearn: >>> On May 6, 2009, at 10:54 AM, Antoine Pitrou wrote: >>>> Zooko Wilcox-O'Hearn <zooko <at> zooko.com> writes: >>>>> >>>>> I'm not thinking of API compatibility as much as data compatibility >>>>> -- someone used Python 3.1 to write down some filenames, and now a >>>>> few years later they are trying to use the latest and greatest >>>>> Python release to read those filenames... >>>> >>>> Well, if the filenames are generated by Python (as opposed to read >>>> from an existing directory on disk), they should be regular unicode >>>> objects without any lone surrogates, so I don't see the >>>> compatibility problem. >>> I meant that the application reads filenames from an existing >>> directory on disk, saves those filenames, and then later, using a >>> future version of Python, wants to read them and use them. >> >> >> Regarding future versions of Python. In the worst case, even if >> Python's default behavior changes, the transcoding done by PEP 383 can >> be done in other software too... it is a straightforward, fully >> specified, 1-to-1, reversible transcoding process, affecting and >> generating only invalid byte encodings on one side, and invalid >> Unicode sequences on the other. >> >> So if Python's default behavior should change, the transcoding >> implemented by PEP 383 could be easily reimplemented to enable a >> future version of a Python application to manipulate the transcoded, >> saved, filenames. >> >> By easily, I mean that I could code it in a couple hours, max. >> >> >>> I'm not saying that I know this would be a problem. I'm saying that >>> I personally can't tell whether it would be a problem or not, and the >>> extensive discussions so far have not convinced me that there is >>> anyone who both understands PEP 383 and considers this use case. >> >> >> Does the above help? >> >> >>> Many people who apparently understand encoding issues well have said >>> something to the effect that there is no problem, but those people >>> haven't yet managed to get through my thick skull how I would use PEP >>> 383 safely for this sort of use case -- the one where data generated >>> by os.listdir() travels forward in time or the one were that data >>> travels sideways to other systems, including Windows or other systems >>> that validate incoming unicode. >> >> >> Regarding data traveling sideways, some comments: >> >> 1) PEP 383's effect could be recoded in other languages as easily as >> it is in Python (or the C in which Python is implmented). So that >> could be a solution. >> >> 2) You mention "Windows" and "other systems that validate incoming >> unicode" in the same phrase, as if you think that "Windows" qualifies >> as an "other systems that validate incoming unicode", but it does not >> (at least not universally). >> >> >>> That's why I am a bit uncomfortable about PEP 383 being quickly >>> implemented and deployed in Python 3.1. >> >> >> Does the above help? >> >> >>> By the way, much of the detailed discussion about what Tahoe requires >>> and how that may or may not benefit from PEP 383 has now moved to the >>> tahoe-dev mailing list: >>> http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev . >> >> >> I have no background with Tahoe, nor particular interest, although it >> sounds like a useful project... so I won't be joining that list. I >> have no idea if there is an installed base of existing Tahoe file >> systems, my suggestions below assume that there is not, and that you >> are presently inventing them. Therefore, I provide no migration path, >> although I could invent one, but it would take longer to describe. >> >> However, since I'm responding here, and have read what you have posted >> here, it seems like the following could be true. >> >> Assumptions from your emails: >> >> A) Tahoe wants to provide a UTF-8 file name system >> B) Tahoe wants to interface to POSIX systems that use (and do not >> validate) byte interfaces. >> C) Tahoe wants to interface to non-POSIX systems that use 16-bit file >> name interfaces, with no validation. >> D) Tahoe wants to interface to non-POSIX systems that use 16-bit file >> name interfaces, with validation. >> >> Uncertainties: I'm not clear on what your goals are for Tahoe >> filenames. There seem to be 2 possibilities: >> >> 1) you want to reject attempts to use non-validating Unicode, be it >> from a 16-bit interface, or a bytes interface. >> 2) you don't want to reject non-validating Unicode, but you want to >> convert it to valid Unicode for (D) systems. >> >> 3) Orthogonally, you might want to store only Valid Unicode in the >> names, or you might not care, if you can meet the other goals. >> >> Truisms: >> >> If you want to support (D), and (2), then you must transform names at >> some point, using some scheme, because not all names supplied by (B) >> systems will be acceptable to (D) systems. You can choose to do this >> transformation when a (B) system provides an invalid (per Unicode) >> name, or you can choose to do the transformation when a (D) system >> accesses a file with an invalid (per Unicode) name. >> >> If the (B) and (D) systems talk to each other outside of Tahoe, they >> will have to do similar transformations, or, if they both access the >> same Tahoe system, they will have to do the identical transformation, >> to be sure that they can access the same file. >> >> All transcoding schemes have the possibility of data puns between >> non-transcoded names and transcoded names. In order to successfully >> and properly manipulate a name, you must know whether or not it has >> been transcoded, and how. >> >> PEP 383 limits its transcoding to names that are invalid (per >> Unicode). Names that cannot be properly decoded to Unicode are >> decoded to invalid Unicode. Names that are invalid Unicode are >> encoded to invalid byte sequences (per the encoding scheme specified). >> >> For PEP 383 and Python, transcoded names can be distinguished by >> checking for the existence of lone surrogates in the str form of the >> filename, or by attempting to do a strict decoding of the bytes form >> of the filename, depending on what you have (generally, the former). >> >> For PEP 383 and Python, the names will round trip from the POSIX bytes >> interfaces to the program, and back to POSIX bytes interfaces, as long >> as only Python wrappers of system functions are used, and the >> filesystem encoding is not changed between calls (or is restored). >> Passing them to 3rd party libraries or other systems requires extra >> work, if there is a desire to manipulate files with names that are not >> decodeable to Unicode by the standard decoding algorithm for that >> encoding. Comments about your interfaces (quote from the linked message you sent): > Then, there are at least five interfaces for connecting the WAPI up to > other things: 1 & 2 sound like special purpose client programs. Such programs can access APIs beyond a mapping of file-system APIs, and do name validations, transcodings, and any other necessary tasks that help. For interface 3, Windows CIFS/SMB, you need to make sure to validate the incoming 16-bit codes for Unicode validity. Windows doesn't. Windows won't supply certain characters in file names that it considers illegal, which include at least : \ ? * (I think there are a few more also, the list is documented). You need to have a plan for what to do when a non-Windows system creates a filename that may be legal Unicode, but is not a legal Windows filename. I know nothing about interfaces 4, 5, or 6. Comments about your requirements (quote from the linked message you sent): > Okay, that's the setting, now the five possible requirements are: Requirement 1 makes it sound like you want to always store a valid Unicode filename. I think that is a good thing, overall. Requirement 2 makes it sound like you want to decode bytes to Unicode using the current filesystem encoding, on POSIX systems. Because you only want valid Unicode, it sounds like you would not benefit from PEP 383, which produces invalid Unicode. However, if you are running (in the future) on a POSIX system that uses PEP 383, you would either need to use the bytes interfaces, and do your own strict Unicode decoding, or use the str interfaces, but validate that the result contains no lone surrogates. Either of these would enable you to determine the faithful unicode if decodable case. Requirement 3 requires some sort of transcoding, you could start from either the original bytes, or the PEP 383 invalid str. If you produce a name that contains only valid Unicode, then it will match what could be a valid Unicode name that was produced in other ways (by the user typing it, for example). If you have collisions, you will not know whether the two names were supposed to be the same, or were supposed to be different, except that you could keep track of the fact that one was generated by transcoding, and the other not. But not all of your interfaces (particularly Windows CIFS/SMB) will be able to access that information, or use it in any meaningful manner. So you have the benefit of having readable, valid Unicode names, but you have the cost of having data puns. The only scheme I can think of for transcoding in this manner, is to have a reserved directory (that lends itself to path name puns, too, of course) for names that have been transcoded, such that /foo and /transcoded/foo are different, even though they otherwise look the same. This would be cumbersome, and would require client programs using filesystem interfaces to either not see those names, or have to go looking for them in particular. Of course, the whole issue could be avoided by people using only valid unicode names. Requirements 4 & 5 can only be met for files initially created invalid Unicode names (via extra metadata), or via a scheme like the reserved directory, and a reversible transformation. Otherwise, CIFS/SMB access could make a copy, not copy the extra metadata (which cannot be available on that interface), and delete the original. Copying the copy back to the original client wouldn't find the metadata to know what the original name was. Whatever the scheme, configuration or command-line options could "tighten" the restrictions, so that only valid Unicode names would be acceptable. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking _______________________________________________ tahoe-dev mailing list [email protected] http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev
