Stephen J, Turnbull wrote: > what's implortant for this discussion is what appears at the interfaces > (user-client, client-server). ... > But that means that what I saw before with Unix ls is no longer what I see in > Tahoe, *or in the destination Unix system with ls*.
I still don't understand why you say that. Let me back up and see if we have the same model of the components and interfaces. There is a system, which either offers a unicode-safe interface (Windows, Mac) or a bytes interface (Linux, Solaris). If it offers a bytes interface then it also offers a declared encoding. Tahoe runs as a user-space process and relies on Python (version 2) to tell Tahoe what this declared encoding is with sys.getfilesystemencoding() (for the filesystem) and sys.getdefaultencoding() (for command-line arguments and stdin/stdout). Then, a Tahoe client writes something down, which must include a valid unicode filename as its primary key, and which also may have optional metadata. Then, another Tahoe client reads what was written and loads it into its memory in-process. The standard interface for that second Tahoe client to emit information is the WUI/WAPI (Web User Interface / Web Application Programming Interface). See http://testgrid.allmydata.org:3567/uri/URI%3ADIR2%3Adjrdkfawoqihigoett4g6auz6a%3Ajx5mplfpwexnoqff7y5e4zjus4lidm76dcuarpct7cckorh2dpgq for an example. An HTTP client contacts the Tahoe client (which acts as an HTTP server) and sends an HTTP request and receives an answer which includes a view of the Tahoe filesystem such as a directory listing. Then, there are at least five interfaces for connecting the WAPI up to other things: 1. The CLI (Command Line Interface) is a processes that contacts the Tahoe client over HTTP and, for its "tahoe ls" and related commands, it emits results on stdout/stderr. 2. The CLI's "tahoe cp" command reads files and directories over HTTP and writes them directly into the local filesystem (when the target of the cp is the local filesystem). 3. The Windows native client serves CIFS/SMB to the local Windows operating system and presents the results returned by the Tahoe client. 4. The iPhone client presents the results on iPhone. 5. The FUSE plugins present the results to the Linux or Mac VFS layer. 6. There may be others that I don't fully appreciate. Probably the two Ruby libraries are sensitive to the decisions we make in this design, but I'm not sure. Okay, that's the setting, now the five possible requirements are: Requirement 1 "valid unicode filename": This is mandated by backwards compatibility with the current Tahoe clients as well as the five or more external components listed above expect valid unicode in the "filename" slot. Requirement 2 "faithful unicode if decodable": If a filename decodes with the getfilesystemencoding(), then we'll use the resulting unicode as the filename. Requirement 3 "no file left behind": If a filename doesn't decode with the getfilesystemencoding(), then we'll invent a unicode string with which to refer to that file, so that the file will at least be present even if badly named. (Note that these first three requirements already require Tahoe to implement some handling of collisions, when the unicode string we invented to name a file with an undecodable name happens to be the same as the name of another file in the same directory.) Requirement 5 "no loss of information": A future cyborg archaeologist can dig into the Tahoe metadata and figure out what the bits were before the filesystem was copied into Tahoe. Possible Requirement 4 "round trip == faithful bytes": This is the tricky one. The motivation is that if you have a Linux or Solaris system, and you do a backup with Tahoe, and then later do a restore with Tahoe, you want the same bytestrings for all your filenames to be restored, even if your locale was set such that those bytestrings were undecodable when you did the backup, or even if your locale was set so that the bytestrings were decodable but were mojibake. On the other hand, if this requirement is satisfied by default then what you see when you view a Tahoe directory through the WUI, "tahoe ls", etc. will be different from what you get when you restore that Tahoe directory to your local filesystem. Also, since everyone is moving toward utf-8, they may consider ill-encoded filenames to be a problem that they would like to learn about as early as possible, such as when they are doing the original backup into Tahoe. Also, at least one person has told me that he would be horrified for a "tahoe cp -r tahoe: hislocalsystem/" to insert filenames into his local system which were *not* valid encodings in his filesystem. He has the exact opposite requirement of "round trip": that even if the original filenames were ill-encoded, he doesn't want Tahoe to write ill-encoded filenames into this system. So I'm having a hard time making up my mind about this one, and at the moment I'm leaning toward making it an option like '--handle-ill-encoded-filenames" with default value of 'mangle' and options of 'forcebytes', 'stop', or 'skipfile'. (Which, by the way, is rather like a suggestion Brian Warner made quite a while back.) My current thinking is that if 'mangle' is set then we should emulate the behavior of Nautilus and, to a lesser extent, of GNU ls, which is to decode while replacing undecodable bytes with the U+FFFD char, and then append " (badly encoded filename)" to the end of the filename. Okay, even though you've written much more which deserves a response, I'm going to stop here and send this just to see if you and I (and everyone else) is on the same page. As I currently understand it, what you see on Unix (using GNU ls or Nautilus, for example) will be what you see on Tahoe and on the target localsystem, unless you pass --handle-ill-encoded-filenames=forcebytes, in which case it depends on the original and target system's encoding matching. Regards, Zooko _______________________________________________ tahoe-dev mailing list [email protected] http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev
