Glenn Linderman writes: > On approximately 5/7/2009 8:40 AM, came the following characters from > the keyboard of Zooko O'Whielacronx: > > Dear Glenn Linderman and SJT: > > > > You two encoding experts who have volunteered some ideas for Tahoe > > might also be interested in this post that David-Sarah Hopwood just > > sent: > > > > http://allmydata.org/pipermail/tahoe-dev/2009-May/001717.html > > > Regarding this proposal,
I agree with everything Glenn wrote, except that I disagree with > I think a scheme along these lines is workable, though, but some > refinements will be needed, and sufficient use cases provided to help > explain how the various schemes work together, once they are refined, > and if they do work together. While great effort to disambiguate the notation is made, in the end Tahoe only controls Tahoe filenames ... but there is no problem with them, since they are well-specified as Unicode. I think that the %% notation is going to suffer from the problems that ">From" stuffing and URL encoding do. Programs and users are going to get confused about whether a string has already been decoded, with at best hilarious results. Of course a sufficiently complex set of rules will probably work in theory, but will not be implemented properly too much of the time. Especially not by users. The choice of "%" as the "escape" character is unfortunate, for the reasons Glenn gives but also because of the collision with URL encoding. Spidering tools and the like regularly produce URL-encoded filenames, and this will collide with that. Eg, as a regular visitor to Japanese sites, URL-encoded file names are occasionally produced on my system when I save a page. And if an URL-encoded filename gets Tahoe-encoded or vice versa, you'll need to know which order to decode in; they do not commute IIUC. Attempting to upload a file with a %%-encoded name is likely to produce bad results on systems that could handle the name. More positive suggestions: If nonetheless you decide to use such an encoding, a similar possibility that avoids collision with URL encoding would be to represent names unrepresentable on the target file system using the old Mac OS convention of representing a high-bit-set octet with ":XX" where the Xs are of course uppercase hex digits. Another possibility would be simply to use a leading ":" to signal that all of the characters in the name are hex digits. Of course both imply that a file whose name already starts with ":" must be hex-encoded. Another possibility would be MIME-word encoding. The Unicode normalization proposed by several of the authors has (probably solvable) issues, especially since NFC is chosen. The problem is that an NFC name may fail to roundtrip *via other utilities* with a Mac in the middle. On several occasions I've found myself looking at two files with the same name on a Linux system because I copied an NFC file name (as bytes) to the Mac, which recognized those bytes as a Unicode transformation format, and when an updated version of the file was copied back, the name goes back as bytes, but of course it is now NFD. Other utilities are Unicode conformant and get this right, but I don't think you can count on it yet. Finally, here's a radically different suggestion. Use a separate filesystem in a file, such as a zip file, for those files with unusable names, and provide a utility for browsing it, as well as extracting file names. This could implement David-Sarah's suggestion for automatic extraction of all files as an option. The UI I envision would be $ tahoe cp tahoe:mystuff ./ Copying ... done. There were 17 files with names that cannot be represented on yoursystem. (B)rowse, (I)nteractively rename, (A)utomatically rename, (Q)uit? Q 16 files were added to undecodable.tahoezip. 1 file was replaced in undecodable.tahoezip. To access them, use "tahoe zipview undecodeable.tahoezip". $ Of course this could all be handled invisibly by a FUSE filesystem, where FUSE is available. Finally, this problem has been encountered before in ISO 9660. That standard has extensions (I believe that these are the so-called "Rock Ridge extensions") that allow for long and/or internationalized file names. Perhaps those conventions (about which I know none of the details, sorry) could be used. _______________________________________________ tahoe-dev mailing list [email protected] http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev
