On approximately 5/7/2009 8:40 AM, came the following characters from the keyboard of Zooko O'Whielacronx: > Dear Glenn Linderman and SJT: > > You two encoding experts who have volunteered some ideas for Tahoe > might also be interested in this post that David-Sarah Hopwood just > sent: > > http://allmydata.org/pipermail/tahoe-dev/2009-May/001717.html
Regarding this proposal, I would assume (but the proposal should clarify) that the proposal is looking at a filename, not a pathname, and that each directory name in a path name would be independently processed by the algorithms in this proposal. The proposal has a lot of merit; it avoids the use of meta-data that, as I pointed out in yesterday's comments, could get lost by transitions between filesystems. Whereas my comments yesterday suggested a directory into which transcoded files could be placed, and that that was problematic (for the unstated reason of separating files into two buckets), this proposal suggests reserving the %% and %u and %U file prefixes for transcoded files. While it keeps the files in the same buckets (directories) which is good, it raises the question of whether the prefix(es) is/are unique enough to mostly avoid problems with name collisions. If some prefix can be thought to be rare enough to avoid problematical collisions, I would think it should be used consistently, just one prefix, rather than 3 prefixes, which triple the chances for collisions. Seems like the distinction between 4-digit and 6-digit Unicode %U encodings is the + after the %. The comment that % need not be escaped from shell commands in any common operating system makes me wonder if the author has ever heard of Microsoft Windows, or has tried to access a file name name %%my%dear%faraway%Abby.doc from a Windows command shell that has environment variables named "my", "dear", and "faraway" defined. The definitions of %% and %u enocdings do not mention escaping the escape character. While the author seems to think that % is rare in filenames, it cannot be guaranteed to be non-existent, and so the presence of a % character in a file name that for other reasons must be %% or %u encoded would introduce ambiguity in the escape sequences. While a %% prefix for a filename may be quite arguably rare, the %u or %U prefixes would, by the same argument, be less rare, and the combination of 3 prefixes be even less rare. Perhaps the %% should be used as a flag that the name has been transcoded, and then followed by U, or u, or B, or b, to indicate if it is Unicode or Bytes escaping? Any such escaping scheme like this could possibly run into length limits on the names, some discussion about that issue should be included in such proposals. The description of %% encoding seems unusual... there are no bytes that do not correspond to ISO Latin-1 characters, except possibly for control characters between 1 and 31 inclusive, if they are outlawed in Tahoe file names (are they? Need they be?). So it seems that %% encoding would only add a %% in front, and then be mojibake, if the byte encoding was not originally ISO Latin-1. The %HH sequence seems an almost unnecessary concept, unless the claimed encoding fails to decode, and only those characters that fail to decode are then encoded via %HH sequences. And, preexisting % bytes would seem to also need to be %HH encoded if anything else might be. The %U encoding description also suffers not mentioning preexisting % characters in the string. The comment that "The %% and %U encodings are never mixed" seems impossible. I posit a POSIX file name with a non-decodable sequence in its original encoding; this forces %% encoding inside Tahoe. If such a name contains a ":", then when a Windows system wants to access the file, it must be %U encoded. How is the mixture avoided? There is no description of how to handle this case. I think a scheme along these lines is workable, though, but some refinements will be needed, and sufficient use cases provided to help explain how the various schemes work together, once they are refined, and if they do work together. If some unique prefix can be accepted as rare enough to be used as an encoding prefix by the Tahoe user community, then the rest of the problems are solvable, but I think there are cases here that are not solved yet. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking _______________________________________________ tahoe-dev mailing list [email protected] http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev
