Glenn Linderman wrote: > On approximately 5/7/2009 8:40 AM, came the following characters from > the keyboard of Zooko O'Whielacronx: >> Dear Glenn Linderman and SJT: >> >> You two encoding experts who have volunteered some ideas for Tahoe >> might also be interested in this post that David-Sarah Hopwood just >> sent: >> >> http://allmydata.org/pipermail/tahoe-dev/2009-May/001717.html > > Regarding this proposal, I would assume (but the proposal should > clarify) that the proposal is looking at a filename, not a pathname, and > that each directory name in a path name would be independently processed > by the algorithms in this proposal.
Yes. > The proposal has a lot of merit; it avoids the use of meta-data that, as > I pointed out in yesterday's comments, could get lost by transitions > between filesystems. > > Whereas my comments yesterday suggested a directory into which > transcoded files could be placed, and that that was problematic (for the > unstated reason of separating files into two buckets), this proposal > suggests reserving the %% and %u and %U file prefixes for transcoded > files. While it keeps the files in the same buckets (directories) which > is good, it raises the question of whether the prefix(es) is/are unique > enough to mostly avoid problems with name collisions. True. However, if the representation of an incorrectly decoded filename is not an invalid string, then it must necessarily step on some subset of valid strings. It would be possible to use non-NFC strings or strings containing Unicode noncharacters, but since those aren't reliably representable as filenames in non-Tahoe filesystems, that wouldn't satisfy the goal of allowing lossless transitions between filesystems. (Also, using noncharacters is strictly speaking not compliant with the Unicode standard -- Unicode APIs are permitted to strip them or treat them as an error.) > If some prefix can be thought to be rare enough to avoid problematical > collisions, I would think it should be used consistently, just one > prefix, rather than 3 prefixes, which triple the chances for collisions. The choice of prefixes is a minor detail, I think. The constraints on the prefixes for the Unicode and byte-oriented encodings are: - they can be distinguished from each other; - they are printable ASCII, and representable in all common filesystems; - they are sufficiently rare at the start of real filenames; - if they contain cased characters, those characters are treated as case-insensitive; - they are not possible prefixes of reserved filenames. > Seems like the distinction between 4-digit and 6-digit Unicode %U > encodings is the + after the %. Yes. An alternative here is to use %HHHH%HHHH (or @h...@hhhh) where each 4-digit hex value represents a UTF-16 code unit. Just using %HHHHHH would obviously be ambiguous. > The comment that % need not be escaped from shell commands in any common > operating system makes me wonder if the author has ever heard of > Microsoft Windows, or has tried to access a file name name > > %%my%dear%faraway%Abby.doc > > from a Windows command shell that has environment variables named "my", > "dear", and "faraway" defined. Oops. I use cygwin on Windows; I had forgotten about the environment variable convention in the cmd.exe shell. There are other characters, such as '@', that could be used instead, and the rest of the proposal is independent of which escape character is used. > The definitions of %% and %u encodings do not mention escaping the > escape character. Yes, I had considered that but just forgot to mention it. Mea culpa. [...] > Any such escaping scheme like this could possibly run into length limits > on the names, some discussion about that issue should be included in > such proposals. This was mentioned in my proposal: # - whenever a Tahoe filename is converted to a name for a # particular filesystem, if the result is too long for # that filesystem, then fail the operation. There is little else that can be done: as you say, *any* escaping scheme (including one using UTF-8B or private-use characters) might run into a length limit for a particular filesystem. A filename that has no need for escaping could also run into a length limit shorter than that of Tahoe. > The description of %% encoding seems unusual... there are no bytes that > do not correspond to ISO Latin-1 characters, except possibly for control > characters between 1 and 31 inclusive, if they are outlawed in Tahoe > file names (are they? Need they be?). "ISO-Latin-1 characters" was intended to mean Unicode characters U+0000..U+00FF inclusive. > So it seems that %% encoding > would only add a %% in front, and then be mojibake, if the byte encoding > was not originally ISO Latin-1. That's not correct; canonical %%-encoding only generates filenames containing POSIX-portable characters plus the escape character. The conversions mention ISO-Latin-1 only because it is possible to construct Tahoe filenames that start with %%, but contain Unicode characters above U+00FF (and therefore are not a %%-encoding at all, never mind a canonical one). Since the conversion functions from a Tahoe filename to a Unicode or byte-oriented filename are intended to be total provided that no length constraint is hit, they must specify what to do in this case. If would probably have been clearer, however, just to say "is not a %%-encoding" rather than mentioning ISO-Latin-1. That would also cover the case of filenames starting with "%%" but that contain an escape character not followed by two hex digits. Of course mojibake is still possible if a filename accidentally decodes using the wrong decoding, but there is nothing much that can be done about that. > The comment that "The %% and %U encodings are never mixed" seems > impossible. The comment is correct. The conversions only generate canonical %%-encodings; since those - only contain POSIX portable characters plus the escape character; - start with a prefix that excludes them from being reserved filenames; they should be representable on all filesystems, and so it is unnecessary to further %U-encode them. (Note that this does not mean that a filename that starts with "%%" cannot be %U-encoded. But any given filename cannot be both a %% and a %U-encoding.) > I posit a POSIX file name with a non-decodable sequence in > its original encoding; this forces %% encoding inside Tahoe. If such a > name contains a ":", then when a Windows system wants to access the > file, it must be %U encoded. How is the mixture avoided? > There is no description of how to handle this case. This case will not occur for %%-encodings generated by Tahoe, because ':' is not a portable POSIX filename character, and so it would be represented as %3A in the canonical %%-encoding. It is possible for a filename such as "%%x:y" to occur other than by canonical %%-encoding, and that case was handled as intended in the description of Tahoe -> Unicode conversion. The corresponding Windows filename would be "%U%0025%0025x%003Ay". (If the escape character is changed to '@', then the equivalent example is that the Windows filename for "@@x:y" would be "@u...@0040@00...@003ay".) > I think a scheme along these lines is workable, though, but some > refinements will be needed, and sufficient use cases provided to help > explain how the various schemes work together, once they are refined, > and if they do work together. I agree; the description needs some work, but I believe the proposal is technically sound. -- David-Sarah Hopwood ⚥ _______________________________________________ tahoe-dev mailing list [email protected] http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev
