Folks: Regarding my Strategy 2.d [1], François's Strategy 2.d&1/2 [2], and Alberto's Strategy 2.e [3], the question is what is more desirable for the case that there is a filename in a local filesystem which isn't actually a valid encoding in that filesystem's default codec, and that file gets "tahoe backup"'ed or "tahoe cp"''ed into a tahoe directory, and *then* an old or lazy tahoe client reads that filename out of a tahoe directory and gives it to you. Do you want this old or lazy tahoe client to give you:
2.d: Whatever that filename would have been if it had actually been encoded in latin-1 in the first place. (I.e., some sort of gibberish, if it wasn't actually latin-1.) 2.d&1/2: The same as 2.d, but prepended with the the U+FFFC char 2.e: Whichever characters of that filename *are* legitimate for the filesystem's default codec, interspersed with U+FFFD "replacement characters" for any characters that aren't legitimate for the default codec. I tend to think that the first of those three options is the best, but I would defer to any established "best practices" among unicode gurus. Remember that we're only talking about backwards- compatibility here -- the behavior of old tahoe clients who don't know how to do anything but treat the "child name" as a unicode string. Also lazy tahoe clients who don't bother to check for this condition and get the original bytes and do "Whatever it is that diligent clients are supposed to do with a bunch of bytes in some unknown encoding.". Regards, Zooko [1] http://allmydata.org/pipermail/tahoe-dev/2009-February/001343.html [2] http://allmydata.org/pipermail/tahoe-dev/2009-February/001346.html [3] http://allmydata.org/pipermail/tahoe-dev/2009-February/001348.html [4] http://en.wikipedia.org/wiki/Replacement_character _______________________________________________ tahoe-dev mailing list [email protected] http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev
