On Feb 17, 2009, at 23:56, Shawn Willden wrote: > On Tuesday 17 February 2009 09:12:51 pm Kevin Reid wrote: >> What I'm thinking is: >> >> Will supporting unknown-bunch-of-bytes filenames be used sufficiently >> often to be worth the systemwide complexity in handling them (being >> not Just Strings), within Tahoe and all client software? >> >> If someone knows they have various-encodings filenames then they can >> just pretend they're Latin-1 -- no information will be lost. > > Hmmm. That is certainly a very simple solution. > > Just to make sure I understand you, you're suggesting that Tahoe > clients who are uploading files do the following:
Actually, I wasn't entirely suggesting a specific behavior for clients, but rather to avoid complicating Tahoe's internals. But I do have a plan for your scenario: > > (2) If the locale decoder can't parse the name, convert it to > Unicode using > the latin1 decoder. This will always work because latin1 allows all > values > from 0x00 to 0xFF. No. (2) is not automatic, but rather the user sets the locale, or tells Tahoe "pretend my locale is latin1" > Tahoe clients downloading files simply retrieve the UTF-8 name and > convert it to the locale encoding. Yes, but respecting the above override. > The downside, of course, is that when files with such funky names are > retrieved, they'll be wrong on EVERY platform. They will be not-wrong to the original uploader when he downloads with the same settings. There could also be a flag bit on the filenames which says "this was uploaded in the byte-preserving mode" and triggers the reverse when downloading to a compatible filesystem. The advantage of this over having a "byte-or-Unicode-string" type is that it is always acceptable for software which just doesn't do raw bytes (e.g. web interfaces) to ignore that bit, rather than being required to handle it. Disadvantage: I'm downloading to my filesystem and I expect all my filenames to be valid UTF-8 and am surprised. Also, I think "some-other-encoding bytes treated as codepoints and stuffed into UTF-8" is a not-unheard-of encoding failure mode, and so it might be not too hard to recognize and repair. -- Kevin Reid <http://homepage.mac.com/kpreid/> _______________________________________________ tahoe-dev mailing list [email protected] http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev
