So I *had* been thinking that tahoe should do what Francois's patch currently does:
Strategy 1: decode the filename using the declared codec of the filesystem, if that fails, raise an exception However Andrej's "crude user logic" (;-)) has shown the problem with that strategy. I now think that we should do: Strategy 2: decode the filename using the declared codec of the filesystem, if that fails, just copy the bytes without decoding them. And, we must mark down somewhere that this is a "just the bytes" filename instead of a utf-8 encoded filename. I think the easiest place to mark this down might be to add a flag to the "metadata" dict associated with that name, something like "unknown_codec: True". I no longer think that we should try to decode the filename with codecs other than the one suggested by the system. If we pass the binary bytes through, then the user on the other side can attempt such guessing. If tahoe guesses, it doesn't give the other side information that the other side couldn't have figured out for itself, and it risks destroying information (when tahoe guesses and gets an apparent success which was actually wrong). Note that this strategy could cause failures in older tahoe clients which are expecting utf-8 encoded names in the name field. They could get a decode error. Newer tahoe clients would know to check for the "unknown_codec" flag before decoding. Hm -- that doesn't sound good. I can think of three options: Strategy 2.a. ... if that fails, copy the bytes into the "name" slot and add a flag to the metadata saying that name isn't a normal utf-8 encoded name (this is what I suggested in the previous paragraph) Strategy 2.b. ... if that fails, put some placeholder, like "?1", "? 2", "?3", etc. in the name slot, and put the bytes into the metadata in a "name_bytes" field. Old tahoe clients (or very simple ones) end up getting the incrementing "?N" names, smarter tahoe clients check for the "name_bytes" field first. and if there is anything there then they use the name_bytes and do their best to represent them to the user, and they don't use the "?N" placeholder at all. Strategy 2.c. If it fails, encode the bytes in some magical way that a later utf-8 decoding of them will get the same bytes back. This might be the hack that Brian suggested that KDE uses to shovel undecodable strings into some unused corner of the unicode space -- I didn't really understand that idea. This smells to me like the same sort of slop which created these problems in the first place (trying to shoehorn semantically incompatible things into the same bits without explicit flagging). If we did this, then for example python code which called .decode() on that string would get back a unicode object which didn't actually contain unicode chars, but contains bytes in some unknown encoding. Hopefully we don't need to do this since some other strategy ought do better. Okay, folks, what do you think? One of these Strategy 2 options, or yet a different Strategy? By the way, Andrej, the reason that we were earlier proposing to do Strategy 1, which Francois's patch implemented, and which rejects yout filename is because Tahoe can't know whether that filename will come out as gibberish in certain views, such as the ls/nautilus/ konqueror that you mentioned, or if you share the file with a friend. However, I guess in this case it is better to pass the data through and let it be Someone Else's Problem. Regards, Zooko _______________________________________________ tahoe-dev mailing list [email protected] http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev
