On Feb 27, 2009, at 10:45 AM, Brian Warner wrote: > [must be brief, typing on an iphone, I'll write more on Monday when > I've got a real keyboard]
... he said before writing a note as thorough and detailed as most programmers ever write. > On the inbound side, if we can't decode the filename with the > user's preferred encoding (which can default to utf-8, or utf-16 on > windows, or something configured into python, etc) Fortunately on Windows all filenames really are utf-16-encoded (or UCS-2, or whatever encoding it is that the filesystem specifies), so you'll never get a decode error, nor a silent misdecoding to random gibberish. (Insert joke here about treating unix as a first class citizen even though it doesn't deserve it.) > then we pretend to decode it with Latin-1, so that a human looking > at the mangled unicode name can hopefully guess what the proper > name should have been. We use the unicode result as the childname. > In all cases, we store the orginal bytestring in the metadata. As I understand it from Shawn and Kevin, taking an arbitrary byte string and decoding it with latin-1 to produce a unicode object is lossless -- a subsequent encode of that unicode object with latin-1 will always yield the same bytes. Is that right? In that case, we don't need the separate base32-encoded bytestring, just the flag to say whether the child name element was the result of a successful decode using the encoding declared by the filesystem, or else the result of a "fallback" latin-1 decode. This simpler approach *does* mean that we lose information whenever there is a file which isn't *actually* encoded in the declared encoding of the local filesystem, but which happens to decode when you try. However, I'm not sure it is worth the complexity of preserving the bytes of that file's name (which after all nobody else can decode either except by guessing at encodings). Also, note that almost certainly the local user examing that local filename with his local tools will see the gibberish that results from decoding that name with his local filesystem encoding, raising the question of what "actually" actually means in the previous sentence. So I propose Strategy 2.d (but who's counting?): Decode the filename with the declared encoding. If that succeeds, then put that unicode string (utf-8 encoded) into the child name and set the flag "latin_1_fallback: False". If that fails then decode the filename with latin-1 (which can't fail) then put that unicode string (utf-8 encoded) into the child name and set the flag "latin_1_fallback: True". Now old tahoe clients (or lazy new ones), will just get the child name bytes, utf-8 decode them to get a unicode string, and use it. It will either be right, or it will be the gibberish that you get from interpreting whatever-it-originally-was as latin-1. New and diligent tahoe clients will check the "latin_1_fallback" flag first. If it is False, they proceed as before, knowing that they're getting the right name. If it is True, then they take the unicode object (which they got by utf-8-decoding the child name bytes), and they encode it with latin-1. This gives them back the original bytes (right?). Now they do whatever diligent tahoe clients do with the original bytes of a filename in an unknown encoding. This seems simpler to me than your proposal, but I'm not sure if I understood everything in your proposal, so I'm not sure if there is something that this proposal wouldn't do as well. Please everyone who understands this let me know if this would work. Regards, Zooko _______________________________________________ tahoe-dev mailing list [email protected] http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev
