On Tue, 17 Feb 2009 12:52:56 -0700 Shawn Willden <[email protected]> wrote:
> Since you control the dirnode format, wouldn't it be easier just to > add a "this isn't Unicode" flag, rather than translating to a > reserved range? If the flag is set, the name is opaque binary data. > Otherwise, it's UTF-8. > > Hmm. That suggests another option for Zooko's list: Provide > per-file name encoding in the dirnode format. Set it to whatever the > FS says it should be set to. Hrm. Well, we could rev the dirnode format (introducing a compatibility break: older tahoe clients would be unable to read those directories). We've been planning to do this anyways, when we move to ECDSA-based dirnodes (to add traversal caps, and to remove the now-redundant HMAC, and to let the tables be parsed faster). Such a break would let us add a childname-encoding field for each child. Or, we could add a childname-encoding field into the metadata in the current dirnode format, which would be more backwards-compatible. The larger problems remains though: even if we change Tahoe's internal format to use (encoding, encoded-name-bytestring) instead of (utf8-name-bytestring), how should the webapi share this with the outside world? The primary machine-oriented webapi (which zooko refers to as the "WAPI", to distinguish it from the human/browser-oriented "WUI") uses JSON to publish the directory contents, and JSON only supports unicode strings, so any bytestrings would have to be encoded down to ASCII (like, base64 or something), and the clients would have to expect an (encoding, encoded-name-base64string) where they currently get (unicode-name). Eww. This would expose the compatibility break to webapi clients. There would also be a number of internal changes; we'd probably want to define a AnyEncodingString class, which would behave somewhat like a unicode object, but would internally contain an encoding-name and a bytestring. The idea of defining tahoe dirnodes as using Unicode was to accomodate everything. It's a pity that the problems seem to lie in 0x80-0xff, rather than in some more exotic code plane.. like a runner jumping out of the starting blocks to find that their shoelaces are tied together. I suppose that 99% of local file names *are* representable in unicode somehow, but the real problem is that the node (on the near side of os.listdir) doesn't know what encoding to use, and the lack of a clear way to pick one. Ah, which means that storing childname-encoding in the dirnode doesn't actually help, because the real problem is that we don't know what that encoding is. If we knew what value to store, we could have simply converted the childname into unicode and then into UTF-8. Unless we permitted a "I don't know what encoding this bytestring is" value: that would perhaps tell the output side to simply feed the unknown-encoding bytestring to open() and hope that the downloading user is using the same conventions as the uploading user was. Sigh. As the t-shirt says, "I (empty square box) Unicode". -Brian _______________________________________________ tahoe-dev mailing list [email protected] http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev
