On Tue, 23 Dec 2008 16:09:25 -0600 "Dan McNair" <[email protected]> wrote:
> Curious: does Tahoe support arbitrary binary strings as filenames in the > backend, or only accept certain encodings? HTTP certainly supports arbitrary > byte sequences, ugly though it may be. I don't recall anything from my scan > of the DIR2 documentation that would cause problems with filenames in > arbitrary encoding(s). Tahoe's directories are specified to have child names which are unicode strings. Internally it encodes those unicode strings into UTF-8 before serializing them into the mutable file contents, but that should be opaque to clients. As a result of this specification, Tahoe cannot accept arbitrary 0x80-0xff bytes in filenames. When the user is trying to take a non-unicode bytestring (say, from their local disk filesystem) and use it in a Tahoe directory, we'll have problems. There was a thread on the python-dev mailing list about this sort of thing about a month ago, in the context of how python3.0 ought to handle the program's external boundaries (sys.argv, sys.environ, os.listdir, etc). I think it was Glyph who pointed out that some systems (KDE?) actually convert high-bit non-ASCII bytes into a special reserved range of unicode, so that they can at least reverse the transformation and restore the original (non-unicode who-knows-what-encoding) filename later on. Tahoe could conceivably do the same. Tahoe's internal dirnode interfaces (add_child, list, rename, delete, etc) are all defined in terms of unicode objects (and throw an exception if you give them a bytestring instead of a unicode instance). We should push this requirement out as far as we can, which is basically the boundary of the program (sys.argv, or the webapi's HTTP URL / form body). If the OS has some way to define what encoding is being used for the filename-ish pieces of sys.argv (maybe sys.getfilesystemencoding() or something?), then we can use that, otherwise the intent of the current CLI code is to assume UTF-8. The webapi is intended to require UTF-8 in the URL, and to use the "_charset" convention in form bodies (and default to UTF-8 if not provided). cheers, -Brian _______________________________________________ tahoe-dev mailing list [email protected] http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev
