Glenn Linderman writes: > This branch of this thread has migrated to tahoe-dev, Stephen, not > python-dev. So you need to think about their needs if you respond here, > not the needs of Python or python-dev.
Zooko asked for my comments on the protocol for translating from valid Unicode in Tahoe to whatever on a POSIX system, and the reverse; I intend to stick to that, until there's an explicit suggestion that the principle that "Tahoe filenames are valid Unicode" being reconsidered. > A PU character registry would remove from Tahoe the ability > for Tahoe clients to use PU characters for their own, actual character > purposes, which may also not be acceptable. Did you read the post where I explained how this could be done in a way that does *not* interfere with client use of the PUA? This use of the PUA would be *entirely* internal to Tahoe (including display of the file names), and therefore does not encroach on clients' uses. (OTOH, the clients can "DoS" Tahoe by using whole planes of PU characters in file names, but this seems kind of unlikely.) > > > I question how many programs, faced with apparently URL-encoded > > > filenames, actually attempt to URL-decode the name. Most of what > > > I've seen is that the names simply linger, containing their > > > URL-encoding, and looking ugly. > > > > I decode such on an ad hoc basis all the time. I suspect other users > > in non-Latin locales will do so, too. > > So if you have an extra layer of encoding, you will either figure out > how it works, and how and when do the appropriate decoding, or you will > do it wrong and be confused. Yes. I think that latter case will be occur frequently for the proposed %%/%U/%u encoding, balancing its useful features to a great extent. > If Tahoe enforces a consistent normalization, then it would need a > scheme for dealing with the potential duplications that could result > from file systems that don't. It does, and it does. The point of the example is that certain types of use cases are likely to suffer from this a lot, even if "world wide" it is extremely uncommon on average. > The solution for Rock Ridge and Joliet each seem to depend on the > flexibility of the original ISO 9660 system having an "escape" system to > allow alternate names, and each defines a rigid way of using those > alternate names. > > Unfortunately, none of the file systems we are talking about do that. > Except, Tahoe _could_. In fact Tahoe can do it both internally (by adding metadata) and externally (by convention, eg. creating a file named TRANS.TBL in the same directory which maps Unicode names to original bytes). External conventions are not terribly reliable, but might work in enough cases. > Remember that the %% and %u encoding proposal that we are > responding to is intended to avoid the idea of fragile metadata > that could get lost; The problem with the encoding proposal is that we already *have* a universal encoding, and it's called "Unicode". If Unicode is not going to work, inventing a new universal encoding is unlikely to work very well either. The best bet is to keep any complexity (such as a PU character registry) entirely internal to Tahoe, while making the external interface as simple and unambiguous as possible. Note that "ambiguity" is not entirely determined by the quality of your algorithms, but also by the kinds of encoding that are used in the environment. _______________________________________________ tahoe-dev mailing list [email protected] http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev
