On approximately 5/9/2009 9:58 AM, came the following characters from the keyboard of Stephen J. Turnbull: > Glenn Linderman writes: > > > > While great effort to disambiguate the notation is made, in the end > > > Tahoe only controls Tahoe filenames ... but there is no problem with > > > them, since they are well-specified as Unicode. > > > > Well, Stephen, you are correct that there is no problem with Tahoe > > filenames... except that the fact that they are restricted to Unicode, > > and POSIX filenames are not, _is_ a problem. > > Sure, but it's a *solved* problem (surrogate-escape coding systems do > it simply, a PU character registry does it in a more complicated way). > Tahoe doesn't seem to like those schemes, too bad for Tahoe -- but > it's not *our* problem in this thread. >
This branch of this thread has migrated to tahoe-dev, Stephen, not python-dev. So you need to think about their needs if you respond here, not the needs of Python or python-dev. So while one variety of solution to the problem has been proposed and accepted for Python on POSIX, where invalid Unicode surrogate-escape sequences have been pronounced to be acceptable, even though they are totally unreadable, the environment in which Tahoe is operating is constrained to strictly legal Unicode, so they cannot use that solution. A PU character registry would remove from Tahoe the ability for Tahoe clients to use PU characters for their own, actual character purposes, which may also not be acceptable. So yes, it is a solved problem in the same sense that telling the peasants to eat cake solved the famine in France. > > As presently defined, %% notation has problems, I agree. And if other > > programs get in the act of interpreting the names, and trying to > > re-encode them, "just like Tahoe would" > > You might have a hope if the intent was to emulate Tahoe. But > those names may get munged by other transports etc. and people will > undoubtedly be using ad hoc algorithms. > > > I question how many programs, faced with apparently URL-encoded > > filenames, actually attempt to URL-decode the name. Most of what > > I've seen is that the names simply linger, containing their > > URL-encoding, and looking ugly. > > I decode such on an ad hoc basis all the time. I suspect other users > in non-Latin locales will do so, too. > So if you have an extra layer of encoding, you will either figure out how it works, and how and when do the appropriate decoding, or you will do it wrong and be confused. > > At this point, it is appropriate to point out that the transcoding > > algorithms between Tahoe and any particular non-Tahoe system need not be > > the same as the transcoding algorithms between Tahoe and any other > > particular non-Tahoe system. > > I don't think you want to go there. That will confuse the heck out of > multihomed users, who would at least like to see the same mojibake on > different systems. > Every system has its quirks... that is why this thread even exists. It is not clear that encoding names that are unacceptable to one type of system on all types of systems (encoding to the LCD) is beneficial. For example, as far as I know, there is no reason a file named "prn.foo" should be encoded on a Mac, but it certainly needs to be encoded on Windows. It may be that the encoding _system_ can be uniform, at least on large groups of platforms, such that the same decoding algorithm will work on all systems, but it is probably true that the choice of what names must be encoded, and what names need not be, is platform dependent. > > > The Unicode normalization proposed by several of the authors has > > > (probably solvable) issues, especially since NFC is chosen. The > > > problem is that an NFC name may fail to roundtrip *via other > > > utilities* with a Mac in the middle. On several occasions I've found > > > myself looking at two files with the same name on a Linux system > > > The Unicode normalization issues for a specific platform can be solved > > by the Tahoe client programs created for that platform. In other words, > > NFD names found on Mac OS X can be renormalized to NFC by Tahoe client > > programs, or upon receipt by a Tahoe server that knows it is talking to > > a Mac OS X client. > > That's true, but it has nothing to do with my example, which shows how > Tahoe could encounter two names that are identical as Unicode but > different in POSIX in the same client directory. > I'm not a Mac user. If Mac consistently renormalizes to NFD, then within the Mac, it should be consistent, and could be returned to NFC when interfacing to Tahoe. But yes, if the Mac talks to a filesystem that dosen't enforce a consistent Unicode normalization (POSIX), then that file system could have both styles of normalization... but then that file system could have both styles of normalization anyway. If Tahoe enforces a consistent normalization, then it would need a scheme for dealing with the potential duplications that could result from file systems that don't. > > The [zipfile] idea suffers from the same problem as my earlier > > suggestion of using a separate directory, rather than a prefix, for > > encoded names... the files get placed in separate buckets, and > > globs don't work as uniformly. > > It's not clear that users will generally want globs to work on broken > names. If they do, of course a method for "exploding" the file into > the current directory with some sort of names would be needed. The > advantage of the zipfile over a directory is precisely that most > programs that recurse into subdirectories won't do that with the > zipfile. > Clearly that is for the Tahoe users to decide. Encoded names are not necessarily "broken". I was only pointing out the con. The zip file idea may or may not be an acceptable solution for them. If it is, though, so would an extra directory that has the file names encoded somehow, and the extra directory would be simpler to deal with, having no need to do unarchiving to access it. > > I think ISO 9660 limited filenames to A-Z0-9 and 8.3 format. Rock Ridge > > allows other character sets; I suppose one of the allowable other > > character sets might be Unicode UTF-8, or POSIX bytes, I haven't looked > > that up. The Joliet (MS) extension allows UCS-2, except for control > > characters and 6 blacklisted characters. > > > > I don't think the problems correspond particularly well. > > Maybe not, but that doesn't mean the solutions won't. This is a hard > problem, and it's not a new one. Hope springs eternal, but I think it > unlikely that we'll invent a new scheme that *really works* after all > these years. At the very least we need to see how people solved > similar or related problems in the past. > The solution for Rock Ridge and Joliet each seem to depend on the flexibility of the original ISO 9660 system having an "escape" system to allow alternate names, and each defines a rigid way of using those alternate names. Unfortunately, none of the file systems we are talking about do that. Except, Tahoe _could_. Remember that the %% and %u encoding proposal that we are responding to is intended to avoid the idea of fragile metadata that could get lost; an earlier Tahoe proposal was to keep both a translated or encoded name (of some sort) together with the original name from the original system in metadata. As a reminder, the cons of that system, is that once the file is processed on and replaced by a different system, the original name would be lost, and the original system might not recognize the translated or encoded name, and the original name would be lost. Given that the original name was illegal Unicode, that may or may not be perceived as a catastrophe; it appears that Tahoe users are divided over the issue, some preferring to keep (or translate back to) the original name, and other preferring to convert to Unicode, and keep the name Unicode thenceforth. -- Glenn -- http://nevcal.com/ =========================== A protocol is complete when there is nothing left to remove. -- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking _______________________________________________ tahoe-dev mailing list [email protected] http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev
