* Juliano F. Ravasi <[EMAIL PROTECTED]> [2008-08-27 19:55]: > Unicode (and thus UTF-8) is a superset of all encodings. It is > required by the specification of the Unicode that any > conversion X->Unicode->X MUST be lossless. It is important to > keep in mind that Unicode has versions, and that some writing > systems are only fully supported by some Unicode versions. > > Due to the complexity of some writing systems, Unicode allows > that the same character sequence be represented by more than > one way. For example, "ü" (u-umlaut) may be represented by > U+00FC alone, or by the sequence U+0075 U+0308. But for any of > such ambiguities, there is *always* the one of them that is the > "normalized" version (the normalized one may change from one > version of Unicode to another, but this is usually avoided > whenever possible). > > So, for proper Unicode support, you must forbid any > non-normalized UTF-8 input for filenames, so that there is > always unique character sequences stored in the repository, and > there will always be unique conversions to any other encoding. > > So, it doesn't hold true that it is possible to encode > something to UTF-8 and get errors when converting it back. Of > course, Unicode->X conversion fails if the Unicode sequence > contains characters that are not present in X, this is > something to expect. But the X->Unicode conversion MUST be > valid and MUST have an unique representation, whatever X is.
Would that all this were true. The standard Shift_JIS encoding cannot be mapped to Unicode without ambiguities. Worse, there is a whole range of variants of this charset that all go by the name of Shift_JIS, yet map codepoints that are unused in the official Shift_JIS standard to different characters. If you have something labelled Shift_JIS, you cannot safely convert it to Unicode without risking data loss. There is nothing that the Unicode consortium can do about this either, since the problem is that Shift_JIS is a mess, not that the Unicode mapping for it or Unicode’s character coverage is somehow defective. The problem is unfixable. “I swear, text will be the death of me.” —Dan Sugalski, initial Parrot VM lead architect, in http://www.sidhe.org/~dan/blog/archives/000281.html >> Treating everything as a sequence of bytes is far safer (not >> to mention faster) than converting everything every time it's >> commited or checked out. > > Sure it is faster, but I don't think it is safer. See the > problems that Git and Mercurial present when they are ported to > systems that expect all filenames to be clear and valid Unicode > sequences. The problem is that on Unix, at least, the filesystem actually gives you nothing but octet sequences. The only invalid filenames are ones containing slashes or nulls. Everything else is fair game. Filenames can be any random garbage whatsoever. Take a look at the contortions that the GNOME people had to go through for the file selector dialog and similar things where gtk+ and friends touch upon the filesystem. > Safer, for me, is to forbid the addition to the repository of > any file name that is known to give problems when converted to > any other encoding. Forbid the inclusion of any data > incompatible with the users LC_CTYPE, including non-normalized > UTF-8. Then from a SCM design point of view the question turns into this: do we want to the repository to be unable to store some actual files that some users may conceivably have good reason to have? (See Shift_JIS mess above.) The problem space is unfortunately much bigger and messier than what you have portrayed in your mail. IMO the fact that git punts is regrettable, but also pretty much inevitable, so it is not something that I hold against git. There is simply no truly sane way of untangling this incredibly ugly yarnball. Sorry. :-(( Regards, -- Aristotle Pagaltzis // <http://plasmasturm.org/> _______________________________________________ vcs-home mailing list vcs-home@lists.madduck.net http://lists.madduck.net/listinfo/vcs-home