* Juliano F. Ravasi <[EMAIL PROTECTED]> [2008-08-27 19:55]:
> Unicode (and thus UTF-8) is a superset of all encodings. It is
> required by the specification of the Unicode that any
> conversion X->Unicode->X MUST be lossless. It is important to
> keep in mind that Unicode has versions, and that some writing
> systems are only fully supported by some Unicode versions.
> Due to the complexity of some writing systems, Unicode allows
> that the same character sequence be represented by more than
> one way. For example, "ü" (u-umlaut) may be represented by
> U+00FC alone, or by the sequence U+0075 U+0308. But for any of
> such ambiguities, there is *always* the one of them that is the
> "normalized" version (the normalized one may change from one
> version of Unicode to another, but this is usually avoided
> whenever possible).
> So, for proper Unicode support, you must forbid any
> non-normalized UTF-8 input for filenames, so that there is
> always unique character sequences stored in the repository, and
> there will always be unique conversions to any other encoding.
> So, it doesn't hold true that it is possible to encode
> something to UTF-8 and get errors when converting it back. Of
> course, Unicode->X conversion fails if the Unicode sequence
> contains characters that are not present in X, this is
> something to expect. But the X->Unicode conversion MUST be
> valid and MUST have an unique representation, whatever X is.
Would that all this were true.
The standard Shift_JIS encoding cannot be mapped to Unicode
without ambiguities. Worse, there is a whole range of variants
of this charset that all go by the name of Shift_JIS, yet map
codepoints that are unused in the official Shift_JIS standard
to different characters. If you have something labelled
Shift_JIS, you cannot safely convert it to Unicode without
risking data loss.
There is nothing that the Unicode consortium can do about this
either, since the problem is that Shift_JIS is a mess, not that
the Unicode mapping for it or Unicode’s character coverage is
somehow defective. The problem is unfixable.
“I swear, text will be the death of me.”
—Dan Sugalski, initial Parrot VM lead architect, in
>> Treating everything as a sequence of bytes is far safer (not
>> to mention faster) than converting everything every time it's
>> commited or checked out.
> Sure it is faster, but I don't think it is safer. See the
> problems that Git and Mercurial present when they are ported to
> systems that expect all filenames to be clear and valid Unicode
The problem is that on Unix, at least, the filesystem actually
gives you nothing but octet sequences. The only invalid filenames
are ones containing slashes or nulls. Everything else is fair
game. Filenames can be any random garbage whatsoever.
Take a look at the contortions that the GNOME people had to go
through for the file selector dialog and similar things where
gtk+ and friends touch upon the filesystem.
> Safer, for me, is to forbid the addition to the repository of
> any file name that is known to give problems when converted to
> any other encoding. Forbid the inclusion of any data
> incompatible with the users LC_CTYPE, including non-normalized
Then from a SCM design point of view the question turns into
this: do we want to the repository to be unable to store some
actual files that some users may conceivably have good reason to
have? (See Shift_JIS mess above.)
The problem space is unfortunately much bigger and messier than
what you have portrayed in your mail. IMO the fact that git punts
is regrettable, but also pretty much inevitable, so it is not
something that I hold against git. There is simply no truly sane
way of untangling this incredibly ugly yarnball.
Aristotle Pagaltzis // <http://plasmasturm.org/>
vcs-home mailing list