[Sorry Benjamin, the message was supposed to go to the list] Benjamin M. A'Lee wrote: > There is a good reason for this, and it's explained in the git-log > manual page: it's not necessarily possible to convert something to UTF-8 > (or any other Unicode encoding) and convert it back without introducing > errors, especially with some less-commonly-used character sets.
Unicode (and thus UTF-8) is a superset of all encodings. It is required by the specification of the Unicode that any conversion X->Unicode->X MUST be lossless. It is important to keep in mind that Unicode has versions, and that some writing systems are only fully supported by some Unicode versions. Due to the complexity of some writing systems, Unicode allows that the same character sequence be represented by more than one way. For example, "ü" (u-umlaut) may be represented by U+00FC alone, or by the sequence U+0075 U+0308. But for any of such ambiguities, there is *always* the one of them that is the "normalized" version (the normalized one may change from one version of Unicode to another, but this is usually avoided whenever possible). So, for proper Unicode support, you must forbid any non-normalized UTF-8 input for filenames, so that there is always unique character sequences stored in the repository, and there will always be unique conversions to any other encoding. So, it doesn't hold true that it is possible to encode something to UTF-8 and get errors when converting it back. Of course, Unicode->X conversion fails if the Unicode sequence contains characters that are not present in X, this is something to expect. But the X->Unicode conversion MUST be valid and MUST have an unique representation, whatever X is. > Treating > everything as a sequence of bytes is far safer (not to mention faster) > than converting everything every time it's commited or checked out. Sure it is faster, but I don't think it is safer. See the problems that Git and Mercurial present when they are ported to systems that expect all filenames to be clear and valid Unicode sequences. Safer, for me, is to forbid the addition to the repository of any file name that is known to give problems when converted to any other encoding. Forbid the inclusion of any data incompatible with the users LC_CTYPE, including non-normalized UTF-8. Note that this is not a real problem for Git while you use it for what it was designed to be: an SCM. You should never use files outside the ASCII set for your source code files, unless you are really expecting to have problems. This just becomes an issue when Git is used for storing other things. > Not true. Though I don't know about LANG and LC_CTYPE support, it's > certainly not true that Git expects UTF8 no matter what; you can > override the i18n.commitencoding and i18n.logoutputencoding settings as > necessary. Ok, I missed these options. I don't know if they are new or if I just missed them when I read the documentation for the first time. Well, it makes Git a little less bad. It stores the encoding along with the commit message (without conversion), and tries to recode to whatever output encoding is requested during the log output. It still breaks the expected behavior that every other locale-aware software conforms to. I.e., the output these two commands should provide proper, different outputs: LC_ALL=en_US.UTF-8 git --no-pager log LC_ALL=en_US.ISO-8859-1 git --no-pager log Compare with, for example: LC_ALL=pt_BR.UTF-8 iconv --help | head LC_ALL=pt_BR.ISO-8859-1 iconv --help | head (the second one should display some broken characters if your system is UTF-8, that is expected). -- Juliano F. Ravasi ·· http://juliano.info/ 5105 46CC B2B7 F0CD 5F47 E740 72CA 54F4 DF37 9E96 "A candle loses nothing by lighting another candle." -- Erin Majors * NOTE: Don't try to reach me through this address, use "contact@" instead. _______________________________________________ vcs-home mailing list vcs-home@lists.madduck.net http://lists.madduck.net/listinfo/vcs-home