[Sorry Benjamin, the message was supposed to go to the list]

Benjamin M. A'Lee wrote:
> There is a good reason for this, and it's explained in the git-log
> manual page: it's not necessarily possible to convert something to UTF-8
> (or any other Unicode encoding) and convert it back without introducing
> errors, especially with some less-commonly-used character sets.

Unicode (and thus UTF-8) is a superset of all encodings. It is required
by the specification of the Unicode that any conversion X->Unicode->X
MUST be lossless. It is important to keep in mind that Unicode has
versions, and that some writing systems are only fully supported by some
Unicode versions.

Due to the complexity of some writing systems, Unicode allows that the
same character sequence be represented by more than one way. For
example, "ü" (u-umlaut) may be represented by U+00FC alone, or by the
sequence U+0075 U+0308. But for any of such ambiguities, there is
*always* the one of them that is the "normalized" version (the
normalized one may change from one version of Unicode to another, but
this is usually avoided whenever possible).

So, for proper Unicode support, you must forbid any non-normalized UTF-8
input for filenames, so that there is always unique character sequences
stored in the repository, and there will always be unique conversions to
any other encoding.

So, it doesn't hold true that it is possible to encode something to
UTF-8 and get errors when converting it back. Of course, Unicode->X
conversion fails if the Unicode sequence contains characters that are
not present in X, this is something to expect. But the X->Unicode
conversion MUST be valid and MUST have an unique representation,
whatever X is.

> Treating
> everything as a sequence of bytes is far safer (not to mention faster)
> than converting everything every time it's commited or checked out.

Sure it is faster, but I don't think it is safer. See the problems that
Git and Mercurial present when they are ported to systems that expect
all filenames to be clear and valid Unicode sequences.

Safer, for me, is to forbid the addition to the repository of any file
name that is known to give problems when converted to any other
encoding. Forbid the inclusion of any data incompatible with the users
LC_CTYPE, including non-normalized UTF-8.

Note that this is not a real problem for Git while you use it for what
it was designed to be: an SCM. You should never use files outside the
ASCII set for your source code files, unless you are really expecting to
have problems. This just becomes an issue when Git is used for storing
other things.

> Not true. Though I don't know about LANG and LC_CTYPE support, it's
> certainly not true that Git expects UTF8 no matter what; you can
> override the i18n.commitencoding and i18n.logoutputencoding settings as
> necessary.

Ok, I missed these options. I don't know if they are new or if I just
missed them when I read the documentation for the first time. Well, it
makes Git a little less bad. It stores the encoding along with the
commit message (without conversion), and tries to recode to whatever
output encoding is requested during the log output.

It still breaks the expected behavior that every other locale-aware
software conforms to. I.e., the output these two commands should provide
proper, different outputs:

        LC_ALL=en_US.UTF-8      git --no-pager log
        LC_ALL=en_US.ISO-8859-1 git --no-pager log

Compare with, for example:

        LC_ALL=pt_BR.UTF-8      iconv --help | head
        LC_ALL=pt_BR.ISO-8859-1 iconv --help | head

(the second one should display some broken characters if your system is
UTF-8, that is expected).

Juliano F. Ravasi ·· http://juliano.info/
5105 46CC B2B7 F0CD 5F47 E740 72CA 54F4 DF37 9E96

"A candle loses nothing by lighting another candle." -- Erin Majors

* NOTE: Don't try to reach me through this address, use "contact@" instead.
vcs-home mailing list

Reply via email to