Re: [PATCH 1/2] utf8: refactor code to decide fallback encoding

2016-09-27 Thread Junio C Hamano
Jeff King  writes:

> But once we introduce other fallbacks, then "utf8 -> latin1" may become
> "UTF-8 -> iso8859-1". A system that knows only "utf8" and "iso8859-1"
> _could_ work if we turned the knobs individually, but won't if we turn
> them both at once. Worse, a system that knows only "UTF-8" and "latin1"
> works now, but would break with your patches.
>
> I'm not convinced it's worth worrying about, though. The existence of
> such a system is theoretical at this point. I'm not even sure how common
> the "know about utf8 but not UTF-8" thing is, or if we were merely being
> overly cautious.

Yeah, I did consider having to try the permutations until it works,
but suspecting that somebody takes "utf8" without taking "UTF-8" is
to pretty much invalidate the basic premise of the existing code,
i.e. spelling it as "UTF-8" is the most likely to work anywhere as
long as UTF-8 is supported, so I stopped worrying about it at that
point.

I'd actually welcome a more generic suggestions we can put in our
documentation so that we can _lose_ the fallback entirely (e.g. "if
your contributor spelled 'utf8' and your system, which does take
'UTF-8', does not like it, then here is what you can do to your
/etc/locale.alias").



Re: [PATCH 1/2] utf8: refactor code to decide fallback encoding

2016-09-26 Thread Jeff King
On Mon, Sep 26, 2016 at 06:22:10PM -0700, Junio C Hamano wrote:

> @@ -501,17 +516,9 @@ char *reencode_string_len(const char *in, int insz,
>  
>   conv = iconv_open(out_encoding, in_encoding);
>   if (conv == (iconv_t) -1) {
> - /*
> -  * Some platforms do not have the variously spelled variants of
> -  * UTF-8, so let's fall back to trying the most official
> -  * spelling. We do so only as a fallback in case the platform
> -  * does understand the user's spelling, but not our official
> -  * one.
> -  */
> - if (is_encoding_utf8(in_encoding))
> - in_encoding = "UTF-8";
> - if (is_encoding_utf8(out_encoding))
> - out_encoding = "UTF-8";
> + in_encoding = fallback_encoding(in_encoding);
> + out_encoding = fallback_encoding(out_encoding);
> +

This comment is interesting. We're concerned about a platform knowing
"utf8" but not "UTF-8". When we fallback, we do it for both the input
and output encodings, because we don't know which may have caused the
problem. So is it possible that we improve one case but break the other?

With just UTF-8, I don't think so. That could only be the case with
something like "utf8 -> utf-8" because they both become "UTF-8". So
either it improves the situation or not (because we either understand
UTF-8 or not).

But once we introduce other fallbacks, then "utf8 -> latin1" may become
"UTF-8 -> iso8859-1". A system that knows only "utf8" and "iso8859-1"
_could_ work if we turned the knobs individually, but won't if we turn
them both at once. Worse, a system that knows only "UTF-8" and "latin1"
works now, but would break with your patches.

I'm not convinced it's worth worrying about, though. The existence of
such a system is theoretical at this point. I'm not even sure how common
the "know about utf8 but not UTF-8" thing is, or if we were merely being
overly cautious.

-Peff