Re: [PATCH v10 7/9] convert: check for detectable errors in UTF encodings

2018-03-09 Thread Lars Schneider

> On 07 Mar 2018, at 19:04, Eric Sunshine  wrote:
> 
> On Wed, Mar 7, 2018 at 12:30 PM,   wrote:
>> Check that new content is valid with respect to the user defined
>> 'working-tree-encoding' attribute.
>> 
>> Signed-off-by: Lars Schneider 
>> ---
>> diff --git a/convert.c b/convert.c
>> @@ -266,6 +266,58 @@ static int will_convert_lf_to_crlf(size_t len, struct 
>> text_stat *stats,
>> +static int validate_encoding(const char *path, const char *enc,
>> + const char *data, size_t len, int die_on_error)
>> +{
>> +   /* We only check for UTF here as UTF?? can be an alias for UTF-?? */
>> +   if (startscase_with(enc, "UTF")) {
>> +   /*
>> +* Check for detectable errors in UTF encodings
>> +*/
>> +   if (has_prohibited_utf_bom(enc, data, len)) {
>> +   const char *error_msg = _(
>> +   "BOM is prohibited in '%s' if encoded as 
>> %s");
>> +   /*
>> +* This advice is shown for UTF-??BE and UTF-??LE 
>> encodings.
>> +* We cut off the last two characters of the 
>> encoding name
>> +# to generate the encoding name suitable for BOMs.
> 
> s/#/*/

Of course!


>> diff --git a/t/t0028-working-tree-encoding.sh 
>> b/t/t0028-working-tree-encoding.sh
>> @@ -62,6 +62,46 @@ test_expect_success 'check $GIT_DIR/info/attributes 
>> support' '
>> for i in 16 32
>> do
>> +   test_expect_success "check prohibited UTF-${i} BOM" '
>> +   test_when_finished "git reset --hard HEAD" &&
>> +
>> +   echo "*.utf${i}be text working-tree-encoding=utf-${i}be" 
>> >>.gitattributes &&
>> +   echo "*.utf${i}le text working-tree-encoding=utf-${i}le" 
>> >>.gitattributes &&
> 
> v10 is checking only hyphenated lowercase encoding name; earlier
> versions checked uppercase. For better coverage, it would be nice to
> check several combinations: all uppercase, all lowercase, mixed case,
> hyphenated, not hyphenated.
> 
> I'm not suggesting running all the tests repeatedly but rather just
> varying the format of the encoding name in these tests you're adding.
> For instance, the above could instead be:
> 
>echo "*.utf${i}be text working-tree-encoding=UTF-${i}be" >>.gitattributes 
> &&
>echo "*.utf${i}le text working-tree-encoding=utf${i}LE" >>.gitattributes &&
> 
> or something.

The casing is a good idea - I will do that. I don't want to do "hyphenated, not 
hyphenated" as this would make the tests fail on macOS (and I believe on 
Windows).


Thanks,
Lars

Re: [PATCH v10 7/9] convert: check for detectable errors in UTF encodings

2018-03-07 Thread Junio C Hamano
Lars Schneider  writes:

> I would like to advise the dashed form as this seems to be the
> canonical form and it avoids cross platform issues. My macOS
> iconv does not support the form without dashes.

Sure, that is why I said canonicalization without inserting dash
does not make much sense, hence an interim step with only upcasing
is not a good idea.  A possible interim solution would be to do
nothing (no dash insertion, no upcasing) and fixing both in a later
follow-up patch, but as I said, I do not care too strongly either
way.

> Would this approach work for you?
>
>   const char *advise_msg = _(
>   "The file '%s' contains a byte order "
>   "mark (BOM). Please use UTF-%s as "
>   "working-tree-encoding.");
>   const char *stripped;
>   char *upper = xstrdup_toupper(enc);
>   upper[strlen(upper)-2] = '\0';
>   skip_prefix(upper, "UTF-", ) ||
>   skip_prefix(stripped, "UTF", );
>   advise(advise_msg, path, stripped);

Are you now interested in not having any interim step and jump
directly to the endgame solution?  If so, that is fine by me, too,
but as I already said earlier (i.e. not doing this BOM check for an
encoding that is not spelled in your canonical upcase-with-dash form
might be a feature that leaves an escape hatch), I am not all that
interested in enforcing policy at this point in the codepath to
begin with, so...



Re: [PATCH v10 7/9] convert: check for detectable errors in UTF encodings

2018-03-07 Thread Lars Schneider

> On 07 Mar 2018, at 23:57, Junio C Hamano  wrote:
> 
> Lars Schneider  writes:
> 
>> At this point I thought it would make sense to make the advised
>> encoding name uppercase in both situations. OK with you?
> 
> In the endgame, if upcased and properly dashed form is always used,
> that would be good (if we are enforcing the policy, which I am not
> onboard 100% but it's your code and I do not care too strongly about
> it).  I do not see much point in an interim step that only upcases
> without doing the dash insertion.

I would like to advise the dashed form as this seems to be the
canonical form and it avoids cross platform issues. My macOS
iconv does not support the form without dashes.

Would this approach work for you?

const char *advise_msg = _(
"The file '%s' contains a byte order "
"mark (BOM). Please use UTF-%s as "
"working-tree-encoding.");
const char *stripped;
char *upper = xstrdup_toupper(enc);
upper[strlen(upper)-2] = '\0';
skip_prefix(upper, "UTF-", ) ||
skip_prefix(stripped, "UTF", );
advise(advise_msg, path, stripped);

Thanks,
Lars


Re: [PATCH v10 7/9] convert: check for detectable errors in UTF encodings

2018-03-07 Thread Junio C Hamano
Lars Schneider  writes:

> In the case of has_prohibited_utf_bom() you are right as we are 
> dropping the BE/LE suffix in the advise. However, look at the 
> is_missing_required_utf_bom() advise. Here we *add* BE/LE.

So?  Then add BE/LE like "Utf-16BE" or "utf16BE".  You do not have
enough information to decide what is best for the user anyway (which
is why I am not convinced that it is even a good idea to always
upcase in the first place).

> At this point I thought it would make sense to make the advised
> encoding name uppercase in both situations. OK with you?

In the endgame, if upcased and properly dashed form is always used,
that would be good (if we are enforcing the policy, which I am not
onboard 100% but it's your code and I do not care too strongly about
it).  I do not see much point in an interim step that only upcases
without doing the dash insertion.




Re: [PATCH v10 7/9] convert: check for detectable errors in UTF encodings

2018-03-07 Thread Lars Schneider

> On 07 Mar 2018, at 23:32, Junio C Hamano  wrote:
> 
> Lars Schneider  writes:
> 
>> I also would have liked to advise "UTF-16" instead of "UTF16" as
>> you suggested. However, that required a few more lines and I wanted
>> to keep the change to a minimum. I feel this could be added in a
>> follow up patch.
> 
> I'd say the whole upcase thing belongs to such a follow-up patch if
> that is the case.
> 
>>> On the other hand, if we are not enforcing such a policy decision
>>> but merely explaining a way to work around this check, then it may
>>> be better to give a variant with the smaller difference from the
>>> original (i.e. without up-casing).
>> 
>> See example mentioned above: "Utf-16". How would you handle that?
> 
> Dropping LE suffix from "Utf-16LE" or "Utf16LE" would yield "Utf-16"
> or "Utf16" if the advise message does not force policy, or "UTF-16"
> in the canoical form if it does.  Is there a problem?

In the case of has_prohibited_utf_bom() you are right as we are 
dropping the BE/LE suffix in the advise. However, look at the 
is_missing_required_utf_bom() advise. Here we *add* BE/LE.

At this point I thought it would make sense to make the advised
encoding name uppercase in both situations. OK with you?

- Lars






Re: [PATCH v10 7/9] convert: check for detectable errors in UTF encodings

2018-03-07 Thread Junio C Hamano
Lars Schneider  writes:

> I also would have liked to advise "UTF-16" instead of "UTF16" as
> you suggested. However, that required a few more lines and I wanted
> to keep the change to a minimum. I feel this could be added in a
> follow up patch.

I'd say the whole upcase thing belongs to such a follow-up patch if
that is the case.

>> On the other hand, if we are not enforcing such a policy decision
>> but merely explaining a way to work around this check, then it may
>> be better to give a variant with the smaller difference from the
>> original (i.e. without up-casing).
>
> See example mentioned above: "Utf-16". How would you handle that?

Dropping LE suffix from "Utf-16LE" or "Utf16LE" would yield "Utf-16"
or "Utf16" if the advise message does not force policy, or "UTF-16"
in the canoical form if it does.  Is there a problem?


Re: [PATCH v10 7/9] convert: check for detectable errors in UTF encodings

2018-03-07 Thread Lars Schneider

> On 07 Mar 2018, at 20:49, Junio C Hamano  wrote:
> 
> lars.schnei...@autodesk.com writes:
> 
>> +static int validate_encoding(const char *path, const char *enc,
>> +  const char *data, size_t len, int die_on_error)
>> +{
>> +/* We only check for UTF here as UTF?? can be an alias for UTF-?? */
>> +if (startscase_with(enc, "UTF")) {
>> +/*
>> + * Check for detectable errors in UTF encodings
>> + */
>> +if (has_prohibited_utf_bom(enc, data, len)) {
>> +const char *error_msg = _(
>> +"BOM is prohibited in '%s' if encoded as %s");
>> +/*
>> + * This advice is shown for UTF-??BE and UTF-??LE 
>> encodings.
>> + * We cut off the last two characters of the encoding 
>> name
>> + # to generate the encoding name suitable for BOMs.
>> + */
> 
> Yuck.  The code pretends to abstract away the details in a helper
> has_prohibited_x() yet the caller still knows quite a lot.

True, but has_prohibited_x() cannot create a proper error/advise
message unless we give it more parameters (e.g. path name).
Therefore, I don't see a better way right now.


>> +const char *advise_msg = _(
>> +"The file '%s' contains a byte order "
>> +"mark (BOM). Please use %s as "
>> +"working-tree-encoding.");
>> +char *upper_enc = xstrdup_toupper(enc);
>> +upper_enc[strlen(upper_enc)-2] = '\0';
>> +advise(advise_msg, path, upper_enc);
>> +free(upper_enc);
> 
> I think this up-casing is more problematic than without, not from
> the point of view of the internal code, but from the point of view
> of the end user experience.  When the user writes utf16le or
> utf-16le and the data does not trigger the BOM check, we are likely
> to successfully convert it.  I do not see the merit of suggesting
> UTF16 or UTF-16 in such a case, over telling them to just drop the
> byte-order suffix from the encoding names (i.e. utf16 or utf-16).
> 
> If you are trying to force/nudge people in the direction of
> canonical way of spelling things (which may not be a bad idea), then
> "utf16le" as the original input would want to result in "UTF-16"
> with dash in the advise, no?

Correct. In the error messages I kept the encoding name "as-is" and
only in the advise message I used the uppercase variant to steer
people into the canonical direction. My initial reason for this was
that in is_missing_required_utf_bom() we add "BE/LE" to the encoding
in the advise message. Let's say the user used "Utf-16" as encoding.
 Should "BE/LE" be upper case or lower case? To avoid that question 
I made it always upper case.

I also would have liked to advise "UTF-16" instead of "UTF16" as
you suggested. However, that required a few more lines and I wanted
to keep the change to a minimum. I feel this could be added in a
follow up patch.


> On the other hand, if we are not enforcing such a policy decision
> but merely explaining a way to work around this check, then it may
> be better to give a variant with the smaller difference from the
> original (i.e. without up-casing).

See example mentioned above: "Utf-16". How would you handle that?


Thanks,
Lars



Re: [PATCH v10 7/9] convert: check for detectable errors in UTF encodings

2018-03-07 Thread Junio C Hamano
lars.schnei...@autodesk.com writes:

> +static int validate_encoding(const char *path, const char *enc,
> +   const char *data, size_t len, int die_on_error)
> +{
> + /* We only check for UTF here as UTF?? can be an alias for UTF-?? */
> + if (startscase_with(enc, "UTF")) {
> + /*
> +  * Check for detectable errors in UTF encodings
> +  */
> + if (has_prohibited_utf_bom(enc, data, len)) {
> + const char *error_msg = _(
> + "BOM is prohibited in '%s' if encoded as %s");
> + /*
> +  * This advice is shown for UTF-??BE and UTF-??LE 
> encodings.
> +  * We cut off the last two characters of the encoding 
> name
> +  # to generate the encoding name suitable for BOMs.
> +  */

Yuck.  The code pretends to abstract away the details in a helper
has_prohibited_x() yet the caller still knows quite a lot.

> + const char *advise_msg = _(
> + "The file '%s' contains a byte order "
> + "mark (BOM). Please use %s as "
> + "working-tree-encoding.");
> + char *upper_enc = xstrdup_toupper(enc);
> + upper_enc[strlen(upper_enc)-2] = '\0';
> + advise(advise_msg, path, upper_enc);
> + free(upper_enc);

I think this up-casing is more problematic than without, not from
the point of view of the internal code, but from the point of view
of the end user experience.  When the user writes utf16le or
utf-16le and the data does not trigger the BOM check, we are likely
to successfully convert it.  I do not see the merit of suggesting
UTF16 or UTF-16 in such a case, over telling them to just drop the
byte-order suffix from the encoding names (i.e. utf16 or utf-16).

If you are trying to force/nudge people in the direction of
canonical way of spelling things (which may not be a bad idea), then
"utf16le" as the original input would want to result in "UTF-16"
with dash in the advise, no?

On the other hand, if we are not enforcing such a policy decision
but merely explaining a way to work around this check, then it may
be better to give a variant with the smaller difference from the
original (i.e. without up-casing).


Re: [PATCH v10 7/9] convert: check for detectable errors in UTF encodings

2018-03-07 Thread Eric Sunshine
On Wed, Mar 7, 2018 at 12:30 PM,   wrote:
> Check that new content is valid with respect to the user defined
> 'working-tree-encoding' attribute.
>
> Signed-off-by: Lars Schneider 
> ---
> diff --git a/convert.c b/convert.c
> @@ -266,6 +266,58 @@ static int will_convert_lf_to_crlf(size_t len, struct 
> text_stat *stats,
> +static int validate_encoding(const char *path, const char *enc,
> + const char *data, size_t len, int die_on_error)
> +{
> +   /* We only check for UTF here as UTF?? can be an alias for UTF-?? */
> +   if (startscase_with(enc, "UTF")) {
> +   /*
> +* Check for detectable errors in UTF encodings
> +*/
> +   if (has_prohibited_utf_bom(enc, data, len)) {
> +   const char *error_msg = _(
> +   "BOM is prohibited in '%s' if encoded as %s");
> +   /*
> +* This advice is shown for UTF-??BE and UTF-??LE 
> encodings.
> +* We cut off the last two characters of the encoding 
> name
> +# to generate the encoding name suitable for BOMs.

s/#/*/

> +*/
> +   const char *advise_msg = _(
> +   "The file '%s' contains a byte order "
> +   "mark (BOM). Please use %s as "
> +   "working-tree-encoding.");
> +   char *upper_enc = xstrdup_toupper(enc);
> +   upper_enc[strlen(upper_enc)-2] = '\0';

Due to startscase_with(...,"UTF"), we know at this point that the
string is at least 3 characters long, thus it's safe to back up by 2.
Good.

> +   advise(advise_msg, path, upper_enc);
> +   free(upper_enc);
> +   if (die_on_error)
> +   die(error_msg, path, enc);
> +   else {
> +   return error(error_msg, path, enc);
> +   }
> diff --git a/t/t0028-working-tree-encoding.sh 
> b/t/t0028-working-tree-encoding.sh
> @@ -62,6 +62,46 @@ test_expect_success 'check $GIT_DIR/info/attributes 
> support' '
>  for i in 16 32
>  do
> +   test_expect_success "check prohibited UTF-${i} BOM" '
> +   test_when_finished "git reset --hard HEAD" &&
> +
> +   echo "*.utf${i}be text working-tree-encoding=utf-${i}be" 
> >>.gitattributes &&
> +   echo "*.utf${i}le text working-tree-encoding=utf-${i}le" 
> >>.gitattributes &&

v10 is checking only hyphenated lowercase encoding name; earlier
versions checked uppercase. For better coverage, it would be nice to
check several combinations: all uppercase, all lowercase, mixed case,
hyphenated, not hyphenated.

I'm not suggesting running all the tests repeatedly but rather just
varying the format of the encoding name in these tests you're adding.
For instance, the above could instead be:

echo "*.utf${i}be text working-tree-encoding=UTF-${i}be" >>.gitattributes &&
echo "*.utf${i}le text working-tree-encoding=utf${i}LE" >>.gitattributes &&

or something.

> +   # Here we add a UTF-16 (resp. UTF-32) files with BOM 
> (big/little-endian)
> +   # but we tell Git to treat it as UTF-16BE/UTF-16LE (resp. 
> UTF-32).
> +   # In these cases the BOM is prohibited.
> +   cp bebom.utf${i}be.raw bebom.utf${i}be &&
> +   test_must_fail git add bebom.utf${i}be 2>err.out &&
> +   test_i18ngrep "fatal: BOM is prohibited .* utf-${i}be" 
> err.out &&
> +
> +   cp lebom.utf${i}le.raw lebom.utf${i}be &&
> +   test_must_fail git add lebom.utf${i}be 2>err.out &&
> +   test_i18ngrep "fatal: BOM is prohibited .* utf-${i}be" 
> err.out &&
> +
> +   cp bebom.utf${i}be.raw bebom.utf${i}le &&
> +   test_must_fail git add bebom.utf${i}le 2>err.out &&
> +   test_i18ngrep "fatal: BOM is prohibited .* utf-${i}le" 
> err.out &&
> +
> +   cp lebom.utf${i}le.raw lebom.utf${i}le &&
> +   test_must_fail git add lebom.utf${i}le 2>err.out &&
> +   test_i18ngrep "fatal: BOM is prohibited .* utf-${i}le" err.out
> +   '
> +
> +   test_expect_success "check required UTF-${i} BOM" '
> +   test_when_finished "git reset --hard HEAD" &&
> +
> +   echo "*.utf${i} text working-tree-encoding=utf-${i}" 
> >>.gitattributes &&

This is another opportunity for checking a variation (case,
hyphenation) of the encoding name rather than using only hyphenated
lowercase.

> +
> +   cp nobom.utf${i}be.raw nobom.utf${i} &&
> +   test_must_fail git add nobom.utf${i} 2>err.out &&
> +   test_i18ngrep "fatal: BOM is required .* utf-${i}" err.out &&
> +
> +   cp 

[PATCH v10 7/9] convert: check for detectable errors in UTF encodings

2018-03-07 Thread lars . schneider
From: Lars Schneider 

Check that new content is valid with respect to the user defined
'working-tree-encoding' attribute.

Signed-off-by: Lars Schneider 
---
 convert.c| 55 +++
 t/t0028-working-tree-encoding.sh | 56 
 2 files changed, 111 insertions(+)

diff --git a/convert.c b/convert.c
index 80549ff2b5..f19a15dd02 100644
--- a/convert.c
+++ b/convert.c
@@ -266,6 +266,58 @@ static int will_convert_lf_to_crlf(size_t len, struct 
text_stat *stats,
 
 }
 
+static int validate_encoding(const char *path, const char *enc,
+ const char *data, size_t len, int die_on_error)
+{
+   /* We only check for UTF here as UTF?? can be an alias for UTF-?? */
+   if (startscase_with(enc, "UTF")) {
+   /*
+* Check for detectable errors in UTF encodings
+*/
+   if (has_prohibited_utf_bom(enc, data, len)) {
+   const char *error_msg = _(
+   "BOM is prohibited in '%s' if encoded as %s");
+   /*
+* This advice is shown for UTF-??BE and UTF-??LE 
encodings.
+* We cut off the last two characters of the encoding 
name
+# to generate the encoding name suitable for BOMs.
+*/
+   const char *advise_msg = _(
+   "The file '%s' contains a byte order "
+   "mark (BOM). Please use %s as "
+   "working-tree-encoding.");
+   char *upper_enc = xstrdup_toupper(enc);
+   upper_enc[strlen(upper_enc)-2] = '\0';
+   advise(advise_msg, path, upper_enc);
+   free(upper_enc);
+   if (die_on_error)
+   die(error_msg, path, enc);
+   else {
+   return error(error_msg, path, enc);
+   }
+
+   } else if (is_missing_required_utf_bom(enc, data, len)) {
+   const char *error_msg = _(
+   "BOM is required in '%s' if encoded as %s");
+   const char *advise_msg = _(
+   "The file '%s' is missing a byte order "
+   "mark (BOM). Please use %sBE or %sLE "
+   "(depending on the byte order) as "
+   "working-tree-encoding.");
+   char *upper_enc = xstrdup_toupper(enc);
+   advise(advise_msg, path, upper_enc, upper_enc);
+   free(upper_enc);
+   if (die_on_error)
+   die(error_msg, path, enc);
+   else {
+   return error(error_msg, path, enc);
+   }
+   }
+
+   }
+   return 0;
+}
+
 static const char *default_encoding = "UTF-8";
 
 static int encode_to_git(const char *path, const char *src, size_t src_len,
@@ -291,6 +343,9 @@ static int encode_to_git(const char *path, const char *src, 
size_t src_len,
if (!buf && !src)
return 1;
 
+   if (validate_encoding(path, enc, src, src_len, die_on_error))
+   return 0;
+
dst = reencode_string_len(src, src_len, default_encoding, enc,
  _len);
if (!dst) {
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 89b4dbee1d..aa05f82166 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -62,6 +62,46 @@ test_expect_success 'check $GIT_DIR/info/attributes support' 
'
 
 for i in 16 32
 do
+   test_expect_success "check prohibited UTF-${i} BOM" '
+   test_when_finished "git reset --hard HEAD" &&
+
+   echo "*.utf${i}be text working-tree-encoding=utf-${i}be" 
>>.gitattributes &&
+   echo "*.utf${i}le text working-tree-encoding=utf-${i}le" 
>>.gitattributes &&
+
+   # Here we add a UTF-16 (resp. UTF-32) files with BOM 
(big/little-endian)
+   # but we tell Git to treat it as UTF-16BE/UTF-16LE (resp. 
UTF-32).
+   # In these cases the BOM is prohibited.
+   cp bebom.utf${i}be.raw bebom.utf${i}be &&
+   test_must_fail git add bebom.utf${i}be 2>err.out &&
+   test_i18ngrep "fatal: BOM is prohibited .* utf-${i}be" err.out 
&&
+
+   cp lebom.utf${i}le.raw lebom.utf${i}be &&
+   test_must_fail git add lebom.utf${i}be 2>err.out &&
+   test_i18ngrep "fatal: BOM is prohibited .* utf-${i}be" err.out 
&&
+
+   cp bebom.utf${i}be.raw bebom.utf${i}le &&
+