Re: [PATCH v10 7/9] convert: check for detectable errors in UTF encodings
> On 07 Mar 2018, at 19:04, Eric Sunshinewrote: > > On Wed, Mar 7, 2018 at 12:30 PM, wrote: >> Check that new content is valid with respect to the user defined >> 'working-tree-encoding' attribute. >> >> Signed-off-by: Lars Schneider >> --- >> diff --git a/convert.c b/convert.c >> @@ -266,6 +266,58 @@ static int will_convert_lf_to_crlf(size_t len, struct >> text_stat *stats, >> +static int validate_encoding(const char *path, const char *enc, >> + const char *data, size_t len, int die_on_error) >> +{ >> + /* We only check for UTF here as UTF?? can be an alias for UTF-?? */ >> + if (startscase_with(enc, "UTF")) { >> + /* >> +* Check for detectable errors in UTF encodings >> +*/ >> + if (has_prohibited_utf_bom(enc, data, len)) { >> + const char *error_msg = _( >> + "BOM is prohibited in '%s' if encoded as >> %s"); >> + /* >> +* This advice is shown for UTF-??BE and UTF-??LE >> encodings. >> +* We cut off the last two characters of the >> encoding name >> +# to generate the encoding name suitable for BOMs. > > s/#/*/ Of course! >> diff --git a/t/t0028-working-tree-encoding.sh >> b/t/t0028-working-tree-encoding.sh >> @@ -62,6 +62,46 @@ test_expect_success 'check $GIT_DIR/info/attributes >> support' ' >> for i in 16 32 >> do >> + test_expect_success "check prohibited UTF-${i} BOM" ' >> + test_when_finished "git reset --hard HEAD" && >> + >> + echo "*.utf${i}be text working-tree-encoding=utf-${i}be" >> >>.gitattributes && >> + echo "*.utf${i}le text working-tree-encoding=utf-${i}le" >> >>.gitattributes && > > v10 is checking only hyphenated lowercase encoding name; earlier > versions checked uppercase. For better coverage, it would be nice to > check several combinations: all uppercase, all lowercase, mixed case, > hyphenated, not hyphenated. > > I'm not suggesting running all the tests repeatedly but rather just > varying the format of the encoding name in these tests you're adding. > For instance, the above could instead be: > >echo "*.utf${i}be text working-tree-encoding=UTF-${i}be" >>.gitattributes > && >echo "*.utf${i}le text working-tree-encoding=utf${i}LE" >>.gitattributes && > > or something. The casing is a good idea - I will do that. I don't want to do "hyphenated, not hyphenated" as this would make the tests fail on macOS (and I believe on Windows). Thanks, Lars
Re: [PATCH v10 7/9] convert: check for detectable errors in UTF encodings
Lars Schneiderwrites: > I would like to advise the dashed form as this seems to be the > canonical form and it avoids cross platform issues. My macOS > iconv does not support the form without dashes. Sure, that is why I said canonicalization without inserting dash does not make much sense, hence an interim step with only upcasing is not a good idea. A possible interim solution would be to do nothing (no dash insertion, no upcasing) and fixing both in a later follow-up patch, but as I said, I do not care too strongly either way. > Would this approach work for you? > > const char *advise_msg = _( > "The file '%s' contains a byte order " > "mark (BOM). Please use UTF-%s as " > "working-tree-encoding."); > const char *stripped; > char *upper = xstrdup_toupper(enc); > upper[strlen(upper)-2] = '\0'; > skip_prefix(upper, "UTF-", ) || > skip_prefix(stripped, "UTF", ); > advise(advise_msg, path, stripped); Are you now interested in not having any interim step and jump directly to the endgame solution? If so, that is fine by me, too, but as I already said earlier (i.e. not doing this BOM check for an encoding that is not spelled in your canonical upcase-with-dash form might be a feature that leaves an escape hatch), I am not all that interested in enforcing policy at this point in the codepath to begin with, so...
Re: [PATCH v10 7/9] convert: check for detectable errors in UTF encodings
> On 07 Mar 2018, at 23:57, Junio C Hamanowrote: > > Lars Schneider writes: > >> At this point I thought it would make sense to make the advised >> encoding name uppercase in both situations. OK with you? > > In the endgame, if upcased and properly dashed form is always used, > that would be good (if we are enforcing the policy, which I am not > onboard 100% but it's your code and I do not care too strongly about > it). I do not see much point in an interim step that only upcases > without doing the dash insertion. I would like to advise the dashed form as this seems to be the canonical form and it avoids cross platform issues. My macOS iconv does not support the form without dashes. Would this approach work for you? const char *advise_msg = _( "The file '%s' contains a byte order " "mark (BOM). Please use UTF-%s as " "working-tree-encoding."); const char *stripped; char *upper = xstrdup_toupper(enc); upper[strlen(upper)-2] = '\0'; skip_prefix(upper, "UTF-", ) || skip_prefix(stripped, "UTF", ); advise(advise_msg, path, stripped); Thanks, Lars
Re: [PATCH v10 7/9] convert: check for detectable errors in UTF encodings
Lars Schneiderwrites: > In the case of has_prohibited_utf_bom() you are right as we are > dropping the BE/LE suffix in the advise. However, look at the > is_missing_required_utf_bom() advise. Here we *add* BE/LE. So? Then add BE/LE like "Utf-16BE" or "utf16BE". You do not have enough information to decide what is best for the user anyway (which is why I am not convinced that it is even a good idea to always upcase in the first place). > At this point I thought it would make sense to make the advised > encoding name uppercase in both situations. OK with you? In the endgame, if upcased and properly dashed form is always used, that would be good (if we are enforcing the policy, which I am not onboard 100% but it's your code and I do not care too strongly about it). I do not see much point in an interim step that only upcases without doing the dash insertion.
Re: [PATCH v10 7/9] convert: check for detectable errors in UTF encodings
> On 07 Mar 2018, at 23:32, Junio C Hamanowrote: > > Lars Schneider writes: > >> I also would have liked to advise "UTF-16" instead of "UTF16" as >> you suggested. However, that required a few more lines and I wanted >> to keep the change to a minimum. I feel this could be added in a >> follow up patch. > > I'd say the whole upcase thing belongs to such a follow-up patch if > that is the case. > >>> On the other hand, if we are not enforcing such a policy decision >>> but merely explaining a way to work around this check, then it may >>> be better to give a variant with the smaller difference from the >>> original (i.e. without up-casing). >> >> See example mentioned above: "Utf-16". How would you handle that? > > Dropping LE suffix from "Utf-16LE" or "Utf16LE" would yield "Utf-16" > or "Utf16" if the advise message does not force policy, or "UTF-16" > in the canoical form if it does. Is there a problem? In the case of has_prohibited_utf_bom() you are right as we are dropping the BE/LE suffix in the advise. However, look at the is_missing_required_utf_bom() advise. Here we *add* BE/LE. At this point I thought it would make sense to make the advised encoding name uppercase in both situations. OK with you? - Lars
Re: [PATCH v10 7/9] convert: check for detectable errors in UTF encodings
Lars Schneiderwrites: > I also would have liked to advise "UTF-16" instead of "UTF16" as > you suggested. However, that required a few more lines and I wanted > to keep the change to a minimum. I feel this could be added in a > follow up patch. I'd say the whole upcase thing belongs to such a follow-up patch if that is the case. >> On the other hand, if we are not enforcing such a policy decision >> but merely explaining a way to work around this check, then it may >> be better to give a variant with the smaller difference from the >> original (i.e. without up-casing). > > See example mentioned above: "Utf-16". How would you handle that? Dropping LE suffix from "Utf-16LE" or "Utf16LE" would yield "Utf-16" or "Utf16" if the advise message does not force policy, or "UTF-16" in the canoical form if it does. Is there a problem?
Re: [PATCH v10 7/9] convert: check for detectable errors in UTF encodings
> On 07 Mar 2018, at 20:49, Junio C Hamanowrote: > > lars.schnei...@autodesk.com writes: > >> +static int validate_encoding(const char *path, const char *enc, >> + const char *data, size_t len, int die_on_error) >> +{ >> +/* We only check for UTF here as UTF?? can be an alias for UTF-?? */ >> +if (startscase_with(enc, "UTF")) { >> +/* >> + * Check for detectable errors in UTF encodings >> + */ >> +if (has_prohibited_utf_bom(enc, data, len)) { >> +const char *error_msg = _( >> +"BOM is prohibited in '%s' if encoded as %s"); >> +/* >> + * This advice is shown for UTF-??BE and UTF-??LE >> encodings. >> + * We cut off the last two characters of the encoding >> name >> + # to generate the encoding name suitable for BOMs. >> + */ > > Yuck. The code pretends to abstract away the details in a helper > has_prohibited_x() yet the caller still knows quite a lot. True, but has_prohibited_x() cannot create a proper error/advise message unless we give it more parameters (e.g. path name). Therefore, I don't see a better way right now. >> +const char *advise_msg = _( >> +"The file '%s' contains a byte order " >> +"mark (BOM). Please use %s as " >> +"working-tree-encoding."); >> +char *upper_enc = xstrdup_toupper(enc); >> +upper_enc[strlen(upper_enc)-2] = '\0'; >> +advise(advise_msg, path, upper_enc); >> +free(upper_enc); > > I think this up-casing is more problematic than without, not from > the point of view of the internal code, but from the point of view > of the end user experience. When the user writes utf16le or > utf-16le and the data does not trigger the BOM check, we are likely > to successfully convert it. I do not see the merit of suggesting > UTF16 or UTF-16 in such a case, over telling them to just drop the > byte-order suffix from the encoding names (i.e. utf16 or utf-16). > > If you are trying to force/nudge people in the direction of > canonical way of spelling things (which may not be a bad idea), then > "utf16le" as the original input would want to result in "UTF-16" > with dash in the advise, no? Correct. In the error messages I kept the encoding name "as-is" and only in the advise message I used the uppercase variant to steer people into the canonical direction. My initial reason for this was that in is_missing_required_utf_bom() we add "BE/LE" to the encoding in the advise message. Let's say the user used "Utf-16" as encoding. Should "BE/LE" be upper case or lower case? To avoid that question I made it always upper case. I also would have liked to advise "UTF-16" instead of "UTF16" as you suggested. However, that required a few more lines and I wanted to keep the change to a minimum. I feel this could be added in a follow up patch. > On the other hand, if we are not enforcing such a policy decision > but merely explaining a way to work around this check, then it may > be better to give a variant with the smaller difference from the > original (i.e. without up-casing). See example mentioned above: "Utf-16". How would you handle that? Thanks, Lars
Re: [PATCH v10 7/9] convert: check for detectable errors in UTF encodings
lars.schnei...@autodesk.com writes: > +static int validate_encoding(const char *path, const char *enc, > + const char *data, size_t len, int die_on_error) > +{ > + /* We only check for UTF here as UTF?? can be an alias for UTF-?? */ > + if (startscase_with(enc, "UTF")) { > + /* > + * Check for detectable errors in UTF encodings > + */ > + if (has_prohibited_utf_bom(enc, data, len)) { > + const char *error_msg = _( > + "BOM is prohibited in '%s' if encoded as %s"); > + /* > + * This advice is shown for UTF-??BE and UTF-??LE > encodings. > + * We cut off the last two characters of the encoding > name > + # to generate the encoding name suitable for BOMs. > + */ Yuck. The code pretends to abstract away the details in a helper has_prohibited_x() yet the caller still knows quite a lot. > + const char *advise_msg = _( > + "The file '%s' contains a byte order " > + "mark (BOM). Please use %s as " > + "working-tree-encoding."); > + char *upper_enc = xstrdup_toupper(enc); > + upper_enc[strlen(upper_enc)-2] = '\0'; > + advise(advise_msg, path, upper_enc); > + free(upper_enc); I think this up-casing is more problematic than without, not from the point of view of the internal code, but from the point of view of the end user experience. When the user writes utf16le or utf-16le and the data does not trigger the BOM check, we are likely to successfully convert it. I do not see the merit of suggesting UTF16 or UTF-16 in such a case, over telling them to just drop the byte-order suffix from the encoding names (i.e. utf16 or utf-16). If you are trying to force/nudge people in the direction of canonical way of spelling things (which may not be a bad idea), then "utf16le" as the original input would want to result in "UTF-16" with dash in the advise, no? On the other hand, if we are not enforcing such a policy decision but merely explaining a way to work around this check, then it may be better to give a variant with the smaller difference from the original (i.e. without up-casing).
Re: [PATCH v10 7/9] convert: check for detectable errors in UTF encodings
On Wed, Mar 7, 2018 at 12:30 PM,wrote: > Check that new content is valid with respect to the user defined > 'working-tree-encoding' attribute. > > Signed-off-by: Lars Schneider > --- > diff --git a/convert.c b/convert.c > @@ -266,6 +266,58 @@ static int will_convert_lf_to_crlf(size_t len, struct > text_stat *stats, > +static int validate_encoding(const char *path, const char *enc, > + const char *data, size_t len, int die_on_error) > +{ > + /* We only check for UTF here as UTF?? can be an alias for UTF-?? */ > + if (startscase_with(enc, "UTF")) { > + /* > +* Check for detectable errors in UTF encodings > +*/ > + if (has_prohibited_utf_bom(enc, data, len)) { > + const char *error_msg = _( > + "BOM is prohibited in '%s' if encoded as %s"); > + /* > +* This advice is shown for UTF-??BE and UTF-??LE > encodings. > +* We cut off the last two characters of the encoding > name > +# to generate the encoding name suitable for BOMs. s/#/*/ > +*/ > + const char *advise_msg = _( > + "The file '%s' contains a byte order " > + "mark (BOM). Please use %s as " > + "working-tree-encoding."); > + char *upper_enc = xstrdup_toupper(enc); > + upper_enc[strlen(upper_enc)-2] = '\0'; Due to startscase_with(...,"UTF"), we know at this point that the string is at least 3 characters long, thus it's safe to back up by 2. Good. > + advise(advise_msg, path, upper_enc); > + free(upper_enc); > + if (die_on_error) > + die(error_msg, path, enc); > + else { > + return error(error_msg, path, enc); > + } > diff --git a/t/t0028-working-tree-encoding.sh > b/t/t0028-working-tree-encoding.sh > @@ -62,6 +62,46 @@ test_expect_success 'check $GIT_DIR/info/attributes > support' ' > for i in 16 32 > do > + test_expect_success "check prohibited UTF-${i} BOM" ' > + test_when_finished "git reset --hard HEAD" && > + > + echo "*.utf${i}be text working-tree-encoding=utf-${i}be" > >>.gitattributes && > + echo "*.utf${i}le text working-tree-encoding=utf-${i}le" > >>.gitattributes && v10 is checking only hyphenated lowercase encoding name; earlier versions checked uppercase. For better coverage, it would be nice to check several combinations: all uppercase, all lowercase, mixed case, hyphenated, not hyphenated. I'm not suggesting running all the tests repeatedly but rather just varying the format of the encoding name in these tests you're adding. For instance, the above could instead be: echo "*.utf${i}be text working-tree-encoding=UTF-${i}be" >>.gitattributes && echo "*.utf${i}le text working-tree-encoding=utf${i}LE" >>.gitattributes && or something. > + # Here we add a UTF-16 (resp. UTF-32) files with BOM > (big/little-endian) > + # but we tell Git to treat it as UTF-16BE/UTF-16LE (resp. > UTF-32). > + # In these cases the BOM is prohibited. > + cp bebom.utf${i}be.raw bebom.utf${i}be && > + test_must_fail git add bebom.utf${i}be 2>err.out && > + test_i18ngrep "fatal: BOM is prohibited .* utf-${i}be" > err.out && > + > + cp lebom.utf${i}le.raw lebom.utf${i}be && > + test_must_fail git add lebom.utf${i}be 2>err.out && > + test_i18ngrep "fatal: BOM is prohibited .* utf-${i}be" > err.out && > + > + cp bebom.utf${i}be.raw bebom.utf${i}le && > + test_must_fail git add bebom.utf${i}le 2>err.out && > + test_i18ngrep "fatal: BOM is prohibited .* utf-${i}le" > err.out && > + > + cp lebom.utf${i}le.raw lebom.utf${i}le && > + test_must_fail git add lebom.utf${i}le 2>err.out && > + test_i18ngrep "fatal: BOM is prohibited .* utf-${i}le" err.out > + ' > + > + test_expect_success "check required UTF-${i} BOM" ' > + test_when_finished "git reset --hard HEAD" && > + > + echo "*.utf${i} text working-tree-encoding=utf-${i}" > >>.gitattributes && This is another opportunity for checking a variation (case, hyphenation) of the encoding name rather than using only hyphenated lowercase. > + > + cp nobom.utf${i}be.raw nobom.utf${i} && > + test_must_fail git add nobom.utf${i} 2>err.out && > + test_i18ngrep "fatal: BOM is required .* utf-${i}" err.out && > + > + cp
[PATCH v10 7/9] convert: check for detectable errors in UTF encodings
From: Lars SchneiderCheck that new content is valid with respect to the user defined 'working-tree-encoding' attribute. Signed-off-by: Lars Schneider --- convert.c| 55 +++ t/t0028-working-tree-encoding.sh | 56 2 files changed, 111 insertions(+) diff --git a/convert.c b/convert.c index 80549ff2b5..f19a15dd02 100644 --- a/convert.c +++ b/convert.c @@ -266,6 +266,58 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats, } +static int validate_encoding(const char *path, const char *enc, + const char *data, size_t len, int die_on_error) +{ + /* We only check for UTF here as UTF?? can be an alias for UTF-?? */ + if (startscase_with(enc, "UTF")) { + /* +* Check for detectable errors in UTF encodings +*/ + if (has_prohibited_utf_bom(enc, data, len)) { + const char *error_msg = _( + "BOM is prohibited in '%s' if encoded as %s"); + /* +* This advice is shown for UTF-??BE and UTF-??LE encodings. +* We cut off the last two characters of the encoding name +# to generate the encoding name suitable for BOMs. +*/ + const char *advise_msg = _( + "The file '%s' contains a byte order " + "mark (BOM). Please use %s as " + "working-tree-encoding."); + char *upper_enc = xstrdup_toupper(enc); + upper_enc[strlen(upper_enc)-2] = '\0'; + advise(advise_msg, path, upper_enc); + free(upper_enc); + if (die_on_error) + die(error_msg, path, enc); + else { + return error(error_msg, path, enc); + } + + } else if (is_missing_required_utf_bom(enc, data, len)) { + const char *error_msg = _( + "BOM is required in '%s' if encoded as %s"); + const char *advise_msg = _( + "The file '%s' is missing a byte order " + "mark (BOM). Please use %sBE or %sLE " + "(depending on the byte order) as " + "working-tree-encoding."); + char *upper_enc = xstrdup_toupper(enc); + advise(advise_msg, path, upper_enc, upper_enc); + free(upper_enc); + if (die_on_error) + die(error_msg, path, enc); + else { + return error(error_msg, path, enc); + } + } + + } + return 0; +} + static const char *default_encoding = "UTF-8"; static int encode_to_git(const char *path, const char *src, size_t src_len, @@ -291,6 +343,9 @@ static int encode_to_git(const char *path, const char *src, size_t src_len, if (!buf && !src) return 1; + if (validate_encoding(path, enc, src, src_len, die_on_error)) + return 0; + dst = reencode_string_len(src, src_len, default_encoding, enc, _len); if (!dst) { diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh index 89b4dbee1d..aa05f82166 100755 --- a/t/t0028-working-tree-encoding.sh +++ b/t/t0028-working-tree-encoding.sh @@ -62,6 +62,46 @@ test_expect_success 'check $GIT_DIR/info/attributes support' ' for i in 16 32 do + test_expect_success "check prohibited UTF-${i} BOM" ' + test_when_finished "git reset --hard HEAD" && + + echo "*.utf${i}be text working-tree-encoding=utf-${i}be" >>.gitattributes && + echo "*.utf${i}le text working-tree-encoding=utf-${i}le" >>.gitattributes && + + # Here we add a UTF-16 (resp. UTF-32) files with BOM (big/little-endian) + # but we tell Git to treat it as UTF-16BE/UTF-16LE (resp. UTF-32). + # In these cases the BOM is prohibited. + cp bebom.utf${i}be.raw bebom.utf${i}be && + test_must_fail git add bebom.utf${i}be 2>err.out && + test_i18ngrep "fatal: BOM is prohibited .* utf-${i}be" err.out && + + cp lebom.utf${i}le.raw lebom.utf${i}be && + test_must_fail git add lebom.utf${i}be 2>err.out && + test_i18ngrep "fatal: BOM is prohibited .* utf-${i}be" err.out && + + cp bebom.utf${i}be.raw bebom.utf${i}le && +