Re: [PATCH v11 06/10] convert: add 'working-tree-encoding' attribute
> On 05 Apr 2018, at 18:41, Torsten Bögershausenwrote: > > On 01.04.18 15:24, Lars Schneider wrote: >>> TRUE or false are values, but just wrong ones. >>> If this test is removed, the user will see "failed to encode "TRUE" to >>> "UTF-8", >>> which should give enough information to fix it. >> >> I see your point. However, I would like to stop the processing right >> there for these invalid values. How about >> >> error(_("true/false are no valid working-tree-encodings")); >> >> I think that is the most straight forward/helpful error message >> for the enduser (I consider the term "boolean" but dismissed it >> as potentially confusing to folks not familiar with the term). >> >> OK with you? > > Yes. Great! > Another thing that came up recently, independent of your series: > > What should happen if a user specifies "UTF-8" and the file > has an UTF-8 encoded BOM ? > I ask because I stumbled over such a file coming from a Windows > which the java compiler under Linux didn't accept. > > And because some tools love to put an UTF-8 encoded BOM > into text files. > > The clearest thing would be to extend the BOM check in 5/9 > to cover UTF-32, UTF-16 and UTF-8. > > Are there any plans to do so? If `working-tree-encoding` is not defined or defined as UTF-8, then we would return from encode_to_git() early. That means we would never run validate_encoding() which would check the BOM. However, adding the UTF-8 BOM would still make sense. This way Git could scream if a user set `working-tree-encoding` to UTF-16 but the file is really UTF-8 encoded. > And thanks for the work. Thanks :-) - Lars
Re: [PATCH v11 06/10] convert: add 'working-tree-encoding' attribute
On 01.04.18 15:24, Lars Schneider wrote: >> TRUE or false are values, but just wrong ones. >> If this test is removed, the user will see "failed to encode "TRUE" to >> "UTF-8", >> which should give enough information to fix it. > > I see your point. However, I would like to stop the processing right > there for these invalid values. How about > > error(_("true/false are no valid working-tree-encodings")); > > I think that is the most straight forward/helpful error message > for the enduser (I consider the term "boolean" but dismissed it > as potentially confusing to folks not familiar with the term). > > OK with you? Yes. Another thing that came up recently, independent of your series: What should happen if a user specifies "UTF-8" and the file has an UTF-8 encoded BOM ? I ask because I stumbled over such a file coming from a Windows which the java compiler under Linux didn't accept. And because some tools love to put an UTF-8 encoded BOM into text files. The clearest thing would be to extend the BOM check in 5/9 to cover UTF-32, UTF-16 and UTF-8. Are there any plans to do so? And thanks for the work.
Re: [PATCH v11 06/10] convert: add 'working-tree-encoding' attribute
> On 18 Mar 2018, at 08:24, Torsten Bögershausenwrote: > > Some comments inline > > On Fri, Mar 09, 2018 at 06:35:32PM +0100, lars.schnei...@autodesk.com wrote: >> From: Lars Schneider >> >> Git recognizes files encoded with ASCII or one of its supersets (e.g. >> UTF-8 or ISO-8859-1) as text files. All other encodings are usually >> interpreted as binary and consequently built-in Git text processing >> tools (e.g. 'git diff') as well as most Git web front ends do not >> visualize the content. >> >> Add an attribute to tell Git what encoding the user has defined for a >> given file. If the content is added to the index, then Git converts the > > Minor comment: > "Git converts the content" > Everywhere else (?) "encodes or reencodes" is used. > "Git reencodes the content" may be more consistent. OK, will change. >> >> +static const char *default_encoding = "UTF-8"; >> + >> +static int encode_to_git(const char *path, const char *src, size_t src_len, >> + struct strbuf *buf, const char *enc, int conv_flags) >> +{ >> +char *dst; >> +int dst_len; >> +int die_on_error = conv_flags & CONV_WRITE_OBJECT; >> + >> +/* >> + * No encoding is specified or there is nothing to encode. >> + * Tell the caller that the content was not modified. >> + */ >> +if (!enc || (src && !src_len)) >> +return 0; > > (This may have been discussed before. > As we checked (enc != NULL) I think we can add here:) > if (is_encoding_utf8(enc)) > return 0; This should be covered in git_path_check_encoding(), introduced in v12: /* Don't encode to the default encoding */ if (same_encoding(value, default_encoding)) return NULL; In that function the encoding of a certain file is read from the .gitattributes. If the encoding matches the compile-time defined default encoding (= UTF-8), then the encoding is set to NULL. >> >> + >> +static int encode_to_worktree(const char *path, const char *src, size_t >> src_len, >> + struct strbuf *buf, const char *enc) >> +{ >> +char *dst; >> +int dst_len; >> + >> +/* >> + * No encoding is specified or there is nothing to encode. >> + * Tell the caller that the content was not modified. >> + */ >> +if (!enc || (src && !src_len)) >> +return 0; > > Same as above: > if (is_encoding_utf8(enc)) > return 0; > >> + >> +dst = reencode_string_len(src, src_len, enc, default_encoding, >> + _len); >> +if (!dst) { >> +error("failed to encode '%s' from %s to %s", >> +path, default_encoding, enc); >> +return 0; >> +} >> + >> +strbuf_attach(buf, dst, dst_len, dst_len + 1); >> +return 1; >> +} >> + >> static int crlf_to_git(const struct index_state *istate, >> const char *path, const char *src, size_t len, >> struct strbuf *buf, >> @@ -978,6 +1051,25 @@ static int ident_to_worktree(const char *path, const >> char *src, size_t len, >> return 1; >> } >> >> +static const char *git_path_check_encoding(struct attr_check_item *check) >> +{ >> +const char *value = check->value; >> + >> +if (ATTR_UNSET(value) || !strlen(value)) >> +return NULL; >> + > > >> +if (ATTR_TRUE(value) || ATTR_FALSE(value)) { >> +error(_("working-tree-encoding attribute requires a value")); >> +return NULL; >> +} > > TRUE or false are values, but just wrong ones. > If this test is removed, the user will see "failed to encode "TRUE" to > "UTF-8", > which should give enough information to fix it. I see your point. However, I would like to stop the processing right there for these invalid values. How about error(_("true/false are no valid working-tree-encodings")); I think that is the most straight forward/helpful error message for the enduser (I consider the term "boolean" but dismissed it as potentially confusing to folks not familiar with the term). OK with you? > >> + >> +/* Don't encode to the default encoding */ >> +if (!strcasecmp(value, default_encoding)) >> +return NULL; > Same as above ?: > if (is_encoding_utf8(value)) > return 0; Yes, that was fixed in v12 as mentioned above :-) - Lars
Re: [PATCH v11 06/10] convert: add 'working-tree-encoding' attribute
Some comments inline On Fri, Mar 09, 2018 at 06:35:32PM +0100, lars.schnei...@autodesk.com wrote: > From: Lars Schneider> > Git recognizes files encoded with ASCII or one of its supersets (e.g. > UTF-8 or ISO-8859-1) as text files. All other encodings are usually > interpreted as binary and consequently built-in Git text processing > tools (e.g. 'git diff') as well as most Git web front ends do not > visualize the content. > > Add an attribute to tell Git what encoding the user has defined for a > given file. If the content is added to the index, then Git converts the Minor comment: "Git converts the content" Everywhere else (?) "encodes or reencodes" is used. "Git reencodes the content" may be more consistent. [No comments on the .gitattributes] > > diff --git a/convert.c b/convert.c > index b976eb968c..aa59ecfe49 100644 > --- a/convert.c > +++ b/convert.c > @@ -7,6 +7,7 @@ > #include "sigchain.h" > #include "pkt-line.h" > #include "sub-process.h" > +#include "utf8.h" > > /* > * convert.c - convert a file when checking it out and checking it in. > @@ -265,6 +266,78 @@ static int will_convert_lf_to_crlf(size_t len, struct > text_stat *stats, > > } > > +static const char *default_encoding = "UTF-8"; > + > +static int encode_to_git(const char *path, const char *src, size_t src_len, > + struct strbuf *buf, const char *enc, int conv_flags) > +{ > + char *dst; > + int dst_len; > + int die_on_error = conv_flags & CONV_WRITE_OBJECT; > + > + /* > + * No encoding is specified or there is nothing to encode. > + * Tell the caller that the content was not modified. > + */ > + if (!enc || (src && !src_len)) > + return 0; (This may have been discussed before. As we checked (enc != NULL) I think we can add here:) if (is_encoding_utf8(enc)) return 0; > + > + /* > + * Looks like we got called from "would_convert_to_git()". > + * This means Git wants to know if it would encode (= modify!) > + * the content. Let's answer with "yes", since an encoding was > + * specified. > + */ > + if (!buf && !src) > + return 1; > + > + dst = reencode_string_len(src, src_len, default_encoding, enc, > + _len); > + if (!dst) { > + /* > + * We could add the blob "as-is" to Git. However, on checkout > + * we would try to reencode to the original encoding. This > + * would fail and we would leave the user with a messed-up > + * working tree. Let's try to avoid this by screaming loud. > + */ > + const char* msg = _("failed to encode '%s' from %s to %s"); > + if (die_on_error) > + die(msg, path, enc, default_encoding); > + else { > + error(msg, path, enc, default_encoding); > + return 0; > + } > + } > + > + strbuf_attach(buf, dst, dst_len, dst_len + 1); > + return 1; > +} > + > +static int encode_to_worktree(const char *path, const char *src, size_t > src_len, > + struct strbuf *buf, const char *enc) > +{ > + char *dst; > + int dst_len; > + > + /* > + * No encoding is specified or there is nothing to encode. > + * Tell the caller that the content was not modified. > + */ > + if (!enc || (src && !src_len)) > + return 0; Same as above: if (is_encoding_utf8(enc)) return 0; > + > + dst = reencode_string_len(src, src_len, enc, default_encoding, > + _len); > + if (!dst) { > + error("failed to encode '%s' from %s to %s", > + path, default_encoding, enc); > + return 0; > + } > + > + strbuf_attach(buf, dst, dst_len, dst_len + 1); > + return 1; > +} > + > static int crlf_to_git(const struct index_state *istate, > const char *path, const char *src, size_t len, > struct strbuf *buf, > @@ -978,6 +1051,25 @@ static int ident_to_worktree(const char *path, const > char *src, size_t len, > return 1; > } > > +static const char *git_path_check_encoding(struct attr_check_item *check) > +{ > + const char *value = check->value; > + > + if (ATTR_UNSET(value) || !strlen(value)) > + return NULL; > + > + if (ATTR_TRUE(value) || ATTR_FALSE(value)) { > + error(_("working-tree-encoding attribute requires a value")); > + return NULL; > + } TRUE or false are values, but just wrong ones. If this test is removed, the user will see "failed to encode "TRUE" to "UTF-8", which should give enough information to fix it. > + > + /* Don't encode to the default encoding */ > + if (!strcasecmp(value, default_encoding)) > + return NULL; Same as above ?:
Re: [PATCH v11 06/10] convert: add 'working-tree-encoding' attribute
> On 09 Mar 2018, at 20:10, Junio C Hamanowrote: > > lars.schnei...@autodesk.com writes: > >> +static const char *default_encoding = "UTF-8"; >> + >> ... >> +static const char *git_path_check_encoding(struct attr_check_item *check) >> +{ >> +const char *value = check->value; >> + >> +if (ATTR_UNSET(value) || !strlen(value)) >> +return NULL; >> + >> +if (ATTR_TRUE(value) || ATTR_FALSE(value)) { >> +error(_("working-tree-encoding attribute requires a value")); >> +return NULL; >> +} > > Hmph, so we decide to be loud but otherwise ignore an undefined > configuration? Shouldn't we rather die instead to avoid touching > the user data in unexpected ways? OK. >> + >> +/* Don't encode to the default encoding */ >> +if (!strcasecmp(value, default_encoding)) >> +return NULL; > > Is this an optimization to avoid "recode one encoding to the same > encoding" no-op overhead? Correct. > We already have the optimization in the > same spirit in may existing codepaths that has nothing to do with > w-t-e, and I think we should share the code. Two pieces of thought > comes to mind. > > One is a lot smaller in scale: Is same_encoding() sufficient for > this callsite instead of strcasecmp()? Yes! > The other one is a lot bigger: Looking at all the existing callers > of same_encoding() that call reencode_string() when it returns false, > would it make sense to drop same_encoding() and move the optimization > to reencode_string() instead? > > I suspect that the answer to the smaller one is "yes, and even if > not, it should be easy to enhance/extend same_encoding() to make it > do what we want it to, and such a change will benefit even existing > callers." The answer to the larger one is likely "the optimization > is not about skipping only reencode_string() call but other things > are subtly different among callers of same_encoding(), so such a > refactoring would not be all that useful." I agree. reencode_string() would need to signal 3 cases: 1. reencode performed 2. reencode not necessary 3. reencode failed We could model "reencode not necessary" as "char *in == char *return". However, I think this should be tackled in a separate series. Thanks Lars
Re: [PATCH v11 06/10] convert: add 'working-tree-encoding' attribute
lars.schnei...@autodesk.com writes: > +static const char *default_encoding = "UTF-8"; > + > ... > +static const char *git_path_check_encoding(struct attr_check_item *check) > +{ > + const char *value = check->value; > + > + if (ATTR_UNSET(value) || !strlen(value)) > + return NULL; > + > + if (ATTR_TRUE(value) || ATTR_FALSE(value)) { > + error(_("working-tree-encoding attribute requires a value")); > + return NULL; > + } Hmph, so we decide to be loud but otherwise ignore an undefined configuration? Shouldn't we rather die instead to avoid touching the user data in unexpected ways? > + > + /* Don't encode to the default encoding */ > + if (!strcasecmp(value, default_encoding)) > + return NULL; Is this an optimization to avoid "recode one encoding to the same encoding" no-op overhead? We already have the optimization in the same spirit in may existing codepaths that has nothing to do with w-t-e, and I think we should share the code. Two pieces of thought comes to mind. One is a lot smaller in scale: Is same_encoding() sufficient for this callsite instead of strcasecmp()? The other one is a lot bigger: Looking at all the existing callers of same_encoding() that call reencode_string() when it returns false, would it make sense to drop same_encoding() and move the optimization to reencode_string() instead? I suspect that the answer to the smaller one is "yes, and even if not, it should be easy to enhance/extend same_encoding() to make it do what we want it to, and such a change will benefit even existing callers." The answer to the larger one is likely "the optimization is not about skipping only reencode_string() call but other things are subtly different among callers of same_encoding(), so such a refactoring would not be all that useful." The above still holds for the code after 10/10 touches this part.