Re: [PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding
Torsten Bögershausenwrites: >> the user explicitly tells us it is in UTF-16, right? Is there such a >> thing as UTF-16 binary? > > I don't think so, by definiton UTF-16 is ment to be text. > (this means that git ls-files --eol needs some update, I can have a look) > > Do we agree that UTF-16 is text ? > If yes, could Git assume that the "text" attribute is set when > working-tree-encoding is set ? These are two different questions. It seems that between the two of us, we established that "UTF-16 binary" is a nonsense, and a path that is given working-tree-encoding=UTF-16 must be treated as holding a text file. But that does not have direct relevance to the other question you are asking: "is a question 'does this path have text attribute set?' be answered with 'yes' if the path has wte attribute set to UTF-16?" I think the answer to that latter question ought to be "no". By the way, a related tangent is if it makes sense to give working-tree-encoding to anything that is binary, regardless of the value---I am inclined to say it is not, so the issue here is not "by definition UTF-16 is text", but "any path that has wte set to some enconding should be treated the same way as if the path also has text attribute set as far as convert machinery is concerned.".
Re: [PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding
On Fri, Feb 02, 2018 at 11:17:04AM -0800, Junio C Hamano wrote: > Torsten Bögershausenwrites: > > > There are 2 opposite opionions/user expectations here: > > > > a) They are binary in the working tree, so git should leave the line endings > >as is. (Unless specified otherwise in the .attributes file) > > ... > > b) They are text files in the index. Git will convert line endings > >if core.autocrlf is true (or the .gitattributes file specifies "-text") > > I sense that you seem to be focusing on the distinction between "in > the working tree" vs "in the index" while contrasting. The "binary > vs text" in your "binary in wt, text in index" is based on the > default heuristics without any input from end-users or the project > that uses Git that happens to contain such files. If the users and > the project that uses Git want to treat contents in a path as text, > it is text even when it is (re-)encoded to UTF-16, no? > > Such files may be (mis)classified as binary with the default > heuristics when there is no help from what is written in the > .gitattributes file, but here we are talking about the case where > the user explicitly tells us it is in UTF-16, right? Is there such a > thing as UTF-16 binary? I don't think so, by definiton UTF-16 is ment to be text. (this means that git ls-files --eol needs some update, I can have a look) Do we agree that UTF-16 is text ? If yes, could Git assume that the "text" attribute is set when working-tree-encoding is set ? I would even go a step further and demand that the user -must- make a decision about the line endings for working-tree-encoded files: working-tree-encoding=UTF-16 # illegal, die() working-tree-encoding=UTF-16 text=auto # illegal, die() working-tree-encoding=UTF-16 -text # no eol conversion working-tree-encoding=UTF-16 text# eol according to core.eol working-tree-encoding=UTF-16 text eol=lf # LF working-tree-encoding=UTF-16 text eol=crlf # CRLF What do you think ?
Re: [PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding
Torsten Bögershausenwrites: > There are 2 opposite opionions/user expectations here: > > a) They are binary in the working tree, so git should leave the line endings >as is. (Unless specified otherwise in the .attributes file) > ... > b) They are text files in the index. Git will convert line endings >if core.autocrlf is true (or the .gitattributes file specifies "-text") I sense that you seem to be focusing on the distinction between "in the working tree" vs "in the index" while contrasting. The "binary vs text" in your "binary in wt, text in index" is based on the default heuristics without any input from end-users or the project that uses Git that happens to contain such files. If the users and the project that uses Git want to treat contents in a path as text, it is text even when it is (re-)encoded to UTF-16, no? Such files may be (mis)classified as binary with the default heuristics when there is no help from what is written in the .gitattributes file, but here we are talking about the case where the user explicitly tells us it is in UTF-16, right? Is there such a thing as UTF-16 binary?
Re: [PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding
> On 31 Jan 2018, at 18:28, Torsten Bögershausenwrote: > > [] >>> That is a good one. >>> If you ever plan a re-roll (I don't at the moment) the *.proj extemsion >>> make much more sense in Documentation/gitattributes that *.tx >>> There no text files encoded in UTF-16 wich are called xxx.txt, but those >>> are non-ideal examples. *.proj makes good sense as an example. >> >> OK, I'll do that. Would that fix the problem which this patch tries to >> address for you? >> (I would also explicitly add a paragraph to discuss line endings) > > Please let me see the patch first, before I can have a comment. Sure! I'll have it ready tomorrow. > But back to the more general question: > > How should Git handle the line endings of UTF-16 files in the woring-tree, > which are UTF-8 in the index? > > > There are 2 opposite opionions/user expectations here: > > a) They are binary in the working tree, so git should leave the line endings > as is. (Unless specified otherwise in the .attributes file) Well, if you consider your UTF-16 files binary then you would not change *anything*. You would not enable the new "working-tree-encoding" attribute. As a consequence, Git's behavior would not change. Git would leave all line endings as they are for the UTF-16 files. > b) They are text files in the index. Git will convert line endings > if core.autocrlf is true (or the .gitattributes file specifies "-text") This would *only* happen if you enable the new "working-tree-encoding" attribute. In this case a user has already made the conscious decision to treat these files as text files. Therefore, the user expects Git to handle them in the same way other text files are handled. > My feeling is that both arguments are valid, so let's ask for opinions > and thoughts of others. > Erik, Junio, Johannes, Johannes, Jeff, Ramsay, everybody: > What do yo think ? - Lars
Re: [PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding
[] > > That is a good one. > > If you ever plan a re-roll (I don't at the moment) the *.proj extemsion > > make much more sense in Documentation/gitattributes that *.tx > > There no text files encoded in UTF-16 wich are called xxx.txt, but those > > are non-ideal examples. *.proj makes good sense as an example. > > OK, I'll do that. Would that fix the problem which this patch tries to > address for you? > (I would also explicitly add a paragraph to discuss line endings) Please let me see the patch first, before I can have a comment. But back to the more general question: How should Git handle the line endings of UTF-16 files in the woring-tree, which are UTF-8 in the index? There are 2 opposite opionions/user expectations here: a) They are binary in the working tree, so git should leave the line endings as is. (Unless specified otherwise in the .attributes file) b) They are text files in the index. Git will convert line endings if core.autocrlf is true (or the .gitattributes file specifies "-text") My feeling is that both arguments are valid, so let's ask for opinions and thoughts of others. Erik, Junio, Johannes, Johannes, Jeff, Ramsay, everybody: What do yo think ?
Re: [PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding
> On 30 Jan 2018, at 15:40, Torsten Bögershausenwrote: > > On Tue, Jan 30, 2018 at 12:23:47PM +0100, Lars Schneider wrote: >> >>> On 29 Jan 2018, at 21:19, tbo...@web.de wrote: >>> >>> From: Torsten Bögershausen >>> >>> UTF-16 encoded files are treated as "binary" by Git, and no CRLF >>> conversion is done. >>> When the UTF-16 encoded files are converted into UF-8 using the new >> s/UF-8/UTF-8/ >> >> >>> "working-tree-encoding", the CRLF are converted if core.autocrlf is true. >>> >>> This may lead to confusion: >>> A tool writes an UTF-16 encoded file with CRLF. >>> The file is commited with core.autocrlf=true, the CLRF are converted into >>> LF. >>> The repo is pushed somewhere and cloned by a different user, who has >>> decided to use core.autocrlf=false. >>> He uses the same tool, and now the CRLF are not there as expected, but LF, >>> make the file useless for the tool. >>> >>> Avoid this (possible) confusion by ignoring core.autocrlf for all files >>> which have "working-tree-encoding" defined. >> >> Maybe I don't understand your use case but I think this will generate even >> more confusion because that's not what I would expect as a user. I think Git >> should behave consistently independent of the used encoding. Here are my >> arguments: > > To start with: I have probably seen too many repos with CRLF messed up. > >> >> (1) Legacy users are *not* affected. If you don't use the >> "working-tree-encoding" >> attribute then nothing changes for you. > > People who don't use "working-tree-encoding" are not affected, > I never ment to state that. > > I am thinking about people who use "working-tree-encoding" without thinking > about line endings. > Or the ones that have in mind that core.autocrlf=true will leave the > line endings for UTF-16 encoded files as is, but that changes as soon as they > are converted into UTF-8 and the "auto" check is now done > -after- the conversion. I would find that confusing. > >> >> (2) If you use the "working-tree-encoding" attribute *and* you want to >> ensure >> your file keeps CRLF then you can define that in the attributes too. >> E.g.: >> >> *.proj textworking-tree-encoding=UTF-16 eol=crlf > > That is a good one. > If you ever plan a re-roll (I don't at the moment) the *.proj extemsion > make much more sense in Documentation/gitattributes that *.tx > There no text files encoded in UTF-16 wich are called xxx.txt, but those > are non-ideal examples. *.proj makes good sense as an example. OK, I'll do that. Would that fix the problem which this patch tries to address for you? (I would also explicitly add a paragraph to discuss line endings) - Lars
Re: [PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding
On Tue, Jan 30, 2018 at 12:23:47PM +0100, Lars Schneider wrote: > > > On 29 Jan 2018, at 21:19, tbo...@web.de wrote: > > > > From: Torsten Bögershausen> > > > UTF-16 encoded files are treated as "binary" by Git, and no CRLF > > conversion is done. > > When the UTF-16 encoded files are converted into UF-8 using the new > s/UF-8/UTF-8/ > > > > "working-tree-encoding", the CRLF are converted if core.autocrlf is true. > > > > This may lead to confusion: > > A tool writes an UTF-16 encoded file with CRLF. > > The file is commited with core.autocrlf=true, the CLRF are converted into > > LF. > > The repo is pushed somewhere and cloned by a different user, who has > > decided to use core.autocrlf=false. > > He uses the same tool, and now the CRLF are not there as expected, but LF, > > make the file useless for the tool. > > > > Avoid this (possible) confusion by ignoring core.autocrlf for all files > > which have "working-tree-encoding" defined. > > Maybe I don't understand your use case but I think this will generate even > more confusion because that's not what I would expect as a user. I think Git > should behave consistently independent of the used encoding. Here are my > arguments: To start with: I have probably seen too many repos with CRLF messed up. > > (1) Legacy users are *not* affected. If you don't use the > "working-tree-encoding" > attribute then nothing changes for you. People who don't use "working-tree-encoding" are not affected, I never ment to state that. I am thinking about people who use "working-tree-encoding" without thinking about line endings. Or the ones that have in mind that core.autocrlf=true will leave the line endings for UTF-16 encoded files as is, but that changes as soon as they are converted into UTF-8 and the "auto" check is now done -after- the conversion. I would find that confusing. > > (2) If you use the "working-tree-encoding" attribute *and* you want to > ensure > your file keeps CRLF then you can define that in the attributes too. > E.g.: > > *.proj textworking-tree-encoding=UTF-16 eol=crlf That is a good one. If you ever plan a re-roll (I don't at the moment) the *.proj extemsion make much more sense in Documentation/gitattributes that *.tx There no text files encoded in UTF-16 wich are called xxx.txt, but those are non-ideal examples. *.proj makes good sense as an example. > > - Lars > > > > > The user can still use a .gitattributes file and specify the line endings > > like "text=auto", "text", or "text eol=crlf" and let that .gitattribute > > file travel together with push and clone. > > > > Change convert.c to e more careful, simplify the initialization when > > attributes are retrived (and none are specified) and update the > > documentation. > > > > Signed-off-by: Torsten Bögershausen > > --- > > Documentation/gitattributes.txt | 9 ++--- > > convert.c | 15 --- > > 2 files changed, 18 insertions(+), 6 deletions(-) > > > > diff --git a/Documentation/gitattributes.txt > > b/Documentation/gitattributes.txt > > index a8dbf4be3..3665c4677 100644 > > --- a/Documentation/gitattributes.txt > > +++ b/Documentation/gitattributes.txt > > @@ -308,12 +308,15 @@ Use the `working-tree-encoding` attribute only if you > > cannot store a file in > > UTF-8 encoding and if you want Git to be able to process the content as > > text. > > > > +Note that when `working-tree-encoding` is defined, core.autocrlf is > > ignored. > > +Set the `text` attribute (or `text=auto`) to enable CRLF conversions. > > + > > Use the following attributes if your '*.txt' files are UTF-16 encoded > > -with byte order mark (BOM) and you want Git to perform automatic line > > -ending conversion based on your platform. > > +with byte order mark (BOM) and you want Git to perform line > > +ending conversion based on core.eol. > > > > > > -*.txt text working-tree-encoding=UTF-16 > > +*.txt working-tree-encoding=UTF-16 text > > > > > > Use the following attributes if your '*.txt' files are UTF-16 little > > diff --git a/convert.c b/convert.c > > index 13fad490c..e7f11d1db 100644 > > --- a/convert.c > > +++ b/convert.c > > @@ -1264,15 +1264,24 @@ static void convert_attrs(struct conv_attrs *ca, > > const char *path) > > } > > ca->checkout_encoding = git_path_check_encoding(ccheck + 5); > > } else { > > - ca->drv = NULL; > > - ca->crlf_action = CRLF_UNDEFINED; > > - ca->ident = 0; > > + memset(ca, 0, sizeof(*ca)); > > } > > > > /* Save attr and make a decision for action */ > > ca->attr_action = ca->crlf_action; > > if (ca->crlf_action == CRLF_TEXT) > > ca->crlf_action = text_eol_is_crlf() ? CRLF_TEXT_CRLF : > > CRLF_TEXT_INPUT; > > + /* > > +* Often UTF-16 encoded files are read and written by
Re: [PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding
> On 29 Jan 2018, at 21:19, tbo...@web.de wrote: > > From: Torsten Bögershausen> > UTF-16 encoded files are treated as "binary" by Git, and no CRLF > conversion is done. > When the UTF-16 encoded files are converted into UF-8 using the new s/UF-8/UTF-8/ > "working-tree-encoding", the CRLF are converted if core.autocrlf is true. > > This may lead to confusion: > A tool writes an UTF-16 encoded file with CRLF. > The file is commited with core.autocrlf=true, the CLRF are converted into LF. > The repo is pushed somewhere and cloned by a different user, who has > decided to use core.autocrlf=false. > He uses the same tool, and now the CRLF are not there as expected, but LF, > make the file useless for the tool. > > Avoid this (possible) confusion by ignoring core.autocrlf for all files > which have "working-tree-encoding" defined. Maybe I don't understand your use case but I think this will generate even more confusion because that's not what I would expect as a user. I think Git should behave consistently independent of the used encoding. Here are my arguments: (1) Legacy users are *not* affected. If you don't use the "working-tree-encoding" attribute then nothing changes for you. (2) If you use the "working-tree-encoding" attribute *and* you want to ensure your file keeps CRLF then you can define that in the attributes too. E.g.: *.proj textworking-tree-encoding=UTF-16 eol=crlf - Lars > The user can still use a .gitattributes file and specify the line endings > like "text=auto", "text", or "text eol=crlf" and let that .gitattribute > file travel together with push and clone. > > Change convert.c to e more careful, simplify the initialization when > attributes are retrived (and none are specified) and update the documentation. > > Signed-off-by: Torsten Bögershausen > --- > Documentation/gitattributes.txt | 9 ++--- > convert.c | 15 --- > 2 files changed, 18 insertions(+), 6 deletions(-) > > diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt > index a8dbf4be3..3665c4677 100644 > --- a/Documentation/gitattributes.txt > +++ b/Documentation/gitattributes.txt > @@ -308,12 +308,15 @@ Use the `working-tree-encoding` attribute only if you > cannot store a file in > UTF-8 encoding and if you want Git to be able to process the content as > text. > > +Note that when `working-tree-encoding` is defined, core.autocrlf is ignored. > +Set the `text` attribute (or `text=auto`) to enable CRLF conversions. > + > Use the following attributes if your '*.txt' files are UTF-16 encoded > -with byte order mark (BOM) and you want Git to perform automatic line > -ending conversion based on your platform. > +with byte order mark (BOM) and you want Git to perform line > +ending conversion based on core.eol. > > > -*.txttext working-tree-encoding=UTF-16 > +*.txtworking-tree-encoding=UTF-16 text > > > Use the following attributes if your '*.txt' files are UTF-16 little > diff --git a/convert.c b/convert.c > index 13fad490c..e7f11d1db 100644 > --- a/convert.c > +++ b/convert.c > @@ -1264,15 +1264,24 @@ static void convert_attrs(struct conv_attrs *ca, > const char *path) > } > ca->checkout_encoding = git_path_check_encoding(ccheck + 5); > } else { > - ca->drv = NULL; > - ca->crlf_action = CRLF_UNDEFINED; > - ca->ident = 0; > + memset(ca, 0, sizeof(*ca)); > } > > /* Save attr and make a decision for action */ > ca->attr_action = ca->crlf_action; > if (ca->crlf_action == CRLF_TEXT) > ca->crlf_action = text_eol_is_crlf() ? CRLF_TEXT_CRLF : > CRLF_TEXT_INPUT; > + /* > + * Often UTF-16 encoded files are read and written by programs which > + * really need CRLF, and it is important to keep the CRLF "as is" when > + * files are committed with core.autocrlf=true and the repo is pushed. > + * The CRLF would be converted into LF when the repo is cloned to > + * a machine with core.autocrlf=false. > + * Obey the "text" and "eol" attributes and be independent on the > + * local core.autocrlf for all "encoded" files. > + */ > + if ((ca->crlf_action == CRLF_UNDEFINED) && ca->checkout_encoding) > + ca->crlf_action = CRLF_BINARY; > if (ca->crlf_action == CRLF_UNDEFINED && auto_crlf == AUTO_CRLF_FALSE) > ca->crlf_action = CRLF_BINARY; > if (ca->crlf_action == CRLF_UNDEFINED && auto_crlf == AUTO_CRLF_TRUE) > -- > 2.16.0.rc0.2.g64d3e4d0cc.dirty >
[PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding
From: Torsten BögershausenUTF-16 encoded files are treated as "binary" by Git, and no CRLF conversion is done. When the UTF-16 encoded files are converted into UF-8 using the new "working-tree-encoding", the CRLF are converted if core.autocrlf is true. This may lead to confusion: A tool writes an UTF-16 encoded file with CRLF. The file is commited with core.autocrlf=true, the CLRF are converted into LF. The repo is pushed somewhere and cloned by a different user, who has decided to use core.autocrlf=false. He uses the same tool, and now the CRLF are not there as expected, but LF, make the file useless for the tool. Avoid this (possible) confusion by ignoring core.autocrlf for all files which have "working-tree-encoding" defined. The user can still use a .gitattributes file and specify the line endings like "text=auto", "text", or "text eol=crlf" and let that .gitattribute file travel together with push and clone. Change convert.c to e more careful, simplify the initialization when attributes are retrived (and none are specified) and update the documentation. Signed-off-by: Torsten Bögershausen --- Documentation/gitattributes.txt | 9 ++--- convert.c | 15 --- 2 files changed, 18 insertions(+), 6 deletions(-) diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt index a8dbf4be3..3665c4677 100644 --- a/Documentation/gitattributes.txt +++ b/Documentation/gitattributes.txt @@ -308,12 +308,15 @@ Use the `working-tree-encoding` attribute only if you cannot store a file in UTF-8 encoding and if you want Git to be able to process the content as text. +Note that when `working-tree-encoding` is defined, core.autocrlf is ignored. +Set the `text` attribute (or `text=auto`) to enable CRLF conversions. + Use the following attributes if your '*.txt' files are UTF-16 encoded -with byte order mark (BOM) and you want Git to perform automatic line -ending conversion based on your platform. +with byte order mark (BOM) and you want Git to perform line +ending conversion based on core.eol. -*.txt text working-tree-encoding=UTF-16 +*.txt working-tree-encoding=UTF-16 text Use the following attributes if your '*.txt' files are UTF-16 little diff --git a/convert.c b/convert.c index 13fad490c..e7f11d1db 100644 --- a/convert.c +++ b/convert.c @@ -1264,15 +1264,24 @@ static void convert_attrs(struct conv_attrs *ca, const char *path) } ca->checkout_encoding = git_path_check_encoding(ccheck + 5); } else { - ca->drv = NULL; - ca->crlf_action = CRLF_UNDEFINED; - ca->ident = 0; + memset(ca, 0, sizeof(*ca)); } /* Save attr and make a decision for action */ ca->attr_action = ca->crlf_action; if (ca->crlf_action == CRLF_TEXT) ca->crlf_action = text_eol_is_crlf() ? CRLF_TEXT_CRLF : CRLF_TEXT_INPUT; + /* +* Often UTF-16 encoded files are read and written by programs which +* really need CRLF, and it is important to keep the CRLF "as is" when +* files are committed with core.autocrlf=true and the repo is pushed. +* The CRLF would be converted into LF when the repo is cloned to +* a machine with core.autocrlf=false. +* Obey the "text" and "eol" attributes and be independent on the +* local core.autocrlf for all "encoded" files. +*/ + if ((ca->crlf_action == CRLF_UNDEFINED) && ca->checkout_encoding) + ca->crlf_action = CRLF_BINARY; if (ca->crlf_action == CRLF_UNDEFINED && auto_crlf == AUTO_CRLF_FALSE) ca->crlf_action = CRLF_BINARY; if (ca->crlf_action == CRLF_UNDEFINED && auto_crlf == AUTO_CRLF_TRUE) -- 2.16.0.rc0.2.g64d3e4d0cc.dirty