Re: [PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding

2018-02-07 Thread Junio C Hamano
Torsten Bögershausen  writes:

>> the user explicitly tells us it is in UTF-16, right?  Is there such a
>> thing as UTF-16 binary?
>
> I don't think so, by definiton UTF-16 is ment to be text.
> (this means that git ls-files --eol needs some update, I can have a look)
>
> Do we agree that UTF-16 is text ?
> If yes, could Git assume that the "text" attribute is set when
> working-tree-encoding is set ?

These are two different questions.  It seems that between the two of
us, we established that "UTF-16 binary" is a nonsense, and a path
that is given working-tree-encoding=UTF-16 must be treated as
holding a text file.  But that does not have direct relevance to the
other question you are asking: "is a question 'does this path have
text attribute set?' be answered with 'yes' if the path has wte
attribute set to UTF-16?"  I think the answer to that latter
question ought to be "no".

By the way, a related tangent is if it makes sense to give
working-tree-encoding to anything that is binary, regardless of the
value---I am inclined to say it is not, so the issue here is not "by
definition UTF-16 is text", but "any path that has wte set to some
enconding should be treated the same way as if the path also has
text attribute set as far as convert machinery is concerned.".



Re: [PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding

2018-02-06 Thread Torsten Bögershausen
On Fri, Feb 02, 2018 at 11:17:04AM -0800, Junio C Hamano wrote:
> Torsten Bögershausen  writes:
> 
> > There are 2 opposite opionions/user expectations here:
> >
> > a) They are binary in the working tree, so git should leave the line endings
> >as is. (Unless specified otherwise in the .attributes file)
> > ...
> > b) They are text files in the index. Git will convert line endings
> >if core.autocrlf is true (or the .gitattributes file specifies "-text")
> 
> I sense that you seem to be focusing on the distinction between "in
> the working tree" vs "in the index" while contrasting.  The "binary
> vs text" in your "binary in wt, text in index" is based on the
> default heuristics without any input from end-users or the project
> that uses Git that happens to contain such files.  If the users and
> the project that uses Git want to treat contents in a path as text,
> it is text even when it is (re-)encoded to UTF-16, no?
> 
> Such files may be (mis)classified as binary with the default
> heuristics when there is no help from what is written in the
> .gitattributes file, but here we are talking about the case where
> the user explicitly tells us it is in UTF-16, right?  Is there such a
> thing as UTF-16 binary?

I don't think so, by definiton UTF-16 is ment to be text.
(this means that git ls-files --eol needs some update, I can have a look)

Do we agree that UTF-16 is text ?
If yes, could Git assume that the "text" attribute is set when
working-tree-encoding is set ?

I would even go a step further and demand that the user -must- make a decision
about the line endings for working-tree-encoded files:
working-tree-encoding=UTF-16 # illegal, die()
working-tree-encoding=UTF-16 text=auto   # illegal, die()
working-tree-encoding=UTF-16 -text   # no eol conversion
working-tree-encoding=UTF-16 text# eol according to core.eol
working-tree-encoding=UTF-16 text eol=lf # LF
working-tree-encoding=UTF-16 text eol=crlf   # CRLF

What do you think ?





Re: [PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding

2018-02-02 Thread Junio C Hamano
Torsten Bögershausen  writes:

> There are 2 opposite opionions/user expectations here:
>
> a) They are binary in the working tree, so git should leave the line endings
>as is. (Unless specified otherwise in the .attributes file)
> ...
> b) They are text files in the index. Git will convert line endings
>if core.autocrlf is true (or the .gitattributes file specifies "-text")

I sense that you seem to be focusing on the distinction between "in
the working tree" vs "in the index" while contrasting.  The "binary
vs text" in your "binary in wt, text in index" is based on the
default heuristics without any input from end-users or the project
that uses Git that happens to contain such files.  If the users and
the project that uses Git want to treat contents in a path as text,
it is text even when it is (re-)encoded to UTF-16, no?

Such files may be (mis)classified as binary with the default
heuristics when there is no help from what is written in the
.gitattributes file, but here we are talking about the case where
the user explicitly tells us it is in UTF-16, right?  Is there such a
thing as UTF-16 binary?


Re: [PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding

2018-01-31 Thread Lars Schneider

> On 31 Jan 2018, at 18:28, Torsten Bögershausen  wrote:
> 
> []
>>> That is a good one.
>>> If you ever plan a re-roll (I don't at the moment) the *.proj extemsion
>>> make much more sense in Documentation/gitattributes that *.tx
>>> There no text files encoded in UTF-16 wich are called xxx.txt, but those
>>> are non-ideal examples. *.proj makes good sense as an example.
>> 
>> OK, I'll do that. Would that fix the problem which this patch tries to 
>> address for you?
>> (I would also explicitly add a paragraph to discuss line endings)
> 
> Please let me see the patch first, before I can have a comment.

Sure! I'll have it ready tomorrow.


> But back to the more general question:
> 
> How should Git handle the line endings of UTF-16 files in the woring-tree,
> which are UTF-8 in the index?
> 
> 
> There are 2 opposite opionions/user expectations here:
> 
> a) They are binary in the working tree, so git should leave the line endings
>   as is. (Unless specified otherwise in the .attributes file)

Well, if you consider your UTF-16 files binary then you would not change
*anything*. You would not enable the new "working-tree-encoding" attribute.
As a consequence, Git's behavior would not change. Git would leave all line 
endings as they are for the UTF-16 files.


> b) They are text files in the index. Git will convert line endings
>   if core.autocrlf is true (or the .gitattributes file specifies "-text")

This would *only* happen if you enable the new "working-tree-encoding"
attribute. In this case a user has already made the conscious decision to
treat these files as text files. Therefore, the user expects Git to handle
them in the same way other text files are handled.


> My feeling is that both arguments are valid, so let's ask for opinions
> and thoughts of others.
> Erik, Junio, Johannes, Johannes, Jeff, Ramsay, everybody:
> What do yo think ?

- Lars

Re: [PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding

2018-01-31 Thread Torsten Bögershausen
[]
> > That is a good one.
> > If you ever plan a re-roll (I don't at the moment) the *.proj extemsion
> > make much more sense in Documentation/gitattributes that *.tx
> > There no text files encoded in UTF-16 wich are called xxx.txt, but those
> > are non-ideal examples. *.proj makes good sense as an example.
> 
> OK, I'll do that. Would that fix the problem which this patch tries to 
> address for you?
> (I would also explicitly add a paragraph to discuss line endings)

Please let me see the patch first, before I can have a comment.

But back to the more general question:

How should Git handle the line endings of UTF-16 files in the woring-tree,
which are UTF-8 in the index?


There are 2 opposite opionions/user expectations here:

a) They are binary in the working tree, so git should leave the line endings
   as is. (Unless specified otherwise in the .attributes file)
b) They are text files in the index. Git will convert line endings
   if core.autocrlf is true (or the .gitattributes file specifies "-text")

My feeling is that both arguments are valid, so let's ask for opinions
and thoughts of others.
Erik, Junio, Johannes, Johannes, Jeff, Ramsay, everybody:
What do yo think ?


Re: [PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding

2018-01-30 Thread Lars Schneider

> On 30 Jan 2018, at 15:40, Torsten Bögershausen  wrote:
> 
> On Tue, Jan 30, 2018 at 12:23:47PM +0100, Lars Schneider wrote:
>> 
>>> On 29 Jan 2018, at 21:19, tbo...@web.de wrote:
>>> 
>>> From: Torsten Bögershausen 
>>> 
>>> UTF-16 encoded files are treated as "binary" by Git, and no CRLF
>>> conversion is done.
>>> When the UTF-16 encoded files are converted into UF-8 using the new
>> s/UF-8/UTF-8/
>> 
>> 
>>> "working-tree-encoding", the CRLF are converted if core.autocrlf is true.
>>> 
>>> This may lead to confusion:
>>> A tool writes an UTF-16 encoded file with CRLF.
>>> The file is commited with core.autocrlf=true, the CLRF are converted into 
>>> LF.
>>> The repo is pushed somewhere and cloned by a different user, who has
>>> decided to use core.autocrlf=false.
>>> He uses the same tool, and now the CRLF are not there as expected, but LF,
>>> make the file useless for the tool.
>>> 
>>> Avoid this (possible) confusion by ignoring core.autocrlf for all files
>>> which have "working-tree-encoding" defined.
>> 
>> Maybe I don't understand your use case but I think this will generate even 
>> more confusion because that's not what I would expect as a user. I think Git 
>> should behave consistently independent of the used encoding. Here are my 
>> arguments:
> 
> To start with: I have probably seen too many repos with CRLF messed up.
> 
>> 
>>  (1) Legacy users are *not* affected. If you don't use the 
>> "working-tree-encoding"
>>  attribute then nothing changes for you.
> 
> People who don't use "working-tree-encoding" are not affected,
> I never ment to state that.
> 
> I am thinking about people who use "working-tree-encoding" without thinking
> about line endings.
> Or the ones that have in mind that core.autocrlf=true will leave the
> line endings for UTF-16 encoded files as is, but that changes as soon as they
> are converted into UTF-8 and the "auto" check is now done
> -after- the conversion. I would find that confusing.
> 
>> 
>>  (2) If you use the "working-tree-encoding" attribute *and* you want to 
>> ensure 
>>  your file keeps CRLF then you can define that in the attributes too. 
>> E.g.:
>> 
>>  *.proj textworking-tree-encoding=UTF-16 eol=crlf
> 
> That is a good one.
> If you ever plan a re-roll (I don't at the moment) the *.proj extemsion
> make much more sense in Documentation/gitattributes that *.tx
> There no text files encoded in UTF-16 wich are called xxx.txt, but those
> are non-ideal examples. *.proj makes good sense as an example.

OK, I'll do that. Would that fix the problem which this patch tries to address 
for you?
(I would also explicitly add a paragraph to discuss line endings)

- Lars

Re: [PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding

2018-01-30 Thread Torsten Bögershausen
On Tue, Jan 30, 2018 at 12:23:47PM +0100, Lars Schneider wrote:
> 
> > On 29 Jan 2018, at 21:19, tbo...@web.de wrote:
> > 
> > From: Torsten Bögershausen 
> > 
> > UTF-16 encoded files are treated as "binary" by Git, and no CRLF
> > conversion is done.
> > When the UTF-16 encoded files are converted into UF-8 using the new
> s/UF-8/UTF-8/
> 
> 
> > "working-tree-encoding", the CRLF are converted if core.autocrlf is true.
> > 
> > This may lead to confusion:
> > A tool writes an UTF-16 encoded file with CRLF.
> > The file is commited with core.autocrlf=true, the CLRF are converted into 
> > LF.
> > The repo is pushed somewhere and cloned by a different user, who has
> > decided to use core.autocrlf=false.
> > He uses the same tool, and now the CRLF are not there as expected, but LF,
> > make the file useless for the tool.
> > 
> > Avoid this (possible) confusion by ignoring core.autocrlf for all files
> > which have "working-tree-encoding" defined.
> 
> Maybe I don't understand your use case but I think this will generate even 
> more confusion because that's not what I would expect as a user. I think Git 
> should behave consistently independent of the used encoding. Here are my 
> arguments:

To start with: I have probably seen too many repos with CRLF messed up.

> 
>   (1) Legacy users are *not* affected. If you don't use the 
> "working-tree-encoding"
>   attribute then nothing changes for you.

People who don't use "working-tree-encoding" are not affected,
I never ment to state that.

I am thinking about people who use "working-tree-encoding" without thinking
about line endings.
Or the ones that have in mind that core.autocrlf=true will leave the
line endings for UTF-16 encoded files as is, but that changes as soon as they
are converted into UTF-8 and the "auto" check is now done
-after- the conversion. I would find that confusing.

> 
>   (2) If you use the "working-tree-encoding" attribute *and* you want to 
> ensure 
>   your file keeps CRLF then you can define that in the attributes too. 
> E.g.:
>   
>   *.proj textworking-tree-encoding=UTF-16 eol=crlf

That is a good one.
If you ever plan a re-roll (I don't at the moment) the *.proj extemsion
make much more sense in Documentation/gitattributes that *.tx
There no text files encoded in UTF-16 wich are called xxx.txt, but those
are non-ideal examples. *.proj makes good sense as an example.


> 
> - Lars
> 
> 
> 
> > The user can still use a .gitattributes file and specify the line endings
> > like "text=auto", "text", or "text eol=crlf" and let that .gitattribute
> > file travel together with push and clone.
> > 
> > Change convert.c to e more careful, simplify the initialization when
> > attributes are retrived (and none are specified) and update the 
> > documentation.
> > 
> > Signed-off-by: Torsten Bögershausen 
> > ---
> > Documentation/gitattributes.txt |  9 ++---
> > convert.c   | 15 ---
> > 2 files changed, 18 insertions(+), 6 deletions(-)
> > 
> > diff --git a/Documentation/gitattributes.txt 
> > b/Documentation/gitattributes.txt
> > index a8dbf4be3..3665c4677 100644
> > --- a/Documentation/gitattributes.txt
> > +++ b/Documentation/gitattributes.txt
> > @@ -308,12 +308,15 @@ Use the `working-tree-encoding` attribute only if you 
> > cannot store a file in
> > UTF-8 encoding and if you want Git to be able to process the content as
> > text.
> > 
> > +Note that when `working-tree-encoding` is defined, core.autocrlf is 
> > ignored.
> > +Set the `text` attribute (or `text=auto`) to enable CRLF conversions.
> > +
> > Use the following attributes if your '*.txt' files are UTF-16 encoded
> > -with byte order mark (BOM) and you want Git to perform automatic line
> > -ending conversion based on your platform.
> > +with byte order mark (BOM) and you want Git to perform line
> > +ending conversion based on core.eol.
> > 
> > 
> > -*.txt  text working-tree-encoding=UTF-16
> > +*.txt  working-tree-encoding=UTF-16 text
> > 
> > 
> > Use the following attributes if your '*.txt' files are UTF-16 little
> > diff --git a/convert.c b/convert.c
> > index 13fad490c..e7f11d1db 100644
> > --- a/convert.c
> > +++ b/convert.c
> > @@ -1264,15 +1264,24 @@ static void convert_attrs(struct conv_attrs *ca, 
> > const char *path)
> > }
> > ca->checkout_encoding = git_path_check_encoding(ccheck + 5);
> > } else {
> > -   ca->drv = NULL;
> > -   ca->crlf_action = CRLF_UNDEFINED;
> > -   ca->ident = 0;
> > +   memset(ca, 0, sizeof(*ca));
> > }
> > 
> > /* Save attr and make a decision for action */
> > ca->attr_action = ca->crlf_action;
> > if (ca->crlf_action == CRLF_TEXT)
> > ca->crlf_action = text_eol_is_crlf() ? CRLF_TEXT_CRLF : 
> > CRLF_TEXT_INPUT;
> > +   /*
> > +* Often UTF-16 encoded files are read and written by 

Re: [PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding

2018-01-30 Thread Lars Schneider

> On 29 Jan 2018, at 21:19, tbo...@web.de wrote:
> 
> From: Torsten Bögershausen 
> 
> UTF-16 encoded files are treated as "binary" by Git, and no CRLF
> conversion is done.
> When the UTF-16 encoded files are converted into UF-8 using the new
s/UF-8/UTF-8/


> "working-tree-encoding", the CRLF are converted if core.autocrlf is true.
> 
> This may lead to confusion:
> A tool writes an UTF-16 encoded file with CRLF.
> The file is commited with core.autocrlf=true, the CLRF are converted into LF.
> The repo is pushed somewhere and cloned by a different user, who has
> decided to use core.autocrlf=false.
> He uses the same tool, and now the CRLF are not there as expected, but LF,
> make the file useless for the tool.
> 
> Avoid this (possible) confusion by ignoring core.autocrlf for all files
> which have "working-tree-encoding" defined.

Maybe I don't understand your use case but I think this will generate even 
more confusion because that's not what I would expect as a user. I think Git 
should behave consistently independent of the used encoding. Here are my 
arguments:

  (1) Legacy users are *not* affected. If you don't use the 
"working-tree-encoding"
  attribute then nothing changes for you.

  (2) If you use the "working-tree-encoding" attribute *and* you want to ensure 
  your file keeps CRLF then you can define that in the attributes too. E.g.:
  
  *.proj textworking-tree-encoding=UTF-16 eol=crlf

- Lars



> The user can still use a .gitattributes file and specify the line endings
> like "text=auto", "text", or "text eol=crlf" and let that .gitattribute
> file travel together with push and clone.
> 
> Change convert.c to e more careful, simplify the initialization when
> attributes are retrived (and none are specified) and update the documentation.
> 
> Signed-off-by: Torsten Bögershausen 
> ---
> Documentation/gitattributes.txt |  9 ++---
> convert.c   | 15 ---
> 2 files changed, 18 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
> index a8dbf4be3..3665c4677 100644
> --- a/Documentation/gitattributes.txt
> +++ b/Documentation/gitattributes.txt
> @@ -308,12 +308,15 @@ Use the `working-tree-encoding` attribute only if you 
> cannot store a file in
> UTF-8 encoding and if you want Git to be able to process the content as
> text.
> 
> +Note that when `working-tree-encoding` is defined, core.autocrlf is ignored.
> +Set the `text` attribute (or `text=auto`) to enable CRLF conversions.
> +
> Use the following attributes if your '*.txt' files are UTF-16 encoded
> -with byte order mark (BOM) and you want Git to perform automatic line
> -ending conversion based on your platform.
> +with byte order mark (BOM) and you want Git to perform line
> +ending conversion based on core.eol.
> 
> 
> -*.txttext working-tree-encoding=UTF-16
> +*.txtworking-tree-encoding=UTF-16 text
> 
> 
> Use the following attributes if your '*.txt' files are UTF-16 little
> diff --git a/convert.c b/convert.c
> index 13fad490c..e7f11d1db 100644
> --- a/convert.c
> +++ b/convert.c
> @@ -1264,15 +1264,24 @@ static void convert_attrs(struct conv_attrs *ca, 
> const char *path)
>   }
>   ca->checkout_encoding = git_path_check_encoding(ccheck + 5);
>   } else {
> - ca->drv = NULL;
> - ca->crlf_action = CRLF_UNDEFINED;
> - ca->ident = 0;
> + memset(ca, 0, sizeof(*ca));
>   }
> 
>   /* Save attr and make a decision for action */
>   ca->attr_action = ca->crlf_action;
>   if (ca->crlf_action == CRLF_TEXT)
>   ca->crlf_action = text_eol_is_crlf() ? CRLF_TEXT_CRLF : 
> CRLF_TEXT_INPUT;
> + /*
> +  * Often UTF-16 encoded files are read and written by programs which
> +  * really need CRLF, and it is important to keep the CRLF "as is" when
> +  * files are committed with core.autocrlf=true and the repo is pushed.
> +  * The CRLF would be converted into LF when the repo is cloned to
> +  * a machine with core.autocrlf=false.
> +  * Obey the "text" and "eol" attributes and be independent on the
> +  * local core.autocrlf for all "encoded" files.
> +  */
> + if ((ca->crlf_action == CRLF_UNDEFINED) && ca->checkout_encoding)
> + ca->crlf_action = CRLF_BINARY;
>   if (ca->crlf_action == CRLF_UNDEFINED && auto_crlf == AUTO_CRLF_FALSE)
>   ca->crlf_action = CRLF_BINARY;
>   if (ca->crlf_action == CRLF_UNDEFINED && auto_crlf == AUTO_CRLF_TRUE)
> -- 
> 2.16.0.rc0.2.g64d3e4d0cc.dirty
> 



[PATCH/RFC v5 7/7] Careful with CRLF when using e.g. UTF-16 for working-tree-encoding

2018-01-29 Thread tboegi
From: Torsten Bögershausen 

UTF-16 encoded files are treated as "binary" by Git, and no CRLF
conversion is done.
When the UTF-16 encoded files are converted into UF-8 using the new
"working-tree-encoding", the CRLF are converted if core.autocrlf is true.

This may lead to confusion:
A tool writes an UTF-16 encoded file with CRLF.
The file is commited with core.autocrlf=true, the CLRF are converted into LF.
The repo is pushed somewhere and cloned by a different user, who has
decided to use core.autocrlf=false.
He uses the same tool, and now the CRLF are not there as expected, but LF,
make the file useless for the tool.

Avoid this (possible) confusion by ignoring core.autocrlf for all files
which have "working-tree-encoding" defined.

The user can still use a .gitattributes file and specify the line endings
like "text=auto", "text", or "text eol=crlf" and let that .gitattribute
file travel together with push and clone.

Change convert.c to e more careful, simplify the initialization when
attributes are retrived (and none are specified) and update the documentation.

Signed-off-by: Torsten Bögershausen 
---
 Documentation/gitattributes.txt |  9 ++---
 convert.c   | 15 ---
 2 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index a8dbf4be3..3665c4677 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -308,12 +308,15 @@ Use the `working-tree-encoding` attribute only if you 
cannot store a file in
 UTF-8 encoding and if you want Git to be able to process the content as
 text.
 
+Note that when `working-tree-encoding` is defined, core.autocrlf is ignored.
+Set the `text` attribute (or `text=auto`) to enable CRLF conversions.
+
 Use the following attributes if your '*.txt' files are UTF-16 encoded
-with byte order mark (BOM) and you want Git to perform automatic line
-ending conversion based on your platform.
+with byte order mark (BOM) and you want Git to perform line
+ending conversion based on core.eol.
 
 
-*.txt  text working-tree-encoding=UTF-16
+*.txt  working-tree-encoding=UTF-16 text
 
 
 Use the following attributes if your '*.txt' files are UTF-16 little
diff --git a/convert.c b/convert.c
index 13fad490c..e7f11d1db 100644
--- a/convert.c
+++ b/convert.c
@@ -1264,15 +1264,24 @@ static void convert_attrs(struct conv_attrs *ca, const 
char *path)
}
ca->checkout_encoding = git_path_check_encoding(ccheck + 5);
} else {
-   ca->drv = NULL;
-   ca->crlf_action = CRLF_UNDEFINED;
-   ca->ident = 0;
+   memset(ca, 0, sizeof(*ca));
}
 
/* Save attr and make a decision for action */
ca->attr_action = ca->crlf_action;
if (ca->crlf_action == CRLF_TEXT)
ca->crlf_action = text_eol_is_crlf() ? CRLF_TEXT_CRLF : 
CRLF_TEXT_INPUT;
+   /*
+* Often UTF-16 encoded files are read and written by programs which
+* really need CRLF, and it is important to keep the CRLF "as is" when
+* files are committed with core.autocrlf=true and the repo is pushed.
+* The CRLF would be converted into LF when the repo is cloned to
+* a machine with core.autocrlf=false.
+* Obey the "text" and "eol" attributes and be independent on the
+* local core.autocrlf for all "encoded" files.
+*/
+   if ((ca->crlf_action == CRLF_UNDEFINED) && ca->checkout_encoding)
+   ca->crlf_action = CRLF_BINARY;
if (ca->crlf_action == CRLF_UNDEFINED && auto_crlf == AUTO_CRLF_FALSE)
ca->crlf_action = CRLF_BINARY;
if (ca->crlf_action == CRLF_UNDEFINED && auto_crlf == AUTO_CRLF_TRUE)
-- 
2.16.0.rc0.2.g64d3e4d0cc.dirty