Re: [PATCH v3 0/7] convert: add support for different encodings

2018-01-08 Thread Torsten Bögershausen
On Mon, Jan 08, 2018 at 03:38:48PM +0100, Lars Schneider wrote:
[]
> > Some comments:
> > 
> > I would like to have the CRLF conversion a little bit more strict -
> > many users tend to set core.autocrlf=true or write "* text=auto"
> > in the .gitattributes.
> > Reading all the effort about BOM markers and UTF-16LE, I think there
> > should ne some effort to make the line endings round trip.
> > Therefore I changed convert.c to demand that the "text" attribute
> > is set to enable CRLF conversions.
> > (If I had submitted the patch, I would have demanded
> > "text eol=lf" or "text eol=crlf", but the test case t0028 indicates
> > that there is a demand to produce line endings as configured in core.eol)
> 
> But wouldn't that be inconvenient for the users? E.g. if I add a UTF-16
> file on Windows with CRLF then it would be nice if Git would automatically
> convert the line endings to LF on Linux, no?
> 
> IOW: Why should we handle text files that have a defined checkout-encoding
> differently compared to UTF-8 encoded text files? Wouldn't that be unexpected
> to the user?
> 
> Thanks,
> Lars

The problem is, if user A has core.autocrlf=true and user B
core.autocrlf=false.
(The line endings don't show up as expected at user B)

Having said that in all shortness, you convinced me:
If text=auto, we care about line endings.
If text,  we care about line endings.

If the .gitattributes don't say anything about text,
we don't convert eol.
(In other words: we don't look at core.autocrlf, when checkout-encoding
is defined)

A new branch is push to github/tboegi




Re: [PATCH v3 0/7] convert: add support for different encodings

2018-01-08 Thread Lars Schneider

> On 07 Jan 2018, at 10:38, Torsten Bögershausen  wrote:
> 
> On Sat, Jan 06, 2018 at 01:48:01AM +0100, lars.schnei...@autodesk.com wrote:
>> From: Lars Schneider 
>> 
>> Hi,
>> 
>> Patches 1-5 and 6 are helper functions and preparation.
>> Patch 6 is the actual change.
>> 
>> I am still torn between "checkout-encoding" and "working-tree-encoding"
>> as attribute name. I am happy to hear arguments for/against one or the
>> other.
> 
> checkout-encoding is probably misleading, as it is even the checkin-encoding.

Yeah, I start to think the same.


> What is wrong with working-tree-encoding ?
> I think the 2 "-".
> 
> What was wrong with workingtree-encoding ?

Yeah, the two dashes are a minor annoyance.

However, consider this:

$ git grep 'working tree' -- '*.txt' | wc -l
 570

$ git grep 'working-tree' -- '*.txt' | wc -l
   6

$ git grep 'workingtree' -- '*.txt' | wc -l
   0


$ git grep 'working tree' -- po | wc -l
 704

$ git grep 'working-tree' -- po | wc -l
   0

$ git grep 'workingtree' -- po | wc -l
   0

I think "working tree" is a pretty established term that
endusers might be able to understand. Therefore, I would
like to go with "working-tree-encoding" as it was written
that way at least 6 times in the Git tree before.

Would that work for you?


> Or
> workdir-encoding ?

Although I like the shortness, the term "workdir" might already 
be occupied [1]. Could that cause confusion?

[1] 4f01748d51 (contrib/workdir: add a simple script to create a working 
directory, 2007-03-27)


>> 
>> * Removed unnecessary NUL assignment in xstrdup_tolower() (Torsten)
>> 
>> * Set "git config core.eol lf" to made the test run on Windows (Dscho)
>> 
>> * Made BOM arrays static (Ramsay)
> 
> 
> Some comments:
> 
> I would like to have the CRLF conversion a little bit more strict -
> many users tend to set core.autocrlf=true or write "* text=auto"
> in the .gitattributes.
> Reading all the effort about BOM markers and UTF-16LE, I think there
> should ne some effort to make the line endings round trip.
> Therefore I changed convert.c to demand that the "text" attribute
> is set to enable CRLF conversions.
> (If I had submitted the patch, I would have demanded
> "text eol=lf" or "text eol=crlf", but the test case t0028 indicates
> that there is a demand to produce line endings as configured in core.eol)

But wouldn't that be inconvenient for the users? E.g. if I add a UTF-16
file on Windows with CRLF then it would be nice if Git would automatically
convert the line endings to LF on Linux, no?

IOW: Why should we handle text files that have a defined checkout-encoding
differently compared to UTF-8 encoded text files? Wouldn't that be unexpected
to the user?

Thanks,
Lars



> 
> Anyway, I rebased it onto git.git/master, changed the docu, and pushed it to
> https://github.com/tboegi/git/tree/180107-0935-For-lars-schneider-encode-V3B
> 
> Here is a inter-diff against your version:
> 
> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
> index 1bc03e69c..b8d9f91c8 100644
> --- a/Documentation/gitattributes.txt
> +++ b/Documentation/gitattributes.txt
> @@ -281,7 +281,7 @@ interpreted as binary and consequently built-in Git text 
> processing
>  tools (e.g. 'git diff') as well as most Git web front ends do not
>  visualize the content.
> 
> -In these cases you can teach Git the encoding of a file in the working
> +In these cases you can tell Git the encoding of a file in the working

Oops. I meant to change that already. Thanks!

>  directory with the `checkout-encoding` attribute. If a file with this
>  attributes is added to Git, then Git reencodes the content from the
>  specified encoding to UTF-8 and stores the result in its internal data
> @@ -308,17 +308,20 @@ Use the `checkout-encoding` attribute only if you 
> cannot store a file in
>  UTF-8 encoding and if you want Git to be able to process the content as
>  text.
> 
> +Note that when `checkout-encoding` is defined, by default the line
> +endings are not converted. `text=auto` and core.autocrlf are ignored.
> +Set the `text` attribute to enable CRLF conversions.
> +
>  Use the following attributes if your '*.txt' files are UTF-16 encoded
> -with byte order mark (BOM) and you want Git to perform automatic line
> -ending conversion based on your platform.
> +with byte order mark (BOM).
> 
>  
> -*.txttext checkout-encoding=UTF-16
> +*.txtcheckout-encoding=UTF-16
>  
> 
>  Use the following attributes if your '*.txt' files are UTF-16 little
> -endian encoded without BOM and you want Git to use Windows line endings
> -in the working directory.
> +endian encoded without BOM and you want Git to use LF in the repo and
> +CRLF in the working directory.
> 
>  
>  *.txtcheckout-encoding=UTF-16LE text eol=CRLF
> diff --git a/convert.c b/convert.c
> index 

Re: [PATCH v3 0/7] convert: add support for different encodings

2018-01-07 Thread Torsten Bögershausen
On Sat, Jan 06, 2018 at 01:48:01AM +0100, lars.schnei...@autodesk.com wrote:
> From: Lars Schneider 
> 
> Hi,
> 
> Patches 1-5 and 6 are helper functions and preparation.
> Patch 6 is the actual change.
> 
> I am still torn between "checkout-encoding" and "working-tree-encoding"
> as attribute name. I am happy to hear arguments for/against one or the
> other.

checkout-encoding is probably misleading, as it is even the checkin-encoding.

What is wrong with working-tree-encoding ?
I think the 2 "-".

What was wrong with workingtree-encoding ?
Or
workdir-encoding ?



> 
> Changes since v2:
> 
> * Added Torsten's crlfsave refactoring patch (patch 5)
>   @Torsten: I tried to make the commit message more clean, added
> some comments to and renamed conv_flags_eol to
> global_conv_flags_eol.
> 
> * Improved documentation and commit message (Torsten)

Good, thanks.
> 
> * Removed unnecessary NUL assignment in xstrdup_tolower() (Torsten)
> 
> * Set "git config core.eol lf" to made the test run on Windows (Dscho)
> 
> * Made BOM arrays static (Ramsay)


Some comments:

I would like to have the CRLF conversion a little bit more strict -
many users tend to set core.autocrlf=true or write "* text=auto"
in the .gitattributes.
Reading all the effort about BOM markers and UTF-16LE, I think there
should ne some effort to make the line endings round trip.
Therefore I changed convert.c to demand that the "text" attribute
is set to enable CRLF conversions.
(If I had submitted the patch, I would have demanded
"text eol=lf" or "text eol=crlf", but the test case t0028 indicates
that there is a demand to produce line endings as configured in core.eol)

Anyway, I rebased it onto git.git/master, changed the docu, and pushed it to
https://github.com/tboegi/git/tree/180107-0935-For-lars-schneider-encode-V3B

Here is a inter-diff against your version:

 diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
 index 1bc03e69c..b8d9f91c8 100644
 --- a/Documentation/gitattributes.txt
 +++ b/Documentation/gitattributes.txt
 @@ -281,7 +281,7 @@ interpreted as binary and consequently built-in Git text 
processing
  tools (e.g. 'git diff') as well as most Git web front ends do not
  visualize the content.
  
 -In these cases you can teach Git the encoding of a file in the working
 +In these cases you can tell Git the encoding of a file in the working
  directory with the `checkout-encoding` attribute. If a file with this
  attributes is added to Git, then Git reencodes the content from the
  specified encoding to UTF-8 and stores the result in its internal data
 @@ -308,17 +308,20 @@ Use the `checkout-encoding` attribute only if you cannot 
store a file in
  UTF-8 encoding and if you want Git to be able to process the content as
  text.
  
 +Note that when `checkout-encoding` is defined, by default the line
 +endings are not converted. `text=auto` and core.autocrlf are ignored.
 +Set the `text` attribute to enable CRLF conversions.
 +
  Use the following attributes if your '*.txt' files are UTF-16 encoded
 -with byte order mark (BOM) and you want Git to perform automatic line
 -ending conversion based on your platform.
 +with byte order mark (BOM).
  
  
 -*.txt text checkout-encoding=UTF-16
 +*.txt checkout-encoding=UTF-16
  
  
  Use the following attributes if your '*.txt' files are UTF-16 little
 -endian encoded without BOM and you want Git to use Windows line endings
 -in the working directory.
 +endian encoded without BOM and you want Git to use LF in the repo and
 +CRLF in the working directory.
  
  
  *.txt checkout-encoding=UTF-16LE text eol=CRLF
 diff --git a/convert.c b/convert.c
 index 13f766d2a..1e29f515e 100644
 --- a/convert.c
 +++ b/convert.c
 @@ -221,18 +221,27 @@ static void check_global_conv_flags_eol(const char 
*path, enum crlf_action crlf_
}
  }
  
  
  static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
 @@ -432,7 +441,7 @@ static int crlf_to_git(const struct index_state *istate,
 * cherry-pick.
 */
if ((!(conv_flags & CONV_EOL_RENORMALIZE)) &&
 -  has_cr_in_index(istate, path))
 +  has_crlf_in_index(istate, path))
convert_crlf_into_lf = 0;
}
if (((conv_flags & CONV_EOL_RNDTRP_WARN) ||
 @@ -1214,9 +1223,28 @@ static void convert_attrs(struct conv_attrs *ca, const 
char *path)
ca->crlf_action = git_path_check_crlf(ccheck + 0);
ca->ident = git_path_check_ident(ccheck + 1);
ca->drv = git_path_check_convert(ccheck + 2);
 +  ca->checkout_encoding = git_path_check_encoding(ccheck + 5);
if (ca->crlf_action != CRLF_BINARY) {
enum eol eol_attr = git_path_check_eol(ccheck + 3);
 -  

[PATCH v3 0/7] convert: add support for different encodings

2018-01-05 Thread lars . schneider
From: Lars Schneider 

Hi,

Patches 1-5 and 6 are helper functions and preparation.
Patch 6 is the actual change.

I am still torn between "checkout-encoding" and "working-tree-encoding"
as attribute name. I am happy to hear arguments for/against one or the
other.

Changes since v2:

* Added Torsten's crlfsave refactoring patch (patch 5)
  @Torsten: I tried to make the commit message more clean, added
some comments to and renamed conv_flags_eol to
global_conv_flags_eol.

* Improved documentation and commit message (Torsten)

* Removed unnecessary NUL assignment in xstrdup_tolower() (Torsten)

* Set "git config core.eol lf" to made the test run on Windows (Dscho)

* Made BOM arrays static (Ramsay)


Thanks,
Lars


RFC: 
https://public-inbox.org/git/bdb9b884-6d17-4be3-a83c-f67e2afa2...@gmail.com/
 v1: 
https://public-inbox.org/git/20171211155023.1405-1-lars.schnei...@autodesk.com/
 v2: 
https://public-inbox.org/git/2017122915.39680-1-lars.schnei...@autodesk.com/



Base Ref: master
Web-Diff: https://github.com/larsxschneider/git/commit/f21a1841a4
Checkout: git fetch https://github.com/larsxschneider/git encoding-v3 && git 
checkout f21a1841a4


### Interdiff (v2..v3):

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 0039bd38c3..1bc03e69cb 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -285,11 +285,18 @@ In these cases you can teach Git the encoding of a file 
in the working
 directory with the `checkout-encoding` attribute. If a file with this
 attributes is added to Git, then Git reencodes the content from the
 specified encoding to UTF-8 and stores the result in its internal data
-structure. On checkout the content is encoded back to the specified
-encoding.
+structure (called "the index"). On checkout the content is encoded
+back to the specified encoding.

-Please note that using the `checkout-encoding` attribute has a number
-of drawbacks:
+Please note that using the `checkout-encoding` attribute may have a
+number of pitfalls:
+
+- Git clients that do not support the `checkout-encoding` attribute
+  will checkout the respective files UTF-8 encoded and not in the
+  expected encoding. Consequently, these files will appear different
+  which typically causes trouble. This is in particular the case for
+  older Git versions and alternative Git implementations such as JGit
+  or libgit2 (as of January 2018).

 - Reencoding content to non-UTF encodings (e.g. SHIFT-JIS) can cause
   errors as the conversion might not be round trip safe.
@@ -297,12 +304,6 @@ of drawbacks:
 - Reencoding content requires resources that might slow down certain
   Git operations (e.g 'git checkout' or 'git add').

-- Git clients that do not support the `checkout-encoding` attribute or
-  the used encoding will checkout the respective files as UTF-8 encoded.
-  That means the content appears to be different which could cause
-  trouble. Affected clients are older Git versions and alternative Git
-  implementations such as JGit or libgit2 (as of January 2018).
-
 Use the `checkout-encoding` attribute only if you cannot store a file in
 UTF-8 encoding and if you want Git to be able to process the content as
 text.
diff --git a/apply.c b/apply.c
index c4bd5cf1f2..f8b67bfee2 100644
--- a/apply.c
+++ b/apply.c
@@ -2263,8 +2263,8 @@ static void show_stats(struct apply_state *state, struct 
patch *patch)
 static int read_old_data(struct stat *st, struct patch *patch,
 const char *path, struct strbuf *buf)
 {
-   enum safe_crlf safe_crlf = patch->crlf_in_old ?
-   SAFE_CRLF_KEEP_CRLF : SAFE_CRLF_RENORMALIZE;
+   int conv_flags = patch->crlf_in_old ?
+   CONV_EOL_KEEP_CRLF : CONV_EOL_RENORMALIZE;
switch (st->st_mode & S_IFMT) {
case S_IFLNK:
if (strbuf_readlink(buf, path, st->st_size) < 0)
@@ -2281,7 +2281,7 @@ static int read_old_data(struct stat *st, struct patch 
*patch,
 * should never look at the index when explicit crlf option
 * is given.
 */
-   convert_to_git(NULL, path, buf->buf, buf->len, buf, safe_crlf, 
0);
+   convert_to_git(NULL, path, buf->buf, buf->len, buf, conv_flags);
return 0;
default:
return -1;
diff --git a/blame.c b/blame.c
index 388b66897b..2893f3c103 100644
--- a/blame.c
+++ b/blame.c
@@ -229,7 +229,7 @@ static struct commit *fake_working_tree_commit(struct 
diff_options *opt,
if (strbuf_read(, 0, 0) < 0)
die_errno("failed to read from stdin");
}
-   convert_to_git(_index, path, buf.buf, buf.len, , 0, 0);
+   convert_to_git(_index, path, buf.buf, buf.len, , 0);
origin->file.ptr = buf.buf;
origin->file.size = buf.len;
pretend_sha1_file(buf.buf, buf.len, OBJ_BLOB, origin->blob_oid.hash);
diff --git