Re: [PATCH v6 5/7] convert: add 'working-tree-encoding' attribute

2018-02-14 Thread Lars Schneider

> On 10 Feb 2018, at 10:48, Torsten Bögershausen  wrote:
> 
> On Fri, Feb 09, 2018 at 02:28:28PM +0100, lars.schnei...@autodesk.com wrote:
>> From: Lars Schneider 
>> 
>> ...
>> 
>> +Please note that using the `working-tree-encoding` attribute may have a
>> +number of pitfalls:
>> +
>> +- Git clients that do not support the `working-tree-encoding` attribute
> 
> A client to Git ?
> Or may be "third party Git implementations"

OK, I'll go with "Third party Git implementations".


>> 
>> +As an example, use the following attributes if your '*.proj' files are
>> +UTF-16 encoded with byte order mark (BOM) and you want Git to perform
>> +automatic line ending conversion based on your platform.
>> +
>> +
>> +*.proj  text working-tree-encoding=UTF-16
>> +
>> +
>> +Use the following attributes if your '*.proj' files are UTF-16 little
>> +endian encoded without BOM and you want Git to use Windows line endings
>> +in the working directory. Please note, it is highly recommended to
>> +explicitly define the line endings with `eol` if the `working-tree-encoding`
>> +attribute is used to avoid ambiguity.
>> +
>> +
>> +*.proj  working-tree-encoding=UTF-16LE text eol=CRLF
>> +
>> +
>> +You can get a list of all available encodings on your platform with the
>> +following command:
> 
> One question:
> +*.proj   text working-tree-encoding=UTF-16
> vs
> *.projworking-tree-encoding=UTF-16LE text eol=CRLF
> 
> Technically the order of attributes doesn't matter, but that is not what we
> want to demonstrate here and now.
> I would probably move the "text" attribute to the end of the line.
> So that readers don't start to wonder if the order is important.

I agree in general. However, I would move "text" to the beginning to be
consistent with the gitattribute pattern above. OK?


>> 
>> +if (has_prohibited_utf_bom(enc->name, src, src_len)) {
>> +const char *error_msg = _(
>> +"BOM is prohibited for '%s' if encoded as %s");
>> +const char *advise_msg = _(
>> +"You told Git to treat '%s' as %s. A byte order mark "
>> +"(BOM) is prohibited with this encoding. Either use "
>> +"%.6s as working tree encoding or remove the BOM from 
>> the "
>> +"file.");
> 
> "You told Git" is probly right from Gits point of view, and advises are 
> really helpfull.
> But what should the user do about it ?
> Could we give a better advise ?
> 
> 
> "A byte order mark (BOM) is prohibited with %s.
> Please remove the BOM from the file %s 
> or use "%s as working-tree-encoding"
> 
> I would probably suspect that a tool wrote the BOM, and that is
> good and can or should not be changed by a user.
> 
> So a simply message like this could be the preferred (and only)
> solution for a user:
> "A byte order mark (BOM) is prohibited with %s.
> Please use "%s as working-tree-encoding"

OK. I like the last one!


> (And why %.6s and not simply %s ?)

The encodings is UTF-16LE, UTF-16BE, UTF-32LE, or UTF-32BE.
I just use the first 6 characters to print the encoding that
allows BOMs (UTF-16 or UTF-32). I'll add a comment to explain 
the trickery in the code!

Thanks,
Lars

Re: [PATCH v6 5/7] convert: add 'working-tree-encoding' attribute

2018-02-10 Thread Torsten Bögershausen
On Fri, Feb 09, 2018 at 02:28:28PM +0100, lars.schnei...@autodesk.com wrote:
> From: Lars Schneider 
> 
> Git recognizes files encoded with ASCII or one of its supersets (e.g.
> UTF-8 or ISO-8859-1) as text files. All other encodings are usually
> interpreted as binary and consequently built-in Git text processing
> tools (e.g. 'git diff') as well as most Git web front ends do not
> visualize the content.
> 
> Add an attribute to tell Git what encoding the user has defined for a
> given file. If the content is added to the index, then Git converts the
> content to a canonical UTF-8 representation. On checkout Git will
> reverse the conversion.
> 
> Signed-off-by: Lars Schneider 
> ---
>  Documentation/gitattributes.txt  |  66 
>  convert.c| 157 -
>  convert.h|   1 +
>  sha1_file.c  |   2 +-
>  t/t0028-working-tree-encoding.sh | 210 
> +++
>  5 files changed, 434 insertions(+), 2 deletions(-)
>  create mode 100755 t/t0028-working-tree-encoding.sh
> 
> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
> index 30687de81a..4ecdcd4859 100644
> --- a/Documentation/gitattributes.txt
> +++ b/Documentation/gitattributes.txt
> @@ -272,6 +272,72 @@ few exceptions.  Even though...
>catch potential problems early, safety triggers.
>  
>  
> +`working-tree-encoding`
> +^^^
> +
> +Git recognizes files encoded with ASCII or one of its supersets (e.g.
> +UTF-8 or ISO-8859-1) as text files.  All other encodings are usually
> +interpreted as binary and consequently built-in Git text processing
> +tools (e.g. 'git diff') as well as most Git web front ends do not
> +visualize the content.
> +
> +In these cases you can tell Git the encoding of a file in the working
> +directory with the `working-tree-encoding` attribute. If a file with this
> +attribute is added to Git, then Git reencodes the content from the
> +specified encoding to UTF-8. Finally, Git stores the UTF-8 encoded
> +content in its internal data structure (called "the index"). On checkout
> +the content is reencoded back to the specified encoding.
> +
> +Please note that using the `working-tree-encoding` attribute may have a
> +number of pitfalls:
> +
> +- Git clients that do not support the `working-tree-encoding` attribute

A client to Git ?
Or may be "third party Git implementations"

> +  will checkout the respective files UTF-8 encoded and not in the
> +  expected encoding. Consequently, these files will appear different
> +  which typically causes trouble. This is in particular the case for
> +  older Git versions and alternative Git implementations such as JGit
> +  or libgit2 (as of February 2018).
> +
> +- Reencoding content requires resources that might slow down certain
> +  Git operations (e.g 'git checkout' or 'git add').
> +
> +Use the `working-tree-encoding` attribute only if you cannot store a file
> +in UTF-8 encoding and if you want Git to be able to process the content
> +as text.
> +
> +As an example, use the following attributes if your '*.proj' files are
> +UTF-16 encoded with byte order mark (BOM) and you want Git to perform
> +automatic line ending conversion based on your platform.
> +
> +
> +*.proj   text working-tree-encoding=UTF-16
> +
> +
> +Use the following attributes if your '*.proj' files are UTF-16 little
> +endian encoded without BOM and you want Git to use Windows line endings
> +in the working directory. Please note, it is highly recommended to
> +explicitly define the line endings with `eol` if the `working-tree-encoding`
> +attribute is used to avoid ambiguity.
> +
> +
> +*.proj   working-tree-encoding=UTF-16LE text eol=CRLF
> +
> +
> +You can get a list of all available encodings on your platform with the
> +following command:

One question:
 +*.projtext working-tree-encoding=UTF-16
vs
*.proj  working-tree-encoding=UTF-16LE text eol=CRLF

Technically the order of attributes doesn't matter, but that is not what we
want to demonstrate here and now.
I would probably move the "text" attribute to the end of the line.
So that readers don't start to wonder if the order is important.

> +
> +
> +iconv --list
> +
> +
> +If you do not know the encoding of a file, then you can use the `file`
> +command to guess the encoding:
> +
> +
> +file foo.proj
> +
> +
> +
>  `ident`
>  ^^^
>  
> diff --git a/convert.c b/convert.c
> index b976eb968c..dc9e2db6b5 100644
> --- a/convert.c
> +++ b/convert.c
> @@ -7,6 +7,7 @@
>  #include "sigchain.h"
>  #include "pkt-line.h"
>  #include "sub-process.h"
> +#include "utf8.h"
>  
>  /*
>   * convert.c - convert a file 

[PATCH v6 5/7] convert: add 'working-tree-encoding' attribute

2018-02-09 Thread lars . schneider
From: Lars Schneider 

Git recognizes files encoded with ASCII or one of its supersets (e.g.
UTF-8 or ISO-8859-1) as text files. All other encodings are usually
interpreted as binary and consequently built-in Git text processing
tools (e.g. 'git diff') as well as most Git web front ends do not
visualize the content.

Add an attribute to tell Git what encoding the user has defined for a
given file. If the content is added to the index, then Git converts the
content to a canonical UTF-8 representation. On checkout Git will
reverse the conversion.

Signed-off-by: Lars Schneider 
---
 Documentation/gitattributes.txt  |  66 
 convert.c| 157 -
 convert.h|   1 +
 sha1_file.c  |   2 +-
 t/t0028-working-tree-encoding.sh | 210 +++
 5 files changed, 434 insertions(+), 2 deletions(-)
 create mode 100755 t/t0028-working-tree-encoding.sh

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 30687de81a..4ecdcd4859 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -272,6 +272,72 @@ few exceptions.  Even though...
   catch potential problems early, safety triggers.
 
 
+`working-tree-encoding`
+^^^
+
+Git recognizes files encoded with ASCII or one of its supersets (e.g.
+UTF-8 or ISO-8859-1) as text files.  All other encodings are usually
+interpreted as binary and consequently built-in Git text processing
+tools (e.g. 'git diff') as well as most Git web front ends do not
+visualize the content.
+
+In these cases you can tell Git the encoding of a file in the working
+directory with the `working-tree-encoding` attribute. If a file with this
+attribute is added to Git, then Git reencodes the content from the
+specified encoding to UTF-8. Finally, Git stores the UTF-8 encoded
+content in its internal data structure (called "the index"). On checkout
+the content is reencoded back to the specified encoding.
+
+Please note that using the `working-tree-encoding` attribute may have a
+number of pitfalls:
+
+- Git clients that do not support the `working-tree-encoding` attribute
+  will checkout the respective files UTF-8 encoded and not in the
+  expected encoding. Consequently, these files will appear different
+  which typically causes trouble. This is in particular the case for
+  older Git versions and alternative Git implementations such as JGit
+  or libgit2 (as of February 2018).
+
+- Reencoding content requires resources that might slow down certain
+  Git operations (e.g 'git checkout' or 'git add').
+
+Use the `working-tree-encoding` attribute only if you cannot store a file
+in UTF-8 encoding and if you want Git to be able to process the content
+as text.
+
+As an example, use the following attributes if your '*.proj' files are
+UTF-16 encoded with byte order mark (BOM) and you want Git to perform
+automatic line ending conversion based on your platform.
+
+
+*.proj text working-tree-encoding=UTF-16
+
+
+Use the following attributes if your '*.proj' files are UTF-16 little
+endian encoded without BOM and you want Git to use Windows line endings
+in the working directory. Please note, it is highly recommended to
+explicitly define the line endings with `eol` if the `working-tree-encoding`
+attribute is used to avoid ambiguity.
+
+
+*.proj working-tree-encoding=UTF-16LE text eol=CRLF
+
+
+You can get a list of all available encodings on your platform with the
+following command:
+
+
+iconv --list
+
+
+If you do not know the encoding of a file, then you can use the `file`
+command to guess the encoding:
+
+
+file foo.proj
+
+
+
 `ident`
 ^^^
 
diff --git a/convert.c b/convert.c
index b976eb968c..dc9e2db6b5 100644
--- a/convert.c
+++ b/convert.c
@@ -7,6 +7,7 @@
 #include "sigchain.h"
 #include "pkt-line.h"
 #include "sub-process.h"
+#include "utf8.h"
 
 /*
  * convert.c - convert a file when checking it out and checking it in.
@@ -265,6 +266,110 @@ static int will_convert_lf_to_crlf(size_t len, struct 
text_stat *stats,
 
 }
 
+static struct encoding {
+   const char *name;
+   struct encoding *next;
+} *encoding, **encoding_tail;
+static const char *default_encoding = "UTF-8";
+
+static int encode_to_git(const char *path, const char *src, size_t src_len,
+struct strbuf *buf, struct encoding *enc, int 
conv_flags)
+{
+   char *dst;
+   int dst_len;
+
+   /*
+* No encoding is specified or there is nothing to encode.
+* Tell the caller that the content was not modified.
+*/
+   if (!enc || (src && !src_len))
+   return 0;
+
+   /*
+*