Re: Import/Export as a fast way to purge files from Git?

2018-10-31 Thread Lars Schneider



> On Sep 24, 2018, at 7:24 PM, Elijah Newren  wrote:
> 
> On Sun, Sep 23, 2018 at 6:08 AM Lars Schneider  
> wrote:
>> 
>> Hi,
>> 
>> I recently had to purge files from large Git repos (many files, many 
>> commits).
>> The usual recommendation is to use `git filter-branch --index-filter` to 
>> purge
>> files. However, this is *very* slow for large repos (e.g. it takes 45min to
>> remove the `builtin` directory from git core). I realized that I can remove
>> files *way* faster by exporting the repo, removing the file references,
>> and then importing the repo (see Perl script below, it takes ~30sec to remove
>> the `builtin` directory from git core). Do you see any problem with this
>> approach?
> 
> It looks like others have pointed you at other tools, and you're
> already shifting to that route.  But I think it's a useful question to
> answer more generally, so for those that are really curious...
> 
> 
> The basic approach is fine, though if you try to extend it much you
> can run into a few possible edge/corner cases (more on that below).
> I've been using this basic approach for years and even created a
> mini-python library[1] designed specifically to allow people to create
> "fast-filters", used as
>   git fast-export  | your-fast-filter | git fast-import 
> 
> But that library didn't really take off; even I have rarely used it,
> often opting for filter-branch despite its horrible performance or a
> simple fast-export | long-sed-command | fast-import (with some extra
> pre-checking to make sure the sed wouldn't unintentionally munge other
> data).  BFG is great, as long as you're only interested in removing a
> few big items, but otherwise doesn't seem very useful (to be fair,
> it's very upfront about only wanting to solve that problem).
> Recently, due to continuing questions on filter-branch and folks still
> getting confused with it, I looked at existing tools, decided I didn't
> think any quite fit, and started looking into converting
> git_fast_filter into a filter-branch-like tool instead of just a
> libary.  Found some bugs and missing features in fast-export along the
> way (and have some patches I still need to send in).  But I kind of
> got stuck -- if the tool is in python, will that limit adoption too
> much?  It'd be kind of nice to have this tool in core git.  But I kind
> of like leaving open the possibility of using it as a tool _or_ as a
> library, the latter for the special cases where case-specific
> programmatic filtering is needed.  But a developer-convenience library
> makes almost no sense unless in a higher level language, such as
> python.  I'm still trying to make up my mind about what I want (and
> what others might want), and have been kind of blocking on that.  (If
> others have opinions, I'm all ears.)

That library sounds like a very interesting idea. Unfortunately, the 
referenced repo seems not to be available anymore:
git://gitorious.org/git_fast_filter/mainline.git

I very much like Python. However, more recently I started to
write Git tools in Perl as they work out of the box on every
machine with Git installed ... and I think Perl can be quite
readable if no shortcuts are used :-). 


> Anyway, the edge/corner cases you can watch out for:
> 
>  - Signed tags are a problem; you may need to specify
> --signed-tags=strip to fast-export
> 
>  - References to other commits in your commit messages will now be
> incorrect.  I think a good tool should either default to rewriting
> commit ids in commit messages or at least have an option to do so
> (BFG does this; filter-branch doesn't; fast-export format makes it
> really hard for a filter based on it to do so)
> 
>  - If the paths you remove are the only paths modified in a commit,
> the commit can become empty.  If you're only filtering a few paths
> out, this might be nothing more than a minor inconvenience for you.
> However, if you're trying to prune directories (and perhaps several
> toplevel ones), then it can be extremely annoying to have a new
> history with the vast majority of all commits being empty.
> (filter-branch has an option for this; BFG does not; tools based on
> fast-export output can do it with sufficient effort).
> 
>  - If you start pruning empty commits, you have to worry about
> rewriting branches and tags to remaining parents.  This _might_ happen
> for free depending on your history's structure and the fast-export
> stream, but to be correct in general you will have to specify the new
> commit for some branches or tags.
> 
>  - If you start pruning empty commits, you have to decide whether to
> allow pruning of merge commits.  Your first reaction might be to not
> allow it, but if one parent a

Re: Import/Export as a fast way to purge files from Git?

2018-09-23 Thread Lars Schneider



> On Sep 23, 2018, at 4:55 PM, Eric Sunshine  wrote:
> 
> On Sun, Sep 23, 2018 at 9:05 AM Lars Schneider  
> wrote:
>> I recently had to purge files from large Git repos (many files, many 
>> commits).
>> The usual recommendation is to use `git filter-branch --index-filter` to 
>> purge
>> files. However, this is *very* slow for large repos (e.g. it takes 45min to
>> remove the `builtin` directory from git core). I realized that I can remove
>> files *way* faster by exporting the repo, removing the file references,
>> and then importing the repo (see Perl script below, it takes ~30sec to remove
>> the `builtin` directory from git core). Do you see any problem with this
>> approach?
> 
> A couple comments:
> 
> For purging files from a history, take a look at BFG[1] which bills
> itself as "a simpler, faster alternative to git-filter-branch for
> cleansing bad data out of your Git repository history".

Yes, BFG is great. Unfortunately, it requires Java which is not available
on every system I have to work with. I required a solution that would work
in every Git environment. Hence the Perl script :-)


> The approach of exporting to a fast-import stream, modifying the
> stream, and re-importing is quite reasonable.

Thanks for the confirmation!


> However, rather than
> re-inventing, take a look at reposurgeon[2], which allows you to do
> major surgery on fast-import streams. Not only can it purge files from
> a repository, but it can slice, dice, puree, and saute pretty much any
> attribute of a repository.

Wow. Reposurgeon looks very interesting. Thanks a lot for the pointer!

Cheers,
Lars


> [1]: https://rtyley.github.io/bfg-repo-cleaner/
> [2]: http://www.catb.org/esr/reposurgeon/



Import/Export as a fast way to purge files from Git?

2018-09-23 Thread Lars Schneider
Hi,

I recently had to purge files from large Git repos (many files, many commits). 
The usual recommendation is to use `git filter-branch --index-filter` to purge 
files. However, this is *very* slow for large repos (e.g. it takes 45min to
remove the `builtin` directory from git core). I realized that I can remove
files *way* faster by exporting the repo, removing the file references, 
and then importing the repo (see Perl script below, it takes ~30sec to remove
the `builtin` directory from git core). Do you see any problem with this 
approach?

Thank you,
Lars



#!/usr/bin/perl
#
# Purge paths from Git repositories.
#
# Usage:
# git-purge-path [path-regex1] [path-regex2] ...
#
# Examples:
#Remove the file "test.bin" from all directories:
#git-purge-path "/test.bin$"
#
#Remove all "*.bin" files from all directories:
#git-purge-path "\.bin$"
#
#Remove all files in the "/foo" directory:
#git-purge-path "^/foo/$"
#
# Attention:
# You want to run this script on a case sensitive file-system (e.g.
# ext4 on Linux). Otherwise the resulting Git repository will not
# contain changes that modify the casing of file paths.
#

use strict;
use warnings;

open( my $pipe_in, "git fast-export --progress=100 --no-data HEAD |" ) or die 
$!;
open( my $pipe_out, "| git fast-import --force --quiet" ) or die $!;

LOOP: while ( my $cmd = <$pipe_in> ) {
my $data = "";
if ( $cmd =~ /^data ([0-9]+)$/ ) {
# skip data blocks
my $skip_bytes = $1;
read($pipe_in, $data, $skip_bytes);
}
elsif ( $cmd =~ /^M [0-9]{6} [0-9a-f]{40} (.+)$/ ) {
my $pathname = $1;
foreach (@ARGV) {
next LOOP if ("/" . $pathname) =~ /$_/
}
}
print {$pipe_out} $cmd . $data;
}



Re: Find commit that referenced a blob first

2018-07-19 Thread Lars Schneider


> On Jul 19, 2018, at 11:19 PM, Stefan Beller  wrote:
> 
> On Thu, Jul 19, 2018 at 2:02 PM Lars Schneider  
> wrote:
>> 
>> Hi,
>> 
>> I have a blob hash and I would like to know what commit referenced
>> this blob first in a given Git repo.
> 
> git describe 
> 
> If the given object refers to a blob, it will be described as
> :,
> such that the blob can be found at  in the , which itself
> describes the first commit in which this blob occurs in a reverse
> revision walk from HEAD.
> 
> Since
> 644eb60bd01 (builtin/describe.c: describe a blob, 2017-11-15)
> (included first in 2.16.0
> 
> You're welcome,
> Stefan

Awesome! Thank you very much :-)

- Lars

Find commit that referenced a blob first

2018-07-19 Thread Lars Schneider
Hi,

I have a blob hash and I would like to know what commit referenced 
this blob first in a given Git repo.

I could iterate through all commits sorted by date (or generation 
number) and then recursively search in the referenced trees until 
I find my blob. I wonder, is this the most efficient way to solve 
my problem? Is there some trick/existing Git command that would 
perform this search more efficiently for large repos?

Thank you,
Lars


Re: [PATCH v1 2/2] convert: add alias support for 'working-tree-encoding' attributes

2018-07-08 Thread Lars Schneider



> On Jul 8, 2018, at 8:30 PM, larsxschnei...@gmail.com wrote:
> 
> From: Lars Schneider 
> 
> In 107642fe26 ("convert: add 'working-tree-encoding' attribute",
> 2018-04-15) we added an attribute which defines the working tree
> encoding of a file.
> 
> Some platforms might spell the name of a certain encoding differently or
> some users might want to use different encodings on different platforms.
> Add the Git config "encoding..insteadOf = " to
> support these use-cases with a user specific mapping. If the alias
> matches an existing encoding name, then the alias will take precedence.
> The alias is case insensitive.
> 
> Example:
> 
>   (in .gitattributes)
>   *.c working-tree-encoding=foo
> 
>   (in config)
>   [encoding "UTF-16"]
>   insteadOf = foo
> 
> Signed-off-by: Lars Schneider 
> ---
> Documentation/gitattributes.txt  | 19 
> convert.c| 50 
> t/t0028-working-tree-encoding.sh | 28 ++
> 3 files changed, 97 insertions(+)
> 
> diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
> index 92010b062e..3628f0e5cf 100644
> --- a/Documentation/gitattributes.txt
> +++ b/Documentation/gitattributes.txt
> @@ -366,6 +366,25 @@ command to guess the encoding:
> file foo.ps1
> 
> 
> ...
> 
>   return 0;
> @@ -1225,6 +1264,17 @@ static const char *git_path_check_encoding(struct 
> attr_check_item *check)
>   die(_("true/false are no valid working-tree-encodings"));
>   }
> 
> + /* Check if an alias was defined for the encoding in the Git config */
> + if (encoding_aliases_initialized) {
> + struct alias2enc hashkey;
> + struct alias2enc *entry;
> + hashmap_entry_init(, strihash(value));
> + hashkey.alias = value;
> + entry = hashmap_get(_map, , NULL);
> + if (entry)
> + value = entry->encoding;

Here I reuse the char* pointer from the hashmap.
The hashmap is static and no entry is ever removed.
Is this OK or should I rather create a copy of the string?

Thanks,
Lars



Re: [PATCH v1 1/2] convert: refactor conversion driver config parsing

2018-07-08 Thread Lars Schneider



> On Jul 8, 2018, at 8:30 PM, larsxschnei...@gmail.com wrote:
> 
> From: Lars Schneider 
> 
> Refactor conversion driver config parsing to ease the parsing of new
> configs in a subsequent patch.
> 
> No functional change intended.
> 
> Signed-off-by: Lars Schneider 
> ---
> convert.c | 64 +++
> 1 file changed, 32 insertions(+), 32 deletions(-)
> 
> diff --git a/convert.c b/convert.c
> index 64d0d30e08..949bc783e4 100644
> --- a/convert.c
> +++ b/convert.c
> @@ -1003,43 +1003,43 @@ static int read_convert_config(const char *var, const 
> char *value, void *cb)
>   int namelen;
>   struct convert_driver *drv;
> 
> ...
> 
> - /*
> -  * filter..smudge and filter..clean specifies
> -  * the command line:
> -  *
> -  *  command-line
> -  *
> -  * The command-line will not be interpolated in any way.
> -  */
> + /*
> +  * filter..smudge and filter..clean specifies
> +  * the command line:
> +  *
> +  *  command-line
> +  *
> +  * The command-line will not be interpolated in any way.
> +  */

I stumbled over this comment introduced in aa4ed402c9 
("Add 'filter' attribute and external filter driver definition.", 2007-04-21).

Is the middle "command-line" intentional?

- Lars

Re: Use of new .gitattributes working-tree-encoding attribute across different platform types

2018-07-02 Thread Lars Schneider
> -----Lars Schneider  wrote: -
> To: Jeff King 
> From: Lars Schneider 
> Date: 06/28/2018 18:21
> Cc: "brian m. carlson" , Steve Groeger 
> , git@vger.kernel.org
> Subject: Re: Use of new .gitattributes working-tree-encoding attribute across 
> different platform types
> 
> 
>> On Jun 28, 2018, at 4:34 PM, Jeff King  wrote:
>> 
>> On Thu, Jun 28, 2018 at 02:44:47AM +, brian m. carlson wrote:
>> 
>>> On Wed, Jun 27, 2018 at 07:54:52AM +, Steve Groeger wrote:
>>>> We have common code that is supposed to be usable across different 
>>>> platforms and hence different file encodings. With the full support of the 
>>>> working-tree-encoding in the latest version of git on all platforms, how 
>>>> do we have files converted to different encodings on different platforms?
>>>> I could not find anything that would allow us to say 'if platform = z/OS 
>>>> then encoding=EBCDIC else encoding=ASCII'.   Is there a way this can be 
>>>> done?
>>> 
>>> I don't believe there is such functionality.  Git doesn't have
>>> attributes that are conditional on the platform in that sort of way.
>>> You could use a smudge/clean filter and adjust the filter for the
>>> platform you're on, which might meet your needs.
>> 
>> We do have prior art in the line-ending code, though. There the
>> attributes say either that a file needs a specific line-ending type
>> (which is relatively rare), or that it should follow the system type,
>> which is then set separately in the config.
>> 
>> I have the impression that the working-tree-encoding stuff was made to
>> handle the first case, but not the second. It doesn't seem like an
>> outrageous thing to eventually add.
>> 
>> (Though I agree that clean/smudge filters would work, and can even
>> implement the existing working-tree-encoding feature, albeit less
>> efficiently and conveniently).
> 
> Thanks for the suggestion Peff! 
> How about this:
> 
> 1) We allow users to set the encoding "auto". Example:
> 
>   *.txt working-tree-encoding=auto
> 
> 2) We define a new variable `core.autoencoding`. By default the value is 
> UTF-8 (== no re-encoding) but user can set to any value in their Git config. 
> Example:
> 
>git config --global core.autoencoding UTF-16
> 
> All files marked with the value "auto" will use the encoding defined in
> `core.autoencoding`.
> 
> Would that work?
> 
> @steve: Would that fix your problem?


On Jul 2, 2018, at 2:13 PM, Steve Groeger  wrote:
> 
> I think this proposed solution may resolve my issue.

Thanks for the confirmation!

Brian had a good argument [1] for an even more flexible system
proposed by Peff:


1) We allow users to define custom encoding mappings in their Git config. 
Example:

git config --global core.encoding.myenc UTF-16


2) Users can reuse these mappings in ther .gitattributes files:

*.txt working-tree-encoding=myenc


Does this idea look good to everyone?

Thanks,
Lars


[1] 
https://public-inbox.org/git/20180701175657.gc7...@genre.crustytoothpaste.net/


Re: Use of new .gitattributes working-tree-encoding attribute across different platform types

2018-06-28 Thread Lars Schneider



> On Jun 28, 2018, at 4:34 PM, Jeff King  wrote:
> 
> On Thu, Jun 28, 2018 at 02:44:47AM +, brian m. carlson wrote:
> 
>> On Wed, Jun 27, 2018 at 07:54:52AM +, Steve Groeger wrote:
>>> We have common code that is supposed to be usable across different 
>>> platforms and hence different file encodings. With the full support of the 
>>> working-tree-encoding in the latest version of git on all platforms, how do 
>>> we have files converted to different encodings on different platforms?
>>> I could not find anything that would allow us to say 'if platform = z/OS 
>>> then encoding=EBCDIC else encoding=ASCII'.   Is there a way this can be 
>>> done?
>> 
>> I don't believe there is such functionality.  Git doesn't have
>> attributes that are conditional on the platform in that sort of way.
>> You could use a smudge/clean filter and adjust the filter for the
>> platform you're on, which might meet your needs.
> 
> We do have prior art in the line-ending code, though. There the
> attributes say either that a file needs a specific line-ending type
> (which is relatively rare), or that it should follow the system type,
> which is then set separately in the config.
> 
> I have the impression that the working-tree-encoding stuff was made to
> handle the first case, but not the second. It doesn't seem like an
> outrageous thing to eventually add.
> 
> (Though I agree that clean/smudge filters would work, and can even
> implement the existing working-tree-encoding feature, albeit less
> efficiently and conveniently).

Thanks for the suggestion Peff! 
How about this:

1) We allow users to set the encoding "auto". Example:

*.txt working-tree-encoding=auto

2) We define a new variable `core.autoencoding`. By default the value is 
UTF-8 (== no re-encoding) but user can set to any value in their Git config. 
Example:

git config --global core.autoencoding UTF-16

All files marked with the value "auto" will use the encoding defined in
`core.autoencoding`.

Would that work?

@steve: Would that fix your problem?

- Lars

Re: [RFC PATCH v1] http: add http.keepRejectedCredentials config

2018-06-07 Thread Lars Schneider


> On 04 Jun 2018, at 11:55, Jeff King  wrote:
> 
> On Mon, Jun 04, 2018 at 12:18:59PM -0400, Martin-Louis Bright wrote:
> 
>> Why must the credentials must be deleted after receiving the 401 (or
>> any) error? What's the rationale for this?
> 
> Because Git only tries a single credential per invocation. So if a
> helper provides one, it doesn't prompt. If you get a 401 and then the
> program aborts, invoking it again is just going to try the same
> credential over and over. Dropping the credential from the helper breaks
> out of that loop.
> 
> In fact, this patch probably should give the user some advice in that
> regard (either in the documentation, or as a warning when we skip the
> rejection). If you _do_ have a bogus credential and set the new option,
> you'd need to reject it manually (you can do it with "git credential
> reject", but it's probably easier to just unset the option temporarily
> and re-invoke the original command).

I like the advice idea very much!

How about this?

$ git fetch
hint: Git has stored invalid credentials.
hint: Reject them with 'git credential reject' or
hint: disable the Git config 'http.keepRejectedCredentials'.
remote: Invalid username or password.
fatal: Authentication failed for 'https://server.com/myrepo.git/'

I am not really sure about the grammar :-)

Thanks,
Lars


Re: [ANNOUNCE] Git v2.18.0-rc1

2018-06-04 Thread Lars Schneider


> On 04 Jun 2018, at 06:53, Junio C Hamano  wrote:
> 
> A release candidate Git v2.18.0-rc1 is now available for testing
> at the usual places.  It is comprised of 842 non-merge commits
> since v2.17.0, contributed by 65 people, 20 of which are new faces.
> 
> ...
> 
> * The new "checkout-encoding" attribute can ask Git to convert the
>   contents to the specified encoding when checking out to the working
>   tree (and the other way around when checking in).

Did you call the feature "checkout-encoding" here intentionally?
The attribute is called "working-tree-encoding" in the final and
merged round. Shouldn't we call it that way here too?

Thanks,
Lars


[RFC PATCH v1] http: add http.keepRejectedCredentials config

2018-06-04 Thread lars . schneider
From: Lars Schneider 

If a Git HTTP server responds with 401 or 407, then Git tells the
credential helper to reject and delete the credentials. In general
this is good.

However, in certain automation environments it is not desired to remove
credentials automatically. This is in particular the case if credentials
are only invalid temporarily (e.g. because of problems in the server's
authentication backend).

Therefore, add the config "http.keepRejectedCredentials" which tells
Git to keep invalid credentials if set to "true".

It was considered to disable the credential deletion in credential.c
directly. This approach was not chosen as it could be confusing to
other callers of credential_reject() if the function does not do what
its name says (e.g. in imap-send.c).

The Git-Credential-Manager-for-Windows already implements a similar
mechanism [1]. This solution aims to enable that feature for all
credential helper implementations.

[1] 
https://github.com/Microsoft/Git-Credential-Manager-for-Windows/blob/0c1af463b33b0a0142f36f99c49ca8f83e86ee43/Shared/Cli/Functions/Common.cs#L484-L504

Signed-off-by: Lars Schneider 
---

Notes:
Base Ref: master
Web-Diff: https://github.com/larsxschneider/git/commit/51993c2ff9
Checkout: git fetch https://github.com/larsxschneider/git keepcreds-v1 && 
git checkout 51993c2ff9

 Documentation/config.txt |  6 ++
 http.c   | 12 ++--
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index ab641bf5a9..184aee8dbc 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -1997,6 +1997,12 @@ http.emptyAuth::
a username in the URL, as libcurl normally requires a username for
authentication.

+http.keepRejectedCredentials::
+   Keep credentials in the credential helper that a Git server responded
+   to with 401 (unauthorized) or 407 (proxy authentication required).
+   This can be useful in automation environments where credentials might
+   become temporarily invalid. The default is `false`.
+
 http.delegation::
Control GSSAPI credential delegation. The delegation is disabled
by default in libcurl since version 7.21.7. Set parameter to tell
diff --git a/http.c b/http.c
index b4bfbceaeb..ff6932813f 100644
--- a/http.c
+++ b/http.c
@@ -138,6 +138,7 @@ static int ssl_cert_password_required;
 #ifdef LIBCURL_CAN_HANDLE_AUTH_ANY
 static unsigned long http_auth_methods = CURLAUTH_ANY;
 static int http_auth_methods_restricted;
+static int keep_rejected_credentials = 0;
 /* Modes for which empty_auth cannot actually help us. */
 static unsigned long empty_auth_useless =
CURLAUTH_BASIC
@@ -403,6 +404,11 @@ static int http_options(const char *var, const char 
*value, void *cb)
return 0;
}

+   if (!strcmp("http.keeprejectedcredentials", var)) {
+   keep_rejected_credentials = git_config_bool(var, value);
+   return 0;
+   }
+
/* Fall back on the default ones */
return git_default_config(var, value, cb);
 }
@@ -1471,7 +1477,8 @@ static int handle_curl_result(struct slot_results 
*results)
return HTTP_MISSING_TARGET;
else if (results->http_code == 401) {
if (http_auth.username && http_auth.password) {
-   credential_reject(_auth);
+   if (!keep_rejected_credentials)
+   credential_reject(_auth);
return HTTP_NOAUTH;
} else {
 #ifdef LIBCURL_CAN_HANDLE_AUTH_ANY
@@ -1485,7 +1492,8 @@ static int handle_curl_result(struct slot_results 
*results)
}
} else {
if (results->http_connectcode == 407)
-   credential_reject(_auth);
+   if (!keep_rejected_credentials)
+   credential_reject(_auth);
 #if LIBCURL_VERSION_NUM >= 0x070c00
if (!curl_errorstr[0])
strlcpy(curl_errorstr,

base-commit: c2c7d17b030646b40e6764ba34a5ebf66aee77af
--
2.17.1



Re: worktrees vs. alternates

2018-05-16 Thread Lars Schneider

> On 16 May 2018, at 11:29, Ævar Arnfjörð Bjarmason <ava...@gmail.com> wrote:
> 
> 
> On Wed, May 16 2018, Lars Schneider wrote:
> 
>> I am looking into different options to cache Git repositories on build
>> machines. The two most promising ways seem to be git-worktree [1] and
>> git-alternates [2].
>> 
>> I wonder if you see an advantage of one over the other?
>> 
>> My impression is that git-worktree supersedes git-alternates. Would
>> that be a fair statement? If yes, would it makes sense to deprecate
>> alternates for simplification?
>> 
>> [1] https://git-scm.com/docs/git-worktree
>> [2] 
>> https://git-scm.com/docs/gitrepository-layout#gitrepository-layout-objectsinfoalternates
> 
> It's not correct that worktrees supersede alternates, or the other way
> around, they're orthagonal features.
> 
> git-worktree allows you to create a new working directory connected to
> the same local object store.
> 
> Alternates allow you to declare in any given local object store, that
> your set of objects isn't complete, and you can find the rest at some
> other location, those object stores may or may not have more than one
> worktree connected to them.

OK. I just wonder in what situation I would work with an incomplete
object store. The only use case I could imagine is that two repos share
a common set of objects (most likely blobs). However, in that situation
I would keep the two independent lines of development in a single repo
with two root commits.

Would it be fair to say that "git alternates" are a good mechanism to 
cache objects across different repos? However, I would consider a cache 
hit  between different repos unlikely. In that line of thinking
"git worktree" would be a good (maybe better?) mechanism to cache objects
for a single repo?

Thanks,
Lars

worktrees vs. alternates

2018-05-16 Thread Lars Schneider
Hi,

I am looking into different options to cache Git repositories on build
machines. The two most promising ways seem to be git-worktree [1] and
git-alternates [2].

I wonder if you see an advantage of one over the other? 

My impression is that git-worktree supersedes git-alternates. Would
that be a fair statement? If yes, would it makes sense to deprecate
alternates for simplification?

Thanks,
Lars


[1] https://git-scm.com/docs/git-worktree
[2] 
https://git-scm.com/docs/gitrepository-layout#gitrepository-layout-objectsinfoalternates

Re: Optimizing writes to unchanged files during merges?

2018-04-17 Thread Lars Schneider

> On 16 Apr 2018, at 19:45, Jacob Keller <jacob.kel...@gmail.com> wrote:
> 
> On Mon, Apr 16, 2018 at 10:43 AM, Jacob Keller <jacob.kel...@gmail.com> wrote:
>> On Mon, Apr 16, 2018 at 9:07 AM, Lars Schneider
>> <larsxschnei...@gmail.com> wrote:
>>> What if Git kept a LRU list that contains file path, content hash, and
>>> mtime of any file that is removed or modified during a checkout. If a
>>> file is checked out later with the exact same path and content hash,
>>> then Git could set the mtime to the previous value. This way the
>>> compiler would not think that the content has been changed since the
>>> last rebuild.
>> 
>> That would only work until they actuall *did* a build on the second
>> branch, and upon changing back, how would this detect that it needs to
>> update mtime again? I don't think this solution really works.
>> Ultimately, the problem is that the build tool relies on the mtime to
>> determine what to rebuild. I think this would cause worse problems
>> because we *wouldn't* rebuild in the case. How is git supposed to know
>> that we rebuilt when switching branches or not?
>> 
>> Thanks,
>> Jake
> 
> I think a better solution for your problem would be to extend the
> build system you're using to avoid rebuilding when the contents
> haven't changed since last build (possibly by using hashes?). At the
> very least, I would not want this to be default, as it could possibly
> result in *no* build when there should be one, which is far more
> confusing to debug.

I am 100% with you that this is a build system issue. But changing
the build system for many teams in a large organization is really
hard. That's why I wondered if Git could help with a shortcut.
Looks like there is no shortcut (see my other reply in this thread).

Thanks
Lars

Re: Optimizing writes to unchanged files during merges?

2018-04-17 Thread Lars Schneider

> On 16 Apr 2018, at 19:04, Ævar Arnfjörð Bjarmason <ava...@gmail.com> wrote:
> 
> 
> On Mon, Apr 16 2018, Lars Schneider wrote:
> 
>>> On 16 Apr 2018, at 04:03, Linus Torvalds <torva...@linux-foundation.org> 
>>> wrote:
>>> 
>>> On Sun, Apr 15, 2018 at 6:44 PM, Junio C Hamano <gits...@pobox.com> wrote:
>>>> 
>>>> I think Elijah's corrected was_tracked() also does not care "has
>>>> this been renamed".
>>> 
>>> I'm perfectly happy with the slightly smarter patches. My patch was
>>> really just an RFC and because I had tried it out.
>>> 
>>>> One thing that makes me curious is what happens (and what we want to
>>>> happen) when such a "we already have the changes the side branch
>>>> tries to bring in" path has local (i.e. not yet in the index)
>>>> changes.  For a dirty file that trivially merges (e.g. a path we
>>>> modified since our histories forked, while the other side didn't do
>>>> anything, has local changes in the working tree), we try hard to
>>>> make the merge succeed while keeping the local changes, and we
>>>> should be able to do the same in this case, too.
>>> 
>>> I think it might be nice, but probably not really worth it.
>>> 
>>> I find the "you can merge even if some files are dirty" to be really
>>> convenient, because I often keep stupid test patches in my tree that I
>>> may not even intend to commit, and I then use the same tree for
>>> merging.
>>> 
>>> For example, I sometimes end up editing the Makefile for the release
>>> version early, but I won't *commit* that until I actually cut the
>>> release. But if I pull some branch that has also changed the Makefile,
>>> it's not worth any complexity to try to be nice about the dirty state.
>>> 
>>> If it's a file that actually *has* been changed in the branch I'm
>>> merging, and I'm more than happy to just stage the patch (or throw it
>>> away - I think it's about 50:50 for me).
>>> 
>>> So I don't think it's a big deal, and I'd rather have the merge fail
>>> very early with "that file has seen changes in the branch you are
>>> merging" than add any real complexity to the merge logic.
>> 
>> I am happy to see this discussion and the patches, because long rebuilds
>> are a constant annoyance for us. We might have been bitten by the exact
>> case discussed here, but more often, we have a slightly different
>> situation:
>> 
>> An engineer works on a task branch and runs incremental builds — all
>> is good. The engineer switches to another branch to review another
>> engineer's work. This other branch changes a low-level header file,
>> but no rebuild is triggered. The engineer switches back to the previous
>> task branch. At this point, the incremental build will rebuild
>> everything, as the compiler thinks that the low-level header file has
>> been changed (because the mtime is different).
>> 
>> Of course, this problem can be solved with a separate worktree. However,
>> our engineers forget about that sometimes, and then, they are annoyed by
>> a 4h rebuild.
>> 
>> Is this situation a problem for others too?
>> If yes, what do you think about the following approach:
>> 
>> What if Git kept a LRU list that contains file path, content hash, and
>> mtime of any file that is removed or modified during a checkout. If a
>> file is checked out later with the exact same path and content hash,
>> then Git could set the mtime to the previous value. This way the
>> compiler would not think that the content has been changed since the
>> last rebuild.
>> 
>> I think that would fix the problem that our engineers run into and also
>> the problem that Linus experienced during the merge, wouldn't it?
> 
> Could what you're describing be prototyped as a post-checkout hook that
> looks at the reflog? It sounds to me like it could, but perhaps I've
> missed some subtlety.

Yeah, probably. You don't even need the reflog I think. I just wanted
to get a sense if other people run into this problem too.


> Not re-writing out a file that hasn't changed is one thing, but I think
> for more complex behaviors (such as the "I want everything to have the
> same mtime" mentioned in another thread on-list), and this, it makes
> sense to provide some hook mechanism than have git itself do all the
> work.
> 
> I also don't see how what you're describing could be generalized, or
> even be made t

Re: Optimizing writes to unchanged files during merges?

2018-04-16 Thread Lars Schneider

> On 16 Apr 2018, at 04:03, Linus Torvalds  
> wrote:
> 
> On Sun, Apr 15, 2018 at 6:44 PM, Junio C Hamano  wrote:
>> 
>> I think Elijah's corrected was_tracked() also does not care "has
>> this been renamed".
> 
> I'm perfectly happy with the slightly smarter patches. My patch was
> really just an RFC and because I had tried it out.
> 
>> One thing that makes me curious is what happens (and what we want to
>> happen) when such a "we already have the changes the side branch
>> tries to bring in" path has local (i.e. not yet in the index)
>> changes.  For a dirty file that trivially merges (e.g. a path we
>> modified since our histories forked, while the other side didn't do
>> anything, has local changes in the working tree), we try hard to
>> make the merge succeed while keeping the local changes, and we
>> should be able to do the same in this case, too.
> 
> I think it might be nice, but probably not really worth it.
> 
> I find the "you can merge even if some files are dirty" to be really
> convenient, because I often keep stupid test patches in my tree that I
> may not even intend to commit, and I then use the same tree for
> merging.
> 
> For example, I sometimes end up editing the Makefile for the release
> version early, but I won't *commit* that until I actually cut the
> release. But if I pull some branch that has also changed the Makefile,
> it's not worth any complexity to try to be nice about the dirty state.
> 
> If it's a file that actually *has* been changed in the branch I'm
> merging, and I'm more than happy to just stage the patch (or throw it
> away - I think it's about 50:50 for me).
> 
> So I don't think it's a big deal, and I'd rather have the merge fail
> very early with "that file has seen changes in the branch you are
> merging" than add any real complexity to the merge logic.

I am happy to see this discussion and the patches, because long rebuilds 
are a constant annoyance for us. We might have been bitten by the exact 
case discussed here, but more often, we have a slightly different 
situation:

An engineer works on a task branch and runs incremental builds — all 
is good. The engineer switches to another branch to review another 
engineer's work. This other branch changes a low-level header file, 
but no rebuild is triggered. The engineer switches back to the previous 
task branch. At this point, the incremental build will rebuild 
everything, as the compiler thinks that the low-level header file has
been changed (because the mtime is different).

Of course, this problem can be solved with a separate worktree. However, 
our engineers forget about that sometimes, and then, they are annoyed by 
a 4h rebuild.

Is this situation a problem for others too?
If yes, what do you think about the following approach:

What if Git kept a LRU list that contains file path, content hash, and 
mtime of any file that is removed or modified during a checkout. If a 
file is checked out later with the exact same path and content hash, 
then Git could set the mtime to the previous value. This way the 
compiler would not think that the content has been changed since the 
last rebuild.

I think that would fix the problem that our engineers run into and also 
the problem that Linus experienced during the merge, wouldn't it?

Thanks,
Lars

[PATCH v13 10/10] convert: add round trip check based on 'core.checkRoundtripEncoding'

2018-04-15 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

UTF supports lossless conversion round tripping and conversions between
UTF and other encodings are mostly round trip safe as Unicode aims to be
a superset of all other character encodings. However, certain encodings
(e.g. SHIFT-JIS) are known to have round trip issues [1].

Add 'core.checkRoundtripEncoding', which contains a comma separated
list of encodings, to define for what encodings Git should check the
conversion round trip if they are used in the 'working-tree-encoding'
attribute.

Set SHIFT-JIS as default value for 'core.checkRoundtripEncoding'.

[1] 
https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode

Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
---
 Documentation/config.txt |  6 
 Documentation/gitattributes.txt  |  8 +
 config.c |  5 +++
 convert.c| 77 
 convert.h|  1 +
 environment.c|  1 +
 t/t0028-working-tree-encoding.sh | 39 
 7 files changed, 137 insertions(+)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 0e25b2c92b..7dcac9b540 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -530,6 +530,12 @@ core.autocrlf::
This variable can be set to 'input',
in which case no output conversion is performed.
 
+core.checkRoundtripEncoding::
+   A comma and/or whitespace separated list of encodings that Git
+   performs UTF-8 round trip checks on if they are used in an
+   `working-tree-encoding` attribute (see linkgit:gitattributes[5]).
+   The default value is `SHIFT-JIS`.
+
 core.symlinks::
If false, symbolic links are checked out as small plain files that
contain the link text. linkgit:git-update-index[1] and
diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 31a4f92840..aa3deae392 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -312,6 +312,14 @@ number of pitfalls:
   internal contents as UTF-8 and try to convert it to UTF-16 on checkout.
   That operation will fail and cause an error.
 
+- Reencoding content to non-UTF encodings can cause errors as the
+  conversion might not be UTF-8 round trip safe. If you suspect your
+  encoding to not be round trip safe, then add it to
+  `core.checkRoundtripEncoding` to make Git check the round trip
+  encoding (see linkgit:git-config[1]). SHIFT-JIS (Japanese character
+  set) is known to have round trip issues with UTF-8 and is checked by
+  default.
+
 - Reencoding content requires resources that might slow down certain
   Git operations (e.g 'git checkout' or 'git add').
 
diff --git a/config.c b/config.c
index 1f003fbb90..d0ada9fcd4 100644
--- a/config.c
+++ b/config.c
@@ -1172,6 +1172,11 @@ static int git_default_core_config(const char *var, 
const char *value)
return 0;
}
 
+   if (!strcmp(var, "core.checkroundtripencoding")) {
+   check_roundtrip_encoding = xstrdup(value);
+   return 0;
+   }
+
if (!strcmp(var, "core.notesref")) {
notes_ref_name = xstrdup(value);
return 0;
diff --git a/convert.c b/convert.c
index bc35c33249..1ae6301629 100644
--- a/convert.c
+++ b/convert.c
@@ -347,6 +347,42 @@ static void trace_encoding(const char *context, const char 
*path,
strbuf_release();
 }
 
+static int check_roundtrip(const char *enc_name)
+{
+   /*
+* check_roundtrip_encoding contains a string of comma and/or
+* space separated encodings (eg. "UTF-16, ASCII, CP1125").
+* Search for the given encoding in that string.
+*/
+   const char *found = strcasestr(check_roundtrip_encoding, enc_name);
+   const char *next;
+   int len;
+   if (!found)
+   return 0;
+   next = found + strlen(enc_name);
+   len = strlen(check_roundtrip_encoding);
+   return (found && (
+   /*
+* check that the found encoding is at the
+* beginning of check_roundtrip_encoding or
+* that it is prefixed with a space or comma
+*/
+   found == check_roundtrip_encoding || (
+   (isspace(found[-1]) || found[-1] == ',')
+   )
+   ) && (
+   /*
+* check that the found encoding is at the
+* end of check_roundtrip_encoding or
+* that it is suffixed with a space or comma
+*/
+   next == check_roundtrip_encoding + len || (
+   next < check_roundtrip_encoding + len &

[PATCH v13 09/10] convert: add tracing for 'working-tree-encoding' attribute

2018-04-15 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

Add the GIT_TRACE_WORKING_TREE_ENCODING environment variable to enable
tracing for content that is reencoded with the 'working-tree-encoding'
attribute. This is useful to debug encoding issues.

Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
---
 convert.c| 25 +
 t/t0028-working-tree-encoding.sh |  2 ++
 2 files changed, 27 insertions(+)

diff --git a/convert.c b/convert.c
index 0e7930c154..bc35c33249 100644
--- a/convert.c
+++ b/convert.c
@@ -324,6 +324,29 @@ static int validate_encoding(const char *path, const char 
*enc,
return 0;
 }
 
+static void trace_encoding(const char *context, const char *path,
+  const char *encoding, const char *buf, size_t len)
+{
+   static struct trace_key coe = TRACE_KEY_INIT(WORKING_TREE_ENCODING);
+   struct strbuf trace = STRBUF_INIT;
+   int i;
+
+   strbuf_addf(, "%s (%s, considered %s):\n", context, path, 
encoding);
+   for (i = 0; i < len && buf; ++i) {
+   strbuf_addf(
+   ,"| \e[2m%2i:\e[0m %2x \e[2m%c\e[0m%c",
+   i,
+   (unsigned char) buf[i],
+   (buf[i] > 32 && buf[i] < 127 ? buf[i] : ' '),
+   ((i+1) % 8 && (i+1) < len ? ' ' : '\n')
+   );
+   }
+   strbuf_addchars(, '\n', 1);
+
+   trace_strbuf(, );
+   strbuf_release();
+}
+
 static const char *default_encoding = "UTF-8";
 
 static int encode_to_git(const char *path, const char *src, size_t src_len,
@@ -352,6 +375,7 @@ static int encode_to_git(const char *path, const char *src, 
size_t src_len,
if (validate_encoding(path, enc, src, src_len, die_on_error))
return 0;
 
+   trace_encoding("source", path, enc, src, src_len);
dst = reencode_string_len(src, src_len, default_encoding, enc,
  _len);
if (!dst) {
@@ -369,6 +393,7 @@ static int encode_to_git(const char *path, const char *src, 
size_t src_len,
return 0;
}
}
+   trace_encoding("destination", path, default_encoding, dst, dst_len);
 
strbuf_attach(buf, dst, dst_len, dst_len + 1);
return 1;
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index a318d03232..026544ce09 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -4,6 +4,8 @@ test_description='working-tree-encoding conversion via 
gitattributes'
 
 . ./test-lib.sh
 
+GIT_TRACE_WORKING_TREE_ENCODING=1 && export GIT_TRACE_WORKING_TREE_ENCODING
+
 test_expect_success 'setup test files' '
git config core.eol lf &&
 
-- 
2.16.2



[PATCH v13 07/10] convert: add 'working-tree-encoding' attribute

2018-04-15 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

Git recognizes files encoded with ASCII or one of its supersets (e.g.
UTF-8 or ISO-8859-1) as text files. All other encodings are usually
interpreted as binary and consequently built-in Git text processing
tools (e.g. 'git diff') as well as most Git web front ends do not
visualize the content.

Add an attribute to tell Git what encoding the user has defined for a
given file. If the content is added to the index, then Git reencodes
the content to a canonical UTF-8 representation. On checkout Git will
reverse this operation.

Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
---
 Documentation/gitattributes.txt  |  80 ++
 convert.c| 113 ++-
 convert.h|   1 +
 sha1_file.c  |   2 +-
 t/t0028-working-tree-encoding.sh | 142 +++
 5 files changed, 336 insertions(+), 2 deletions(-)
 create mode 100755 t/t0028-working-tree-encoding.sh

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 30687de81a..31a4f92840 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -272,6 +272,86 @@ few exceptions.  Even though...
   catch potential problems early, safety triggers.
 
 
+`working-tree-encoding`
+^^^
+
+Git recognizes files encoded in ASCII or one of its supersets (e.g.
+UTF-8, ISO-8859-1, ...) as text files. Files encoded in certain other
+encodings (e.g. UTF-16) are interpreted as binary and consequently
+built-in Git text processing tools (e.g. 'git diff') as well as most Git
+web front ends do not visualize the contents of these files by default.
+
+In these cases you can tell Git the encoding of a file in the working
+directory with the `working-tree-encoding` attribute. If a file with this
+attribute is added to Git, then Git reencodes the content from the
+specified encoding to UTF-8. Finally, Git stores the UTF-8 encoded
+content in its internal data structure (called "the index"). On checkout
+the content is reencoded back to the specified encoding.
+
+Please note that using the `working-tree-encoding` attribute may have a
+number of pitfalls:
+
+- Alternative Git implementations (e.g. JGit or libgit2) and older Git
+  versions (as of March 2018) do not support the `working-tree-encoding`
+  attribute. If you decide to use the `working-tree-encoding` attribute
+  in your repository, then it is strongly recommended to ensure that all
+  clients working with the repository support it.
+
+  For example, Microsoft Visual Studio resources files (`*.rc`) or
+  PowerShell script files (`*.ps1`) are sometimes encoded in UTF-16.
+  If you declare `*.ps1` as files as UTF-16 and you add `foo.ps1` with
+  a `working-tree-encoding` enabled Git client, then `foo.ps1` will be
+  stored as UTF-8 internally. A client without `working-tree-encoding`
+  support will checkout `foo.ps1` as UTF-8 encoded file. This will
+  typically cause trouble for the users of this file.
+
+  If a Git client, that does not support the `working-tree-encoding`
+  attribute, adds a new file `bar.ps1`, then `bar.ps1` will be
+  stored "as-is" internally (in this example probably as UTF-16).
+  A client with `working-tree-encoding` support will interpret the
+  internal contents as UTF-8 and try to convert it to UTF-16 on checkout.
+  That operation will fail and cause an error.
+
+- Reencoding content requires resources that might slow down certain
+  Git operations (e.g 'git checkout' or 'git add').
+
+Use the `working-tree-encoding` attribute only if you cannot store a file
+in UTF-8 encoding and if you want Git to be able to process the content
+as text.
+
+As an example, use the following attributes if your '*.ps1' files are
+UTF-16 encoded with byte order mark (BOM) and you want Git to perform
+automatic line ending conversion based on your platform.
+
+
+*.ps1  text working-tree-encoding=UTF-16
+
+
+Use the following attributes if your '*.ps1' files are UTF-16 little
+endian encoded without BOM and you want Git to use Windows line endings
+in the working directory. Please note, it is highly recommended to
+explicitly define the line endings with `eol` if the `working-tree-encoding`
+attribute is used to avoid ambiguity.
+
+
+*.ps1  text working-tree-encoding=UTF-16LE eol=CRLF
+
+
+You can get a list of all available encodings on your platform with the
+following command:
+
+
+iconv --list
+
+
+If you do not know the encoding of a file, then you can use the `file`
+command to guess the encoding:
+
+
+file foo.ps1
+
+
+
 `ident`
 ^^^
 
diff --git a/convert.c b/convert.c
index b976eb968c..21d5cb60da 100644
--- a/conver

[PATCH v13 06/10] utf8: add function to detect a missing UTF-16/32 BOM

2018-04-15 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

If the endianness is not defined in the encoding name, then let's
be strict and require a BOM to avoid any encoding confusion. The
is_missing_required_utf_bom() function returns true if a required BOM
is missing.

The Unicode standard instructs to assume big-endian if there in no BOM
for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used
in HTML5 recommends to assume little-endian to "deal with deployed
content" [3]. Strictly requiring a BOM seems to be the safest option
for content in Git.

This function is used in a subsequent commit.

[1] http://unicode.org/faq/utf_bom.html#gen6
[2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
 Section 3.10, D98, page 132
[3] https://encoding.spec.whatwg.org/#utf-16le

Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
---
 utf8.c | 13 +
 utf8.h | 19 +++
 2 files changed, 32 insertions(+)

diff --git a/utf8.c b/utf8.c
index 83af3a9f6b..25d366d6b3 100644
--- a/utf8.c
+++ b/utf8.c
@@ -586,6 +586,19 @@ int has_prohibited_utf_bom(const char *enc, const char 
*data, size_t len)
);
 }
 
+int is_missing_required_utf_bom(const char *enc, const char *data, size_t len)
+{
+   return (
+  (same_utf_encoding(enc, "UTF-16")) &&
+  !(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
+has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
+   ) || (
+  (same_utf_encoding(enc, "UTF-32")) &&
+  !(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
+has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
+   );
+}
+
 /*
  * Returns first character length in bytes for multi-byte `text` according to
  * `encoding`.
diff --git a/utf8.h b/utf8.h
index 0db1db4519..cce654a64a 100644
--- a/utf8.h
+++ b/utf8.h
@@ -79,4 +79,23 @@ void strbuf_utf8_align(struct strbuf *buf, align_type 
position, unsigned int wid
  */
 int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
 
+/*
+ * If the endianness is not defined in the encoding name, then we
+ * require a BOM. The function returns true if a required BOM is missing.
+ *
+ * The Unicode standard instructs to assume big-endian if there in no
+ * BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard
+ * used in HTML5 recommends to assume little-endian to "deal with
+ * deployed content" [3].
+ *
+ * Therefore, strictly requiring a BOM seems to be the safest option for
+ * content in Git.
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#gen6
+ * [2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
+ * Section 3.10, D98, page 132
+ * [3] https://encoding.spec.whatwg.org/#utf-16le
+ */
+int is_missing_required_utf_bom(const char *enc, const char *data, size_t len);
+
 #endif
-- 
2.16.2



[PATCH v13 03/10] strbuf: add a case insensitive starts_with()

2018-04-15 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

Check in a case insensitive manner if one string is a prefix of another
string.

This function is used in a subsequent commit.

Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
---
 git-compat-util.h | 1 +
 strbuf.c  | 9 +
 2 files changed, 10 insertions(+)

diff --git a/git-compat-util.h b/git-compat-util.h
index 68b2ad531e..95c9b34832 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -455,6 +455,7 @@ extern void (*get_warn_routine(void))(const char *warn, 
va_list params);
 extern void set_die_is_recursing_routine(int (*routine)(void));
 
 extern int starts_with(const char *str, const char *prefix);
+extern int istarts_with(const char *str, const char *prefix);
 
 /*
  * If the string "str" begins with the string found in "prefix", return 1.
diff --git a/strbuf.c b/strbuf.c
index b635f0bdc4..99812b8488 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -11,6 +11,15 @@ int starts_with(const char *str, const char *prefix)
return 0;
 }
 
+int istarts_with(const char *str, const char *prefix)
+{
+   for (; ; str++, prefix++)
+   if (!*prefix)
+   return 1;
+   else if (tolower(*str) != tolower(*prefix))
+   return 0;
+}
+
 int skip_to_optional_arg_default(const char *str, const char *prefix,
 const char **arg, const char *def)
 {
-- 
2.16.2



[PATCH v13 04/10] utf8: teach same_encoding() alternative UTF encoding names

2018-04-15 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

The function same_encoding() could only recognize alternative names for
UTF-8 encodings. Teach it to recognize all kinds of alternative UTF
encoding names (e.g. utf16).

While we are at it, fix a crash that would occur if same_encoding() was
called with a NULL argument and a non-NULL argument.

This function is used in a subsequent commit.

Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
---
 utf8.c | 26 --
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/utf8.c b/utf8.c
index 2c27ce0137..40f4b142d6 100644
--- a/utf8.c
+++ b/utf8.c
@@ -401,18 +401,40 @@ void strbuf_utf8_replace(struct strbuf *sb_src, int pos, 
int width,
strbuf_release(_dst);
 }
 
+/*
+ * Returns true (1) if the src encoding name matches the dst encoding
+ * name directly or one of its alternative names. E.g. UTF-16BE is the
+ * same as UTF16BE.
+ */
+static int same_utf_encoding(const char *src, const char *dst)
+{
+   if (istarts_with(src, "utf") && istarts_with(dst, "utf")) {
+   /* src[3] or dst[3] might be '\0' */
+   int i = (src[3] == '-' ? 4 : 3);
+   int j = (dst[3] == '-' ? 4 : 3);
+   return !strcasecmp(src+i, dst+j);
+   }
+   return 0;
+}
+
 int is_encoding_utf8(const char *name)
 {
if (!name)
return 1;
-   if (!strcasecmp(name, "utf-8") || !strcasecmp(name, "utf8"))
+   if (same_utf_encoding("utf-8", name))
return 1;
return 0;
 }
 
 int same_encoding(const char *src, const char *dst)
 {
-   if (is_encoding_utf8(src) && is_encoding_utf8(dst))
+   static const char utf8[] = "UTF-8";
+
+   if (!src)
+   src = utf8;
+   if (!dst)
+   dst = utf8;
+   if (same_utf_encoding(src, dst))
return 1;
return !strcasecmp(src, dst);
 }
-- 
2.16.2



[PATCH v13 01/10] strbuf: remove unnecessary NUL assignment in xstrdup_tolower()

2018-04-15 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

Since 3733e69464 (use xmallocz to avoid size arithmetic, 2016-02-22) we
allocate the buffer for the lower case string with xmallocz(). This
already ensures a NUL at the end of the allocated buffer.

Remove the unnecessary assignment.

Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
---
 strbuf.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/strbuf.c b/strbuf.c
index 1df674e919..55b7daeb35 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -781,7 +781,6 @@ char *xstrdup_tolower(const char *string)
result = xmallocz(len);
for (i = 0; i < len; i++)
result[i] = tolower(string[i]);
-   result[i] = '\0';
return result;
 }
 
-- 
2.16.2



[PATCH v13 05/10] utf8: add function to detect prohibited UTF-16/32 BOM

2018-04-15 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE
or UTF-32LE a BOM must not be used [1]. The function returns true if
this is the case.

This function is used in a subsequent commit.

[1] http://unicode.org/faq/utf_bom.html#bom10

Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
---
 utf8.c | 26 ++
 utf8.h |  9 +
 2 files changed, 35 insertions(+)

diff --git a/utf8.c b/utf8.c
index 40f4b142d6..83af3a9f6b 100644
--- a/utf8.c
+++ b/utf8.c
@@ -560,6 +560,32 @@ char *reencode_string_len(const char *in, int insz,
 }
 #endif
 
+static int has_bom_prefix(const char *data, size_t len,
+ const char *bom, size_t bom_len)
+{
+   return data && bom && (len >= bom_len) && !memcmp(data, bom, bom_len);
+}
+
+static const char utf16_be_bom[] = {0xFE, 0xFF};
+static const char utf16_le_bom[] = {0xFF, 0xFE};
+static const char utf32_be_bom[] = {0x00, 0x00, 0xFE, 0xFF};
+static const char utf32_le_bom[] = {0xFF, 0xFE, 0x00, 0x00};
+
+int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
+{
+   return (
+ (same_utf_encoding("UTF-16BE", enc) ||
+  same_utf_encoding("UTF-16LE", enc)) &&
+ (has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
+  has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
+   ) || (
+ (same_utf_encoding("UTF-32BE",  enc) ||
+  same_utf_encoding("UTF-32LE", enc)) &&
+ (has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
+  has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
+   );
+}
+
 /*
  * Returns first character length in bytes for multi-byte `text` according to
  * `encoding`.
diff --git a/utf8.h b/utf8.h
index 6bbcf31a83..0db1db4519 100644
--- a/utf8.h
+++ b/utf8.h
@@ -70,4 +70,13 @@ typedef enum {
 void strbuf_utf8_align(struct strbuf *buf, align_type position, unsigned int 
width,
   const char *s);
 
+/*
+ * If a data stream is declared as UTF-16BE or UTF-16LE, then a UTF-16
+ * BOM must not be used [1]. The same applies for the UTF-32 equivalents.
+ * The function returns true if this rule is violated.
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#bom10
+ */
+int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
+
 #endif
-- 
2.16.2



[PATCH v13 00/10] convert: add support for different encodings

2018-04-15 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

Hi,

Patches 1-6,9 are preparation and helper functions.
Patch 7,8,10 are the actual change.

This series is based on v2.16.0 and Torsten's 8462ff43e4 (convert_to_git():
safe_crlf/checksafe becomes int conv_flags, 2018-01-13).

The series can be rebased without conflicts on top of v2.17.0:
https://github.com/larsxschneider/git/tree/encoding-2.17

Changes since v12:

* commit message improvement (Torsten)
* prevent undefined memcpy behavior in has_bom_prefix (Avar)
* improve error message: true/false are no valid working-tree-encodings 
(Torsten)
* fix crash in same_encoding() if only one argument is NULL (this bug
  was already present before this series, Eric)

Thanks,
Lars

  RFC: 
https://public-inbox.org/git/bdb9b884-6d17-4be3-a83c-f67e2afa2...@gmail.com/
   v1: 
https://public-inbox.org/git/20171211155023.1405-1-lars.schnei...@autodesk.com/
   v2: 
https://public-inbox.org/git/2017122915.39680-1-lars.schnei...@autodesk.com/
   v3: 
https://public-inbox.org/git/20180106004808.77513-1-lars.schnei...@autodesk.com/
   v4: 
https://public-inbox.org/git/20180120152418.52859-1-lars.schnei...@autodesk.com/
   v5: https://public-inbox.org/git/20180129201855.9182-1-tbo...@web.de/
   v6: 
https://public-inbox.org/git/20180209132830.55385-1-lars.schnei...@autodesk.com/
   v7: 
https://public-inbox.org/git/20180215152711.158-1-lars.schnei...@autodesk.com/
   v8: 
https://public-inbox.org/git/20180224162801.98860-1-lars.schnei...@autodesk.com/
   v9: 
https://public-inbox.org/git/20180304201418.60958-1-lars.schnei...@autodesk.com/
  v10: 
https://public-inbox.org/git/20180307173026.30058-1-lars.schnei...@autodesk.com/
  v11: 
https://public-inbox.org/git/20180309173536.62012-1-lars.schnei...@autodesk.com/
  v12: 
https://public-inbox.org/git/20180315225746.18119-1-lars.schnei...@autodesk.com/

Base Ref:
Web-Diff: https://github.com/larsxschneider/git/commit/3aa98e6975
Checkout: git fetch https://github.com/larsxschneider/git encoding-v13 && git 
checkout 3aa98e6975


### Interdiff (v12..v13):

diff --git a/convert.c b/convert.c
index 2a002af66d..1ae6301629 100644
--- a/convert.c
+++ b/convert.c
@@ -1222,7 +1222,7 @@ static const char *git_path_check_encoding(struct 
attr_check_item *check)
return NULL;

if (ATTR_TRUE(value) || ATTR_FALSE(value)) {
-   die(_("working-tree-encoding attribute requires a value"));
+   die(_("true/false are no valid working-tree-encodings"));
}

/* Don't encode to the default encoding */
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 884f0878b1..12b8eb963a 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -152,7 +152,7 @@ test_expect_success 'check unsupported encodings' '
echo "*.set text working-tree-encoding" >.gitattributes &&
printf "set" >t.set &&
test_must_fail git add t.set 2>err.out &&
-   test_i18ngrep "working-tree-encoding attribute requires a value" 
err.out &&
+   test_i18ngrep "true/false are no valid working-tree-encodings" err.out 
&&

echo "*.unset text -working-tree-encoding" >.gitattributes &&
printf "unset" >t.unset &&
diff --git a/utf8.c b/utf8.c
index 2d8821d36e..25d366d6b3 100644
--- a/utf8.c
+++ b/utf8.c
@@ -428,8 +428,12 @@ int is_encoding_utf8(const char *name)

 int same_encoding(const char *src, const char *dst)
 {
-   if (is_encoding_utf8(src) && is_encoding_utf8(dst))
-   return 1;
+   static const char utf8[] = "UTF-8";
+
+   if (!src)
+   src = utf8;
+   if (!dst)
+   dst = utf8;
if (same_utf_encoding(src, dst))
return 1;
return !strcasecmp(src, dst);
@@ -559,7 +563,7 @@ char *reencode_string_len(const char *in, int insz,
 static int has_bom_prefix(const char *data, size_t len,
  const char *bom, size_t bom_len)
 {
-   return (len >= bom_len) && !memcmp(data, bom, bom_len);
+   return data && bom && (len >= bom_len) && !memcmp(data, bom, bom_len);
 }

 static const char utf16_be_bom[] = {0xFE, 0xFF};


### Patches

Lars Schneider (10):
  strbuf: remove unnecessary NUL assignment in xstrdup_tolower()
  strbuf: add xstrdup_toupper()
  strbuf: add a case insensitive starts_with()
  utf8: teach same_encoding() alternative UTF encoding names
  utf8: add function to detect prohibited UTF-16/32 BOM
  utf8: add function to detect a missing UTF-16/32 BOM
  convert: add 'working-tree-encoding' attribute
  convert: check for detectable errors in UTF encodings
  convert: add tracing for 'working-tree-encoding' attribute
  convert: add round trip ch

[PATCH v13 02/10] strbuf: add xstrdup_toupper()

2018-04-15 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

Create a copy of an existing string and make all characters upper case.
Similar xstrdup_tolower().

This function is used in a subsequent commit.

Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
---
 strbuf.c | 12 
 strbuf.h |  1 +
 2 files changed, 13 insertions(+)

diff --git a/strbuf.c b/strbuf.c
index 55b7daeb35..b635f0bdc4 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -784,6 +784,18 @@ char *xstrdup_tolower(const char *string)
return result;
 }
 
+char *xstrdup_toupper(const char *string)
+{
+   char *result;
+   size_t len, i;
+
+   len = strlen(string);
+   result = xmallocz(len);
+   for (i = 0; i < len; i++)
+   result[i] = toupper(string[i]);
+   return result;
+}
+
 char *xstrvfmt(const char *fmt, va_list ap)
 {
struct strbuf buf = STRBUF_INIT;
diff --git a/strbuf.h b/strbuf.h
index 14c8c10d66..df7ced53ed 100644
--- a/strbuf.h
+++ b/strbuf.h
@@ -607,6 +607,7 @@ __attribute__((format (printf,2,3)))
 extern int fprintf_ln(FILE *fp, const char *fmt, ...);
 
 char *xstrdup_tolower(const char *);
+char *xstrdup_toupper(const char *);
 
 /**
  * Create a newly allocated string using printf format. You can do this easily
-- 
2.16.2



[PATCH v13 08/10] convert: check for detectable errors in UTF encodings

2018-04-15 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

Check that new content is valid with respect to the user defined
'working-tree-encoding' attribute.

Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
---
 convert.c| 61 +++
 t/t0028-working-tree-encoding.sh | 62 
 2 files changed, 123 insertions(+)

diff --git a/convert.c b/convert.c
index 21d5cb60da..0e7930c154 100644
--- a/convert.c
+++ b/convert.c
@@ -266,6 +266,64 @@ static int will_convert_lf_to_crlf(size_t len, struct 
text_stat *stats,
 
 }
 
+static int validate_encoding(const char *path, const char *enc,
+ const char *data, size_t len, int die_on_error)
+{
+   /* We only check for UTF here as UTF?? can be an alias for UTF-?? */
+   if (istarts_with(enc, "UTF")) {
+   /*
+* Check for detectable errors in UTF encodings
+*/
+   if (has_prohibited_utf_bom(enc, data, len)) {
+   const char *error_msg = _(
+   "BOM is prohibited in '%s' if encoded as %s");
+   /*
+* This advice is shown for UTF-??BE and UTF-??LE 
encodings.
+* We cut off the last two characters of the encoding 
name
+* to generate the encoding name suitable for BOMs.
+*/
+   const char *advise_msg = _(
+   "The file '%s' contains a byte order "
+   "mark (BOM). Please use UTF-%s as "
+   "working-tree-encoding.");
+   const char *stripped = NULL;
+   char *upper = xstrdup_toupper(enc);
+   upper[strlen(upper)-2] = '\0';
+   if (!skip_prefix(upper, "UTF-", ))
+   skip_prefix(stripped, "UTF", );
+   advise(advise_msg, path, stripped);
+   free(upper);
+   if (die_on_error)
+   die(error_msg, path, enc);
+   else {
+   return error(error_msg, path, enc);
+   }
+
+   } else if (is_missing_required_utf_bom(enc, data, len)) {
+   const char *error_msg = _(
+   "BOM is required in '%s' if encoded as %s");
+   const char *advise_msg = _(
+   "The file '%s' is missing a byte order "
+   "mark (BOM). Please use UTF-%sBE or UTF-%sLE "
+   "(depending on the byte order) as "
+   "working-tree-encoding.");
+   const char *stripped = NULL;
+   char *upper = xstrdup_toupper(enc);
+   if (!skip_prefix(upper, "UTF-", ))
+   skip_prefix(stripped, "UTF", );
+   advise(advise_msg, path, stripped, stripped);
+   free(upper);
+   if (die_on_error)
+   die(error_msg, path, enc);
+   else {
+   return error(error_msg, path, enc);
+   }
+   }
+
+   }
+   return 0;
+}
+
 static const char *default_encoding = "UTF-8";
 
 static int encode_to_git(const char *path, const char *src, size_t src_len,
@@ -291,6 +349,9 @@ static int encode_to_git(const char *path, const char *src, 
size_t src_len,
if (!buf && !src)
return 1;
 
+   if (validate_encoding(path, enc, src, src_len, die_on_error))
+   return 0;
+
dst = reencode_string_len(src, src_len, default_encoding, enc,
  _len);
if (!dst) {
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 8e574ccdd8..a318d03232 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -62,6 +62,52 @@ test_expect_success 'check $GIT_DIR/info/attributes support' 
'
 
 for i in 16 32
 do
+   test_expect_success "check prohibited UTF-${i} BOM" '
+   test_when_finished "git reset --hard HEAD" &&
+
+   echo "*.utf${i}be text working-tree-encoding=utf-${i}be" 
>>.gitattributes &&
+   echo "*.utf${i}le text working-tree-encoding=utf-${i}LE" 
>>.gitattributes &&
+
+   # Here we add a UTF-16 (resp. UTF-32) files with BOM 
(big/little-endian)
+   # but we tell Git to treat it as 

Re: [PATCH v11 06/10] convert: add 'working-tree-encoding' attribute

2018-04-15 Thread Lars Schneider

> On 05 Apr 2018, at 18:41, Torsten Bögershausen <tbo...@web.de> wrote:
> 
> On 01.04.18 15:24, Lars Schneider wrote:
>>> TRUE or false are values, but just wrong ones.
>>> If this test is removed, the user will see "failed to encode "TRUE" to 
>>> "UTF-8",
>>> which should give enough information to fix it.
>> 
>> I see your point. However, I would like to stop the processing right
>> there for these invalid values. How about 
>> 
>>  error(_("true/false are no valid working-tree-encodings"));
>> 
>> I think that is the most straight forward/helpful error message
>> for the enduser (I consider the term "boolean" but dismissed it
>> as potentially confusing to folks not familiar with the term).
>> 
>> OK with you?
> 
> Yes.

Great!


> Another thing that came up recently, independent of your series:
> 
> What should happen if a user specifies "UTF-8" and the file
> has an UTF-8 encoded BOM ?
> I ask because I stumbled over such a file coming from a Windows
> which the java compiler under Linux didn't accept.
> 
> And because some tools love to put an UTF-8 encoded BOM
> into text files.
> 
> The clearest thing would be to extend the BOM check in 5/9
> to cover UTF-32, UTF-16 and UTF-8.
> 
> Are there any plans to do so?

If `working-tree-encoding` is not defined or defined as UTF-8,
then we would return from encode_to_git() early. That means we
would never run validate_encoding() which would check the BOM.

However, adding the UTF-8 BOM would still make sense. This way
Git could scream if a user set `working-tree-encoding` to UTF-16
but the file is really UTF-8 encoded.


> And thanks for the work.

Thanks :-)


- Lars

Re: [PATCH v2 1/5] core.aheadbehind: add new config setting

2018-04-03 Thread Lars Schneider

> On 04 Jan 2018, at 20:26, Jeff King  wrote:
> 
> On Wed, Dec 27, 2017 at 09:41:30AM -0800, Junio C Hamano wrote:
> 
>> Jeff King  writes:
>> 
>>> I, too, had a funny feeling about calling this "core". But I didn't have
>>> a better name, as I'm not sure what other place we have for config
>>> options that cross many command boundaries. "diff" and "status" don't
>>> seem quite right to me. While you can argue they are subsystems, it
>>> seems too easy for users to confuse them with the commands of the same
>>> names.
>>> 
>>> Maybe there should be a "ui.*" config hierarchy for these kinds of
>>> cross-command interface options?
>> 
>> I had an impression that ui.* was primarily pretty-printing,
>> colouring and things of such nature.
> 
> I didn't think we had a "ui.*" so far. We have "color.ui" and
> "column.ui", but I think that's it.
> 
> At any rate, my intent was to consider this a "ui" issue, in that we are
> deciding how the ahead/behind hints should be shown to the user.
> 
>> I do not think it is such a
>> bad idea to honor a status.frotz variable that affects how (e.g. to
>> what degree of detailedness) status on frotz are reported in Git
>> subcommands other than 'git status' if they report the same sort of
>> information on 'frotz' that 'git status' makes.
> 
> Is ahead/behind uniquely attached to git-status? IOW, could this be called
> "branch.aheadbehind" and git-status respects it? It seems like putting
> it in status introduces a weird asymmetry.
> 
> I buy the argument more that "status" here is not "this is a git-status
> config option", but "this config section encompasses various things
> about the status of a repository reported by many commands". But then
> it's kind of funny to have many of the existing options there that
> really are specific to git-status.
> 
> In can be both of those things, of course, but then it becomes less
> clear to the user which config options affect which command.
> 
> I dunno. It is probably not _that_ big a deal, and I can live with it
> wherever. But Git has a reputation for having inconsistencies and weird
> asymmetries in its UI, so I like to give some thought to squashing them
> preemptively.

What is the state of this series? I can't find it in git/git nor in 
git-for-windows/git. I think Stolee mentioned the config in
his Git Merge talk [1] and I was about to test it/roll it out :-)

- Lars


[1] https://youtu.be/oOMzi983Qmw



Re: [PATCH v12 00/10] convert: add support for different encodings

2018-04-03 Thread Lars Schneider

> On 02 Apr 2018, at 20:31, Lars Schneider <larsxschnei...@gmail.com> wrote:
> 
> 
>> On 29 Mar 2018, at 20:37, Junio C Hamano <gits...@pobox.com> wrote:
>> 
>> lars.schnei...@autodesk.com writes:
>> 
>>> From: Lars Schneider <larsxschnei...@gmail.com>
>>> 
>>> Patches 1-6,9 are preparation and helper functions. Patch 4 is new.
>>> Patch 7,8,10 are the actual change.
>>> 
>>> This series depends on Torsten's 8462ff43e4 (convert_to_git():
>>> safe_crlf/checksafe becomes int conv_flags, 2018-01-13) which is
>>> already in master.
>> 
>> I didn't see any further review comments on this round, and as far
>> as I can tell from my reading, these patches looked more-or-less
>> ready.  
>> 
>> Except for 04/10 that had a few messages around "who should be
>> responsible for handling the 'NULL is for the default UTF-8'?", that
>> is.
>> 
>> So, what's the doneness of this thing?
> 
> Almost. I'll send a new round tomorrow. I hope to make this the final
> round.

@Junio: Can you remind me of your preference? Should I rebase my
series to 2.17.0 or keep the base to make it easier for you to 
check the interdiff?

Thanks,
Lars


Re: [PATCH v12 00/10] convert: add support for different encodings

2018-04-02 Thread Lars Schneider

> On 29 Mar 2018, at 20:37, Junio C Hamano <gits...@pobox.com> wrote:
> 
> lars.schnei...@autodesk.com writes:
> 
>> From: Lars Schneider <larsxschnei...@gmail.com>
>> 
>> Patches 1-6,9 are preparation and helper functions. Patch 4 is new.
>> Patch 7,8,10 are the actual change.
>> 
>> This series depends on Torsten's 8462ff43e4 (convert_to_git():
>> safe_crlf/checksafe becomes int conv_flags, 2018-01-13) which is
>> already in master.
> 
> I didn't see any further review comments on this round, and as far
> as I can tell from my reading, these patches looked more-or-less
> ready.  
> 
> Except for 04/10 that had a few messages around "who should be
> responsible for handling the 'NULL is for the default UTF-8'?", that
> is.
> 
> So, what's the doneness of this thing?

Almost. I'll send a new round tomorrow. I hope to make this the final
round.

- Lars



Re: [GSoC] [PATCH] travis-ci: added clang static analysis

2018-04-01 Thread Lars Schneider

> On 13 Mar 2018, at 18:45, Siddhartha Mishra <sidm1...@gmail.com> wrote:
> 
> On Mon, Mar 12, 2018 at 3:49 PM, Lars Schneider
> <larsxschnei...@gmail.com> wrote:
>> Hi,
>> 
>> That looks interesting but I agree with Dscho that we should not limit
>> this to master/maint.
>> 
>> I assume you did run this on TravisCI already? Can you share a link?
>> I assume you did find errors? Can we fix them or are there too many?
>> If there are existing errors, how do we define a "successful" build?
>> 
>> Thanks for working on this,
>> Lars
>> 
> 
> Thanks for the reply,
> 
> I assume there will be false positives in the code which we can't fix
> by making small modifications to the code as recommended in the FAQ
> (https://clang-analyzer.llvm.org/faq.html). According to the FAQ,
> there is no solid mechanism for suppressing a specific warning, so are
> options are limited. Some of the things which might help reduce the
> noise are:
> 
> 1) To add specific tags in our source code to tell the analyzer to
> ignore the code. This is probably a bad idea since it is intrusive and
> forces changes to the actual source code which only affect one task.
> 
> 2) Count the number of bugs in the previous pushed build and fail the
> build if the number of bugs increases. It doesn't help remove the
> noise from the error log but it does tell you if you've added more
> bugs. However if you add a bug and remove one, it'll pass the job and
> might mislead you into thinking that the code is correct.
> 
> 3) Write a script to check the diff of the error log from that of the
> previous pushed build(ignoring the line numbers). I haven't thought
> about how exactly it would be implemented so I'm not commenting on it.
> 
> Is there a better solution that I'm missing or should I try coming up
> with a script to come up the diff?

That's a good summary and I don't see a better solution. While (3)
sounds nice, I think (2) is the fastest/most pragmatic solution.

We already use the Travis cache [1]. You could use that mechanism to 
store a file with the latest number of bugs in the cache directory 
$HOME/travis-cache

If the "number of bugs" file does not exist, then create it and don't
complain. If the file exists and the previous number of bugs is higher
or equal, then don't complain either. If the file exists and the previous
number of bugs is lower, then let the build fail.

Do you think that could work?

Cheers,
Lars 

[1] https://docs.travis-ci.com/user/caching/


> 
> Thanks for the time,
> Siddhartha
> 
> On Mon, Mar 12, 2018 at 3:49 PM, Lars Schneider
> <larsxschnei...@gmail.com> wrote:
>> Hi,
>> 
>> That looks interesting but I agree with Dscho that we should not limit
>> this to master/maint.
>> 
>> I assume you did run this on TravisCI already? Can you share a link?
>> I assume you did find errors? Can we fix them or are there too many?
>> If there are existing errors, how do we define a "successful" build?
>> 
>> Thanks for working on this,
>> Lars
>> 
>>> On 05 Mar 2018, at 21:04, SiddharthaMishra <sidm1...@gmail.com> wrote:
>>> 
>>> Added a job to run clang static code analysis on the master and maint branch
>>> 
>>> Signed-off-by: SiddharthaMishra <sidm1...@gmail.com>
>>> ---
>>> .travis.yml   | 17 -
>>> ci/run-static-analysis.sh |  9 -
>>> 2 files changed, 24 insertions(+), 2 deletions(-)
>>> 
>>> diff --git a/.travis.yml b/.travis.yml
>>> index 4684b3f4f..9b891d182 100644
>>> --- a/.travis.yml
>>> +++ b/.travis.yml
>>> @@ -48,7 +48,7 @@ matrix:
>>>  before_install:
>>>  before_script:
>>>  script: ci/run-linux32-docker.sh
>>> -- env: jobname=StaticAnalysis
>>> +- env: jobname=CocciStaticAnalysis
>>>  os: linux
>>>  compiler:
>>>  addons:
>>> @@ -59,6 +59,21 @@ matrix:
>>>  before_script:
>>>  script: ci/run-static-analysis.sh
>>>  after_failure:
>>> +- if: branch IN (master, maint)
>>> +  env: jobname=ClangStaticAnalysis
>>> +  os: linux
>>> +  compiler:
>>> +  add_ons:
>>> +apt:
>>> +  sources:
>>> +  - ubuntu-toolchain-r-test
>>> +  - llvm-toolchain-trusty
>>> +  packages:
>>> +  - clang
>>> +  before_install:
>>> +  before_script:
>>> +  script: ci/run-static-analysis.sh
>>> +  after_failure:
>>>- env: jobname=Documentation
>>>  os: linux
>>>  compiler:
>>> diff --git a/ci/run-static-analysis.sh b/ci/run-static-analysis.sh
>>> index fe4ee4e06..6ae032f54 100755
>>> --- a/ci/run-static-analysis.sh
>>> +++ b/ci/run-static-analysis.sh
>>> @@ -5,6 +5,13 @@
>>> 
>>> . ${0%/*}/lib-travisci.sh
>>> 
>>> -make coccicheck
>>> +case "$jobname" in
>>> +ClangStaticAnalysis)
>>> + scan-build -analyze-headers --status-bugs make
>>> + ;;
>>> +CocciStaticAnalysis)
>>> + make coccicheck
>>> + ;;
>>> +esac
>>> 
>>> save_good_tree
>>> --
>>> 2.16.2.248.ge2408a6f7.dirty
>>> 
>> 



Re: [PATCH v12 04/10] utf8: teach same_encoding() alternative UTF encoding names

2018-04-01 Thread Lars Schneider

> On 16 Mar 2018, at 19:19, Eric Sunshine  wrote:
> 
> On Fri, Mar 16, 2018 at 1:50 PM, Junio C Hamano  wrote:
>> Eric Sunshine  writes:
>>> However, I'm having a tough time imagining cases in which callers
>>> would want same_encoding() to return true if both arguments are NULL,
>>> but outright crash if only one is NULL (which is the behavior even
>>> before this patch). In other words, same_encoding() takes advantage of
>>> is_encoding_utf8() for its convenience, not for its NULL-handling.
>>> Given that view, the two explicit is_encoding_utf8() calls in
>>> same_encoding() seem redundant once the same_utf_encoding() call is
>>> added.
>> 
>> So... does that mean we'd want something like this, or do you have
>> something else in mind?
>> 
>>int same_encoding(const char *src, const char *dst)
>>{
>>static const char utf8[] = "UTF-8";
>> 
>>if (!src)
>>src = utf8;
>>if (!dst)
>>dst = utf8;
>>if (same_utf_encoding(src, dst))
>>return 1;
>>return !strcasecmp(src, dst);
>>}
> 
> I am not proposing anything like that for this patch or patch series.
> I'm merely asking why, after this patch, same_encoding() still
> contains the (in my mind) now-unneeded conditional:
> 
>if (is_encoding_utf8(src) && is_encoding_utf8(dst))
>return 1;

My main motivation was to keep the existing behavior "as-is"
to avoid any regressions.

However, I agree with your critic of the inconsistencies. 

Therefore, I will use Junio's suggestion above as it makes 
the intented behaivior clear.

Thanks,
Lars


Re: [PATCH v11 06/10] convert: add 'working-tree-encoding' attribute

2018-04-01 Thread Lars Schneider

> On 18 Mar 2018, at 08:24, Torsten Bögershausen <tbo...@web.de> wrote:
> 
> Some comments inline
> 
> On Fri, Mar 09, 2018 at 06:35:32PM +0100, lars.schnei...@autodesk.com wrote:
>> From: Lars Schneider <larsxschnei...@gmail.com>
>> 
>> Git recognizes files encoded with ASCII or one of its supersets (e.g.
>> UTF-8 or ISO-8859-1) as text files. All other encodings are usually
>> interpreted as binary and consequently built-in Git text processing
>> tools (e.g. 'git diff') as well as most Git web front ends do not
>> visualize the content.
>> 
>> Add an attribute to tell Git what encoding the user has defined for a
>> given file. If the content is added to the index, then Git converts the
> 
> Minor comment:
> "Git converts the content"
> Everywhere else (?) "encodes or reencodes" is used.
> "Git reencodes the content" may be more consistent.

OK, will change.


>> 
>> +static const char *default_encoding = "UTF-8";
>> +
>> +static int encode_to_git(const char *path, const char *src, size_t src_len,
>> + struct strbuf *buf, const char *enc, int conv_flags)
>> +{
>> +char *dst;
>> +int dst_len;
>> +int die_on_error = conv_flags & CONV_WRITE_OBJECT;
>> +
>> +/*
>> + * No encoding is specified or there is nothing to encode.
>> + * Tell the caller that the content was not modified.
>> + */
>> +if (!enc || (src && !src_len))
>> +return 0;
> 
> (This may have been discussed before.
> As we checked (enc != NULL) I think we can add here:)
>   if (is_encoding_utf8(enc))
>   return 0;

This should be covered in git_path_check_encoding(),
introduced in v12:

/* Don't encode to the default encoding */
if (same_encoding(value, default_encoding))
return NULL;

In that function the encoding of a certain file is read from
the .gitattributes. If the encoding matches the compile-time
defined default encoding (= UTF-8), then the encoding is set
to NULL.


>> 
>> +
>> +static int encode_to_worktree(const char *path, const char *src, size_t 
>> src_len,
>> +  struct strbuf *buf, const char *enc)
>> +{
>> +char *dst;
>> +int dst_len;
>> +
>> +/*
>> + * No encoding is specified or there is nothing to encode.
>> + * Tell the caller that the content was not modified.
>> + */
>> +if (!enc || (src && !src_len))
>> +return 0;
> 
> Same as above:
>   if (is_encoding_utf8(enc))
>   return 0;
> 
>> +
>> +dst = reencode_string_len(src, src_len, enc, default_encoding,
>> +  _len);
>> +if (!dst) {
>> +error("failed to encode '%s' from %s to %s",
>> +path, default_encoding, enc);
>> +return 0;
>> +}
>> +
>> +strbuf_attach(buf, dst, dst_len, dst_len + 1);
>> +return 1;
>> +}
>> +
>> static int crlf_to_git(const struct index_state *istate,
>> const char *path, const char *src, size_t len,
>> struct strbuf *buf,
>> @@ -978,6 +1051,25 @@ static int ident_to_worktree(const char *path, const 
>> char *src, size_t len,
>>  return 1;
>> }
>> 
>> +static const char *git_path_check_encoding(struct attr_check_item *check)
>> +{
>> +const char *value = check->value;
>> +
>> +if (ATTR_UNSET(value) || !strlen(value))
>> +return NULL;
>> +
> 
> 
>> +if (ATTR_TRUE(value) || ATTR_FALSE(value)) {
>> +error(_("working-tree-encoding attribute requires a value"));
>> +return NULL;
>> +}
> 
> TRUE or false are values, but just wrong ones.
> If this test is removed, the user will see "failed to encode "TRUE" to 
> "UTF-8",
> which should give enough information to fix it.

I see your point. However, I would like to stop the processing right
there for these invalid values. How about 

  error(_("true/false are no valid working-tree-encodings"));

I think that is the most straight forward/helpful error message
for the enduser (I consider the term "boolean" but dismissed it
as potentially confusing to folks not familiar with the term).

OK with you?

> 
>> +
>> +/* Don't encode to the default encoding */
>> +if (!strcasecmp(value, default_encoding))
>> +return NULL;
> Same as above ?:
>   if (is_encoding_utf8(value))
>   return 0;

Yes, that was fixed in v12 as mentioned above :-)

- Lars

Re: What's cooking in git.git (Mar 2018, #05; Wed, 28)

2018-04-01 Thread Lars Schneider

> On 30 Mar 2018, at 12:32, Lars Schneider <larsxschnei...@gmail.com> wrote:
> 
> 
>> On 30 Mar 2018, at 11:24, Ævar Arnfjörð Bjarmason <ava...@gmail.com> wrote:
>> 
>> 
>> On Wed, Mar 28 2018, Junio C. Hamano wrote:
>> 
>>> * ls/checkout-encoding (2018-03-16) 10 commits
>>> - convert: add round trip check based on 'core.checkRoundtripEncoding'
>>> - convert: add tracing for 'working-tree-encoding' attribute
>>> - convert: check for detectable errors in UTF encodings
>>> - convert: add 'working-tree-encoding' attribute
>>> - utf8: add function to detect a missing UTF-16/32 BOM
>>> - utf8: add function to detect prohibited UTF-16/32 BOM
>>> - utf8: teach same_encoding() alternative UTF encoding names
>>> - strbuf: add a case insensitive starts_with()
>>> - strbuf: add xstrdup_toupper()
>>> - strbuf: remove unnecessary NUL assignment in xstrdup_tolower()
>>> 
>>> The new "checkout-encoding" attribute can ask Git to convert the
>>> contents to the specified encoding when checking out to the working
>>> tree (and the other way around when checking in).
>> 
>> There's an issue in ca16fc3635 ("convert: check for detectable errors in
>> UTF encodings", 2018-03-15) flagged by Coverity:
>> https://public-inbox.org/git/CAGZ79kbAOcwaRzjuMtZ_HVsYvUr_7UAPbOcnrmPgsdE19q=p...@mail.gmail.com/
> 
> Thanks a lot for pointing me at this!
> I'll prepare a new round soonish.


The report says:

  >>> CID 1433528:  Null pointer dereferences  (FORWARD_NULL)
  >>> Passing null pointer "src" to "validate_encoding", which dereferences 
it.

  411 if (validate_encoding(path, enc, src, src_len, die_on_error))

However, validate_encoding() does not dereference it. It just passes the
pointer to has_prohibited_utf_bom() and is_missing_required_utf_bom().
These functions just pass the pointer to has_bom_prefix().

In has_bom_prefix() we pass the pointer to memcmp() which is undefined
for null pointers. I think that is what Coverity is complaining about,
right?


- Lars

Re: What's cooking in git.git (Mar 2018, #05; Wed, 28)

2018-03-30 Thread Lars Schneider

> On 30 Mar 2018, at 11:24, Ævar Arnfjörð Bjarmason  wrote:
> 
> 
> On Wed, Mar 28 2018, Junio C. Hamano wrote:
> 
>> * ls/checkout-encoding (2018-03-16) 10 commits
>> - convert: add round trip check based on 'core.checkRoundtripEncoding'
>> - convert: add tracing for 'working-tree-encoding' attribute
>> - convert: check for detectable errors in UTF encodings
>> - convert: add 'working-tree-encoding' attribute
>> - utf8: add function to detect a missing UTF-16/32 BOM
>> - utf8: add function to detect prohibited UTF-16/32 BOM
>> - utf8: teach same_encoding() alternative UTF encoding names
>> - strbuf: add a case insensitive starts_with()
>> - strbuf: add xstrdup_toupper()
>> - strbuf: remove unnecessary NUL assignment in xstrdup_tolower()
>> 
>> The new "checkout-encoding" attribute can ask Git to convert the
>> contents to the specified encoding when checking out to the working
>> tree (and the other way around when checking in).
> 
> There's an issue in ca16fc3635 ("convert: check for detectable errors in
> UTF encodings", 2018-03-15) flagged by Coverity:
> https://public-inbox.org/git/CAGZ79kbAOcwaRzjuMtZ_HVsYvUr_7UAPbOcnrmPgsdE19q=p...@mail.gmail.com/

Thanks a lot for pointing me at this!
I'll prepare a new round soonish.

- Lars

Re: [PATCH v2] travis-ci: enable more warnings on travis linux-gcc job

2018-03-17 Thread Lars Schneider

> On 17 Mar 2018, at 09:01, Duy Nguyen  wrote:
> 
> On Fri, Mar 16, 2018 at 10:22 PM, Jeff King  wrote:
>>> diff --git a/ci/run-build-and-tests.sh b/ci/run-build-and-tests.sh
>>> index 3735ce413f..f6f346c468 100755
>>> --- a/ci/run-build-and-tests.sh
>>> +++ b/ci/run-build-and-tests.sh
>>> @@ -7,6 +7,22 @@
>>> 
>>> ln -s "$cache_dir/.prove" t/.prove
>>> 
>>> +if [ "$jobname" = linux-gcc ]; then
>>> + gcc-6 --version
>>> + cat >config.mak <<-EOF
>>> + CC=gcc-6
>>> + CFLAGS = -g -O2 -Wall
>>> + CFLAGS += -Wextra
>>> + CFLAGS += -Wmissing-prototypes
>>> + CFLAGS += -Wno-empty-body
>>> + CFLAGS += -Wno-maybe-uninitialized
>>> + CFLAGS += -Wno-missing-field-initializers
>>> + CFLAGS += -Wno-sign-compare
>>> + CFLAGS += -Wno-unused-function
>>> + CFLAGS += -Wno-unused-parameter
>>> + EOF
>>> +fi
>> 
>> Why isn't this just turning on DEVELOPER=1 if we know we have a capable
>> compiler?
> 
> DEVELOPER=1 is always set even before this patch. It's set and
> exported in lib-travisci.sh.

I interpreted Peff's comment like this:

If DEVELOPER=1 is set and we detect a gcc-6 in the makefile, 
then we could set your additional flags in the makefile.

This way every developer with a new compiler would run these
flags locally (if DEVELOPER=1 is set).

- Lars


Re: [PATCH v6 00/14] Serialized Git Commit Graph

2018-03-16 Thread Lars Schneider

> On 14 Mar 2018, at 21:43, Junio C Hamano  wrote:
> 
> Derrick Stolee  writes:
> 
>> This v6 includes feedback around csum-file.c and the rename of hashclose()
>> to finalize_hashfile(). These are the first two commits of the series, so
>> they could be pulled out independently.
>> 
>> The only other change since v5 is that I re-ran the performance numbers
>> in "commit: integrate commit graph with commit parsing".
> 
> Thanks.
> 
>> Hopefully this version is ready to merge. I have several follow-up topics
>> in mind to submit soon after, including:
> 
> A few patches add trailing blank lines and other whitespace
> breakages, which will stop my "git merge" later to 'next' and down,
> as I have a pre-commit hook to catch them.

@stolee: 

I run "git --no-pager diff --check $BASE_HASH...$HEAD_HASH" to detect
these kinds of things. I run this as part of my "prepare patch" [1] script
which is inspired by a similar script originally written by Dscho.

Do you think it would make sense to mention (or even
recommend) such a script in your awesome GfW CONTRIBUTING.md?


- Lars


[1] 
https://github.com/larsxschneider/git-list-helper/blob/master/prepare-patch.sh#L71




Re: [PATCH v12 04/10] utf8: teach same_encoding() alternative UTF encoding names

2018-03-15 Thread Lars Schneider

> On 16 Mar 2018, at 00:25, Eric Sunshine <sunsh...@sunshineco.com> wrote:
> 
> On Thu, Mar 15, 2018 at 6:57 PM,  <lars.schnei...@autodesk.com> wrote:
>> The function same_encoding() checked only for alternative UTF-8 encoding
>> names. Teach it to check for all kinds of alternative UTF encoding
>> names.
>> 
>> Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
>> ---
>> diff --git a/utf8.c b/utf8.c
>> @@ -401,11 +401,27 @@ void strbuf_utf8_replace(struct strbuf *sb_src, int 
>> pos, int width,
>> +static int same_utf_encoding(const char *src, const char *dst)
>> +{
>> +   if (istarts_with(src, "utf") && istarts_with(dst, "utf")) {
>> +   /* src[3] or dst[3] might be '\0' */
>> +   int i = (src[3] == '-' ? 4 : 3);
>> +   int j = (dst[3] == '-' ? 4 : 3);
>> +   return !strcasecmp(src+i, dst+j);
>> +   }
>> +   return 0;
>> +}
> 
> Alternate implementation without magic numbers:
> 
>if (iskip_prefix(src, "utf", ) &&
>iskip_prefix(dst, "utf", )) {
>if (*src == '-')
>src++;
>if (*dst == '-')
>dst++;
>return !strcasecmp(src, dst);
>}
>return 0;
> 
> ... assuming you add an iskip_prefix() function (patterned after 
> skip_prefix()).

If a reroll is necessary, then I can do this.


>> int is_encoding_utf8(const char *name)
>> {
>>if (!name)
>>return 1;
>> -   if (!strcasecmp(name, "utf-8") || !strcasecmp(name, "utf8"))
>> +   if (same_utf_encoding("utf-8", name))
>>return 1;
>>return 0;
>> }
>> @@ -414,6 +430,8 @@ int same_encoding(const char *src, const char *dst)
>> {
>>if (is_encoding_utf8(src) && is_encoding_utf8(dst))
>>return 1;
>> +   if (same_utf_encoding(src, dst))
>> +   return 1;
>>return !strcasecmp(src, dst);
>> }
> 
> This seems odd. I would have expected the newly-added generalized
> conditional to replace the original UTF-8-specific conditional, not
> supplement it. That is, shouldn't the entire function body be:
> 
>if (same_utf_encoding(src, dst))
>return 1;
>return !strcasecmp(src, dst);

No, because is_encoding_utf8() returns "true" (=1) if the encoding
is NULL. That is not the case for UTF-16 et al. The caller of
same_encoding() might expect that behavior.

I could have moved the "UTF-8" == NULL assumption into 
same_utf_encoding() but that did not feel right.

Does this make sense?

- Lars

Re: What's cooking in git.git (Mar 2018, #03; Wed, 14)

2018-03-15 Thread Lars Schneider

> On 15 Mar 2018, at 20:18, Lars Schneider <larsxschnei...@gmail.com> wrote:
> 
> 
>> On 15 Mar 2018, at 02:34, Junio C Hamano <gits...@pobox.com> wrote:
>> 
>> ...
>> 
>> * ls/checkout-encoding (2018-03-09) 10 commits
>> - convert: add round trip check based on 'core.checkRoundtripEncoding'
>> - convert: add tracing for 'working-tree-encoding' attribute
>> - convert: advise canonical UTF encoding names
>> - convert: check for detectable errors in UTF encodings
>> - convert: add 'working-tree-encoding' attribute
>> - utf8: add function to detect a missing UTF-16/32 BOM
>> - utf8: add function to detect prohibited UTF-16/32 BOM
>> - strbuf: add a case insensitive starts_with()
>> - strbuf: add xstrdup_toupper()
>> - strbuf: remove unnecessary NUL assignment in xstrdup_tolower()
>> 
>> The new "checkout-encoding" attribute can ask Git to convert the
>> contents to the specified encoding when checking out to the working
>> tree (and the other way around when checking in).
>> 
>> Expecting a reroll.
>> cf. <66370a41-a048-44e7-9bf8-4631c50aa...@gmail.com>
>> Modulo minor design decision corrections, the series is almost there.
> 
> If I hurry up (IOW: send a reroll tonight), would this topic
> have a chance for 2.17-rc1 or is it too late?

I just send out a reroll in case the series is still relevant
for 2.17-rc1.

https://public-inbox.org/git/20180315225746.18119-1-lars.schnei...@autodesk.com/

Thanks,
Lars

[PATCH v12 08/10] convert: check for detectable errors in UTF encodings

2018-03-15 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

Check that new content is valid with respect to the user defined
'working-tree-encoding' attribute.

Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
---
 convert.c| 61 +++
 t/t0028-working-tree-encoding.sh | 62 
 2 files changed, 123 insertions(+)

diff --git a/convert.c b/convert.c
index 85e49741af..3cab4fa907 100644
--- a/convert.c
+++ b/convert.c
@@ -266,6 +266,64 @@ static int will_convert_lf_to_crlf(size_t len, struct 
text_stat *stats,
 
 }
 
+static int validate_encoding(const char *path, const char *enc,
+ const char *data, size_t len, int die_on_error)
+{
+   /* We only check for UTF here as UTF?? can be an alias for UTF-?? */
+   if (istarts_with(enc, "UTF")) {
+   /*
+* Check for detectable errors in UTF encodings
+*/
+   if (has_prohibited_utf_bom(enc, data, len)) {
+   const char *error_msg = _(
+   "BOM is prohibited in '%s' if encoded as %s");
+   /*
+* This advice is shown for UTF-??BE and UTF-??LE 
encodings.
+* We cut off the last two characters of the encoding 
name
+* to generate the encoding name suitable for BOMs.
+*/
+   const char *advise_msg = _(
+   "The file '%s' contains a byte order "
+   "mark (BOM). Please use UTF-%s as "
+   "working-tree-encoding.");
+   const char *stripped = NULL;
+   char *upper = xstrdup_toupper(enc);
+   upper[strlen(upper)-2] = '\0';
+   if (!skip_prefix(upper, "UTF-", ))
+   skip_prefix(stripped, "UTF", );
+   advise(advise_msg, path, stripped);
+   free(upper);
+   if (die_on_error)
+   die(error_msg, path, enc);
+   else {
+   return error(error_msg, path, enc);
+   }
+
+   } else if (is_missing_required_utf_bom(enc, data, len)) {
+   const char *error_msg = _(
+   "BOM is required in '%s' if encoded as %s");
+   const char *advise_msg = _(
+   "The file '%s' is missing a byte order "
+   "mark (BOM). Please use UTF-%sBE or UTF-%sLE "
+   "(depending on the byte order) as "
+   "working-tree-encoding.");
+   const char *stripped = NULL;
+   char *upper = xstrdup_toupper(enc);
+   if (!skip_prefix(upper, "UTF-", ))
+   skip_prefix(stripped, "UTF", );
+   advise(advise_msg, path, stripped, stripped);
+   free(upper);
+   if (die_on_error)
+   die(error_msg, path, enc);
+   else {
+   return error(error_msg, path, enc);
+   }
+   }
+
+   }
+   return 0;
+}
+
 static const char *default_encoding = "UTF-8";
 
 static int encode_to_git(const char *path, const char *src, size_t src_len,
@@ -291,6 +349,9 @@ static int encode_to_git(const char *path, const char *src, 
size_t src_len,
if (!buf && !src)
return 1;
 
+   if (validate_encoding(path, enc, src, src_len, die_on_error))
+   return 0;
+
dst = reencode_string_len(src, src_len, default_encoding, enc,
  _len);
if (!dst) {
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index d67dbde1d4..1bb528b339 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -62,6 +62,52 @@ test_expect_success 'check $GIT_DIR/info/attributes support' 
'
 
 for i in 16 32
 do
+   test_expect_success "check prohibited UTF-${i} BOM" '
+   test_when_finished "git reset --hard HEAD" &&
+
+   echo "*.utf${i}be text working-tree-encoding=utf-${i}be" 
>>.gitattributes &&
+   echo "*.utf${i}le text working-tree-encoding=utf-${i}LE" 
>>.gitattributes &&
+
+   # Here we add a UTF-16 (resp. UTF-32) files with BOM 
(big/little-endian)
+   # but we tell Git to treat it as 

[PATCH v12 09/10] convert: add tracing for 'working-tree-encoding' attribute

2018-03-15 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

Add the GIT_TRACE_WORKING_TREE_ENCODING environment variable to enable
tracing for content that is reencoded with the 'working-tree-encoding'
attribute. This is useful to debug encoding issues.

Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
---
 convert.c| 25 +
 t/t0028-working-tree-encoding.sh |  2 ++
 2 files changed, 27 insertions(+)

diff --git a/convert.c b/convert.c
index 3cab4fa907..ba6f2019a3 100644
--- a/convert.c
+++ b/convert.c
@@ -324,6 +324,29 @@ static int validate_encoding(const char *path, const char 
*enc,
return 0;
 }
 
+static void trace_encoding(const char *context, const char *path,
+  const char *encoding, const char *buf, size_t len)
+{
+   static struct trace_key coe = TRACE_KEY_INIT(WORKING_TREE_ENCODING);
+   struct strbuf trace = STRBUF_INIT;
+   int i;
+
+   strbuf_addf(, "%s (%s, considered %s):\n", context, path, 
encoding);
+   for (i = 0; i < len && buf; ++i) {
+   strbuf_addf(
+   ,"| \e[2m%2i:\e[0m %2x \e[2m%c\e[0m%c",
+   i,
+   (unsigned char) buf[i],
+   (buf[i] > 32 && buf[i] < 127 ? buf[i] : ' '),
+   ((i+1) % 8 && (i+1) < len ? ' ' : '\n')
+   );
+   }
+   strbuf_addchars(, '\n', 1);
+
+   trace_strbuf(, );
+   strbuf_release();
+}
+
 static const char *default_encoding = "UTF-8";
 
 static int encode_to_git(const char *path, const char *src, size_t src_len,
@@ -352,6 +375,7 @@ static int encode_to_git(const char *path, const char *src, 
size_t src_len,
if (validate_encoding(path, enc, src, src_len, die_on_error))
return 0;
 
+   trace_encoding("source", path, enc, src, src_len);
dst = reencode_string_len(src, src_len, default_encoding, enc,
  _len);
if (!dst) {
@@ -369,6 +393,7 @@ static int encode_to_git(const char *path, const char *src, 
size_t src_len,
return 0;
}
}
+   trace_encoding("destination", path, default_encoding, dst, dst_len);
 
strbuf_attach(buf, dst, dst_len, dst_len + 1);
return 1;
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 1bb528b339..2ff7541b34 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -4,6 +4,8 @@ test_description='working-tree-encoding conversion via 
gitattributes'
 
 . ./test-lib.sh
 
+GIT_TRACE_WORKING_TREE_ENCODING=1 && export GIT_TRACE_WORKING_TREE_ENCODING
+
 test_expect_success 'setup test files' '
git config core.eol lf &&
 
-- 
2.16.2



[PATCH v12 10/10] convert: add round trip check based on 'core.checkRoundtripEncoding'

2018-03-15 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

UTF supports lossless conversion round tripping and conversions between
UTF and other encodings are mostly round trip safe as Unicode aims to be
a superset of all other character encodings. However, certain encodings
(e.g. SHIFT-JIS) are known to have round trip issues [1].

Add 'core.checkRoundtripEncoding', which contains a comma separated
list of encodings, to define for what encodings Git should check the
conversion round trip if they are used in the 'working-tree-encoding'
attribute.

Set SHIFT-JIS as default value for 'core.checkRoundtripEncoding'.

[1] 
https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode

Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
---
 Documentation/config.txt |  6 
 Documentation/gitattributes.txt  |  8 +
 config.c |  5 +++
 convert.c| 77 
 convert.h|  1 +
 environment.c|  1 +
 t/t0028-working-tree-encoding.sh | 39 
 7 files changed, 137 insertions(+)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 0e25b2c92b..7dcac9b540 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -530,6 +530,12 @@ core.autocrlf::
This variable can be set to 'input',
in which case no output conversion is performed.
 
+core.checkRoundtripEncoding::
+   A comma and/or whitespace separated list of encodings that Git
+   performs UTF-8 round trip checks on if they are used in an
+   `working-tree-encoding` attribute (see linkgit:gitattributes[5]).
+   The default value is `SHIFT-JIS`.
+
 core.symlinks::
If false, symbolic links are checked out as small plain files that
contain the link text. linkgit:git-update-index[1] and
diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 31a4f92840..aa3deae392 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -312,6 +312,14 @@ number of pitfalls:
   internal contents as UTF-8 and try to convert it to UTF-16 on checkout.
   That operation will fail and cause an error.
 
+- Reencoding content to non-UTF encodings can cause errors as the
+  conversion might not be UTF-8 round trip safe. If you suspect your
+  encoding to not be round trip safe, then add it to
+  `core.checkRoundtripEncoding` to make Git check the round trip
+  encoding (see linkgit:git-config[1]). SHIFT-JIS (Japanese character
+  set) is known to have round trip issues with UTF-8 and is checked by
+  default.
+
 - Reencoding content requires resources that might slow down certain
   Git operations (e.g 'git checkout' or 'git add').
 
diff --git a/config.c b/config.c
index 1f003fbb90..d0ada9fcd4 100644
--- a/config.c
+++ b/config.c
@@ -1172,6 +1172,11 @@ static int git_default_core_config(const char *var, 
const char *value)
return 0;
}
 
+   if (!strcmp(var, "core.checkroundtripencoding")) {
+   check_roundtrip_encoding = xstrdup(value);
+   return 0;
+   }
+
if (!strcmp(var, "core.notesref")) {
notes_ref_name = xstrdup(value);
return 0;
diff --git a/convert.c b/convert.c
index ba6f2019a3..2a002af66d 100644
--- a/convert.c
+++ b/convert.c
@@ -347,6 +347,42 @@ static void trace_encoding(const char *context, const char 
*path,
strbuf_release();
 }
 
+static int check_roundtrip(const char *enc_name)
+{
+   /*
+* check_roundtrip_encoding contains a string of comma and/or
+* space separated encodings (eg. "UTF-16, ASCII, CP1125").
+* Search for the given encoding in that string.
+*/
+   const char *found = strcasestr(check_roundtrip_encoding, enc_name);
+   const char *next;
+   int len;
+   if (!found)
+   return 0;
+   next = found + strlen(enc_name);
+   len = strlen(check_roundtrip_encoding);
+   return (found && (
+   /*
+* check that the found encoding is at the
+* beginning of check_roundtrip_encoding or
+* that it is prefixed with a space or comma
+*/
+   found == check_roundtrip_encoding || (
+   (isspace(found[-1]) || found[-1] == ',')
+   )
+   ) && (
+   /*
+* check that the found encoding is at the
+* end of check_roundtrip_encoding or
+* that it is suffixed with a space or comma
+*/
+   next == check_roundtrip_encoding + len || (
+   next < check_roundtrip_encoding + len &

[PATCH v12 04/10] utf8: teach same_encoding() alternative UTF encoding names

2018-03-15 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

The function same_encoding() checked only for alternative UTF-8 encoding
names. Teach it to check for all kinds of alternative UTF encoding
names.

This function is used in a subsequent commit.

Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
---
 utf8.c | 20 +++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/utf8.c b/utf8.c
index 2c27ce0137..c30daf4d34 100644
--- a/utf8.c
+++ b/utf8.c
@@ -401,11 +401,27 @@ void strbuf_utf8_replace(struct strbuf *sb_src, int pos, 
int width,
strbuf_release(_dst);
 }
 
+/*
+ * Returns true (1) if the src encoding name matches the dst encoding
+ * name directly or one of its alternative names. E.g. UTF-16BE is the
+ * same as UTF16BE.
+ */
+static int same_utf_encoding(const char *src, const char *dst)
+{
+   if (istarts_with(src, "utf") && istarts_with(dst, "utf")) {
+   /* src[3] or dst[3] might be '\0' */
+   int i = (src[3] == '-' ? 4 : 3);
+   int j = (dst[3] == '-' ? 4 : 3);
+   return !strcasecmp(src+i, dst+j);
+   }
+   return 0;
+}
+
 int is_encoding_utf8(const char *name)
 {
if (!name)
return 1;
-   if (!strcasecmp(name, "utf-8") || !strcasecmp(name, "utf8"))
+   if (same_utf_encoding("utf-8", name))
return 1;
return 0;
 }
@@ -414,6 +430,8 @@ int same_encoding(const char *src, const char *dst)
 {
if (is_encoding_utf8(src) && is_encoding_utf8(dst))
return 1;
+   if (same_utf_encoding(src, dst))
+   return 1;
return !strcasecmp(src, dst);
 }
 
-- 
2.16.2



[PATCH v12 07/10] convert: add 'working-tree-encoding' attribute

2018-03-15 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

Git recognizes files encoded with ASCII or one of its supersets (e.g.
UTF-8 or ISO-8859-1) as text files. All other encodings are usually
interpreted as binary and consequently built-in Git text processing
tools (e.g. 'git diff') as well as most Git web front ends do not
visualize the content.

Add an attribute to tell Git what encoding the user has defined for a
given file. If the content is added to the index, then Git converts the
content to a canonical UTF-8 representation. On checkout Git will
reverse the conversion.

Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
---
 Documentation/gitattributes.txt  |  80 ++
 convert.c| 113 ++-
 convert.h|   1 +
 sha1_file.c  |   2 +-
 t/t0028-working-tree-encoding.sh | 142 +++
 5 files changed, 336 insertions(+), 2 deletions(-)
 create mode 100755 t/t0028-working-tree-encoding.sh

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 30687de81a..31a4f92840 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -272,6 +272,86 @@ few exceptions.  Even though...
   catch potential problems early, safety triggers.
 
 
+`working-tree-encoding`
+^^^
+
+Git recognizes files encoded in ASCII or one of its supersets (e.g.
+UTF-8, ISO-8859-1, ...) as text files. Files encoded in certain other
+encodings (e.g. UTF-16) are interpreted as binary and consequently
+built-in Git text processing tools (e.g. 'git diff') as well as most Git
+web front ends do not visualize the contents of these files by default.
+
+In these cases you can tell Git the encoding of a file in the working
+directory with the `working-tree-encoding` attribute. If a file with this
+attribute is added to Git, then Git reencodes the content from the
+specified encoding to UTF-8. Finally, Git stores the UTF-8 encoded
+content in its internal data structure (called "the index"). On checkout
+the content is reencoded back to the specified encoding.
+
+Please note that using the `working-tree-encoding` attribute may have a
+number of pitfalls:
+
+- Alternative Git implementations (e.g. JGit or libgit2) and older Git
+  versions (as of March 2018) do not support the `working-tree-encoding`
+  attribute. If you decide to use the `working-tree-encoding` attribute
+  in your repository, then it is strongly recommended to ensure that all
+  clients working with the repository support it.
+
+  For example, Microsoft Visual Studio resources files (`*.rc`) or
+  PowerShell script files (`*.ps1`) are sometimes encoded in UTF-16.
+  If you declare `*.ps1` as files as UTF-16 and you add `foo.ps1` with
+  a `working-tree-encoding` enabled Git client, then `foo.ps1` will be
+  stored as UTF-8 internally. A client without `working-tree-encoding`
+  support will checkout `foo.ps1` as UTF-8 encoded file. This will
+  typically cause trouble for the users of this file.
+
+  If a Git client, that does not support the `working-tree-encoding`
+  attribute, adds a new file `bar.ps1`, then `bar.ps1` will be
+  stored "as-is" internally (in this example probably as UTF-16).
+  A client with `working-tree-encoding` support will interpret the
+  internal contents as UTF-8 and try to convert it to UTF-16 on checkout.
+  That operation will fail and cause an error.
+
+- Reencoding content requires resources that might slow down certain
+  Git operations (e.g 'git checkout' or 'git add').
+
+Use the `working-tree-encoding` attribute only if you cannot store a file
+in UTF-8 encoding and if you want Git to be able to process the content
+as text.
+
+As an example, use the following attributes if your '*.ps1' files are
+UTF-16 encoded with byte order mark (BOM) and you want Git to perform
+automatic line ending conversion based on your platform.
+
+
+*.ps1  text working-tree-encoding=UTF-16
+
+
+Use the following attributes if your '*.ps1' files are UTF-16 little
+endian encoded without BOM and you want Git to use Windows line endings
+in the working directory. Please note, it is highly recommended to
+explicitly define the line endings with `eol` if the `working-tree-encoding`
+attribute is used to avoid ambiguity.
+
+
+*.ps1  text working-tree-encoding=UTF-16LE eol=CRLF
+
+
+You can get a list of all available encodings on your platform with the
+following command:
+
+
+iconv --list
+
+
+If you do not know the encoding of a file, then you can use the `file`
+command to guess the encoding:
+
+
+file foo.ps1
+
+
+
 `ident`
 ^^^
 
diff --git a/convert.c b/convert.c
index b976eb968c..85e49741af 100644
--- a/conver

[PATCH v12 06/10] utf8: add function to detect a missing UTF-16/32 BOM

2018-03-15 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

If the endianness is not defined in the encoding name, then let's
be strict and require a BOM to avoid any encoding confusion. The
is_missing_required_utf_bom() function returns true if a required BOM
is missing.

The Unicode standard instructs to assume big-endian if there in no BOM
for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used
in HTML5 recommends to assume little-endian to "deal with deployed
content" [3]. Strictly requiring a BOM seems to be the safest option
for content in Git.

This function is used in a subsequent commit.

[1] http://unicode.org/faq/utf_bom.html#gen6
[2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
 Section 3.10, D98, page 132
[3] https://encoding.spec.whatwg.org/#utf-16le

Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
---
 utf8.c | 13 +
 utf8.h | 19 +++
 2 files changed, 32 insertions(+)

diff --git a/utf8.c b/utf8.c
index d16dc1f244..2d8821d36e 100644
--- a/utf8.c
+++ b/utf8.c
@@ -582,6 +582,19 @@ int has_prohibited_utf_bom(const char *enc, const char 
*data, size_t len)
);
 }
 
+int is_missing_required_utf_bom(const char *enc, const char *data, size_t len)
+{
+   return (
+  (same_utf_encoding(enc, "UTF-16")) &&
+  !(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
+has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
+   ) || (
+  (same_utf_encoding(enc, "UTF-32")) &&
+  !(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
+has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
+   );
+}
+
 /*
  * Returns first character length in bytes for multi-byte `text` according to
  * `encoding`.
diff --git a/utf8.h b/utf8.h
index 0db1db4519..cce654a64a 100644
--- a/utf8.h
+++ b/utf8.h
@@ -79,4 +79,23 @@ void strbuf_utf8_align(struct strbuf *buf, align_type 
position, unsigned int wid
  */
 int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
 
+/*
+ * If the endianness is not defined in the encoding name, then we
+ * require a BOM. The function returns true if a required BOM is missing.
+ *
+ * The Unicode standard instructs to assume big-endian if there in no
+ * BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard
+ * used in HTML5 recommends to assume little-endian to "deal with
+ * deployed content" [3].
+ *
+ * Therefore, strictly requiring a BOM seems to be the safest option for
+ * content in Git.
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#gen6
+ * [2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
+ * Section 3.10, D98, page 132
+ * [3] https://encoding.spec.whatwg.org/#utf-16le
+ */
+int is_missing_required_utf_bom(const char *enc, const char *data, size_t len);
+
 #endif
-- 
2.16.2



[PATCH v12 02/10] strbuf: add xstrdup_toupper()

2018-03-15 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

Create a copy of an existing string and make all characters upper case.
Similar xstrdup_tolower().

This function is used in a subsequent commit.

Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
---
 strbuf.c | 12 
 strbuf.h |  1 +
 2 files changed, 13 insertions(+)

diff --git a/strbuf.c b/strbuf.c
index 55b7daeb35..b635f0bdc4 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -784,6 +784,18 @@ char *xstrdup_tolower(const char *string)
return result;
 }
 
+char *xstrdup_toupper(const char *string)
+{
+   char *result;
+   size_t len, i;
+
+   len = strlen(string);
+   result = xmallocz(len);
+   for (i = 0; i < len; i++)
+   result[i] = toupper(string[i]);
+   return result;
+}
+
 char *xstrvfmt(const char *fmt, va_list ap)
 {
struct strbuf buf = STRBUF_INIT;
diff --git a/strbuf.h b/strbuf.h
index 14c8c10d66..df7ced53ed 100644
--- a/strbuf.h
+++ b/strbuf.h
@@ -607,6 +607,7 @@ __attribute__((format (printf,2,3)))
 extern int fprintf_ln(FILE *fp, const char *fmt, ...);
 
 char *xstrdup_tolower(const char *);
+char *xstrdup_toupper(const char *);
 
 /**
  * Create a newly allocated string using printf format. You can do this easily
-- 
2.16.2



[PATCH v12 00/10] convert: add support for different encodings

2018-03-15 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

Hi,

Patches 1-6,9 are preparation and helper functions. Patch 4 is new.
Patch 7,8,10 are the actual change.

This series depends on Torsten's 8462ff43e4 (convert_to_git():
safe_crlf/checksafe becomes int conv_flags, 2018-01-13) which is
already in master.

Changes since v11:

* die if w-t-e is configured with a true/false (=undefined!)
  value (Junio)
* improve same_encoding to detect all alternatives for
  UTF encodings (new commit, Junio)
* squash in "advise canonical UTF encoding names" and
  remove commit (Junio)
* fix erroneous # in comment (Junio)
* force segv for non-UTF encodings in validate_encoding() (Junio)

Thanks,
Lars


  RFC: 
https://public-inbox.org/git/bdb9b884-6d17-4be3-a83c-f67e2afa2...@gmail.com/
   v1: 
https://public-inbox.org/git/20171211155023.1405-1-lars.schnei...@autodesk.com/
   v2: 
https://public-inbox.org/git/2017122915.39680-1-lars.schnei...@autodesk.com/
   v3: 
https://public-inbox.org/git/20180106004808.77513-1-lars.schnei...@autodesk.com/
   v4: 
https://public-inbox.org/git/20180120152418.52859-1-lars.schnei...@autodesk.com/
   v5: https://public-inbox.org/git/20180129201855.9182-1-tbo...@web.de/
   v6: 
https://public-inbox.org/git/20180209132830.55385-1-lars.schnei...@autodesk.com/
   v7: 
https://public-inbox.org/git/20180215152711.158-1-lars.schnei...@autodesk.com/
   v8: 
https://public-inbox.org/git/20180224162801.98860-1-lars.schnei...@autodesk.com/
   v9: 
https://public-inbox.org/git/20180304201418.60958-1-lars.schnei...@autodesk.com/
  v10: 
https://public-inbox.org/git/20180307173026.30058-1-lars.schnei...@autodesk.com/
  v11: 
https://public-inbox.org/git/20180309173536.62012-1-lars.schnei...@autodesk.com/

Base Ref:
Web-Diff: https://github.com/larsxschneider/git/commit/0daedbbd76
Checkout: git fetch https://github.com/larsxschneider/git encoding-v12 && git 
checkout 0daedbbd76


### Interdiff (v11..v12):

diff --git a/convert.c b/convert.c
index c2d24882c1..2a002af66d 100644
--- a/convert.c
+++ b/convert.c
@@ -280,13 +280,13 @@ static int validate_encoding(const char *path, const char 
*enc,
/*
 * This advice is shown for UTF-??BE and UTF-??LE 
encodings.
 * We cut off the last two characters of the encoding 
name
-# to generate the encoding name suitable for BOMs.
+* to generate the encoding name suitable for BOMs.
 */
const char *advise_msg = _(
"The file '%s' contains a byte order "
"mark (BOM). Please use UTF-%s as "
"working-tree-encoding.");
-   const char *stripped = "";
+   const char *stripped = NULL;
char *upper = xstrdup_toupper(enc);
upper[strlen(upper)-2] = '\0';
if (!skip_prefix(upper, "UTF-", ))
@@ -307,7 +307,7 @@ static int validate_encoding(const char *path, const char 
*enc,
"mark (BOM). Please use UTF-%sBE or UTF-%sLE "
"(depending on the byte order) as "
"working-tree-encoding.");
-   const char *stripped = "";
+   const char *stripped = NULL;
char *upper = xstrdup_toupper(enc);
if (!skip_prefix(upper, "UTF-", ))
skip_prefix(stripped, "UTF", );
@@ -1222,12 +1222,11 @@ static const char *git_path_check_encoding(struct 
attr_check_item *check)
return NULL;

if (ATTR_TRUE(value) || ATTR_FALSE(value)) {
-   error(_("working-tree-encoding attribute requires a value"));
-   return NULL;
+   die(_("working-tree-encoding attribute requires a value"));
}

/* Don't encode to the default encoding */
-   if (is_encoding_utf8(value) && is_encoding_utf8(default_encoding))
+   if (same_encoding(value, default_encoding))
return NULL;

return value;
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 07089bba2e..884f0878b1 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -149,25 +149,23 @@ done
 test_expect_success 'check unsupported encodings' '
test_when_finished "git reset --hard HEAD" &&

-   echo "*.set text working-tree-encoding" >>.gitattributes &&
+   echo "*.set text working-tree-encoding" >.gitattributes &&
printf "set" >t.set &&
-   git add t.set 2>er

[PATCH v12 05/10] utf8: add function to detect prohibited UTF-16/32 BOM

2018-03-15 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE
or UTF-32LE a BOM must not be used [1]. The function returns true if
this is the case.

This function is used in a subsequent commit.

[1] http://unicode.org/faq/utf_bom.html#bom10

Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
---
 utf8.c | 26 ++
 utf8.h |  9 +
 2 files changed, 35 insertions(+)

diff --git a/utf8.c b/utf8.c
index c30daf4d34..d16dc1f244 100644
--- a/utf8.c
+++ b/utf8.c
@@ -556,6 +556,32 @@ char *reencode_string_len(const char *in, int insz,
 }
 #endif
 
+static int has_bom_prefix(const char *data, size_t len,
+ const char *bom, size_t bom_len)
+{
+   return (len >= bom_len) && !memcmp(data, bom, bom_len);
+}
+
+static const char utf16_be_bom[] = {0xFE, 0xFF};
+static const char utf16_le_bom[] = {0xFF, 0xFE};
+static const char utf32_be_bom[] = {0x00, 0x00, 0xFE, 0xFF};
+static const char utf32_le_bom[] = {0xFF, 0xFE, 0x00, 0x00};
+
+int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
+{
+   return (
+ (same_utf_encoding("UTF-16BE", enc) ||
+  same_utf_encoding("UTF-16LE", enc)) &&
+ (has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
+  has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
+   ) || (
+ (same_utf_encoding("UTF-32BE",  enc) ||
+  same_utf_encoding("UTF-32LE", enc)) &&
+ (has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
+  has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
+   );
+}
+
 /*
  * Returns first character length in bytes for multi-byte `text` according to
  * `encoding`.
diff --git a/utf8.h b/utf8.h
index 6bbcf31a83..0db1db4519 100644
--- a/utf8.h
+++ b/utf8.h
@@ -70,4 +70,13 @@ typedef enum {
 void strbuf_utf8_align(struct strbuf *buf, align_type position, unsigned int 
width,
   const char *s);
 
+/*
+ * If a data stream is declared as UTF-16BE or UTF-16LE, then a UTF-16
+ * BOM must not be used [1]. The same applies for the UTF-32 equivalents.
+ * The function returns true if this rule is violated.
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#bom10
+ */
+int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
+
 #endif
-- 
2.16.2



[PATCH v12 01/10] strbuf: remove unnecessary NUL assignment in xstrdup_tolower()

2018-03-15 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

Since 3733e69464 (use xmallocz to avoid size arithmetic, 2016-02-22) we
allocate the buffer for the lower case string with xmallocz(). This
already ensures a NUL at the end of the allocated buffer.

Remove the unnecessary assignment.

Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
---
 strbuf.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/strbuf.c b/strbuf.c
index 1df674e919..55b7daeb35 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -781,7 +781,6 @@ char *xstrdup_tolower(const char *string)
result = xmallocz(len);
for (i = 0; i < len; i++)
result[i] = tolower(string[i]);
-   result[i] = '\0';
return result;
 }
 
-- 
2.16.2



[PATCH v12 03/10] strbuf: add a case insensitive starts_with()

2018-03-15 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

Check in a case insensitive manner if one string is a prefix of another
string.

This function is used in a subsequent commit.

Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
---
 git-compat-util.h | 1 +
 strbuf.c  | 9 +
 2 files changed, 10 insertions(+)

diff --git a/git-compat-util.h b/git-compat-util.h
index 68b2ad531e..95c9b34832 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -455,6 +455,7 @@ extern void (*get_warn_routine(void))(const char *warn, 
va_list params);
 extern void set_die_is_recursing_routine(int (*routine)(void));
 
 extern int starts_with(const char *str, const char *prefix);
+extern int istarts_with(const char *str, const char *prefix);
 
 /*
  * If the string "str" begins with the string found in "prefix", return 1.
diff --git a/strbuf.c b/strbuf.c
index b635f0bdc4..99812b8488 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -11,6 +11,15 @@ int starts_with(const char *str, const char *prefix)
return 0;
 }
 
+int istarts_with(const char *str, const char *prefix)
+{
+   for (; ; str++, prefix++)
+   if (!*prefix)
+   return 1;
+   else if (tolower(*str) != tolower(*prefix))
+   return 0;
+}
+
 int skip_to_optional_arg_default(const char *str, const char *prefix,
 const char **arg, const char *def)
 {
-- 
2.16.2



Re: [PATCH v11 08/10] convert: advise canonical UTF encoding names

2018-03-15 Thread Lars Schneider

> On 09 Mar 2018, at 20:11, Junio C Hamano <gits...@pobox.com> wrote:
> 
> lars.schnei...@autodesk.com writes:
> 
>> From: Lars Schneider <larsxschnei...@gmail.com>
>> 
>> The canonical name of an UTF encoding has the format UTF, dash, number,
>> and an optionally byte order in upper case (e.g. UTF-8 or UTF-16BE).
>> Some iconv versions support alternative names without a dash or with
>> lower case characters.
>> 
>> To avoid problems between different iconv version always suggest the
>> canonical UTF names in advise messages.
>> 
>> Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
>> ---
> 
> I think it is probably better to squash this to earlier step,
> i.e. jumping straight to the endgame solution.

ok!


>> diff --git a/convert.c b/convert.c
>> index b80d666a6b..9a3ae7cce1 100644
>> --- a/convert.c
>> +++ b/convert.c
>> @@ -279,12 +279,20 @@ static int validate_encoding(const char *path, const 
>> char *enc,
>>  "BOM is prohibited in '%s' if encoded as %s");
>>  /*
>>   * This advice is shown for UTF-??BE and UTF-??LE 
>> encodings.
>> + * We cut off the last two characters of the encoding 
>> name
>> + # to generate the encoding name suitable for BOMs.
>>   */
> 
> I somehow thought that I saw "s/#/*/" in somebody's response during
> the previous round?

Oops. Will fix!


>>  const char *advise_msg = _(
>>  "The file '%s' contains a byte order "
>> -"mark (BOM). Please use %.6s as "
>> +"mark (BOM). Please use UTF-%s as "
>>  "working-tree-encoding.");
>> -advise(advise_msg, path, enc);
>> +const char *stripped = "";
>> +char *upper = xstrdup_toupper(enc);
>> +upper[strlen(upper)-2] = '\0';
>> +if (!skip_prefix(upper, "UTF-", ))
>> +skip_prefix(stripped, "UTF", );
>> +advise(advise_msg, path, stripped);
>> +free(upper);
> 
> If this codepath is ever entered with "enc" that does not begin with
> "UTF" (e.g. "Shift_JIS", which is impossible in the current code,
> but I'll talk about future-proofing here), then neither of these
> skip_prefix will trigger, and then you'd end up suggesting to use
> "UTF-" that is nonsense.  Perhaps initialize stripped to NULL and
> force advise to segv to catch such a programmer error?

Agreed!


Thanks,
Lars



Re: [PATCH v11 06/10] convert: add 'working-tree-encoding' attribute

2018-03-15 Thread Lars Schneider

> On 09 Mar 2018, at 20:10, Junio C Hamano  wrote:
> 
> lars.schnei...@autodesk.com writes:
> 
>> +static const char *default_encoding = "UTF-8";
>> +
>> ...
>> +static const char *git_path_check_encoding(struct attr_check_item *check)
>> +{
>> +const char *value = check->value;
>> +
>> +if (ATTR_UNSET(value) || !strlen(value))
>> +return NULL;
>> +
>> +if (ATTR_TRUE(value) || ATTR_FALSE(value)) {
>> +error(_("working-tree-encoding attribute requires a value"));
>> +return NULL;
>> +}
> 
> Hmph, so we decide to be loud but otherwise ignore an undefined
> configuration?  Shouldn't we rather die instead to avoid touching
> the user data in unexpected ways?

OK.


>> +
>> +/* Don't encode to the default encoding */
>> +if (!strcasecmp(value, default_encoding))
>> +return NULL;
> 
> Is this an optimization to avoid "recode one encoding to the same
> encoding" no-op overhead?

Correct.

>  We already have the optimization in the
> same spirit in may existing codepaths that has nothing to do with
> w-t-e, and I think we should share the code.  Two pieces of thought
> comes to mind.
> 
> One is a lot smaller in scale: Is same_encoding() sufficient for
> this callsite instead of strcasecmp()?

Yes!


> The other one is a lot bigger: Looking at all the existing callers
> of same_encoding() that call reencode_string() when it returns false,
> would it make sense to drop same_encoding() and move the optimization
> to reencode_string() instead?
> 
> I suspect that the answer to the smaller one is "yes, and even if
> not, it should be easy to enhance/extend same_encoding() to make it
> do what we want it to, and such a change will benefit even existing
> callers."  The answer to the larger one is likely "the optimization
> is not about skipping only reencode_string() call but other things
> are subtly different among callers of same_encoding(), so such a
> refactoring would not be all that useful."

I agree. reencode_string() would need to signal 3 cases:
1. reencode performed
2. reencode not necessary
3. reencode failed

We could model "reencode not necessary" as "char *in == char *return".
However, I think this should be tackled in a separate series.

Thanks
Lars


Re: What's cooking in git.git (Mar 2018, #03; Wed, 14)

2018-03-15 Thread Lars Schneider

> On 15 Mar 2018, at 02:34, Junio C Hamano  wrote:
> 
> ...
> 
> * ls/checkout-encoding (2018-03-09) 10 commits
> - convert: add round trip check based on 'core.checkRoundtripEncoding'
> - convert: add tracing for 'working-tree-encoding' attribute
> - convert: advise canonical UTF encoding names
> - convert: check for detectable errors in UTF encodings
> - convert: add 'working-tree-encoding' attribute
> - utf8: add function to detect a missing UTF-16/32 BOM
> - utf8: add function to detect prohibited UTF-16/32 BOM
> - strbuf: add a case insensitive starts_with()
> - strbuf: add xstrdup_toupper()
> - strbuf: remove unnecessary NUL assignment in xstrdup_tolower()
> 
> The new "checkout-encoding" attribute can ask Git to convert the
> contents to the specified encoding when checking out to the working
> tree (and the other way around when checking in).
> 
> Expecting a reroll.
> cf. <66370a41-a048-44e7-9bf8-4631c50aa...@gmail.com>
> Modulo minor design decision corrections, the series is almost there.

If I hurry up (IOW: send a reroll tonight), would this topic
have a chance for 2.17-rc1 or is it too late?

Thanks,
Lars




Re: How to debug a "git merge"?

2018-03-15 Thread Lars Schneider

> On 14 Mar 2018, at 23:20, Jeff King <p...@peff.net> wrote:
> 
> On Wed, Mar 14, 2018 at 05:56:04PM +0100, Lars Schneider wrote:
> 
>> I am investigating a Git merge (a86dd40fe) in which an older version of 
>> a file won over the newer version. I try to understand why this is the 
>> case. I can reproduce the merge with the following commands:
>> $ git checkout -b test a02fa3303
>> $ GIT_MERGE_VERBOSITY=5 git merge --verbose c1b82995c
>> 
>> The merge actually generates a merge conflict but not for my
>> problematic file. The common ancestor of the two parents (merge base) 
>> is b91161554.
>> 
>> The merge graph is not pretty (the committers don't have a clean 
>> branching scheme) but I cannot spot a problem between the merge commit
>> and the common ancestor:
>> $ git log --graph --oneline a86dd40fe
>> 
>> Can you give me a hint how to debug this merge further? How can I 
>> understand why Git picked a certain version of a file in a merge?
> 
> Maybe a stupid question, but: did you make sure that the merge does
> indeed pick the wrong version of the file? The other option is that
> somebody mistakenly did a "checkout --ours" or similar while resolving
> the conflict.

No stupid question at all. That's exactly what they did and I did not
realize it! Thank you!

Next time I won't stumble over this. I wonder if this is a common enough
problem to do something about it? For instance what if `git log` (or just
`git show`) has an option `--verify-merges` or `--reenact-merges` or 
something? This option would perform a "default recursive merge" and 
show the diff between the actual merge and the default merge?

In the most common case there is no diff. If there are merge conflicts
then we would just show the conflicting files. If there is no merge
conflict for a file *but* a difference then we would show it. I think
this would have helped me to realize this kind of problem earlier.

Would that option make sense to you?

Thanks,
Lars


Re: How to debug a "git merge"?

2018-03-14 Thread Lars Schneider

> On 14 Mar 2018, at 18:02, Derrick Stolee <sto...@gmail.com> wrote:
> 
> On 3/14/2018 12:56 PM, Lars Schneider wrote:
>> Hi,
>> 
>> I am investigating a Git merge (a86dd40fe) in which an older version of
>> a file won over the newer version. I try to understand why this is the
>> case. I can reproduce the merge with the following commands:
>> $ git checkout -b test a02fa3303
>> $ GIT_MERGE_VERBOSITY=5 git merge --verbose c1b82995c
>> 
>> The merge actually generates a merge conflict but not for my
>> problematic file. The common ancestor of the two parents (merge base)
>> is b91161554.
>> 
>> The merge graph is not pretty (the committers don't have a clean
>> branching scheme) but I cannot spot a problem between the merge commit
>> and the common ancestor:
>> $ git log --graph --oneline a86dd40fe
> 
> Have you tried `git log --graph --oneline --simplify-merges -- path` to see 
> what changes and merges involved the file? I find that view to be very 
> helpful (while the default history simplification can hide things). In 
> particular, if there was a change that was reverted in one side and not 
> another, we could find out.

Thanks for this tip! Unfortunately, this only confirms my current view:

### First parent
$ git log --graph --oneline --simplify-merges a02fa3303 -- path/to/problem
* 4e47a10c7 <-- old version
* 01f01f61c 

### Second parent
$ git log --graph --oneline --simplify-merges c1b82995c -- path/to/problem
* 590e52ed1 <-- new version
* 8e598828d 
* ad4e9034b 
* 4e47a10c7 
* 01f01f61c 

### Merge
$ git log --graph --oneline --simplify-merges a86dd40fe -- path/to/problem
*   a86dd40fe <-- old version ?!?! That's the problem!
|\
| * 590e52ed1 <-- new version
| * 8e598828d
| * ad4e9034b
|/
* 4e47a10c7 <-- old version
* 01f01f61c


> You could also use the "A...B" to check your two commits for merging, and 
> maybe add "--boundary".

$ git diff --boundary a02fa3303...c1b82995c -- path/to/problem

This looks like the correct diff. The "new version" is mark as +/add/green in 
the diff.

Does this make any sense to you?

Thank you,
Lars

How to debug a "git merge"?

2018-03-14 Thread Lars Schneider
Hi,

I am investigating a Git merge (a86dd40fe) in which an older version of 
a file won over the newer version. I try to understand why this is the 
case. I can reproduce the merge with the following commands:
$ git checkout -b test a02fa3303
$ GIT_MERGE_VERBOSITY=5 git merge --verbose c1b82995c

The merge actually generates a merge conflict but not for my
problematic file. The common ancestor of the two parents (merge base) 
is b91161554.

The merge graph is not pretty (the committers don't have a clean 
branching scheme) but I cannot spot a problem between the merge commit
and the common ancestor:
$ git log --graph --oneline a86dd40fe

Can you give me a hint how to debug this merge further? How can I 
understand why Git picked a certain version of a file in a merge?

I am using Git 2.16.2 on Linux.

Thanks,
Lars


Re: [git-sizer] Implications of a large commit object

2018-03-14 Thread Lars Schneider

> On 14 Mar 2018, at 09:33, Michael Haggerty <mhag...@alum.mit.edu> wrote:
> 
> On Wed, Mar 14, 2018 at 9:14 AM, Lars Schneider
> <larsxschnei...@gmail.com> wrote:
>> I am using Michael's fantastic Git repo analyzer tool "git-sizer" [*]
>> and it detected a very large commit of 7.33 MiB in my repo (see chart
>> below).
>> 
>> This large commit is expected. I've imported that repo from another
>> version control system but excluded all binary files (e.g. images) and
>> some 3rd party components as their history is not important [**]. I've
>> reintroduced these files in the head commit again. This is where the
>> large commit came from.
>> 
>> This repo is not used in production yet but I wonder if this kind of
>> approach can cause trouble down the line? Are there any relevant
>> implication of a single large commit like this in history?
>> [...]
>> 
>> ###
>> ## git-sizer output
>> 
>> [...]
>> | Name | Value | Level of concern   |
>> |  | - | -- |
>> [...]
>> | Biggest objects  |   ||
>> | * Commits|   ||
>> |   * Maximum size [1] |  7.33 MiB | !! |
>> [...]
> 
> The "commit size" that is being referred to here is the size of the
> actual commit object; i.e., the author name, parent commits, etc plus
> the log message. So a huge commit probably means that you have a huge
> log message. This has nothing to do with the number or sizes of the
> files added by the commit.
> 
> Maybe your migration tool created a huge commit message, for example
> listing each of the files that was changed.


D'oh! Of course. I was so focused on that commit with the large number of
files that I missed that. Looking at the reference [1] reveals the
problem. Sorry for wasting your time!


> AFAIK this won't cause Git itself any problems, but it's likely to be
> inconvenient. For example, when you type `git log` and 7 million
> characters page by. Or when you use some GUI tool to view your history
> and it performs badly because it wasn't built to handle such enormous
> commit messages.


Thank you,
Lars


[git-sizer] Implications of a large commit object

2018-03-14 Thread Lars Schneider
Hi,

I am using Michael's fantastic Git repo analyzer tool "git-sizer" [*]
and it detected a very large commit of 7.33 MiB in my repo (see chart 
below).

This large commit is expected. I've imported that repo from another
version control system but excluded all binary files (e.g. images) and
some 3rd party components as their history is not important [**]. I've 
reintroduced these files in the head commit again. This is where the 
large commit came from.

This repo is not used in production yet but I wonder if this kind of
approach can cause trouble down the line? Are there any relevant
implication of a single large commit like this in history? 

Thanks,
Lars


 [*] https://github.com/github/git-sizer
[**] I know some of this stuff shouldn't be in the repo in the first 
 place, but I am constrained in the things I can change.


###
## git-sizer output

Processing blobs: 543782
Processing trees: 517104
Processing commits: 43365
Matching commits to trees: 43365
Processing annotated tags: 3
Processing references: 123
| Name | Value | Level of concern   |
|  | - | -- |
| Overall repository size  |   ||
| * Blobs  |   ||
|   * Total size   |  18.8 GiB | ** |
|  |   ||
| Biggest objects  |   ||
| * Commits|   ||
|   * Maximum size [1] |  7.33 MiB | !! |
| * Trees  |   ||
|   * Maximum entries  [2] |  6.84 k   | ** |
|  |   ||
| History structure|   ||
| * Maximum tag depth  [3] | 1 | *  |
|  |   ||
| Biggest checkouts|   ||
| * Number of directories  [4] |  21.9 k   | ** |
| * Maximum path depth [4] |18 | *  |
| * Maximum path length[5] |   225 B   | ** |
| * Number of files[4] |   256 k   | *  |
| * Total size of files[6] |  2.08 GiB | ** |





Re: [GSoC] [PATCH] travis-ci: added clang static analysis

2018-03-12 Thread Lars Schneider
Hi,

That looks interesting but I agree with Dscho that we should not limit
this to master/maint.

I assume you did run this on TravisCI already? Can you share a link?
I assume you did find errors? Can we fix them or are there too many?
If there are existing errors, how do we define a "successful" build?

Thanks for working on this,
Lars

> On 05 Mar 2018, at 21:04, SiddharthaMishra  wrote:
> 
> Added a job to run clang static code analysis on the master and maint branch
> 
> Signed-off-by: SiddharthaMishra 
> ---
> .travis.yml   | 17 -
> ci/run-static-analysis.sh |  9 -
> 2 files changed, 24 insertions(+), 2 deletions(-)
> 
> diff --git a/.travis.yml b/.travis.yml
> index 4684b3f4f..9b891d182 100644
> --- a/.travis.yml
> +++ b/.travis.yml
> @@ -48,7 +48,7 @@ matrix:
>   before_install:
>   before_script:
>   script: ci/run-linux32-docker.sh
> -- env: jobname=StaticAnalysis
> +- env: jobname=CocciStaticAnalysis
>   os: linux
>   compiler:
>   addons:
> @@ -59,6 +59,21 @@ matrix:
>   before_script:
>   script: ci/run-static-analysis.sh
>   after_failure:
> +- if: branch IN (master, maint)
> +  env: jobname=ClangStaticAnalysis
> +  os: linux
> +  compiler:
> +  add_ons:
> +apt:
> +  sources:
> +  - ubuntu-toolchain-r-test
> +  - llvm-toolchain-trusty
> +  packages:
> +  - clang
> +  before_install:
> +  before_script:
> +  script: ci/run-static-analysis.sh
> +  after_failure:
> - env: jobname=Documentation
>   os: linux
>   compiler:
> diff --git a/ci/run-static-analysis.sh b/ci/run-static-analysis.sh
> index fe4ee4e06..6ae032f54 100755
> --- a/ci/run-static-analysis.sh
> +++ b/ci/run-static-analysis.sh
> @@ -5,6 +5,13 @@
> 
> . ${0%/*}/lib-travisci.sh
> 
> -make coccicheck
> +case "$jobname" in
> +ClangStaticAnalysis)
> + scan-build -analyze-headers --status-bugs make
> + ;;
> +CocciStaticAnalysis)
> + make coccicheck
> + ;;
> +esac
> 
> save_good_tree
> -- 
> 2.16.2.248.ge2408a6f7.dirty
> 



Re: [GSoC][PATCH] git-ci: use pylint to analyze the git-p4 code

2018-03-12 Thread Lars Schneider
Hi Viet,

> On 12 Mar 2018, at 03:20, Viet Hung Tran  wrote:
> 
> This is my submission as a microproject for the Google Summer of code.
> I apologize for not setting the [GSoC] in my previous email
> at <20180312020855.7950-1-viethtran1...@gmail.com>.
> Please ignore it.

No worries :-)

> Add a new job named Pylint to .travis.yml in order to analyze git-p4 Python 
> code.
> Although Travis CI have yet to implement continuous integration for multiple
> languages. Python package can be installed using apt packages. From there, 
> pylint can be
> installed using pip and used to analyze git-p4.py file.

That looks interesting!
I assume you did run this on TravisCI already? Can you share a link?
I assume you did find errors? Can we fix them or are there too many?
If there are existing errors, how do we define a "successful" build?

Thanks for working on this,
Lars

> 
> Signed-off-by: Viet Hung Tran 
> ---
> .travis.yml | 10 ++
> 1 file changed, 10 insertions(+)
> 
> diff --git a/.travis.yml b/.travis.yml
> index 5f5ee4f3b..581d75319 100644
> --- a/.travis.yml
> +++ b/.travis.yml
> @@ -46,6 +46,16 @@ matrix:
> - docker
>   before_install:
>   script: ci/run-linux32-docker.sh
> +- env: jobname=Pylint
> +  compiler:
> +  addons:
> +apt:
> +  packages:
> +  - python
> +  before_install:
> +  install: pip install --user pylint
> +  script: pylint git-p4.py
> +  after_failure:
> - env: jobname=StaticAnalysis
>   os: linux
>   compiler:
> -- 
> 2.16.2.440.gc6284da
> 



Re: [PATCH v11 07/10] convert: check for detectable errors in UTF encodings

2018-03-09 Thread Lars Schneider

> On 09 Mar 2018, at 20:00, Junio C Hamano  wrote:
> 
> lars.schnei...@autodesk.com writes:
> 
>> +const char *advise_msg = _(
>> +"The file '%s' contains a byte order "
>> +"mark (BOM). Please use %.6s as "
>> +"working-tree-encoding.");
> 
> I know that this will go away in a later step, but why ".6"?

I deleted the original comment in the rebase, sorry:

/*
 * This advice is shown for UTF-??BE and UTF-??LE
 * encodings. We truncate the encoding name to 6
 * chars with %.6s to cut off the last two "byte
 * order" characters.
 */

- Lars


[PATCH v11 04/10] utf8: add function to detect prohibited UTF-16/32 BOM

2018-03-09 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE
or UTF-32LE a BOM must not be used [1]. The function returns true if
this is the case.

This function is used in a subsequent commit.

[1] http://unicode.org/faq/utf_bom.html#bom10

Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
---
 utf8.c | 26 ++
 utf8.h |  9 +
 2 files changed, 35 insertions(+)

diff --git a/utf8.c b/utf8.c
index 2c27ce0137..e4b99580f0 100644
--- a/utf8.c
+++ b/utf8.c
@@ -538,6 +538,32 @@ char *reencode_string_len(const char *in, int insz,
 }
 #endif
 
+static int has_bom_prefix(const char *data, size_t len,
+ const char *bom, size_t bom_len)
+{
+   return (len >= bom_len) && !memcmp(data, bom, bom_len);
+}
+
+static const char utf16_be_bom[] = {0xFE, 0xFF};
+static const char utf16_le_bom[] = {0xFF, 0xFE};
+static const char utf32_be_bom[] = {0x00, 0x00, 0xFE, 0xFF};
+static const char utf32_le_bom[] = {0xFF, 0xFE, 0x00, 0x00};
+
+int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
+{
+   return (
+ (!strcasecmp(enc, "UTF-16BE") || !strcasecmp(enc, "UTF-16LE") ||
+  !strcasecmp(enc, "UTF16BE") || !strcasecmp(enc, "UTF16LE")) &&
+ (has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
+  has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
+   ) || (
+ (!strcasecmp(enc, "UTF-32BE") || !strcasecmp(enc, "UTF-32LE") ||
+  !strcasecmp(enc, "UTF32BE") || !strcasecmp(enc, "UTF32LE")) &&
+ (has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
+  has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
+   );
+}
+
 /*
  * Returns first character length in bytes for multi-byte `text` according to
  * `encoding`.
diff --git a/utf8.h b/utf8.h
index 6bbcf31a83..0db1db4519 100644
--- a/utf8.h
+++ b/utf8.h
@@ -70,4 +70,13 @@ typedef enum {
 void strbuf_utf8_align(struct strbuf *buf, align_type position, unsigned int 
width,
   const char *s);
 
+/*
+ * If a data stream is declared as UTF-16BE or UTF-16LE, then a UTF-16
+ * BOM must not be used [1]. The same applies for the UTF-32 equivalents.
+ * The function returns true if this rule is violated.
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#bom10
+ */
+int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
+
 #endif
-- 
2.16.2



[PATCH v11 07/10] convert: check for detectable errors in UTF encodings

2018-03-09 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

Check that new content is valid with respect to the user defined
'working-tree-encoding' attribute.

Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
---
 convert.c| 48 ++
 t/t0028-working-tree-encoding.sh | 56 
 2 files changed, 104 insertions(+)

diff --git a/convert.c b/convert.c
index aa59ecfe49..b80d666a6b 100644
--- a/convert.c
+++ b/convert.c
@@ -266,6 +266,51 @@ static int will_convert_lf_to_crlf(size_t len, struct 
text_stat *stats,
 
 }
 
+static int validate_encoding(const char *path, const char *enc,
+ const char *data, size_t len, int die_on_error)
+{
+   /* We only check for UTF here as UTF?? can be an alias for UTF-?? */
+   if (istarts_with(enc, "UTF")) {
+   /*
+* Check for detectable errors in UTF encodings
+*/
+   if (has_prohibited_utf_bom(enc, data, len)) {
+   const char *error_msg = _(
+   "BOM is prohibited in '%s' if encoded as %s");
+   /*
+* This advice is shown for UTF-??BE and UTF-??LE 
encodings.
+*/
+   const char *advise_msg = _(
+   "The file '%s' contains a byte order "
+   "mark (BOM). Please use %.6s as "
+   "working-tree-encoding.");
+   advise(advise_msg, path, enc);
+   if (die_on_error)
+   die(error_msg, path, enc);
+   else {
+   return error(error_msg, path, enc);
+   }
+
+   } else if (is_missing_required_utf_bom(enc, data, len)) {
+   const char *error_msg = _(
+   "BOM is required in '%s' if encoded as %s");
+   const char *advise_msg = _(
+   "The file '%s' is missing a byte order "
+   "mark (BOM). Please use %sBE or %sLE "
+   "(depending on the byte order) as "
+   "working-tree-encoding.");
+   advise(advise_msg, path, enc, enc);
+   if (die_on_error)
+   die(error_msg, path, enc);
+   else {
+   return error(error_msg, path, enc);
+   }
+   }
+
+   }
+   return 0;
+}
+
 static const char *default_encoding = "UTF-8";
 
 static int encode_to_git(const char *path, const char *src, size_t src_len,
@@ -291,6 +336,9 @@ static int encode_to_git(const char *path, const char *src, 
size_t src_len,
if (!buf && !src)
return 1;
 
+   if (validate_encoding(path, enc, src, src_len, die_on_error))
+   return 0;
+
dst = reencode_string_len(src, src_len, default_encoding, enc,
  _len);
if (!dst) {
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index e492945a01..e8408dfe5c 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -62,6 +62,46 @@ test_expect_success 'check $GIT_DIR/info/attributes support' 
'
 
 for i in 16 32
 do
+   test_expect_success "check prohibited UTF-${i} BOM" '
+   test_when_finished "git reset --hard HEAD" &&
+
+   echo "*.utf${i}be text working-tree-encoding=utf-${i}be" 
>>.gitattributes &&
+   echo "*.utf${i}le text working-tree-encoding=utf-${i}LE" 
>>.gitattributes &&
+
+   # Here we add a UTF-16 (resp. UTF-32) files with BOM 
(big/little-endian)
+   # but we tell Git to treat it as UTF-16BE/UTF-16LE (resp. 
UTF-32).
+   # In these cases the BOM is prohibited.
+   cp bebom.utf${i}be.raw bebom.utf${i}be &&
+   test_must_fail git add bebom.utf${i}be 2>err.out &&
+   test_i18ngrep "fatal: BOM is prohibited .* utf-${i}be" err.out 
&&
+
+   cp lebom.utf${i}le.raw lebom.utf${i}be &&
+   test_must_fail git add lebom.utf${i}be 2>err.out &&
+   test_i18ngrep "fatal: BOM is prohibited .* utf-${i}be" err.out 
&&
+
+   cp bebom.utf${i}be.raw bebom.utf${i}le &&
+   test_must_fail git add bebom.utf${i}le 2>err.out &&
+   test_i18ngrep "fatal: BOM is prohibited .* utf-${i}LE" err.out 
&&
+

[PATCH v11 09/10] convert: add tracing for 'working-tree-encoding' attribute

2018-03-09 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

Add the GIT_TRACE_WORKING_TREE_ENCODING environment variable to enable
tracing for content that is reencoded with the 'working-tree-encoding'
attribute. This is useful to debug encoding issues.

Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
---
 convert.c| 25 +
 t/t0028-working-tree-encoding.sh |  2 ++
 2 files changed, 27 insertions(+)

diff --git a/convert.c b/convert.c
index 9a3ae7cce1..d739078016 100644
--- a/convert.c
+++ b/convert.c
@@ -324,6 +324,29 @@ static int validate_encoding(const char *path, const char 
*enc,
return 0;
 }
 
+static void trace_encoding(const char *context, const char *path,
+  const char *encoding, const char *buf, size_t len)
+{
+   static struct trace_key coe = TRACE_KEY_INIT(WORKING_TREE_ENCODING);
+   struct strbuf trace = STRBUF_INIT;
+   int i;
+
+   strbuf_addf(, "%s (%s, considered %s):\n", context, path, 
encoding);
+   for (i = 0; i < len && buf; ++i) {
+   strbuf_addf(
+   ,"| \e[2m%2i:\e[0m %2x \e[2m%c\e[0m%c",
+   i,
+   (unsigned char) buf[i],
+   (buf[i] > 32 && buf[i] < 127 ? buf[i] : ' '),
+   ((i+1) % 8 && (i+1) < len ? ' ' : '\n')
+   );
+   }
+   strbuf_addchars(, '\n', 1);
+
+   trace_strbuf(, );
+   strbuf_release();
+}
+
 static const char *default_encoding = "UTF-8";
 
 static int encode_to_git(const char *path, const char *src, size_t src_len,
@@ -352,6 +375,7 @@ static int encode_to_git(const char *path, const char *src, 
size_t src_len,
if (validate_encoding(path, enc, src, src_len, die_on_error))
return 0;
 
+   trace_encoding("source", path, enc, src, src_len);
dst = reencode_string_len(src, src_len, default_encoding, enc,
  _len);
if (!dst) {
@@ -369,6 +393,7 @@ static int encode_to_git(const char *path, const char *src, 
size_t src_len,
return 0;
}
}
+   trace_encoding("destination", path, default_encoding, dst, dst_len);
 
strbuf_attach(buf, dst, dst_len, dst_len + 1);
return 1;
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 1bee7b9f71..f68e282c5e 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -4,6 +4,8 @@ test_description='working-tree-encoding conversion via 
gitattributes'
 
 . ./test-lib.sh
 
+GIT_TRACE_WORKING_TREE_ENCODING=1 && export GIT_TRACE_WORKING_TREE_ENCODING
+
 test_expect_success 'setup test files' '
git config core.eol lf &&
 
-- 
2.16.2



[PATCH v11 06/10] convert: add 'working-tree-encoding' attribute

2018-03-09 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

Git recognizes files encoded with ASCII or one of its supersets (e.g.
UTF-8 or ISO-8859-1) as text files. All other encodings are usually
interpreted as binary and consequently built-in Git text processing
tools (e.g. 'git diff') as well as most Git web front ends do not
visualize the content.

Add an attribute to tell Git what encoding the user has defined for a
given file. If the content is added to the index, then Git converts the
content to a canonical UTF-8 representation. On checkout Git will
reverse the conversion.

Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
---
 Documentation/gitattributes.txt  |  80 ++
 convert.c| 114 ++-
 convert.h|   1 +
 sha1_file.c  |   2 +-
 t/t0028-working-tree-encoding.sh | 144 +++
 5 files changed, 339 insertions(+), 2 deletions(-)
 create mode 100755 t/t0028-working-tree-encoding.sh

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 30687de81a..31a4f92840 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -272,6 +272,86 @@ few exceptions.  Even though...
   catch potential problems early, safety triggers.
 
 
+`working-tree-encoding`
+^^^
+
+Git recognizes files encoded in ASCII or one of its supersets (e.g.
+UTF-8, ISO-8859-1, ...) as text files. Files encoded in certain other
+encodings (e.g. UTF-16) are interpreted as binary and consequently
+built-in Git text processing tools (e.g. 'git diff') as well as most Git
+web front ends do not visualize the contents of these files by default.
+
+In these cases you can tell Git the encoding of a file in the working
+directory with the `working-tree-encoding` attribute. If a file with this
+attribute is added to Git, then Git reencodes the content from the
+specified encoding to UTF-8. Finally, Git stores the UTF-8 encoded
+content in its internal data structure (called "the index"). On checkout
+the content is reencoded back to the specified encoding.
+
+Please note that using the `working-tree-encoding` attribute may have a
+number of pitfalls:
+
+- Alternative Git implementations (e.g. JGit or libgit2) and older Git
+  versions (as of March 2018) do not support the `working-tree-encoding`
+  attribute. If you decide to use the `working-tree-encoding` attribute
+  in your repository, then it is strongly recommended to ensure that all
+  clients working with the repository support it.
+
+  For example, Microsoft Visual Studio resources files (`*.rc`) or
+  PowerShell script files (`*.ps1`) are sometimes encoded in UTF-16.
+  If you declare `*.ps1` as files as UTF-16 and you add `foo.ps1` with
+  a `working-tree-encoding` enabled Git client, then `foo.ps1` will be
+  stored as UTF-8 internally. A client without `working-tree-encoding`
+  support will checkout `foo.ps1` as UTF-8 encoded file. This will
+  typically cause trouble for the users of this file.
+
+  If a Git client, that does not support the `working-tree-encoding`
+  attribute, adds a new file `bar.ps1`, then `bar.ps1` will be
+  stored "as-is" internally (in this example probably as UTF-16).
+  A client with `working-tree-encoding` support will interpret the
+  internal contents as UTF-8 and try to convert it to UTF-16 on checkout.
+  That operation will fail and cause an error.
+
+- Reencoding content requires resources that might slow down certain
+  Git operations (e.g 'git checkout' or 'git add').
+
+Use the `working-tree-encoding` attribute only if you cannot store a file
+in UTF-8 encoding and if you want Git to be able to process the content
+as text.
+
+As an example, use the following attributes if your '*.ps1' files are
+UTF-16 encoded with byte order mark (BOM) and you want Git to perform
+automatic line ending conversion based on your platform.
+
+
+*.ps1  text working-tree-encoding=UTF-16
+
+
+Use the following attributes if your '*.ps1' files are UTF-16 little
+endian encoded without BOM and you want Git to use Windows line endings
+in the working directory. Please note, it is highly recommended to
+explicitly define the line endings with `eol` if the `working-tree-encoding`
+attribute is used to avoid ambiguity.
+
+
+*.ps1  text working-tree-encoding=UTF-16LE eol=CRLF
+
+
+You can get a list of all available encodings on your platform with the
+following command:
+
+
+iconv --list
+
+
+If you do not know the encoding of a file, then you can use the `file`
+command to guess the encoding:
+
+
+file foo.ps1
+
+
+
 `ident`
 ^^^
 
diff --git a/convert.c b/convert.c
index b976eb968c..aa59ecfe49 100644
--- a/conver

[PATCH v11 03/10] strbuf: add a case insensitive starts_with()

2018-03-09 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

Check in a case insensitive manner if one string is a prefix of another
string.

This function is used in a subsequent commit.

Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
---
 git-compat-util.h | 1 +
 strbuf.c  | 9 +
 2 files changed, 10 insertions(+)

diff --git a/git-compat-util.h b/git-compat-util.h
index 68b2ad531e..95c9b34832 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -455,6 +455,7 @@ extern void (*get_warn_routine(void))(const char *warn, 
va_list params);
 extern void set_die_is_recursing_routine(int (*routine)(void));
 
 extern int starts_with(const char *str, const char *prefix);
+extern int istarts_with(const char *str, const char *prefix);
 
 /*
  * If the string "str" begins with the string found in "prefix", return 1.
diff --git a/strbuf.c b/strbuf.c
index b635f0bdc4..99812b8488 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -11,6 +11,15 @@ int starts_with(const char *str, const char *prefix)
return 0;
 }
 
+int istarts_with(const char *str, const char *prefix)
+{
+   for (; ; str++, prefix++)
+   if (!*prefix)
+   return 1;
+   else if (tolower(*str) != tolower(*prefix))
+   return 0;
+}
+
 int skip_to_optional_arg_default(const char *str, const char *prefix,
 const char **arg, const char *def)
 {
-- 
2.16.2



[PATCH v11 05/10] utf8: add function to detect a missing UTF-16/32 BOM

2018-03-09 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

If the endianness is not defined in the encoding name, then let's
be strict and require a BOM to avoid any encoding confusion. The
is_missing_required_utf_bom() function returns true if a required BOM
is missing.

The Unicode standard instructs to assume big-endian if there in no BOM
for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used
in HTML5 recommends to assume little-endian to "deal with deployed
content" [3]. Strictly requiring a BOM seems to be the safest option
for content in Git.

This function is used in a subsequent commit.

[1] http://unicode.org/faq/utf_bom.html#gen6
[2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
 Section 3.10, D98, page 132
[3] https://encoding.spec.whatwg.org/#utf-16le

Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
---
 utf8.c | 13 +
 utf8.h | 19 +++
 2 files changed, 32 insertions(+)

diff --git a/utf8.c b/utf8.c
index e4b99580f0..81c6678df1 100644
--- a/utf8.c
+++ b/utf8.c
@@ -564,6 +564,19 @@ int has_prohibited_utf_bom(const char *enc, const char 
*data, size_t len)
);
 }
 
+int is_missing_required_utf_bom(const char *enc, const char *data, size_t len)
+{
+   return (
+  (!strcasecmp(enc, "UTF-16") || !strcasecmp(enc, "UTF16")) &&
+  !(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
+has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
+   ) || (
+  (!strcasecmp(enc, "UTF-32") || !strcasecmp(enc, "UTF32")) &&
+  !(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
+has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
+   );
+}
+
 /*
  * Returns first character length in bytes for multi-byte `text` according to
  * `encoding`.
diff --git a/utf8.h b/utf8.h
index 0db1db4519..cce654a64a 100644
--- a/utf8.h
+++ b/utf8.h
@@ -79,4 +79,23 @@ void strbuf_utf8_align(struct strbuf *buf, align_type 
position, unsigned int wid
  */
 int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
 
+/*
+ * If the endianness is not defined in the encoding name, then we
+ * require a BOM. The function returns true if a required BOM is missing.
+ *
+ * The Unicode standard instructs to assume big-endian if there in no
+ * BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard
+ * used in HTML5 recommends to assume little-endian to "deal with
+ * deployed content" [3].
+ *
+ * Therefore, strictly requiring a BOM seems to be the safest option for
+ * content in Git.
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#gen6
+ * [2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
+ * Section 3.10, D98, page 132
+ * [3] https://encoding.spec.whatwg.org/#utf-16le
+ */
+int is_missing_required_utf_bom(const char *enc, const char *data, size_t len);
+
 #endif
-- 
2.16.2



[PATCH v11 01/10] strbuf: remove unnecessary NUL assignment in xstrdup_tolower()

2018-03-09 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

Since 3733e69464 (use xmallocz to avoid size arithmetic, 2016-02-22) we
allocate the buffer for the lower case string with xmallocz(). This
already ensures a NUL at the end of the allocated buffer.

Remove the unnecessary assignment.

Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
---
 strbuf.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/strbuf.c b/strbuf.c
index 1df674e919..55b7daeb35 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -781,7 +781,6 @@ char *xstrdup_tolower(const char *string)
result = xmallocz(len);
for (i = 0; i < len; i++)
result[i] = tolower(string[i]);
-   result[i] = '\0';
return result;
 }
 
-- 
2.16.2



[PATCH v11 08/10] convert: advise canonical UTF encoding names

2018-03-09 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

The canonical name of an UTF encoding has the format UTF, dash, number,
and an optionally byte order in upper case (e.g. UTF-8 or UTF-16BE).
Some iconv versions support alternative names without a dash or with
lower case characters.

To avoid problems between different iconv version always suggest the
canonical UTF names in advise messages.

Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
---
 convert.c| 21 +
 t/t0028-working-tree-encoding.sh | 10 --
 2 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/convert.c b/convert.c
index b80d666a6b..9a3ae7cce1 100644
--- a/convert.c
+++ b/convert.c
@@ -279,12 +279,20 @@ static int validate_encoding(const char *path, const char 
*enc,
"BOM is prohibited in '%s' if encoded as %s");
/*
 * This advice is shown for UTF-??BE and UTF-??LE 
encodings.
+* We cut off the last two characters of the encoding 
name
+# to generate the encoding name suitable for BOMs.
 */
const char *advise_msg = _(
"The file '%s' contains a byte order "
-   "mark (BOM). Please use %.6s as "
+   "mark (BOM). Please use UTF-%s as "
"working-tree-encoding.");
-   advise(advise_msg, path, enc);
+   const char *stripped = "";
+   char *upper = xstrdup_toupper(enc);
+   upper[strlen(upper)-2] = '\0';
+   if (!skip_prefix(upper, "UTF-", ))
+   skip_prefix(stripped, "UTF", );
+   advise(advise_msg, path, stripped);
+   free(upper);
if (die_on_error)
die(error_msg, path, enc);
else {
@@ -296,10 +304,15 @@ static int validate_encoding(const char *path, const char 
*enc,
"BOM is required in '%s' if encoded as %s");
const char *advise_msg = _(
"The file '%s' is missing a byte order "
-   "mark (BOM). Please use %sBE or %sLE "
+   "mark (BOM). Please use UTF-%sBE or UTF-%sLE "
"(depending on the byte order) as "
"working-tree-encoding.");
-   advise(advise_msg, path, enc, enc);
+   const char *stripped = "";
+   char *upper = xstrdup_toupper(enc);
+   if (!skip_prefix(upper, "UTF-", ))
+   skip_prefix(stripped, "UTF", );
+   advise(advise_msg, path, stripped, stripped);
+   free(upper);
if (die_on_error)
die(error_msg, path, enc);
else {
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index e8408dfe5c..1bee7b9f71 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -74,18 +74,22 @@ do
cp bebom.utf${i}be.raw bebom.utf${i}be &&
test_must_fail git add bebom.utf${i}be 2>err.out &&
test_i18ngrep "fatal: BOM is prohibited .* utf-${i}be" err.out 
&&
+   test_i18ngrep "use UTF-${i} as working-tree-encoding" err.out &&
 
cp lebom.utf${i}le.raw lebom.utf${i}be &&
test_must_fail git add lebom.utf${i}be 2>err.out &&
test_i18ngrep "fatal: BOM is prohibited .* utf-${i}be" err.out 
&&
+   test_i18ngrep "use UTF-${i} as working-tree-encoding" err.out &&
 
cp bebom.utf${i}be.raw bebom.utf${i}le &&
test_must_fail git add bebom.utf${i}le 2>err.out &&
test_i18ngrep "fatal: BOM is prohibited .* utf-${i}LE" err.out 
&&
+   test_i18ngrep "use UTF-${i} as working-tree-encoding" err.out &&
 
cp lebom.utf${i}le.raw lebom.utf${i}le &&
test_must_fail git add lebom.utf${i}le 2>err.out &&
-   test_i18ngrep "fatal: BOM is prohibited .* utf-${i}LE" err.out
+   test_i18ngrep "fatal: BOM is prohibited .* utf-${i}LE" err.out 
&&
+   test_i18ngrep "use

[PATCH v11 10/10] convert: add round trip check based on 'core.checkRoundtripEncoding'

2018-03-09 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

UTF supports lossless conversion round tripping and conversions between
UTF and other encodings are mostly round trip safe as Unicode aims to be
a superset of all other character encodings. However, certain encodings
(e.g. SHIFT-JIS) are known to have round trip issues [1].

Add 'core.checkRoundtripEncoding', which contains a comma separated
list of encodings, to define for what encodings Git should check the
conversion round trip if they are used in the 'working-tree-encoding'
attribute.

Set SHIFT-JIS as default value for 'core.checkRoundtripEncoding'.

[1] 
https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode

Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
---
 Documentation/config.txt |  6 +++
 Documentation/gitattributes.txt  |  8 
 config.c |  5 +++
 convert.c| 79 +++-
 convert.h|  1 +
 environment.c|  1 +
 t/t0028-working-tree-encoding.sh | 39 
 7 files changed, 138 insertions(+), 1 deletion(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 0e25b2c92b..7dcac9b540 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -530,6 +530,12 @@ core.autocrlf::
This variable can be set to 'input',
in which case no output conversion is performed.
 
+core.checkRoundtripEncoding::
+   A comma and/or whitespace separated list of encodings that Git
+   performs UTF-8 round trip checks on if they are used in an
+   `working-tree-encoding` attribute (see linkgit:gitattributes[5]).
+   The default value is `SHIFT-JIS`.
+
 core.symlinks::
If false, symbolic links are checked out as small plain files that
contain the link text. linkgit:git-update-index[1] and
diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 31a4f92840..aa3deae392 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -312,6 +312,14 @@ number of pitfalls:
   internal contents as UTF-8 and try to convert it to UTF-16 on checkout.
   That operation will fail and cause an error.
 
+- Reencoding content to non-UTF encodings can cause errors as the
+  conversion might not be UTF-8 round trip safe. If you suspect your
+  encoding to not be round trip safe, then add it to
+  `core.checkRoundtripEncoding` to make Git check the round trip
+  encoding (see linkgit:git-config[1]). SHIFT-JIS (Japanese character
+  set) is known to have round trip issues with UTF-8 and is checked by
+  default.
+
 - Reencoding content requires resources that might slow down certain
   Git operations (e.g 'git checkout' or 'git add').
 
diff --git a/config.c b/config.c
index 1f003fbb90..d0ada9fcd4 100644
--- a/config.c
+++ b/config.c
@@ -1172,6 +1172,11 @@ static int git_default_core_config(const char *var, 
const char *value)
return 0;
}
 
+   if (!strcmp(var, "core.checkroundtripencoding")) {
+   check_roundtrip_encoding = xstrdup(value);
+   return 0;
+   }
+
if (!strcmp(var, "core.notesref")) {
notes_ref_name = xstrdup(value);
return 0;
diff --git a/convert.c b/convert.c
index d739078016..c2d24882c1 100644
--- a/convert.c
+++ b/convert.c
@@ -347,6 +347,42 @@ static void trace_encoding(const char *context, const char 
*path,
strbuf_release();
 }
 
+static int check_roundtrip(const char *enc_name)
+{
+   /*
+* check_roundtrip_encoding contains a string of comma and/or
+* space separated encodings (eg. "UTF-16, ASCII, CP1125").
+* Search for the given encoding in that string.
+*/
+   const char *found = strcasestr(check_roundtrip_encoding, enc_name);
+   const char *next;
+   int len;
+   if (!found)
+   return 0;
+   next = found + strlen(enc_name);
+   len = strlen(check_roundtrip_encoding);
+   return (found && (
+   /*
+* check that the found encoding is at the
+* beginning of check_roundtrip_encoding or
+* that it is prefixed with a space or comma
+*/
+   found == check_roundtrip_encoding || (
+   (isspace(found[-1]) || found[-1] == ',')
+   )
+   ) && (
+   /*
+* check that the found encoding is at the
+* end of check_roundtrip_encoding or
+* that it is suffixed with a space or comma
+*/
+   next == check_roundtrip_encoding + len || (
+   next < check_roundtrip_enco

[PATCH v11 02/10] strbuf: add xstrdup_toupper()

2018-03-09 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

Create a copy of an existing string and make all characters upper case.
Similar xstrdup_tolower().

This function is used in a subsequent commit.

Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
---
 strbuf.c | 12 
 strbuf.h |  1 +
 2 files changed, 13 insertions(+)

diff --git a/strbuf.c b/strbuf.c
index 55b7daeb35..b635f0bdc4 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -784,6 +784,18 @@ char *xstrdup_tolower(const char *string)
return result;
 }
 
+char *xstrdup_toupper(const char *string)
+{
+   char *result;
+   size_t len, i;
+
+   len = strlen(string);
+   result = xmallocz(len);
+   for (i = 0; i < len; i++)
+   result[i] = toupper(string[i]);
+   return result;
+}
+
 char *xstrvfmt(const char *fmt, va_list ap)
 {
struct strbuf buf = STRBUF_INIT;
diff --git a/strbuf.h b/strbuf.h
index 14c8c10d66..df7ced53ed 100644
--- a/strbuf.h
+++ b/strbuf.h
@@ -607,6 +607,7 @@ __attribute__((format (printf,2,3)))
 extern int fprintf_ln(FILE *fp, const char *fmt, ...);
 
 char *xstrdup_tolower(const char *);
+char *xstrdup_toupper(const char *);
 
 /**
  * Create a newly allocated string using printf format. You can do this easily
-- 
2.16.2



[PATCH v11 00/10] convert: add support for different encodings

2018-03-09 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

Hi,

Patches 1-5,9 are preparation and helper functions.
Patch 6-8,10 are the actual change. Patch 8 is new.

This series depends on Torsten's 8462ff43e4 (convert_to_git():
safe_crlf/checksafe becomes int conv_flags, 2018-01-13) which is
already in master.

Changes since v10:

* rename startscase_with() to istarts_with() (Duy)
* validate_encoding() advises the canonical form of the UTF
  encoding name to the user (Junio)
  --> I added it this as a separate commit that you could be dropped
  if desired by the reviewers.
* fix documentation for roundtrip check (Junio)
* use isspace() to check whitespace/tab delimiter
  in core.checkRoundtripEncoding (Junio)
* remove dead code in roundtrip check (Junio)
* fix invalid # in comment (Eric)
* detect UTF8 and UTF-8 as default encoding (Eric)
* make asterisk stick to the variable, not type (Junio)
* print an error if "w-t-e" does not have a proper value (Junio)
  --> BTW: I noticed that the attribute is not set to "git_attr__false"
  even if I define "-working-tree-encoding". I haven't investigated
  further yet. Might that be a bug? If yes, then this should be
  addresses in a separate patch series.

Thanks,
Lars


  RFC: 
https://public-inbox.org/git/bdb9b884-6d17-4be3-a83c-f67e2afa2...@gmail.com/
   v1: 
https://public-inbox.org/git/20171211155023.1405-1-lars.schnei...@autodesk.com/
   v2: 
https://public-inbox.org/git/2017122915.39680-1-lars.schnei...@autodesk.com/
   v3: 
https://public-inbox.org/git/20180106004808.77513-1-lars.schnei...@autodesk.com/
   v4: 
https://public-inbox.org/git/20180120152418.52859-1-lars.schnei...@autodesk.com/
   v5: https://public-inbox.org/git/20180129201855.9182-1-tbo...@web.de/
   v6: 
https://public-inbox.org/git/20180209132830.55385-1-lars.schnei...@autodesk.com/
   v7: 
https://public-inbox.org/git/20180215152711.158-1-lars.schnei...@autodesk.com/
   v8: 
https://public-inbox.org/git/20180224162801.98860-1-lars.schnei...@autodesk.com/
   v9: 
https://public-inbox.org/git/20180304201418.60958-1-lars.schnei...@autodesk.com/
  v10: 
https://public-inbox.org/git/20180307173026.30058-1-lars.schnei...@autodesk.com/

Base Ref:
Web-Diff: https://github.com/larsxschneider/git/commit/afc02ce2e0
Checkout: git fetch https://github.com/larsxschneider/git encoding-v11 && git 
checkout afc02ce2e0


### Interdiff (v10..v11):

diff --git a/Documentation/config.txt b/Documentation/config.txt
index d7a56054a5..7dcac9b540 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -531,10 +531,10 @@ core.autocrlf::
in which case no output conversion is performed.

 core.checkRoundtripEncoding::
-   A comma separated list of encodings that Git performs UTF-8 round
-   trip checks on if they are used in an `working-tree-encoding`
-   attribute (see linkgit:gitattributes[5]). The default value is
-   `SHIFT-JIS`.
+   A comma and/or whitespace separated list of encodings that Git
+   performs UTF-8 round trip checks on if they are used in an
+   `working-tree-encoding` attribute (see linkgit:gitattributes[5]).
+   The default value is `SHIFT-JIS`.

 core.symlinks::
If false, symbolic links are checked out as small plain files that
diff --git a/convert.c b/convert.c
index e861f1abbc..c2d24882c1 100644
--- a/convert.c
+++ b/convert.c
@@ -270,7 +270,7 @@ static int validate_encoding(const char *path, const char 
*enc,
  const char *data, size_t len, int die_on_error)
 {
/* We only check for UTF here as UTF?? can be an alias for UTF-?? */
-   if (startscase_with(enc, "UTF")) {
+   if (istarts_with(enc, "UTF")) {
/*
 * Check for detectable errors in UTF encodings
 */
@@ -284,12 +284,15 @@ static int validate_encoding(const char *path, const char 
*enc,
 */
const char *advise_msg = _(
"The file '%s' contains a byte order "
-   "mark (BOM). Please use %s as "
+   "mark (BOM). Please use UTF-%s as "
"working-tree-encoding.");
-   char *upper_enc = xstrdup_toupper(enc);
-   upper_enc[strlen(upper_enc)-2] = '\0';
-   advise(advise_msg, path, upper_enc);
-   free(upper_enc);
+   const char *stripped = "";
+   char *upper = xstrdup_toupper(enc);
+   upper[strlen(upper)-2] = '\0';
+   if (!skip_prefix(upper, "UTF-", ))
+   skip_prefix(stripped, "UTF", );
+   advise(advise_msg, path, stripped);
+   free(upper);
  

Re: [PATCH v10 7/9] convert: check for detectable errors in UTF encodings

2018-03-09 Thread Lars Schneider

> On 07 Mar 2018, at 19:04, Eric Sunshine <sunsh...@sunshineco.com> wrote:
> 
> On Wed, Mar 7, 2018 at 12:30 PM,  <lars.schnei...@autodesk.com> wrote:
>> Check that new content is valid with respect to the user defined
>> 'working-tree-encoding' attribute.
>> 
>> Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
>> ---
>> diff --git a/convert.c b/convert.c
>> @@ -266,6 +266,58 @@ static int will_convert_lf_to_crlf(size_t len, struct 
>> text_stat *stats,
>> +static int validate_encoding(const char *path, const char *enc,
>> + const char *data, size_t len, int die_on_error)
>> +{
>> +   /* We only check for UTF here as UTF?? can be an alias for UTF-?? */
>> +   if (startscase_with(enc, "UTF")) {
>> +   /*
>> +* Check for detectable errors in UTF encodings
>> +*/
>> +   if (has_prohibited_utf_bom(enc, data, len)) {
>> +   const char *error_msg = _(
>> +   "BOM is prohibited in '%s' if encoded as 
>> %s");
>> +   /*
>> +* This advice is shown for UTF-??BE and UTF-??LE 
>> encodings.
>> +* We cut off the last two characters of the 
>> encoding name
>> +# to generate the encoding name suitable for BOMs.
> 
> s/#/*/

Of course!


>> diff --git a/t/t0028-working-tree-encoding.sh 
>> b/t/t0028-working-tree-encoding.sh
>> @@ -62,6 +62,46 @@ test_expect_success 'check $GIT_DIR/info/attributes 
>> support' '
>> for i in 16 32
>> do
>> +   test_expect_success "check prohibited UTF-${i} BOM" '
>> +   test_when_finished "git reset --hard HEAD" &&
>> +
>> +   echo "*.utf${i}be text working-tree-encoding=utf-${i}be" 
>> >>.gitattributes &&
>> +   echo "*.utf${i}le text working-tree-encoding=utf-${i}le" 
>> >>.gitattributes &&
> 
> v10 is checking only hyphenated lowercase encoding name; earlier
> versions checked uppercase. For better coverage, it would be nice to
> check several combinations: all uppercase, all lowercase, mixed case,
> hyphenated, not hyphenated.
> 
> I'm not suggesting running all the tests repeatedly but rather just
> varying the format of the encoding name in these tests you're adding.
> For instance, the above could instead be:
> 
>echo "*.utf${i}be text working-tree-encoding=UTF-${i}be" >>.gitattributes 
> &&
>echo "*.utf${i}le text working-tree-encoding=utf${i}LE" >>.gitattributes &&
> 
> or something.

The casing is a good idea - I will do that. I don't want to do "hyphenated, not 
hyphenated" as this would make the tests fail on macOS (and I believe on 
Windows).


Thanks,
Lars

Re: [PATCH v10 3/9] strbuf: add a case insensitive starts_with()

2018-03-09 Thread Lars Schneider

> On 09 Mar 2018, at 00:12, Junio C Hamano  wrote:
> 
> Duy Nguyen  writes:
> 
>>> extern int starts_with(const char *str, const char *prefix);
>>> +extern int startscase_with(const char *str, const char *prefix);
>> 
>> This name is a bit hard to read. Boost [1] goes with istarts_with. I
>> wonder if it's better. If not I guess either starts_with_case or
>> starts_case_with will improve readability.
> 
> starts_with_case() sounds quite strange even though
> starts_with_icase() may make it clear that it is "a variant of
> starts_with() function that ignores case".  I dunno.
> dir.c::cmp_icase() takes the _icase suffix for not quite the way
> that is consistent with that understanding, though.

I think following the boost lib makes most sense. Therefore,
I would like to go with "istarts_with". OK with you?

- Lars


Re: [PATCH v10 7/9] convert: check for detectable errors in UTF encodings

2018-03-07 Thread Lars Schneider

> On 07 Mar 2018, at 23:57, Junio C Hamano <gits...@pobox.com> wrote:
> 
> Lars Schneider <larsxschnei...@gmail.com> writes:
> 
>> At this point I thought it would make sense to make the advised
>> encoding name uppercase in both situations. OK with you?
> 
> In the endgame, if upcased and properly dashed form is always used,
> that would be good (if we are enforcing the policy, which I am not
> onboard 100% but it's your code and I do not care too strongly about
> it).  I do not see much point in an interim step that only upcases
> without doing the dash insertion.

I would like to advise the dashed form as this seems to be the
canonical form and it avoids cross platform issues. My macOS
iconv does not support the form without dashes.

Would this approach work for you?

const char *advise_msg = _(
"The file '%s' contains a byte order "
"mark (BOM). Please use UTF-%s as "
"working-tree-encoding.");
const char *stripped;
char *upper = xstrdup_toupper(enc);
upper[strlen(upper)-2] = '\0';
skip_prefix(upper, "UTF-", ) ||
skip_prefix(stripped, "UTF", );
advise(advise_msg, path, stripped);

Thanks,
Lars


Re: [PATCH v10 9/9] convert: add round trip check based on 'core.checkRoundtripEncoding'

2018-03-07 Thread Lars Schneider

> On 07 Mar 2018, at 23:52, Junio C Hamano <gits...@pobox.com> wrote:
> 
> Lars Schneider <larsxschnei...@gmail.com> writes:
> 
>> I don't think HT makes too much sense. However, isspace() is nice
>> and I will use it. Being more permissive on the inputs should hurt.
> 
> You are being incoherent in these three sentences.  If you want to
> be more strict and only allow SP, then isspace() is already too
> broad, as it does allow HT (and even CR and LF).

I meant "Being more permissive on the inputs shouldN'T hurt." :-)


> I do think HT makes just as much sense as SP, though, so if you use
> isspace(), that would be perfectly fine.

OK!

- Lars



Re: [PATCH v10 6/9] convert: add 'working-tree-encoding' attribute

2018-03-07 Thread Lars Schneider

> On 07 Mar 2018, at 18:54, Eric Sunshine <sunsh...@sunshineco.com> wrote:
> 
> On Wed, Mar 7, 2018 at 12:30 PM,  <lars.schnei...@autodesk.com> wrote:
>> [...]
>> Add an attribute to tell Git what encoding the user has defined for a
>> given file. If the content is added to the index, then Git converts the
>> content to a canonical UTF-8 representation. On checkout Git will
>> reverse the conversion.
>> 
>> Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
>> ---
>> Documentation/gitattributes.txt  |  80 +++
>> diff --git a/convert.c b/convert.c
>> @@ -265,6 +266,78 @@ static int will_convert_lf_to_crlf(size_t len, struct 
>> text_stat *stats,
>> +static const char *default_encoding = "UTF-8";
>> @@ -978,6 +1051,21 @@ static int ident_to_worktree(const char *path, const 
>> char *src, size_t len,
>> +static const char *git_path_check_encoding(struct attr_check_item *check)
>> +{
>> +   const char *value = check->value;
>> +
>> +   if (ATTR_TRUE(value) || ATTR_FALSE(value) || ATTR_UNSET(value) ||
>> +   !strlen(value))
>> +   return NULL;
>> +
>> +   /* Don't encode to the default encoding */
>> +   if (!strcasecmp(value, default_encoding))
>> +   return NULL;
> 
> As of v10, the rest of the code accepts encoding names "UTF-xx" and
> "UTFxx" (case insensitive), however, this check recognizes only
> "UTF-8" (case insensitive). For consistency, one would expect this
> also to recognize "UTF8" (case insensitive).

Nice catch. What do you think about this solution using is_encoding_utf8()
from utf.c?

if (is_encoding_utf8(value) && is_encoding_utf8(default_encoding)) 

- Lars

Re: [PATCH v10 7/9] convert: check for detectable errors in UTF encodings

2018-03-07 Thread Lars Schneider

> On 07 Mar 2018, at 23:32, Junio C Hamano <gits...@pobox.com> wrote:
> 
> Lars Schneider <larsxschnei...@gmail.com> writes:
> 
>> I also would have liked to advise "UTF-16" instead of "UTF16" as
>> you suggested. However, that required a few more lines and I wanted
>> to keep the change to a minimum. I feel this could be added in a
>> follow up patch.
> 
> I'd say the whole upcase thing belongs to such a follow-up patch if
> that is the case.
> 
>>> On the other hand, if we are not enforcing such a policy decision
>>> but merely explaining a way to work around this check, then it may
>>> be better to give a variant with the smaller difference from the
>>> original (i.e. without up-casing).
>> 
>> See example mentioned above: "Utf-16". How would you handle that?
> 
> Dropping LE suffix from "Utf-16LE" or "Utf16LE" would yield "Utf-16"
> or "Utf16" if the advise message does not force policy, or "UTF-16"
> in the canoical form if it does.  Is there a problem?

In the case of has_prohibited_utf_bom() you are right as we are 
dropping the BE/LE suffix in the advise. However, look at the 
is_missing_required_utf_bom() advise. Here we *add* BE/LE.

At this point I thought it would make sense to make the advised
encoding name uppercase in both situations. OK with you?

- Lars






Re: [PATCH v10 9/9] convert: add round trip check based on 'core.checkRoundtripEncoding'

2018-03-07 Thread Lars Schneider

> On 07 Mar 2018, at 20:59, Junio C Hamano  wrote:
> 
> lars.schnei...@autodesk.com writes:
> 
>> +static int check_roundtrip(const char* enc_name)
> 
> The asterisk sticks to the variable, not type.

Argh. I need to put this check into Travis CI ;-)


>> +{
>> +/*
>> + * check_roundtrip_encoding contains a string of space and/or
>> + * comma separated encodings (eg. "UTF-16, ASCII, CP1125").
>> + * Search for the given encoding in that string.
>> + */
>> +const char *found = strcasestr(check_roundtrip_encoding, enc_name);
>> +const char *next;
>> +int len;
>> +if (!found)
>> +return 0;
>> +next = found + strlen(enc_name);
>> +len = strlen(check_roundtrip_encoding);
>> +return (found && (
>> +/*
>> + * check that the found encoding is at the
>> + * beginning of check_roundtrip_encoding or
>> + * that it is prefixed with a space or comma
>> + */
>> +found == check_roundtrip_encoding || (
>> +found > check_roundtrip_encoding &&
>> +(*(found-1) == ' ' || *(found-1) == ',')
>> +)
> 
> The second line is unneeded, as we know a non-NULL found either
> points at check_roundtrip_encoding or somewhere to the right, and
> the first test already checked the "points exactly at" case.

Eric objected that too [1], but noted the documentation value:

Although the 'found > check_roundtrip_encoding' expression is
effectively dead code in this case, it does document that the
programmer didn't just blindly use '*(found-1)' without taking
boundary conditions into consideration.

(It's dead code because, at this point, we know that strcasestr()
didn't return NULL and we know that 'found' doesn't equal
'check_roundtrip_encoding', therefore it _must_ be greater than
'check_roundtrip_encoding'.)

[1] 
https://public-inbox.org/git/CAPig+cR81J3fTOtrgAumAs=rc5hqyffsmeb-ru-yf_ahfub...@mail.gmail.com/

Although the line is unnecessary, I felt it is safer/easier to 
understand and maintain. Since both of you tripped over it, I will
remove it though.


> This is defined to be a comma separated list, so it is unnecessary
> to accept  == <"FOO, SHIFT-JIS, BAR", "SHIFT-JIS">; if you
> allow SP, perhaps "isspace(found[-1]) || found[-1] == ','" to also
> allow HT may also be appropriate.

I don't think HT makes too much sense. However, isspace() is nice
and I will use it. Being more permissive on the inputs should hurt.


>  I think "comma or whitespace
> separated list" is fine; in any case, the comment near the beginning
> of this function does not match new text in Documentation/config.txt
> added by this patch.

Nice catch. Will fix.


Thanks,
Lars



Re: [PATCH v10 7/9] convert: check for detectable errors in UTF encodings

2018-03-07 Thread Lars Schneider

> On 07 Mar 2018, at 20:49, Junio C Hamano  wrote:
> 
> lars.schnei...@autodesk.com writes:
> 
>> +static int validate_encoding(const char *path, const char *enc,
>> +  const char *data, size_t len, int die_on_error)
>> +{
>> +/* We only check for UTF here as UTF?? can be an alias for UTF-?? */
>> +if (startscase_with(enc, "UTF")) {
>> +/*
>> + * Check for detectable errors in UTF encodings
>> + */
>> +if (has_prohibited_utf_bom(enc, data, len)) {
>> +const char *error_msg = _(
>> +"BOM is prohibited in '%s' if encoded as %s");
>> +/*
>> + * This advice is shown for UTF-??BE and UTF-??LE 
>> encodings.
>> + * We cut off the last two characters of the encoding 
>> name
>> + # to generate the encoding name suitable for BOMs.
>> + */
> 
> Yuck.  The code pretends to abstract away the details in a helper
> has_prohibited_x() yet the caller still knows quite a lot.

True, but has_prohibited_x() cannot create a proper error/advise
message unless we give it more parameters (e.g. path name).
Therefore, I don't see a better way right now.


>> +const char *advise_msg = _(
>> +"The file '%s' contains a byte order "
>> +"mark (BOM). Please use %s as "
>> +"working-tree-encoding.");
>> +char *upper_enc = xstrdup_toupper(enc);
>> +upper_enc[strlen(upper_enc)-2] = '\0';
>> +advise(advise_msg, path, upper_enc);
>> +free(upper_enc);
> 
> I think this up-casing is more problematic than without, not from
> the point of view of the internal code, but from the point of view
> of the end user experience.  When the user writes utf16le or
> utf-16le and the data does not trigger the BOM check, we are likely
> to successfully convert it.  I do not see the merit of suggesting
> UTF16 or UTF-16 in such a case, over telling them to just drop the
> byte-order suffix from the encoding names (i.e. utf16 or utf-16).
> 
> If you are trying to force/nudge people in the direction of
> canonical way of spelling things (which may not be a bad idea), then
> "utf16le" as the original input would want to result in "UTF-16"
> with dash in the advise, no?

Correct. In the error messages I kept the encoding name "as-is" and
only in the advise message I used the uppercase variant to steer
people into the canonical direction. My initial reason for this was
that in is_missing_required_utf_bom() we add "BE/LE" to the encoding
in the advise message. Let's say the user used "Utf-16" as encoding.
 Should "BE/LE" be upper case or lower case? To avoid that question 
I made it always upper case.

I also would have liked to advise "UTF-16" instead of "UTF16" as
you suggested. However, that required a few more lines and I wanted
to keep the change to a minimum. I feel this could be added in a
follow up patch.


> On the other hand, if we are not enforcing such a policy decision
> but merely explaining a way to work around this check, then it may
> be better to give a variant with the smaller difference from the
> original (i.e. without up-casing).

See example mentioned above: "Utf-16". How would you handle that?


Thanks,
Lars



[PATCH v10 8/9] convert: add tracing for 'working-tree-encoding' attribute

2018-03-07 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

Add the GIT_TRACE_WORKING_TREE_ENCODING environment variable to enable
tracing for content that is reencoded with the 'working-tree-encoding'
attribute. This is useful to debug encoding issues.

Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
---
 convert.c| 25 +
 t/t0028-working-tree-encoding.sh |  2 ++
 2 files changed, 27 insertions(+)

diff --git a/convert.c b/convert.c
index f19a15dd02..5776dfbc99 100644
--- a/convert.c
+++ b/convert.c
@@ -318,6 +318,29 @@ static int validate_encoding(const char *path, const char 
*enc,
return 0;
 }
 
+static void trace_encoding(const char *context, const char *path,
+  const char *encoding, const char *buf, size_t len)
+{
+   static struct trace_key coe = TRACE_KEY_INIT(WORKING_TREE_ENCODING);
+   struct strbuf trace = STRBUF_INIT;
+   int i;
+
+   strbuf_addf(, "%s (%s, considered %s):\n", context, path, 
encoding);
+   for (i = 0; i < len && buf; ++i) {
+   strbuf_addf(
+   ,"| \e[2m%2i:\e[0m %2x \e[2m%c\e[0m%c",
+   i,
+   (unsigned char) buf[i],
+   (buf[i] > 32 && buf[i] < 127 ? buf[i] : ' '),
+   ((i+1) % 8 && (i+1) < len ? ' ' : '\n')
+   );
+   }
+   strbuf_addchars(, '\n', 1);
+
+   trace_strbuf(, );
+   strbuf_release();
+}
+
 static const char *default_encoding = "UTF-8";
 
 static int encode_to_git(const char *path, const char *src, size_t src_len,
@@ -346,6 +369,7 @@ static int encode_to_git(const char *path, const char *src, 
size_t src_len,
if (validate_encoding(path, enc, src, src_len, die_on_error))
return 0;
 
+   trace_encoding("source", path, enc, src, src_len);
dst = reencode_string_len(src, src_len, default_encoding, enc,
  _len);
if (!dst) {
@@ -363,6 +387,7 @@ static int encode_to_git(const char *path, const char *src, 
size_t src_len,
return 0;
}
}
+   trace_encoding("destination", path, default_encoding, dst, dst_len);
 
strbuf_attach(buf, dst, dst_len, dst_len + 1);
return 1;
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index aa05f82166..6f3a82f61b 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -4,6 +4,8 @@ test_description='working-tree-encoding conversion via 
gitattributes'
 
 . ./test-lib.sh
 
+GIT_TRACE_WORKING_TREE_ENCODING=1 && export GIT_TRACE_WORKING_TREE_ENCODING
+
 test_expect_success 'setup test files' '
git config core.eol lf &&
 
-- 
2.16.2



[PATCH v10 6/9] convert: add 'working-tree-encoding' attribute

2018-03-07 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

Git recognizes files encoded with ASCII or one of its supersets (e.g.
UTF-8 or ISO-8859-1) as text files. All other encodings are usually
interpreted as binary and consequently built-in Git text processing
tools (e.g. 'git diff') as well as most Git web front ends do not
visualize the content.

Add an attribute to tell Git what encoding the user has defined for a
given file. If the content is added to the index, then Git converts the
content to a canonical UTF-8 representation. On checkout Git will
reverse the conversion.

Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
---
 Documentation/gitattributes.txt  |  80 +++
 convert.c| 110 +++-
 convert.h|   1 +
 sha1_file.c  |   2 +-
 t/t0028-working-tree-encoding.sh | 133 +++
 5 files changed, 324 insertions(+), 2 deletions(-)
 create mode 100755 t/t0028-working-tree-encoding.sh

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 30687de81a..31a4f92840 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -272,6 +272,86 @@ few exceptions.  Even though...
   catch potential problems early, safety triggers.
 
 
+`working-tree-encoding`
+^^^
+
+Git recognizes files encoded in ASCII or one of its supersets (e.g.
+UTF-8, ISO-8859-1, ...) as text files. Files encoded in certain other
+encodings (e.g. UTF-16) are interpreted as binary and consequently
+built-in Git text processing tools (e.g. 'git diff') as well as most Git
+web front ends do not visualize the contents of these files by default.
+
+In these cases you can tell Git the encoding of a file in the working
+directory with the `working-tree-encoding` attribute. If a file with this
+attribute is added to Git, then Git reencodes the content from the
+specified encoding to UTF-8. Finally, Git stores the UTF-8 encoded
+content in its internal data structure (called "the index"). On checkout
+the content is reencoded back to the specified encoding.
+
+Please note that using the `working-tree-encoding` attribute may have a
+number of pitfalls:
+
+- Alternative Git implementations (e.g. JGit or libgit2) and older Git
+  versions (as of March 2018) do not support the `working-tree-encoding`
+  attribute. If you decide to use the `working-tree-encoding` attribute
+  in your repository, then it is strongly recommended to ensure that all
+  clients working with the repository support it.
+
+  For example, Microsoft Visual Studio resources files (`*.rc`) or
+  PowerShell script files (`*.ps1`) are sometimes encoded in UTF-16.
+  If you declare `*.ps1` as files as UTF-16 and you add `foo.ps1` with
+  a `working-tree-encoding` enabled Git client, then `foo.ps1` will be
+  stored as UTF-8 internally. A client without `working-tree-encoding`
+  support will checkout `foo.ps1` as UTF-8 encoded file. This will
+  typically cause trouble for the users of this file.
+
+  If a Git client, that does not support the `working-tree-encoding`
+  attribute, adds a new file `bar.ps1`, then `bar.ps1` will be
+  stored "as-is" internally (in this example probably as UTF-16).
+  A client with `working-tree-encoding` support will interpret the
+  internal contents as UTF-8 and try to convert it to UTF-16 on checkout.
+  That operation will fail and cause an error.
+
+- Reencoding content requires resources that might slow down certain
+  Git operations (e.g 'git checkout' or 'git add').
+
+Use the `working-tree-encoding` attribute only if you cannot store a file
+in UTF-8 encoding and if you want Git to be able to process the content
+as text.
+
+As an example, use the following attributes if your '*.ps1' files are
+UTF-16 encoded with byte order mark (BOM) and you want Git to perform
+automatic line ending conversion based on your platform.
+
+
+*.ps1  text working-tree-encoding=UTF-16
+
+
+Use the following attributes if your '*.ps1' files are UTF-16 little
+endian encoded without BOM and you want Git to use Windows line endings
+in the working directory. Please note, it is highly recommended to
+explicitly define the line endings with `eol` if the `working-tree-encoding`
+attribute is used to avoid ambiguity.
+
+
+*.ps1  text working-tree-encoding=UTF-16LE eol=CRLF
+
+
+You can get a list of all available encodings on your platform with the
+following command:
+
+
+iconv --list
+
+
+If you do not know the encoding of a file, then you can use the `file`
+command to guess the encoding:
+
+
+file foo.ps1
+
+
+
 `ident`
 ^^^
 
diff --git a/convert.c b/convert.c
index b976eb968c..80549ff2b5 100644
--- a/conver

[PATCH v10 5/9] utf8: add function to detect a missing UTF-16/32 BOM

2018-03-07 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

If the endianness is not defined in the encoding name, then let's
be strict and require a BOM to avoid any encoding confusion. The
is_missing_required_utf_bom() function returns true if a required BOM
is missing.

The Unicode standard instructs to assume big-endian if there in no BOM
for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard used
in HTML5 recommends to assume little-endian to "deal with deployed
content" [3]. Strictly requiring a BOM seems to be the safest option
for content in Git.

This function is used in a subsequent commit.

[1] http://unicode.org/faq/utf_bom.html#gen6
[2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
 Section 3.10, D98, page 132
[3] https://encoding.spec.whatwg.org/#utf-16le

Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
---
 utf8.c | 13 +
 utf8.h | 19 +++
 2 files changed, 32 insertions(+)

diff --git a/utf8.c b/utf8.c
index e4b99580f0..81c6678df1 100644
--- a/utf8.c
+++ b/utf8.c
@@ -564,6 +564,19 @@ int has_prohibited_utf_bom(const char *enc, const char 
*data, size_t len)
);
 }
 
+int is_missing_required_utf_bom(const char *enc, const char *data, size_t len)
+{
+   return (
+  (!strcasecmp(enc, "UTF-16") || !strcasecmp(enc, "UTF16")) &&
+  !(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
+has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
+   ) || (
+  (!strcasecmp(enc, "UTF-32") || !strcasecmp(enc, "UTF32")) &&
+  !(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
+has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
+   );
+}
+
 /*
  * Returns first character length in bytes for multi-byte `text` according to
  * `encoding`.
diff --git a/utf8.h b/utf8.h
index 0db1db4519..cce654a64a 100644
--- a/utf8.h
+++ b/utf8.h
@@ -79,4 +79,23 @@ void strbuf_utf8_align(struct strbuf *buf, align_type 
position, unsigned int wid
  */
 int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
 
+/*
+ * If the endianness is not defined in the encoding name, then we
+ * require a BOM. The function returns true if a required BOM is missing.
+ *
+ * The Unicode standard instructs to assume big-endian if there in no
+ * BOM for UTF-16/32 [1][2]. However, the W3C/WHATWG encoding standard
+ * used in HTML5 recommends to assume little-endian to "deal with
+ * deployed content" [3].
+ *
+ * Therefore, strictly requiring a BOM seems to be the safest option for
+ * content in Git.
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#gen6
+ * [2] http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf
+ * Section 3.10, D98, page 132
+ * [3] https://encoding.spec.whatwg.org/#utf-16le
+ */
+int is_missing_required_utf_bom(const char *enc, const char *data, size_t len);
+
 #endif
-- 
2.16.2



[PATCH v10 4/9] utf8: add function to detect prohibited UTF-16/32 BOM

2018-03-07 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

Whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE
or UTF-32LE a BOM must not be used [1]. The function returns true if
this is the case.

This function is used in a subsequent commit.

[1] http://unicode.org/faq/utf_bom.html#bom10

Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
---
 utf8.c | 26 ++
 utf8.h |  9 +
 2 files changed, 35 insertions(+)

diff --git a/utf8.c b/utf8.c
index 2c27ce0137..e4b99580f0 100644
--- a/utf8.c
+++ b/utf8.c
@@ -538,6 +538,32 @@ char *reencode_string_len(const char *in, int insz,
 }
 #endif
 
+static int has_bom_prefix(const char *data, size_t len,
+ const char *bom, size_t bom_len)
+{
+   return (len >= bom_len) && !memcmp(data, bom, bom_len);
+}
+
+static const char utf16_be_bom[] = {0xFE, 0xFF};
+static const char utf16_le_bom[] = {0xFF, 0xFE};
+static const char utf32_be_bom[] = {0x00, 0x00, 0xFE, 0xFF};
+static const char utf32_le_bom[] = {0xFF, 0xFE, 0x00, 0x00};
+
+int has_prohibited_utf_bom(const char *enc, const char *data, size_t len)
+{
+   return (
+ (!strcasecmp(enc, "UTF-16BE") || !strcasecmp(enc, "UTF-16LE") ||
+  !strcasecmp(enc, "UTF16BE") || !strcasecmp(enc, "UTF16LE")) &&
+ (has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
+  has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
+   ) || (
+ (!strcasecmp(enc, "UTF-32BE") || !strcasecmp(enc, "UTF-32LE") ||
+  !strcasecmp(enc, "UTF32BE") || !strcasecmp(enc, "UTF32LE")) &&
+ (has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
+  has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
+   );
+}
+
 /*
  * Returns first character length in bytes for multi-byte `text` according to
  * `encoding`.
diff --git a/utf8.h b/utf8.h
index 6bbcf31a83..0db1db4519 100644
--- a/utf8.h
+++ b/utf8.h
@@ -70,4 +70,13 @@ typedef enum {
 void strbuf_utf8_align(struct strbuf *buf, align_type position, unsigned int 
width,
   const char *s);
 
+/*
+ * If a data stream is declared as UTF-16BE or UTF-16LE, then a UTF-16
+ * BOM must not be used [1]. The same applies for the UTF-32 equivalents.
+ * The function returns true if this rule is violated.
+ *
+ * [1] http://unicode.org/faq/utf_bom.html#bom10
+ */
+int has_prohibited_utf_bom(const char *enc, const char *data, size_t len);
+
 #endif
-- 
2.16.2



[PATCH v10 0/9] convert: add support for different encodings

2018-03-07 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

Hi,

Patches 1-5,8 are preparation and helper functions. Patch 3 is new.
Patch 6,7,9 are the actual change.

This series depends on Torsten's 8462ff43e4 (convert_to_git():
safe_crlf/checksafe becomes int conv_flags, 2018-01-13) which is
already in master.

Changes since v9:

* make has_bom_prefix() / is_missing_required_utf_bom() more lenient in
  what they accept (ignore casing, accept UTF?? and UTF-?? , Junio)
* replace memcmp() which does not check the length of the strings with
  a case insensitive variant of starts_with() (Junio)
* do not convert encoding names to uppercase
  (this fixes a leak introduced in the last iteration, Eric)
* do not cleanup test files that the test did not create (Eric)
* do not cleanup err.out files in tests (Eric)

I did not address Eric's feedback to make validate_encoding()
cleaner [1] as I want to stabilize the series and Eric wrote
that we can clean this up later:
http://public-inbox.org/git/capig+csoka-ybtybz42jgqtych7ldwntoeovdzfg0_64o9q...@mail.gmail.com

Thanks,
Lars


  RFC: 
https://public-inbox.org/git/bdb9b884-6d17-4be3-a83c-f67e2afa2...@gmail.com/
   v1: 
https://public-inbox.org/git/20171211155023.1405-1-lars.schnei...@autodesk.com/
   v2: 
https://public-inbox.org/git/2017122915.39680-1-lars.schnei...@autodesk.com/
   v3: 
https://public-inbox.org/git/20180106004808.77513-1-lars.schnei...@autodesk.com/
   v4: 
https://public-inbox.org/git/20180120152418.52859-1-lars.schnei...@autodesk.com/
   v5: https://public-inbox.org/git/20180129201855.9182-1-tbo...@web.de/
   v6: 
https://public-inbox.org/git/20180209132830.55385-1-lars.schnei...@autodesk.com/
   v7: 
https://public-inbox.org/git/20180215152711.158-1-lars.schnei...@autodesk.com/
   v8: 
https://public-inbox.org/git/20180224162801.98860-1-lars.schnei...@autodesk.com/
   v9: 
https://public-inbox.org/git/20180304201418.60958-1-lars.schnei...@autodesk.com/



Base Ref:
Web-Diff: https://github.com/larsxschneider/git/commit/a602b8dcef
Checkout: git fetch https://github.com/larsxschneider/git encoding-v10 && git 
checkout a602b8dcef


### Interdiff (v9..v10):

diff --git a/convert.c b/convert.c
index 6cbb2b2618..e861f1abbc 100644
--- a/convert.c
+++ b/convert.c
@@ -269,7 +269,8 @@ static int will_convert_lf_to_crlf(size_t len, struct 
text_stat *stats,
 static int validate_encoding(const char *path, const char *enc,
  const char *data, size_t len, int die_on_error)
 {
-   if (!memcmp("UTF-", enc, 4)) {
+   /* We only check for UTF here as UTF?? can be an alias for UTF-?? */
+   if (startscase_with(enc, "UTF")) {
/*
 * Check for detectable errors in UTF encodings
 */
@@ -277,16 +278,18 @@ static int validate_encoding(const char *path, const char 
*enc,
const char *error_msg = _(
"BOM is prohibited in '%s' if encoded as %s");
/*
-* This advice is shown for UTF-??BE and UTF-??LE
-* encodings. We truncate the encoding name to 6
-* chars with %.6s to cut off the last two "byte
-* order" characters.
+* This advice is shown for UTF-??BE and UTF-??LE 
encodings.
+* We cut off the last two characters of the encoding 
name
+# to generate the encoding name suitable for BOMs.
 */
const char *advise_msg = _(
"The file '%s' contains a byte order "
-   "mark (BOM). Please use %.6s as "
+   "mark (BOM). Please use %s as "
"working-tree-encoding.");
-   advise(advise_msg, path, enc);
+   char *upper_enc = xstrdup_toupper(enc);
+   upper_enc[strlen(upper_enc)-2] = '\0';
+   advise(advise_msg, path, upper_enc);
+   free(upper_enc);
if (die_on_error)
die(error_msg, path, enc);
else {
@@ -301,7 +304,9 @@ static int validate_encoding(const char *path, const char 
*enc,
"mark (BOM). Please use %sBE or %sLE "
"(depending on the byte order) as "
"working-tree-encoding.");
-   advise(advise_msg, path, enc, enc);
+   char *upper_enc = xstrdup_toupper(enc);
+   advise(advise_msg, path, upper_enc, upper_enc);
+   free(upper_enc);
if (die_on_error)
die(error_msg, path, en

[PATCH v10 1/9] strbuf: remove unnecessary NUL assignment in xstrdup_tolower()

2018-03-07 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

Since 3733e69464 (use xmallocz to avoid size arithmetic, 2016-02-22) we
allocate the buffer for the lower case string with xmallocz(). This
already ensures a NUL at the end of the allocated buffer.

Remove the unnecessary assignment.

Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
---
 strbuf.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/strbuf.c b/strbuf.c
index 1df674e919..55b7daeb35 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -781,7 +781,6 @@ char *xstrdup_tolower(const char *string)
result = xmallocz(len);
for (i = 0; i < len; i++)
result[i] = tolower(string[i]);
-   result[i] = '\0';
return result;
 }
 
-- 
2.16.2



[PATCH v10 7/9] convert: check for detectable errors in UTF encodings

2018-03-07 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

Check that new content is valid with respect to the user defined
'working-tree-encoding' attribute.

Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
---
 convert.c| 55 +++
 t/t0028-working-tree-encoding.sh | 56 
 2 files changed, 111 insertions(+)

diff --git a/convert.c b/convert.c
index 80549ff2b5..f19a15dd02 100644
--- a/convert.c
+++ b/convert.c
@@ -266,6 +266,58 @@ static int will_convert_lf_to_crlf(size_t len, struct 
text_stat *stats,
 
 }
 
+static int validate_encoding(const char *path, const char *enc,
+ const char *data, size_t len, int die_on_error)
+{
+   /* We only check for UTF here as UTF?? can be an alias for UTF-?? */
+   if (startscase_with(enc, "UTF")) {
+   /*
+* Check for detectable errors in UTF encodings
+*/
+   if (has_prohibited_utf_bom(enc, data, len)) {
+   const char *error_msg = _(
+   "BOM is prohibited in '%s' if encoded as %s");
+   /*
+* This advice is shown for UTF-??BE and UTF-??LE 
encodings.
+* We cut off the last two characters of the encoding 
name
+# to generate the encoding name suitable for BOMs.
+*/
+   const char *advise_msg = _(
+   "The file '%s' contains a byte order "
+   "mark (BOM). Please use %s as "
+   "working-tree-encoding.");
+   char *upper_enc = xstrdup_toupper(enc);
+   upper_enc[strlen(upper_enc)-2] = '\0';
+   advise(advise_msg, path, upper_enc);
+   free(upper_enc);
+   if (die_on_error)
+   die(error_msg, path, enc);
+   else {
+   return error(error_msg, path, enc);
+   }
+
+   } else if (is_missing_required_utf_bom(enc, data, len)) {
+   const char *error_msg = _(
+   "BOM is required in '%s' if encoded as %s");
+   const char *advise_msg = _(
+   "The file '%s' is missing a byte order "
+   "mark (BOM). Please use %sBE or %sLE "
+   "(depending on the byte order) as "
+   "working-tree-encoding.");
+   char *upper_enc = xstrdup_toupper(enc);
+   advise(advise_msg, path, upper_enc, upper_enc);
+   free(upper_enc);
+   if (die_on_error)
+   die(error_msg, path, enc);
+   else {
+   return error(error_msg, path, enc);
+   }
+   }
+
+   }
+   return 0;
+}
+
 static const char *default_encoding = "UTF-8";
 
 static int encode_to_git(const char *path, const char *src, size_t src_len,
@@ -291,6 +343,9 @@ static int encode_to_git(const char *path, const char *src, 
size_t src_len,
if (!buf && !src)
return 1;
 
+   if (validate_encoding(path, enc, src, src_len, die_on_error))
+   return 0;
+
dst = reencode_string_len(src, src_len, default_encoding, enc,
  _len);
if (!dst) {
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 89b4dbee1d..aa05f82166 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -62,6 +62,46 @@ test_expect_success 'check $GIT_DIR/info/attributes support' 
'
 
 for i in 16 32
 do
+   test_expect_success "check prohibited UTF-${i} BOM" '
+   test_when_finished "git reset --hard HEAD" &&
+
+   echo "*.utf${i}be text working-tree-encoding=utf-${i}be" 
>>.gitattributes &&
+   echo "*.utf${i}le text working-tree-encoding=utf-${i}le" 
>>.gitattributes &&
+
+   # Here we add a UTF-16 (resp. UTF-32) files with BOM 
(big/little-endian)
+   # but we tell Git to treat it as UTF-16BE/UTF-16LE (resp. 
UTF-32).
+   # In these cases the BOM is prohibited.
+   cp bebom.utf${i}be.raw bebom.utf${i}be &&
+   test_must_fail git add bebom.utf${i}be 2>err.out &&
+   test_i18ngrep "fatal: BOM is prohibited .* utf-${i}be" err.out 
&&
+
+   cp leb

[PATCH v10 9/9] convert: add round trip check based on 'core.checkRoundtripEncoding'

2018-03-07 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

UTF supports lossless conversion round tripping and conversions between
UTF and other encodings are mostly round trip safe as Unicode aims to be
a superset of all other character encodings. However, certain encodings
(e.g. SHIFT-JIS) are known to have round trip issues [1].

Add 'core.checkRoundtripEncoding', which contains a comma separated
list of encodings, to define for what encodings Git should check the
conversion round trip if they are used in the 'working-tree-encoding'
attribute.

Set SHIFT-JIS as default value for 'core.checkRoundtripEncoding'.

[1] 
https://support.microsoft.com/en-us/help/170559/prb-conversion-problem-between-shift-jis-and-unicode

Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
---
 Documentation/config.txt |  6 
 Documentation/gitattributes.txt  |  8 +
 config.c |  5 +++
 convert.c| 78 
 convert.h|  1 +
 environment.c|  1 +
 t/t0028-working-tree-encoding.sh | 39 
 7 files changed, 138 insertions(+)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 0e25b2c92b..d7a56054a5 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -530,6 +530,12 @@ core.autocrlf::
This variable can be set to 'input',
in which case no output conversion is performed.
 
+core.checkRoundtripEncoding::
+   A comma separated list of encodings that Git performs UTF-8 round
+   trip checks on if they are used in an `working-tree-encoding`
+   attribute (see linkgit:gitattributes[5]). The default value is
+   `SHIFT-JIS`.
+
 core.symlinks::
If false, symbolic links are checked out as small plain files that
contain the link text. linkgit:git-update-index[1] and
diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index 31a4f92840..aa3deae392 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -312,6 +312,14 @@ number of pitfalls:
   internal contents as UTF-8 and try to convert it to UTF-16 on checkout.
   That operation will fail and cause an error.
 
+- Reencoding content to non-UTF encodings can cause errors as the
+  conversion might not be UTF-8 round trip safe. If you suspect your
+  encoding to not be round trip safe, then add it to
+  `core.checkRoundtripEncoding` to make Git check the round trip
+  encoding (see linkgit:git-config[1]). SHIFT-JIS (Japanese character
+  set) is known to have round trip issues with UTF-8 and is checked by
+  default.
+
 - Reencoding content requires resources that might slow down certain
   Git operations (e.g 'git checkout' or 'git add').
 
diff --git a/config.c b/config.c
index 1f003fbb90..d0ada9fcd4 100644
--- a/config.c
+++ b/config.c
@@ -1172,6 +1172,11 @@ static int git_default_core_config(const char *var, 
const char *value)
return 0;
}
 
+   if (!strcmp(var, "core.checkroundtripencoding")) {
+   check_roundtrip_encoding = xstrdup(value);
+   return 0;
+   }
+
if (!strcmp(var, "core.notesref")) {
notes_ref_name = xstrdup(value);
return 0;
diff --git a/convert.c b/convert.c
index 5776dfbc99..e861f1abbc 100644
--- a/convert.c
+++ b/convert.c
@@ -341,6 +341,43 @@ static void trace_encoding(const char *context, const char 
*path,
strbuf_release();
 }
 
+static int check_roundtrip(const char* enc_name)
+{
+   /*
+* check_roundtrip_encoding contains a string of space and/or
+* comma separated encodings (eg. "UTF-16, ASCII, CP1125").
+* Search for the given encoding in that string.
+*/
+   const char *found = strcasestr(check_roundtrip_encoding, enc_name);
+   const char *next;
+   int len;
+   if (!found)
+   return 0;
+   next = found + strlen(enc_name);
+   len = strlen(check_roundtrip_encoding);
+   return (found && (
+   /*
+* check that the found encoding is at the
+* beginning of check_roundtrip_encoding or
+* that it is prefixed with a space or comma
+*/
+   found == check_roundtrip_encoding || (
+   found > check_roundtrip_encoding &&
+   (*(found-1) == ' ' || *(found-1) == ',')
+   )
+   ) && (
+   /*
+* check that the found encoding is at the
+* end of check_roundtrip_encoding or
+* that it is suffixed with a space or comma
+*/
+   next == check_roundtrip_encoding + len || (

[PATCH v10 3/9] strbuf: add a case insensitive starts_with()

2018-03-07 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

Check in a case insensitive manner if one string is a prefix of another
string.

This function is used in a subsequent commit.

Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
---
 git-compat-util.h | 1 +
 strbuf.c  | 9 +
 2 files changed, 10 insertions(+)

diff --git a/git-compat-util.h b/git-compat-util.h
index 68b2ad531e..f648da0c11 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -455,6 +455,7 @@ extern void (*get_warn_routine(void))(const char *warn, 
va_list params);
 extern void set_die_is_recursing_routine(int (*routine)(void));
 
 extern int starts_with(const char *str, const char *prefix);
+extern int startscase_with(const char *str, const char *prefix);
 
 /*
  * If the string "str" begins with the string found in "prefix", return 1.
diff --git a/strbuf.c b/strbuf.c
index b635f0bdc4..5779a2d591 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -11,6 +11,15 @@ int starts_with(const char *str, const char *prefix)
return 0;
 }
 
+int startscase_with(const char *str, const char *prefix)
+{
+   for (; ; str++, prefix++)
+   if (!*prefix)
+   return 1;
+   else if (tolower(*str) != tolower(*prefix))
+   return 0;
+}
+
 int skip_to_optional_arg_default(const char *str, const char *prefix,
 const char **arg, const char *def)
 {
-- 
2.16.2



[PATCH v10 2/9] strbuf: add xstrdup_toupper()

2018-03-07 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

Create a copy of an existing string and make all characters upper case.
Similar xstrdup_tolower().

This function is used in a subsequent commit.

Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
---
 strbuf.c | 12 
 strbuf.h |  1 +
 2 files changed, 13 insertions(+)

diff --git a/strbuf.c b/strbuf.c
index 55b7daeb35..b635f0bdc4 100644
--- a/strbuf.c
+++ b/strbuf.c
@@ -784,6 +784,18 @@ char *xstrdup_tolower(const char *string)
return result;
 }
 
+char *xstrdup_toupper(const char *string)
+{
+   char *result;
+   size_t len, i;
+
+   len = strlen(string);
+   result = xmallocz(len);
+   for (i = 0; i < len; i++)
+   result[i] = toupper(string[i]);
+   return result;
+}
+
 char *xstrvfmt(const char *fmt, va_list ap)
 {
struct strbuf buf = STRBUF_INIT;
diff --git a/strbuf.h b/strbuf.h
index 14c8c10d66..df7ced53ed 100644
--- a/strbuf.h
+++ b/strbuf.h
@@ -607,6 +607,7 @@ __attribute__((format (printf,2,3)))
 extern int fprintf_ln(FILE *fp, const char *fmt, ...);
 
 char *xstrdup_tolower(const char *);
+char *xstrdup_toupper(const char *);
 
 /**
  * Create a newly allocated string using printf format. You can do this easily
-- 
2.16.2



Re: [PATCH v9 4/8] utf8: add function to detect a missing UTF-16/32 BOM

2018-03-06 Thread Lars Schneider

> On 07 Mar 2018, at 00:07, Junio C Hamano <gits...@pobox.com> wrote:
> 
> Junio C Hamano <gits...@pobox.com> writes:
> 
>> Lars Schneider <larsxschnei...@gmail.com> writes:
>> 
>>>> Also "UTF16" or other spelling
>>>> the platform may support but this code fails to recognise will go
>>>> unchecked.
>>> 
>>> That is true. However, I would assume all iconv implementations use the
>>> same encoding names for UTF encodings, no? That means UTF16 would never be
>>> valid. Would you agree?
>> 
>> After seeing "UTF16" and others in "iconv -l | grep -i utf" output,
>> I am not sure what you mean by "Would you agree?"  People can say in
>> their .gitattributes file that this path is in "UTF16" without dash
>> and that is what will be fed to this code, no?
> 
> FWIW, "iconv -f utf8 -t utf16" seems not to complain, so I am
> reasonably sure that people expect downcased ones to work as well.

Sure! That's why I normalized it to upper case:
https://public-inbox.org/git/CAPig+cQE0pKs-AMvh4GndyCXBMnx=70jppdm6k4jjte-74f...@mail.gmail.com/

After thinking about it I wonder if we should barf on "utf16" without
dash. Your Linux iconv would handle this correctly. My macOS iconv would not.
That means the repo would checkout correctly on your machine but not on mine.

What do you think?

- Lars



Re: [PATCH v9 4/8] utf8: add function to detect a missing UTF-16/32 BOM

2018-03-06 Thread Lars Schneider

> On 06 Mar 2018, at 23:53, Junio C Hamano <gits...@pobox.com> wrote:
> 
> Lars Schneider <larsxschnei...@gmail.com> writes:
> 
>>> Also "UTF16" or other spelling
>>> the platform may support but this code fails to recognise will go
>>> unchecked.
>> 
>> That is true. However, I would assume all iconv implementations use the
>> same encoding names for UTF encodings, no? That means UTF16 would never be
>> valid. Would you agree?
> 
> After seeing "UTF16" and others in "iconv -l | grep -i utf" output,
> I am not sure what you mean by "Would you agree?"  People can say in
> their .gitattributes file that this path is in "UTF16" without dash
> and that is what will be fed to this coe, no?

On macOS I don't see UTF16... but I just checked on my Linux box and
there it is. Consequently, we need to check both versions: with and
without dash.

I'll reroll ASAP (I try to do it first thing in the morning).

Thanks,
Lars


Re: [PATCH v9 4/8] utf8: add function to detect a missing UTF-16/32 BOM

2018-03-06 Thread Lars Schneider

> On 06 Mar 2018, at 21:50, Junio C Hamano  wrote:
> 
> lars.schnei...@autodesk.com writes:
> 
>> +int is_missing_required_utf_bom(const char *enc, const char *data, size_t 
>> len)
>> +{
>> +return (
>> +   !strcmp(enc, "UTF-16") &&
>> +   !(has_bom_prefix(data, len, utf16_be_bom, sizeof(utf16_be_bom)) ||
>> + has_bom_prefix(data, len, utf16_le_bom, sizeof(utf16_le_bom)))
>> +) || (
>> +   !strcmp(enc, "UTF-32") &&
>> +   !(has_bom_prefix(data, len, utf32_be_bom, sizeof(utf32_be_bom)) ||
>> + has_bom_prefix(data, len, utf32_le_bom, sizeof(utf32_le_bom)))
>> +);
>> +}
> 
> These strcmp() calls seem inconsistent with the principle embodied
> by utf8.c::fallback_encoding(), i.e. "be lenient to what we accept",
> and make the interface uneven. I am wondering if we also want to
> complain when the user gave us "utf16" and there is no byte order
> mark in the contents, for example?

Well, if I use stricmp() then I don't need to call and cleanup
xstrdup_toupper() as discussed with Eric [1]. Is there a case
insensitive starts_with() method?

[1] 
https://public-inbox.org/git/CAPig+cQE0pKs-AMvh4GndyCXBMnx=70jppdm6k4jjte-74f...@mail.gmail.com/


>  Also "UTF16" or other spelling
> the platform may support but this code fails to recognise will go
> unchecked.

That is true. However, I would assume all iconv implementations use the
same encoding names for UTF encodings, no? That means UTF16 would never be
valid. Would you agree?

- Lars

Re: [PATCH v9 5/8] convert: add 'working-tree-encoding' attribute

2018-03-06 Thread Lars Schneider

> On 06 Mar 2018, at 21:42, Eric Sunshine <sunsh...@sunshineco.com> wrote:
> 
> On Sun, Mar 4, 2018 at 3:14 PM,  <lars.schnei...@autodesk.com> wrote:
>> Git recognizes files encoded with ASCII or one of its supersets (e.g.
>> UTF-8 or ISO-8859-1) as text files. All other encodings are usually
>> interpreted as binary and consequently built-in Git text processing
>> tools (e.g. 'git diff') as well as most Git web front ends do not
>> visualize the content.
>> [...]
>> Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
>> ---
>> diff --git a/convert.c b/convert.c
>> @@ -978,6 +1051,25 @@ static int ident_to_worktree(const char *path, const 
>> char *src, size_t len,
>> +static const char *git_path_check_encoding(struct attr_check_item *check)
>> +{
>> +   [...]
>> +   /*
>> +* Ensure encoding names are always upper case (e.g. UTF-8) to
>> +* simplify subsequent string comparisons.
>> +*/
>> +   return xstrdup_toupper(value);
> 
> xstrdup_toupper() allocates memory...
> 
>> +}
>> @@ -1033,6 +1125,7 @@ struct conv_attrs {
>>enum crlf_action attr_action; /* What attr says */
>>enum crlf_action crlf_action; /* When no attr is set, use 
>> core.autocrlf */
>>int ident;
>> +   const char *working_tree_encoding; /* Supported encoding or default 
>> encoding if NULL */
> 
> ...which is assigned to 'const char *'...
> 
>> };
>> @@ -1064,6 +1158,7 @@ static void convert_attrs(struct conv_attrs *ca, const 
>> char *path)
>>else if (eol_attr == EOL_CRLF)
>>ca->crlf_action = CRLF_TEXT_CRLF;
>>}
>> +   ca->working_tree_encoding = git_path_check_encoding(ccheck + 
>> 5);
> 
> ...by this code, and eventually leaked.
> 
> It's too bad it isn't cleaned up (freed), but looking at the callers,
> fixing this leak would be mildly noisy (though not particularly
> invasive). How much do we care about this leak?

Hmm. You are right. That was previously handled by the encoding struct 
linked list that I removed in this iteration. I forgot about that aspect :/
I don't like it leaking. I think I would like to reintroduce the linked
list. This way every encoding is only once in memory. What do you think?


>>} else {
>>ca->drv = NULL;
>>ca->crlf_action = CRLF_UNDEFINED;
>> diff --git a/t/t0028-working-tree-encoding.sh 
>> b/t/t0028-working-tree-encoding.sh
>> @@ -0,0 +1,135 @@
>> +test_expect_success 'check $GIT_DIR/info/attributes support' '
>> +   test_when_finished "rm -f test.utf8.raw test.utf32.raw 
>> test.utf32.git" &&
> 
> It seems weird to be cleaning up files this test didn't create
> (test.utf8.raw and test.utf32.raw).

Agreed.


>> +   test_when_finished "git reset --hard HEAD" &&
>> +
>> +   echo "*.utf32 text working-tree-encoding=utf-32" 
>> >.git/info/attributes &&
>> +   git add test.utf32 &&
>> +
>> +   git cat-file -p :test.utf32 >test.utf32.git &&
>> +   test_cmp_bin test.utf8.raw test.utf32.git
>> +'
>> +
>> +test_expect_success 'check unsupported encodings' '
>> +   test_when_finished "rm -f err.out" &&
>> +   test_when_finished "git reset --hard HEAD" &&
> 
> Resetting to HEAD here is an important cleanup action, but tests don't
> usually clean up files such as 'err.out' since such detritus doesn't
> usually impact subsequent tests negatively. (Just an observation; no
> re-roll needed.)

OK. I'll fix it if I reroll.

- Lars

Re: [PATCH v9 6/8] convert: check for detectable errors in UTF encodings

2018-03-06 Thread Lars Schneider

> On 06 Mar 2018, at 02:23, Junio C Hamano <gits...@pobox.com> wrote:
> 
> Lars Schneider <larsxschnei...@gmail.com> writes:
> 
>>> On 05 Mar 2018, at 22:50, Junio C Hamano <gits...@pobox.com> wrote:
>>> 
>>> lars.schnei...@autodesk.com writes:
>>> 
>>>> +static int validate_encoding(const char *path, const char *enc,
>>>> +const char *data, size_t len, int die_on_error)
>>>> +{
>>>> +  if (!memcmp("UTF-", enc, 4)) {
>>> 
>>> Does the caller already know that enc is sufficiently long that
>>> using memcmp is safe?
>> 
>> No :-(
>> 
>> Would you be willing to squash that in?
>> 
>>if (strlen(enc) > 4 && !memcmp("UTF-", enc, 4)) {
>> 
>> I deliberately used "> 4" as plain "UTF-" is not even valid.
> 
> I'd rather not.  The code does not have to even look at 6th and
> later bytes in the enc[] even if it wanted to reject "UTF-" followed
> by nothing, but use of strlen() forces it to look at everything.
> 
> Stepping back, shouldn't
> 
>   if (starts_with(enc, "UTF-") 
> 
> be sufficient?  If you really care about the case where "UTF-" alone
> comes here, you could write
> 
>   if (starts_with(enc, "UTF-") && enc[4])
> 
> but I do not think "&& enc[4]" is even needed.  The functions called
> from this block would not consider "UTF-" alone as something valid
> anyway, so with that "&& enf[4]" we would be piling more code only
> for invalid/rare case.

Agreed, "if (starts_with(enc, "UTF-"))" is sufficient. Can you squash
that in? Thanks for pointing me to starts_with() as I forgot about this
function!

- Lars


Re: [PATCH v9 6/8] convert: check for detectable errors in UTF encodings

2018-03-05 Thread Lars Schneider

> On 05 Mar 2018, at 22:50, Junio C Hamano  wrote:
> 
> lars.schnei...@autodesk.com writes:
> 
>> +static int validate_encoding(const char *path, const char *enc,
>> +  const char *data, size_t len, int die_on_error)
>> +{
>> +if (!memcmp("UTF-", enc, 4)) {
> 
> Does the caller already know that enc is sufficiently long that
> using memcmp is safe?

No :-(

Would you be willing to squash that in?

if (strlen(enc) > 4 && !memcmp("UTF-", enc, 4)) {

I deliberately used "> 4" as plain "UTF-" is not even valid.


Thanks for spotting this,
Lars


Re: Contributor Summit planning

2018-03-05 Thread Lars Schneider

> On 03 Mar 2018, at 11:39, Jeff King  wrote:
> 
> On Sat, Mar 03, 2018 at 05:30:10AM -0500, Jeff King wrote:
> 
>> As in past years, I plan to run it like an unconference. Attendees are
>> expected to bring topics for group discussion. Short presentations are
>> also welcome. We'll put the topics on a whiteboard in the morning, and
>> pick whichever ones people are interested in.
>> 
>> Feel free to reply to this thread if you want to make plans or discuss
>> any proposed topics before the summit. Input or questions from
>> non-attendees is welcome here.
> 
> I'll plan to offer two topics:
> 
> - a round-up of the current state and past year's activities of Git as
>   a member project of Software Freedom Conservancy
> 
> - some updates on the state of the git-scm.com since my report last
>   year
> 
> As with last year, I'll try to send a written report to the list for
> those who aren't at the summit in person.

Thanks for starting this. I would like to propose the following topics:

- hooks: Discuss a proposal for multiple local Git hooks of the same type.

- error reporting: Git is distributed and therefore lots of errors are only
  reported locally. That makes it hard for administrators in larger 
  companies to see trouble. Would it make sense to add a config option that 
  would push recent errors along with "git push" to the server?

- fuzzing: Would it make sense to register Git to Google's OSS fuzzing
  program https://github.com/google/oss-fuzz ?


- Lars


[PATCH v9 7/8] convert: add tracing for 'working-tree-encoding' attribute

2018-03-04 Thread lars . schneider
From: Lars Schneider <larsxschnei...@gmail.com>

Add the GIT_TRACE_WORKING_TREE_ENCODING environment variable to enable
tracing for content that is reencoded with the 'working-tree-encoding'
attribute. This is useful to debug encoding issues.

Signed-off-by: Lars Schneider <larsxschnei...@gmail.com>
---
 convert.c| 25 +
 t/t0028-working-tree-encoding.sh |  2 ++
 2 files changed, 27 insertions(+)

diff --git a/convert.c b/convert.c
index 9647b06679..eec34a94b9 100644
--- a/convert.c
+++ b/convert.c
@@ -313,6 +313,29 @@ static int validate_encoding(const char *path, const char 
*enc,
return 0;
 }
 
+static void trace_encoding(const char *context, const char *path,
+  const char *encoding, const char *buf, size_t len)
+{
+   static struct trace_key coe = TRACE_KEY_INIT(WORKING_TREE_ENCODING);
+   struct strbuf trace = STRBUF_INIT;
+   int i;
+
+   strbuf_addf(, "%s (%s, considered %s):\n", context, path, 
encoding);
+   for (i = 0; i < len && buf; ++i) {
+   strbuf_addf(
+   ,"| \e[2m%2i:\e[0m %2x \e[2m%c\e[0m%c",
+   i,
+   (unsigned char) buf[i],
+   (buf[i] > 32 && buf[i] < 127 ? buf[i] : ' '),
+   ((i+1) % 8 && (i+1) < len ? ' ' : '\n')
+   );
+   }
+   strbuf_addchars(, '\n', 1);
+
+   trace_strbuf(, );
+   strbuf_release();
+}
+
 static const char *default_encoding = "UTF-8";
 
 static int encode_to_git(const char *path, const char *src, size_t src_len,
@@ -341,6 +364,7 @@ static int encode_to_git(const char *path, const char *src, 
size_t src_len,
if (validate_encoding(path, enc, src, src_len, die_on_error))
return 0;
 
+   trace_encoding("source", path, enc, src, src_len);
dst = reencode_string_len(src, src_len, default_encoding, enc,
  _len);
if (!dst) {
@@ -358,6 +382,7 @@ static int encode_to_git(const char *path, const char *src, 
size_t src_len,
return 0;
}
}
+   trace_encoding("destination", path, default_encoding, dst, dst_len);
 
strbuf_attach(buf, dst, dst_len, dst_len + 1);
return 1;
diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
index 5c7e36a164..bdc487b44f 100755
--- a/t/t0028-working-tree-encoding.sh
+++ b/t/t0028-working-tree-encoding.sh
@@ -4,6 +4,8 @@ test_description='working-tree-encoding conversion via 
gitattributes'
 
 . ./test-lib.sh
 
+GIT_TRACE_WORKING_TREE_ENCODING=1 && export GIT_TRACE_WORKING_TREE_ENCODING
+
 test_expect_success 'setup test files' '
git config core.eol lf &&
 
-- 
2.16.2



  1   2   3   4   5   6   7   8   9   10   >