Bug? Git won't apply a split hunk that went through a text editor

2018-08-09 Thread Philip White
Hello, git enthusiasts,

I’d like to report what I suspect is a bug in Git, tested in 2.18 and 2.14. 
(I’d be delighted to be corrected if it is my own misunderstanding.) I’m 
reporting it here based on guidance from https://git-scm.com/community.

I created a minimal testcase with a detailed README here: 
https://github.com/philipmw/git-bugreport-2018-hunk-fail

Overview of the bug:

When interactively selecting hunks to apply, using `git checkout -p  
`, git will correctly apply an unmodified hunk, but will refuse to apply 
a hunk that went through a text editor ("e" command), even when I made no 
changes in the text editor.

Thanks for any advice or attention you can give to this matter.

— 
Philip

Re: 'git submodule update' ignores [http] config

2018-08-09 Thread Jonathan Nieder
+cc: Stefan, who has been looking at fetch --recurse-submodules recently
Hi,

Jonathon Reinhart wrote:

> I've narrowed it down to an observation that the [http] config seems
> to be ignored by 'git submodule update'. Shouldn't those options be
> respected by submodules?
>
> Given a .git/config file like this:
>
> 
> [fetch]
> recurseSubmodules = false
> [http "https://gitlab.exmaple.com;]
> sslCAInfo = 
> C:\\Users\\gitlab-runner\\builds\\deadbeef\\0\\somegroup\\someproj.git\\CA_SERVER_TLS_CA_FILE
[...]
> C:\Users\jreinhart\testrepo>set GIT_CURL_VERBOSE=1
> C:\Users\jreinhart\testrepo>git fetch
[...]
> *   CAfile: 
> C:\Users\gitlab-runner\builds\deadbeef\0\somegroup\someproj.git\CA_SERVER_TLS_CA_FILE
[...]
> C:\Users\jreinhart\testrepo>git checkout master
> C:\Users\jreinhart\testrepo>git submodule update --init
[...]
> *   CAfile: C:/Program Files/Git/mingw64/ssl/certs/ca-bundle.crt
[...]
> Note that the CAfile reverted to its default instead of using the same
> one from the `git fetch`.

Interesting.

The context is that "git submodule update" is simply running commands
like "git fetch" inside the submodules, and the repository-local
config of the superproject does not apply there.

In the long run, commands like "git fetch --recurse-submodules" may
chaange to use a single process.  It's possible that some of the
repository-local configuration of the superproject would apply at that
point, though the inconsistency would be confusing, so probably not
these particular settings.  Anyway, that's a faraway future; today,
"git fetch --recurse-submodules" is also running "git fetch" commands
inside the submodules, and the repository-local config of the
superproject does not apply there.

Would it work for you to put this configuration in the global config
file ("git config --global --edit")?  That way, it would be used by
all repositories.  If you want it only to apply within the testrepo
directory, you can use conditional includes --- something like:

  in $HOME/.git/config/testrepo-ca:

  [http "https://gitlab.example.com;]
sslCAInfo = ...

  in $HOME/.git/config/git:

  [includeIf "gitdir/i:~/testrepo/**"]
path = testrepo-ca

Thanks and hope that helps,
Jonathan


Re: [PATCH 1/1] verify-tag/verify-commit should exit unsuccessfully when signature is not trusted

2018-08-09 Thread brian m. carlson
On Thu, Aug 09, 2018 at 11:40:27AM -0700, Junio C Hamano wrote:
> -- >8 --
> Subject: [PATCH] gpg-interface: propagate exit status from gpg back to the 
> callers
> 
> When gpg-interface API unified support for signature verification
> codepaths for signed tags and signed commits in mid 2015 at around
> v2.6.0-rc0~114, we accidentally loosened the GPG signature
> verification.
> 
> Before that change, signed commits were verified by looking for
> "G"ood signature from GPG, while ignoring the exit status of "gpg
> --verify" process, while signed tags were verified by simply passing
> the exit status of "gpg --verify" through.  The unified code we
> currently have ignores the exit status of "gpg --verify" and returns
> successful verification when the signature matches an unexpired key
> regardless of the trust placed on the key (i.e. in addition to "G"ood
> ones, we accept "U"ntrusted ones).
> 
> Make these commands signal failure with their exit status when
> underlying "gpg --verify" (or the custom command specified by
> "gpg.program" configuration variable) does so.  This essentially
> changes their behaviour in a backward incompatible way to reject
> signatures that have been made with untrusted keys even if they
> correctly verify, as that is how "gpg --verify" behaves.
> 
> Note that the code still overrides a zero exit status obtained from
> "gpg" (or gpg.program) if the output does not say the signature is
> good or computes correctly but made with untrusted keys, to catch
> a poorly written wrapper around "gpg" the user may give us.
> 
> We could exclude "U"ntrusted support from this fallback code, but
> that would be making two backward incompatible changes in a single
> commit, so let's avoid that for now.  A follow-up change could do so
> if desired.

This looks great to me.  Thanks.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204


signature.asc
Description: PGP signature


Re: What's the use case for committing both the freshly created file and it's exclusion in .gitignore?

2018-08-09 Thread Jonathan Nieder
Hi,

Jeff King wrote:
> Bartosz Konikiewicz wrote:

>> Steps to reproduce:
>>
>> 1. Create a new file.
>> 2. Stage the file.
>> 3. Add the file to .gitignore.
>> 4. Stage the .gitignore.
>> 5. Commit changes.
[...]
> As far as I know, that is not an intentionally supported workflow. It is
> merely the result that .gitignore is only considered when adding new
> files to the index, not when committing nor when updating the entry for
> an existing file.

I am not sure I agree with "not intentionally supported".  It's a
little closer to "logical consequence of some intentionally features",
because:

> If you are asking as a more general case: why do we not complain about
> .gitignore for files the index already knows about, then I think that is
> useful. It lets you override the .gitignore _once_ when adding the file
> initially, and then you don't have to deal with it again (and keep in
> mind that the pattern excluding it may be broad, like "*.o", or even
> just "*", so simply deleting it from the .gitignore is not an option).

This workflow is very common.

> You could probably accomplish this these days by using a negative
> pattern in your .gitignore file. But I think the behavior in question
> may predate negative patterns (but I didn't dig). It's also a bit
> simpler to use in practice, IMHO.

Agreed about simpler, even though it's not part of any of my own
habits.

In retrospect, despite the precedent of cvsignore, calling the file
.gitignore may not have been a great idea.  Some other name that
conveys .git-prevent-me-from-accidentally-adding-these-files would
make the behavior less surprising to new users.

"git help gitignore" has some notes about this.  If you have ideas
about moments in interactive use where we could print some messages to
make the behavior less surprising, that would be very welcome.

Thanks,
Jonathan


Re: [PATCH 5/5] rev-list: handle missing tree objects properly

2018-08-09 Thread Jonathan Tan
> @@ -209,7 +210,8 @@ static inline void finish_object__ma(struct object *obj)
>*/
>   switch (arg_missing_action) {
>   case MA_ERROR:
> - die("missing blob object '%s'", oid_to_hex(>oid));
> + die("missing %s object '%s'",
> + type_name(obj->type), oid_to_hex(>oid));
>   return;
>  
>   case MA_ALLOW_ANY:
> @@ -222,8 +224,8 @@ static inline void finish_object__ma(struct object *obj)
>   case MA_ALLOW_PROMISOR:
>   if (is_promisor_object(>oid))
>   return;
> - die("unexpected missing blob object '%s'",
> - oid_to_hex(>oid));
> + die("unexpected missing %s object '%s'",
> + type_name(obj->type), oid_to_hex(>oid));
>   return;

Once again, I'll do a fuller review tomorrow.

These are fine (obj->type is populated), because the types of objects
are known during traversal.

> - if (obj->type == OBJ_BLOB && !has_object_file(>oid)) {
> + if (!has_object_file(>oid)) {
>   finish_object__ma(obj);
>   return 1;

And this is also fine, because finish_object__ma can now handle any
object type.

> + revs.show_missing_trees = 1;

(and elsewhere)

Could we just show missing trees all the time? We do that for blobs and
already rely on the caller (eventually, show_object() in
builtin/rev-list.c) to determine whether the object actually exists or
not; we could do the same for trees. This allows us to not include this
extra knob.

> - if (parse_tree_gently(tree, gently) < 0) {
> + parse_result = parse_tree_gently(tree, gently);
> + if (parse_result < 0 && !revs->show_missing_trees) {
>   if (revs->ignore_missing_links)
>   return;
>  
> @@ -182,7 +185,8 @@ static void process_tree(struct traversal_context *ctx,
>   if (base->len)
>   strbuf_addch(base, '/');
>  
> - process_tree_contents(ctx, tree, base);
> + if (parse_result >= 0)
> + process_tree_contents(ctx, tree, base);
>  
>   if ((obj->flags & NOT_USER_GIVEN) && ctx->filter_fn) {
>   r = ctx->filter_fn(LOFS_END_TREE, obj,

Is it possible to call the appropriate callbacks and then return
immediately, instead of going through the whole function checking
parse_result when necessary? When doing the latter, the reader needs to
keep on checking if each function still works if the tree is
unparseable.


Re: [PATCH 2/5] list-objects-filter: implement filter only:commits

2018-08-09 Thread Jonathan Tan
> Teach list-objects the "only:commits" filter which allows for filtering
> out all non-commit and non-annotated tag objects (unless other objects
> are explicitly specified by the user). The purpose of this patch is to
> allow smaller partial clones.
> 
> The name of this filter - only:commits - is a bit inaccurate because it
> still allows annotated tags to pass through. I chose it because it was
> the only concise name I could think of that was pretty descriptive. I
> considered and decided against "tree:none" because the code and
> documentation for filters seems to lack the concept of "you're filtering
> this, so we'll implicitly filter all referents of this." So "tree:none"
> is vague, since some may think it filters blobs too, while some may not.
> "only:commits" is specific and makes it easier to match it to a
> potential use case.

I'll do a fuller review tomorrow, but here are my initial thoughts.

I'm undecided about whether "only:commits" or "tree:none" is better -
one argument in favor of the latter is that blobs are not of much use
without any trees referring to them, so it makes sense that omitting
trees means omitting blobs. But that requires some thought and is not
immediately obvious.

>  /*
> - * A filter for list-objects to omit ALL blobs from the traversal.
> - * And to OPTIONALLY collect a list of the omitted OIDs.
> + * A filter for list-objects to omit ALL blobs from the traversal, and 
> possibly
> + * trees as well.
> + * Can OPTIONALLY collect a list of the omitted OIDs.
>   */
> -struct filter_blobs_none_data {
> +struct filter_none_of_type_data {
> + unsigned omit_trees : 1;
>   struct oidset *omits;
>  };

I know that it's documented above that blobs are always omitted, but
maybe it's worth it to add a comment /* blobs are always omitted */.

> - case LOFS_BEGIN_TREE:
> - assert(obj->type == OBJ_TREE);
> - /* always include all tree objects */
> - return LOFR_MARK_SEEN | LOFR_DO_SHOW;
> -
>   case LOFS_END_TREE:
>   assert(obj->type == OBJ_TREE);
>   return LOFR_ZERO;
>  
> + case LOFS_BEGIN_TREE:
> + assert(obj->type == OBJ_TREE);
> + if (!filter_data->omit_trees)
> + return LOFR_MARK_SEEN | LOFR_DO_SHOW;
> +
>   case LOFS_BLOB:
> - assert(obj->type == OBJ_BLOB);
>   assert((obj->flags & SEEN) == 0);

Moving the case LOFS_BEGIN_TREE and removing the assert is unnecessary,
I think.

Also, there's fallthrough. If that's on purpose, add /* fallthrough */,
although I think that it complicates the code unnecessarily here.

> +test_expect_success 'verify only:commits packfile has no blobs or trees' '
> + git -C r1 pack-objects --rev --stdout --filter=only:commits 
> >commitsonly.pack <<-EOF &&
> + HEAD
> + EOF
> + git -C r1 index-pack ../commitsonly.pack &&
> + git -C r1 verify-pack -v ../commitsonly.pack \
> + | grep -E "tree|blob" \
> + | sort >observed &&
> + test_line_count = 0 observed
> +'

Bash pipes conceal return codes. Here it's OK, but it might be better to
write the verify-pack on its own line and then '! grep -E "tree|blob"' -
you don't need to sort or test_line_count.

> +test_expect_success 'grab tree directly when using only:commits' '
> + # We should get the tree specified directly but not its blobs or 
> subtrees.
> + git -C r1 pack-objects --rev --stdout --filter=only:commits 
> >commitsonly.pack <<-EOF &&
> + HEAD:
> + EOF
> + git -C r1 index-pack ../commitsonly.pack &&
> + git -C r1 verify-pack -v ../commitsonly.pack \
> + | grep -E "tree|blob" \
> + | sort >observed &&
> + test_line_count = 1 observed
> +'

Similar comment as above, except you can redirect the output of grep to
a file, then test_line_count on that file. No need for sort.


[PATCH 2/2] WIP range-diff: print some statistics about the range

2018-08-09 Thread Stefan Beller
Signed-off-by: Stefan Beller 
---
 range-diff.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/range-diff.c b/range-diff.c
index a977289b7dc..fbabce5f91f 100644
--- a/range-diff.c
+++ b/range-diff.c
@@ -264,6 +264,8 @@ static void get_correspondences(struct string_list *a, 
struct string_list *b,
free(b2a);
 }
 
+int completely_different, slightly_different, same;
+
 static void output_pair_header(struct diff_options *diffopt,
   int patch_no_width,
   struct strbuf *buf,
@@ -288,15 +290,19 @@ static void output_pair_header(struct diff_options 
*diffopt,
if (!b_util) {
color = color_old;
status = '<';
+   slightly_different++;
} else if (!a_util) {
color = color_new;
status = '>';
+   completely_different++;
} else if (strcmp(a_util->patch, b_util->patch)) {
color = color_commit;
status = '!';
+   slightly_different++;
} else {
color = color_commit;
status = '=';
+   same++;
}
 
strbuf_reset(buf);
@@ -439,12 +445,15 @@ int show_range_diff(const char *range1, const char 
*range2,
res = error(_("could not parse log for '%s'"), range2);
 
diffopt->color_moved = COLOR_MOVED_DEFAULT;
+
if (!res) {
find_exact_matches(, );
get_correspondences(, , creation_factor);
output(, , diffopt);
}
 
+   printf("patch ranges are %d same, %d slightly different and %d 
completely different\n", same, slightly_different, completely_different);
+
string_list_clear(, 1);
string_list_clear(, 1);
 
-- 
2.18.0.597.ga71716f1ad-goog



[PATCH 0/2] Getting data on different diff algorithms WAS: Documentation/diff-options: explain different diff algorithms

2018-08-09 Thread Stefan Beller
I instrumented range-diff to use different diff algorithms for the patches,
(I chose default-myers and patience) and then run on the same range,
via ./git-range-diff v2.10.0..HEAD v2.10.0..HEAD
and I found 5245 same, 304 slightly different and 4 completely different
patches in that range.

Looking at the interdiff is not very pleasing even when reading it with
coloring and move detection.

Manually looking at them, I found the patience diff easier to review.

Comparing the 'default' diff algorithm to 'minimal', I'll see 
 5491 same, 58 slightly different and 0 completely different patches.
 
Comparing 'default' to 'histogram', I'll see
 5255 same, 294 slightly different and 8 completely different patches.
 
Comparing 'histogram' to 'patience', I'll see 
 5337 same, 212 slightly different and 10 completely different patches.

This is all to just put data out there, on how much difference to expect from
the diff algorithms. I have not yet dug into the quality of the diffs.

Stefan


Stefan Beller (2):
  WIP: range-diff: take extra arguments for different diffs.
  WIP range-diff: print some statistics about the range

 range-diff.c | 29 +
 1 file changed, 25 insertions(+), 4 deletions(-)

-- 
2.18.0.597.ga71716f1ad-goog



[PATCH 1/2] WIP: range-diff: take extra arguments for different diffs.

2018-08-09 Thread Stefan Beller
We can use the range-diff on the same range to examine differences in the
diff algorithm.

Signed-off-by: Stefan Beller 
---
 range-diff.c | 20 
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/range-diff.c b/range-diff.c
index 347b4a79f25..a977289b7dc 100644
--- a/range-diff.c
+++ b/range-diff.c
@@ -28,7 +28,8 @@ struct patch_util {
  * Reads the patches into a string list, with the `util` field being populated
  * as struct object_id (will need to be free()d).
  */
-static int read_patches(const char *range, struct string_list *list)
+static int read_patches(const char *range, struct string_list *list,
+   struct argv_array *extra_log_args)
 {
struct child_process cp = CHILD_PROCESS_INIT;
FILE *in;
@@ -36,7 +37,12 @@ static int read_patches(const char *range, struct 
string_list *list)
struct patch_util *util = NULL;
int in_header = 1;
 
-   argv_array_pushl(, "log", "--no-color", "-p", "--no-merges",
+   argv_array_pushl(, "log", "--no-color", "-p", "--no-merges", 
NULL);
+
+   if (extra_log_args)
+   argv_array_pushv(, extra_log_args->argv);
+
+   argv_array_pushl(,
"--reverse", "--date-order", "--decorate=no",
"--no-abbrev-commit", range,
NULL);
@@ -419,14 +425,20 @@ int show_range_diff(const char *range1, const char 
*range2,
 {
int res = 0;
 
+   struct argv_array extra = ARGV_ARRAY_INIT;
+
struct string_list branch1 = STRING_LIST_INIT_DUP;
struct string_list branch2 = STRING_LIST_INIT_DUP;
 
-   if (read_patches(range1, ))
+   argv_array_push(, "--diff-algorithm=patience");
+   argv_array_push(, "--indent-heuristic");
+
+   if (read_patches(range1, , NULL))
res = error(_("could not parse log for '%s'"), range1);
-   if (!res && read_patches(range2, ))
+   if (!res && read_patches(range2, , ))
res = error(_("could not parse log for '%s'"), range2);
 
+   diffopt->color_moved = COLOR_MOVED_DEFAULT;
if (!res) {
find_exact_matches(, );
get_correspondences(, , creation_factor);
-- 
2.18.0.597.ga71716f1ad-goog



Re: [PATCH v2] clone: report duplicate entries on case-insensitive filesystems

2018-08-09 Thread Elijah Newren
On Thu, Aug 9, 2018 at 2:59 PM Jeff King  wrote:
> On Thu, Aug 09, 2018 at 02:53:42PM -0700, Elijah Newren wrote:
>
> > On Thu, Aug 9, 2018 at 2:44 PM Jeff King  wrote:
> > > > The error message isn't quite as good, but does the user really need
> > > > all the names of the file?  If so, we gave them enough information to
> > > > figure it out, and this is a really unusual case anyway, right?
> > > > Besides, now we're back to linear performance
> > >
> > > Well, it's still quadratic when they run O(n) iterations of "git
> > > ls-files -s | grep $colliding_oid". You've just pushed the second linear
> > > search onto the user. ;)
> >
> > Wouldn't that be their own fault for not running
> >   git ls-files -s | grep -e $colliding_oid_1 ... -e $colliding_oid_n | sort 
> > -k 2
> > ?   ;-)
>
> Man, this thread is the gift that keeps on giving. :)
>
> That's still quadratic, isn't it? You've just hidden the second
> dimension in the single grep call.

It may depend on the implementation within grep.  If I had remembered
to pass -F, and if wikipedia's claims[1] about the Aho-Corasick
algotihrm being the basis of the original Unix command fgrep, and it's
claims that this algorithm is "linear in the length of the strings
plus the length of the searched text plus the number of output
matches", and I didn't just misunderstand something there, then it
looks like it could be linear.

[1] https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm

Of course, the grep implementation could also do something stupid and
provide dramatically worse behavior than running the command N times
specifying one pattern each time.[2]

[2] http://savannah.gnu.org/bugs/?16305

> Now since these are all going to be constant strings, in theory an
> intelligent grep could stick them all in a search trie, and match each
> line with complexity k, the length of the matched strings. And since
> k=40, that's technically still linear overall.

Looks like Aho-Corasick uses "a finite-state machine that resembles a
trie", so definitely along those lines.


'git submodule update' ignores [http] config

2018-08-09 Thread Jonathon Reinhart
I've been trying to track down an issue with the GitLab CI Runner:
https://gitlab.com/gitlab-org/gitlab-runner/issues/3497

Note that I'm using "git version 2.7.2.windows.1".

I've narrowed it down to an observation that the [http] config seems
to be ignored by 'git submodule update'. Shouldn't those options be
respected by submodules?

Given a .git/config file like this:


[fetch]
recurseSubmodules = false
[http "https://gitlab.exmaple.com;]
sslCAInfo =
C:\\Users\\gitlab-runner\\builds\\deadbeef\\0\\somegroup\\someproj.git\\CA_SERVER_TLS_CA_FILE
[core]
...
[remote "origin"]
url = 
https://jreinhart:@gitlab.example.com/somegroup/someproj.git
fetch = +refs/heads/*:refs/remotes/origin/*
[branch "master"]
remote = origin
merge = refs/heads/master
[submodule "some-lib"]
url = 
https://jreinhart:@gitlab.example.com/somegroup/some-lib.git


...I see the following results:


C:\Users\jreinhart\testrepo>set GIT_CURL_VERBOSE=1
C:\Users\jreinhart\testrepo>git fetch
...
* Connected to gitlab.example.com (x.x.x.x) port 443 (#0)
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
* successfully set certificate verify locations:
*   CAfile: 
C:\Users\gitlab-runner\builds\deadbeef\0\somegroup\someproj.git\CA_SERVER_TLS_CA_FILE
  CApath: none
* SSL connection using TLSv1.2 / ECDHE-RSA-AES256-GCM-SHA384
* ALPN, server did not agree to a protocol
* Server certificate:
*  
*SSL certificate verify ok.
...
C:\Users\jreinhart\testrepo>git checkout master
C:\Users\jreinhart\testrepo>git submodule update --init
...
* Connected to gitlab.example.com (x.x.x.x) port 443 (#0)
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
* successfully set certificate verify locations:
*   CAfile: C:/Program Files/Git/mingw64/ssl/certs/ca-bundle.crt
  CApath: none
* SSL certificate problem: unable to get local issuer certificate
...


Note that the CAfile reverted to its default instead of using the same
one from the `git fetch`.


Thanks in advance,

Jonathon Reinhart


[PATCH 1/5] revision: invert meaning of the USER_GIVEN flag

2018-08-09 Thread Matthew DeVore
Abandon the previous approach of mutating all new objects implicitly in
add_pending_object by inverting the meaning of the bit (it is now
NOT_USER_GIVEN) and only setting the flag when we need to.

This more accurately tracks if a tree was provided directly by the user.
Without this patch, the root tree of all commits were erroneously
considered to be USER_GIVEN, which meant they cannot be filtered. This
distinction is important in the next patch.

Signed-off-by: Matthew DeVore 
---
 list-objects.c | 31 ++-
 revision.c |  1 -
 revision.h | 10 +++---
 3 files changed, 25 insertions(+), 17 deletions(-)

diff --git a/list-objects.c b/list-objects.c
index c99c47ac1..482044bda 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -48,7 +48,7 @@ static void process_blob(struct rev_info *revs,
 
pathlen = path->len;
strbuf_addstr(path, name);
-   if (!(obj->flags & USER_GIVEN) && filter_fn)
+   if ((obj->flags & NOT_USER_GIVEN) && filter_fn)
r = filter_fn(LOFS_BLOB, obj,
  path->buf, >buf[pathlen],
  filter_data);
@@ -133,7 +133,7 @@ static void process_tree(struct rev_info *revs,
}
 
strbuf_addstr(base, name);
-   if (!(obj->flags & USER_GIVEN) && filter_fn)
+   if ((obj->flags & NOT_USER_GIVEN) && filter_fn)
r = filter_fn(LOFS_BEGIN_TREE, obj,
  base->buf, >buf[baselen],
  filter_data);
@@ -156,23 +156,25 @@ static void process_tree(struct rev_info *revs,
continue;
}
 
-   if (S_ISDIR(entry.mode))
-   process_tree(revs,
-lookup_tree(the_repository, entry.oid),
-show, base, entry.path,
+   if (S_ISDIR(entry.mode)) {
+   struct tree *t = lookup_tree(the_repository, entry.oid);
+   t->object.flags |= NOT_USER_GIVEN;
+   process_tree(revs, t, show, base, entry.path,
 cb_data, filter_fn, filter_data);
+   }
else if (S_ISGITLINK(entry.mode))
process_gitlink(revs, entry.oid->hash,
show, base, entry.path,
cb_data);
-   else
-   process_blob(revs,
-lookup_blob(the_repository, entry.oid),
-show, base, entry.path,
+   else {
+   struct blob *b = lookup_blob(the_repository, entry.oid);
+   b->object.flags |= NOT_USER_GIVEN;
+   process_blob(revs, b, show, base, entry.path,
 cb_data, filter_fn, filter_data);
+   }
}
 
-   if (!(obj->flags & USER_GIVEN) && filter_fn) {
+   if ((obj->flags & NOT_USER_GIVEN) && filter_fn) {
r = filter_fn(LOFS_END_TREE, obj,
  base->buf, >buf[baselen],
  filter_data);
@@ -301,8 +303,11 @@ static void do_traverse(struct rev_info *revs,
 * an uninteresting boundary commit may not have its tree
 * parsed yet, but we are not going to show them anyway
 */
-   if (get_commit_tree(commit))
-   add_pending_tree(revs, get_commit_tree(commit));
+   if (get_commit_tree(commit)) {
+   struct tree *tree = get_commit_tree(commit);
+   tree->object.flags |= NOT_USER_GIVEN;
+   add_pending_tree(revs, tree);
+   }
show_commit(commit, show_data);
 
if (revs->tree_blobs_in_commit_order)
diff --git a/revision.c b/revision.c
index 062749437..6d355b43c 100644
--- a/revision.c
+++ b/revision.c
@@ -175,7 +175,6 @@ static void add_pending_object_with_path(struct rev_info 
*revs,
strbuf_release();
return; /* do not add the commit itself */
}
-   obj->flags |= USER_GIVEN;
add_object_array_with_path(obj, name, >pending, mode, path);
 }
 
diff --git a/revision.h b/revision.h
index c599c34da..cd6b62313 100644
--- a/revision.h
+++ b/revision.h
@@ -8,7 +8,11 @@
 #include "diff.h"
 #include "commit-slab-decl.h"
 
-/* Remember to update object flag allocation in object.h */
+/* Remember to update object flag allocation in object.h
+ * NEEDSWORK: NOT_USER_GIVEN doesn't apply to commits because we only support
+ * filtering trees and blobs, but it may be useful to support filtering commits
+ * in the future.
+ */
 #define SEEN   (1u<<0)
 #define UNINTERESTING   (1u<<1)
 #define TREESAME   (1u<<2)
@@ -20,9 +24,9 @@
 #define SYMMETRIC_LEFT (1u<<8)
 #define 

[PATCH 4/5] list-objects: refactor to process_tree_contents

2018-08-09 Thread Matthew DeVore
This will be used in a follow-up patch to reduce indentation needed when
invoking the logic conditionally. i.e. rather than:

if (foo) {
while (...) {
/* this is very indented */
}
}

we will have:

if (foo)
process_tree_contents(...);

Signed-off-by: Matthew DeVore 
---
 list-objects.c | 73 +-
 1 file changed, 43 insertions(+), 30 deletions(-)

diff --git a/list-objects.c b/list-objects.c
index fa34fbf58..7ecdb95ce 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -94,6 +94,48 @@ static void process_gitlink(struct traversal_context *ctx,
/* Nothing to do */
 }
 
+static void process_tree(struct traversal_context *ctx,
+struct tree *tree,
+struct strbuf *base,
+const char *name);
+
+static void process_tree_contents(struct traversal_context *ctx,
+ struct tree *tree,
+ struct strbuf *base)
+{
+   struct tree_desc desc;
+   struct name_entry entry;
+   enum interesting match = ctx->revs->diffopt.pathspec.nr == 0 ?
+   all_entries_interesting : entry_not_interesting;
+
+   init_tree_desc(, tree->buffer, tree->size);
+
+   while (tree_entry(, )) {
+   if (match != all_entries_interesting) {
+   match = tree_entry_interesting(, base, 0,
+  
>revs->diffopt.pathspec);
+   if (match == all_entries_not_interesting)
+   break;
+   if (match == entry_not_interesting)
+   continue;
+   }
+
+   if (S_ISDIR(entry.mode)) {
+   struct tree *t = lookup_tree(the_repository, entry.oid);
+   t->object.flags |= NOT_USER_GIVEN;
+   process_tree(ctx, t, base, entry.path);
+   }
+   else if (S_ISGITLINK(entry.mode))
+   process_gitlink(ctx, entry.oid->hash,
+   base, entry.path);
+   else {
+   struct blob *b = lookup_blob(the_repository, entry.oid);
+   b->object.flags |= NOT_USER_GIVEN;
+   process_blob(ctx, b, base, entry.path);
+   }
+   }
+}
+
 static void process_tree(struct traversal_context *ctx,
 struct tree *tree,
 struct strbuf *base,
@@ -101,10 +143,6 @@ static void process_tree(struct traversal_context *ctx,
 {
struct object *obj = >object;
struct rev_info *revs = ctx->revs;
-   struct tree_desc desc;
-   struct name_entry entry;
-   enum interesting match = revs->diffopt.pathspec.nr == 0 ?
-   all_entries_interesting: entry_not_interesting;
int baselen = base->len;
enum list_objects_filter_result r = LOFR_MARK_SEEN | LOFR_DO_SHOW;
int gently = revs->ignore_missing_links ||
@@ -144,32 +182,7 @@ static void process_tree(struct traversal_context *ctx,
if (base->len)
strbuf_addch(base, '/');
 
-   init_tree_desc(, tree->buffer, tree->size);
-
-   while (tree_entry(, )) {
-   if (match != all_entries_interesting) {
-   match = tree_entry_interesting(, base, 0,
-  >diffopt.pathspec);
-   if (match == all_entries_not_interesting)
-   break;
-   if (match == entry_not_interesting)
-   continue;
-   }
-
-   if (S_ISDIR(entry.mode)) {
-   struct tree *t = lookup_tree(the_repository, entry.oid);
-   t->object.flags |= NOT_USER_GIVEN;
-   process_tree(ctx, t, base, entry.path);
-   }
-   else if (S_ISGITLINK(entry.mode))
-   process_gitlink(ctx, entry.oid->hash,
-   base, entry.path);
-   else {
-   struct blob *b = lookup_blob(the_repository, entry.oid);
-   b->object.flags |= NOT_USER_GIVEN;
-   process_blob(ctx, b, base, entry.path);
-   }
-   }
+   process_tree_contents(ctx, tree, base);
 
if ((obj->flags & NOT_USER_GIVEN) && ctx->filter_fn) {
r = ctx->filter_fn(LOFS_END_TREE, obj,
-- 
2.18.0.597.ga71716f1ad-goog



[PATCH 2/5] list-objects-filter: implement filter only:commits

2018-08-09 Thread Matthew DeVore
Teach list-objects the "only:commits" filter which allows for filtering
out all non-commit and non-annotated tag objects (unless other objects
are explicitly specified by the user). The purpose of this patch is to
allow smaller partial clones.

The name of this filter - only:commits - is a bit inaccurate because it
still allows annotated tags to pass through. I chose it because it was
the only concise name I could think of that was pretty descriptive. I
considered and decided against "tree:none" because the code and
documentation for filters seems to lack the concept of "you're filtering
this, so we'll implicitly filter all referents of this." So "tree:none"
is vague, since some may think it filters blobs too, while some may not.
"only:commits" is specific and makes it easier to match it to a
potential use case.

Signed-off-by: Matthew DeVore 
---
 Documentation/rev-list-options.txt |  2 ++
 list-objects-filter-options.c  |  4 +++
 list-objects-filter-options.h  |  1 +
 list-objects-filter.c  | 43 ++
 t/t5317-pack-objects-filter-objects.sh | 30 ++
 t/t6112-rev-list-filters-objects.sh| 13 
 6 files changed, 80 insertions(+), 13 deletions(-)

diff --git a/Documentation/rev-list-options.txt 
b/Documentation/rev-list-options.txt
index 7b273635d..3a60a490a 100644
--- a/Documentation/rev-list-options.txt
+++ b/Documentation/rev-list-options.txt
@@ -743,6 +743,8 @@ specification contained in .
A debug option to help with future "partial clone" development.
This option specifies how missing objects are handled.
 +
+The form '--filter=only:commits' omits all blobs and trees.
++
 The form '--missing=error' requests that rev-list stop with an error if
 a missing object is encountered.  This is the default action.
 +
diff --git a/list-objects-filter-options.c b/list-objects-filter-options.c
index c0e2bd6a0..ae508 100644
--- a/list-objects-filter-options.c
+++ b/list-objects-filter-options.c
@@ -69,6 +69,10 @@ static int gently_parse_list_objects_filter(
filter_options->choice = LOFC_SPARSE_PATH;
filter_options->sparse_path_value = strdup(v0);
return 0;
+
+   } else if (!strcmp(arg, "only:commits")) {
+   filter_options->choice = LOFC_ONLY_COMMITS;
+   return 0;
}
 
if (errbuf) {
diff --git a/list-objects-filter-options.h b/list-objects-filter-options.h
index a61f8..a68df42c8 100644
--- a/list-objects-filter-options.h
+++ b/list-objects-filter-options.h
@@ -12,6 +12,7 @@ enum list_objects_filter_choice {
LOFC_BLOB_LIMIT,
LOFC_SPARSE_OID,
LOFC_SPARSE_PATH,
+   LOFC_ONLY_COMMITS,
LOFC__COUNT /* must be last */
 };
 
diff --git a/list-objects-filter.c b/list-objects-filter.c
index a0ba78b20..f0a064b4b 100644
--- a/list-objects-filter.c
+++ b/list-objects-filter.c
@@ -26,38 +26,39 @@
 #define FILTER_SHOWN_BUT_REVISIT (1<<21)
 
 /*
- * A filter for list-objects to omit ALL blobs from the traversal.
- * And to OPTIONALLY collect a list of the omitted OIDs.
+ * A filter for list-objects to omit ALL blobs from the traversal, and possibly
+ * trees as well.
+ * Can OPTIONALLY collect a list of the omitted OIDs.
  */
-struct filter_blobs_none_data {
+struct filter_none_of_type_data {
+   unsigned omit_trees : 1;
struct oidset *omits;
 };
 
-static enum list_objects_filter_result filter_blobs_none(
+static enum list_objects_filter_result filter_none_of_type(
enum list_objects_filter_situation filter_situation,
struct object *obj,
const char *pathname,
const char *filename,
void *filter_data_)
 {
-   struct filter_blobs_none_data *filter_data = filter_data_;
+   struct filter_none_of_type_data *filter_data = filter_data_;
 
switch (filter_situation) {
default:
die("unknown filter_situation");
return LOFR_ZERO;
 
-   case LOFS_BEGIN_TREE:
-   assert(obj->type == OBJ_TREE);
-   /* always include all tree objects */
-   return LOFR_MARK_SEEN | LOFR_DO_SHOW;
-
case LOFS_END_TREE:
assert(obj->type == OBJ_TREE);
return LOFR_ZERO;
 
+   case LOFS_BEGIN_TREE:
+   assert(obj->type == OBJ_TREE);
+   if (!filter_data->omit_trees)
+   return LOFR_MARK_SEEN | LOFR_DO_SHOW;
+
case LOFS_BLOB:
-   assert(obj->type == OBJ_BLOB);
assert((obj->flags & SEEN) == 0);
 
if (filter_data->omits)
@@ -72,10 +73,25 @@ static void *filter_blobs_none__init(
filter_object_fn *filter_fn,
filter_free_fn *filter_free_fn)
 {
-   struct filter_blobs_none_data *d = xcalloc(1, sizeof(*d));
+   struct filter_none_of_type_data *d = xcalloc(1, sizeof(*d));
+   d->omits = omitted;
+
+   *filter_fn = filter_none_of_type;
+ 

[RFC PATCH 0/5] filter: support for excluding all trees and blobs

2018-08-09 Thread Matthew DeVore
This patch series does two things:

  a. (patches 1-2) introduces "--filter=only:commits" which filters trees and 
blobs
  b. (patches 3-5) better support for promisor trees in the rev-list command

The intention is to enable initial partial clones to be very tiny by only
including commits. Patches 3-5 are necessary because, even though it has already
been possible to have partial clones with trees missing (there are tests for 
it),
there have not been any filters which support this yet, so it seemed necessary
to make rev-list handle this case better.

Thank you,

Matthew DeVore (5):
  revision: invert meaning of the USER_GIVEN flag
  list-objects-filter: implement filter only:commits
  list-objects: store common func args in struct
  list-objects: refactor to process_tree_contents
  rev-list: handle missing tree objects properly

 Documentation/rev-list-options.txt |   2 +
 builtin/rev-list.c |  12 +-
 list-objects-filter-options.c  |   4 +
 list-objects-filter-options.h  |   1 +
 list-objects-filter.c  |  43 +++--
 list-objects.c | 226 +
 revision.c |   1 -
 revision.h |  11 +-
 t/t5317-pack-objects-filter-objects.sh |  30 
 t/t5616-partial-clone.sh   |  27 +++
 t/t6112-rev-list-filters-objects.sh|  13 ++
 11 files changed, 242 insertions(+), 128 deletions(-)

-- 
2.18.0.597.ga71716f1ad-goog



[PATCH 5/5] rev-list: handle missing tree objects properly

2018-08-09 Thread Matthew DeVore
Previously, we assumed only blob objects could be missing. This patch
makes rev-list handle missing trees like missing blobs. A missing tree
will cause an error if --missing indicates an error should be caused,
and the hash is printed even if the tree is missing.

Signed-off-by: Matthew DeVore 
---
 builtin/rev-list.c   | 12 
 list-objects.c   |  8 ++--
 revision.h   |  1 +
 t/t5616-partial-clone.sh | 27 +++
 4 files changed, 42 insertions(+), 6 deletions(-)

diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index 5b07f3f4a..c870d4fe6 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -6,6 +6,7 @@
 #include "list-objects.h"
 #include "list-objects-filter.h"
 #include "list-objects-filter-options.h"
+#include "object.h"
 #include "object-store.h"
 #include "pack.h"
 #include "pack-bitmap.h"
@@ -209,7 +210,8 @@ static inline void finish_object__ma(struct object *obj)
 */
switch (arg_missing_action) {
case MA_ERROR:
-   die("missing blob object '%s'", oid_to_hex(>oid));
+   die("missing %s object '%s'",
+   type_name(obj->type), oid_to_hex(>oid));
return;
 
case MA_ALLOW_ANY:
@@ -222,8 +224,8 @@ static inline void finish_object__ma(struct object *obj)
case MA_ALLOW_PROMISOR:
if (is_promisor_object(>oid))
return;
-   die("unexpected missing blob object '%s'",
-   oid_to_hex(>oid));
+   die("unexpected missing %s object '%s'",
+   type_name(obj->type), oid_to_hex(>oid));
return;
 
default:
@@ -235,7 +237,7 @@ static inline void finish_object__ma(struct object *obj)
 static int finish_object(struct object *obj, const char *name, void *cb_data)
 {
struct rev_list_info *info = cb_data;
-   if (obj->type == OBJ_BLOB && !has_object_file(>oid)) {
+   if (!has_object_file(>oid)) {
finish_object__ma(obj);
return 1;
}
@@ -373,6 +375,7 @@ int cmd_rev_list(int argc, const char **argv, const char 
*prefix)
init_revisions(, prefix);
revs.abbrev = DEFAULT_ABBREV;
revs.commit_format = CMIT_FMT_UNSPECIFIED;
+   revs.show_missing_trees = 1;
 
/*
 * Scan the argument list before invoking setup_revisions(), so that we
@@ -389,6 +392,7 @@ int cmd_rev_list(int argc, const char **argv, const char 
*prefix)
if (!strcmp(arg, "--exclude-promisor-objects")) {
fetch_if_missing = 0;
revs.exclude_promisor_objects = 1;
+   revs.show_missing_trees = 0;
break;
}
}
diff --git a/list-objects.c b/list-objects.c
index 7ecdb95ce..b0291c45a 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -146,7 +146,9 @@ static void process_tree(struct traversal_context *ctx,
int baselen = base->len;
enum list_objects_filter_result r = LOFR_MARK_SEEN | LOFR_DO_SHOW;
int gently = revs->ignore_missing_links ||
+revs->show_missing_trees ||
 revs->exclude_promisor_objects;
+   int parse_result;
 
if (!revs->tree_objects)
return;
@@ -154,7 +156,8 @@ static void process_tree(struct traversal_context *ctx,
die("bad tree object");
if (obj->flags & (UNINTERESTING | SEEN))
return;
-   if (parse_tree_gently(tree, gently) < 0) {
+   parse_result = parse_tree_gently(tree, gently);
+   if (parse_result < 0 && !revs->show_missing_trees) {
if (revs->ignore_missing_links)
return;
 
@@ -182,7 +185,8 @@ static void process_tree(struct traversal_context *ctx,
if (base->len)
strbuf_addch(base, '/');
 
-   process_tree_contents(ctx, tree, base);
+   if (parse_result >= 0)
+   process_tree_contents(ctx, tree, base);
 
if ((obj->flags & NOT_USER_GIVEN) && ctx->filter_fn) {
r = ctx->filter_fn(LOFS_END_TREE, obj,
diff --git a/revision.h b/revision.h
index cd6b62313..34ff99f05 100644
--- a/revision.h
+++ b/revision.h
@@ -128,6 +128,7 @@ struct rev_info {
first_parent_only:1,
line_level_traverse:1,
tree_blobs_in_commit_order:1,
+   show_missing_trees:1,
 
/* for internal use only */
exclude_promisor_objects:1;
diff --git a/t/t5616-partial-clone.sh b/t/t5616-partial-clone.sh
index bbbe7537d..8a0ca0a74 100755
--- a/t/t5616-partial-clone.sh
+++ b/t/t5616-partial-clone.sh
@@ -170,6 +170,33 @@ test_expect_success 'partial clone fetches blobs pointed 
to by refs even if norm
git -C dst fsck
 '
 
+test_expect_success 'can use only:commits to filter partial clone' '
+   rm -rf dst &&
+ 

[PATCH 3/5] list-objects: store common func args in struct

2018-08-09 Thread Matthew DeVore
This will make utility functions easier to create, as done by the next
patch.

Signed-off-by: Matthew DeVore 
---
 list-objects.c | 152 +++--
 1 file changed, 71 insertions(+), 81 deletions(-)

diff --git a/list-objects.c b/list-objects.c
index 482044bda..fa34fbf58 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -12,20 +12,25 @@
 #include "packfile.h"
 #include "object-store.h"
 
-static void process_blob(struct rev_info *revs,
+struct traversal_context {
+   struct rev_info *revs;
+   show_object_fn show_object;
+   show_commit_fn show_commit;
+   void *show_data;
+   filter_object_fn filter_fn;
+   void *filter_data;
+};
+
+static void process_blob(struct traversal_context *ctx,
 struct blob *blob,
-show_object_fn show,
 struct strbuf *path,
-const char *name,
-void *cb_data,
-filter_object_fn filter_fn,
-void *filter_data)
+const char *name)
 {
struct object *obj = >object;
size_t pathlen;
enum list_objects_filter_result r = LOFR_MARK_SEEN | LOFR_DO_SHOW;
 
-   if (!revs->blob_objects)
+   if (!ctx->revs->blob_objects)
return;
if (!obj)
die("bad blob object");
@@ -41,21 +46,21 @@ static void process_blob(struct rev_info *revs,
 * may cause the actual filter to report an incomplete list
 * of missing objects.
 */
-   if (revs->exclude_promisor_objects &&
+   if (ctx->revs->exclude_promisor_objects &&
!has_object_file(>oid) &&
is_promisor_object(>oid))
return;
 
pathlen = path->len;
strbuf_addstr(path, name);
-   if ((obj->flags & NOT_USER_GIVEN) && filter_fn)
-   r = filter_fn(LOFS_BLOB, obj,
- path->buf, >buf[pathlen],
- filter_data);
+   if ((obj->flags & NOT_USER_GIVEN) && ctx->filter_fn)
+   r = ctx->filter_fn(LOFS_BLOB, obj,
+  path->buf, >buf[pathlen],
+  ctx->filter_data);
if (r & LOFR_MARK_SEEN)
obj->flags |= SEEN;
if (r & LOFR_DO_SHOW)
-   show(obj, path->buf, cb_data);
+   ctx->show_object(obj, path->buf, ctx->show_data);
strbuf_setlen(path, pathlen);
 }
 
@@ -81,26 +86,21 @@ static void process_blob(struct rev_info *revs,
  * the link, and how to do it. Whether it necessarily makes
  * any sense what-so-ever to ever do that is another issue.
  */
-static void process_gitlink(struct rev_info *revs,
+static void process_gitlink(struct traversal_context *ctx,
const unsigned char *sha1,
-   show_object_fn show,
struct strbuf *path,
-   const char *name,
-   void *cb_data)
+   const char *name)
 {
/* Nothing to do */
 }
 
-static void process_tree(struct rev_info *revs,
+static void process_tree(struct traversal_context *ctx,
 struct tree *tree,
-show_object_fn show,
 struct strbuf *base,
-const char *name,
-void *cb_data,
-filter_object_fn filter_fn,
-void *filter_data)
+const char *name)
 {
struct object *obj = >object;
+   struct rev_info *revs = ctx->revs;
struct tree_desc desc;
struct name_entry entry;
enum interesting match = revs->diffopt.pathspec.nr == 0 ?
@@ -133,14 +133,14 @@ static void process_tree(struct rev_info *revs,
}
 
strbuf_addstr(base, name);
-   if ((obj->flags & NOT_USER_GIVEN) && filter_fn)
-   r = filter_fn(LOFS_BEGIN_TREE, obj,
- base->buf, >buf[baselen],
- filter_data);
+   if ((obj->flags & NOT_USER_GIVEN) && ctx->filter_fn)
+   r = ctx->filter_fn(LOFS_BEGIN_TREE, obj,
+  base->buf, >buf[baselen],
+  ctx->filter_data);
if (r & LOFR_MARK_SEEN)
obj->flags |= SEEN;
if (r & LOFR_DO_SHOW)
-   show(obj, base->buf, cb_data);
+   ctx->show_object(obj, base->buf, ctx->show_data);
if (base->len)
strbuf_addch(base, '/');
 
@@ -159,29 +159,26 @@ static void process_tree(struct rev_info *revs,
if (S_ISDIR(entry.mode)) {
struct tree *t = lookup_tree(the_repository, entry.oid);
t->object.flags |= NOT_USER_GIVEN;
-   process_tree(revs, t, show, base, 

Re: [PATCH 02/10] string-list.h: add string_list_pop function.

2018-08-09 Thread Stefan Beller
On Thu, Aug 9, 2018 at 2:56 PM Jeff King  wrote:

>
> I think you could have helpers to spell the two lines above even more
> nicely:
>
>   while (list->nr) {
> work_on(list_top(list));
> list_pop(list); /* note this doesn't return anything! */
>   }
>
> But yes, it's not possible with the current functions.

I like this one most, and as we manage our own memory via the
alloc variable we also would not see a lot of memory churn for constant
push/pop traffic.

> You can also use a list.h linked-list. Then removal from the list and
> freeing are two separate operations (but it exercises your malloc a lot
> more if you're constantly pushing and popping).

For that I'd have to define my own type derived from list.h to carry
the string and the util pointer, which looks very similar to the string_list
we already have from a users POV.

> > > Where that falls down is if you really need work_on() to put more items
> > > on the stack, but only after you've removed the current top. But then
> > > writing it out may still be nicer, because it makes it clear you have to
> > > do:
> > >
> > >   const char *cur_string = xstrdup(list->items[list->nr-1].string);
> >
> > Another way would be to use
> >
> >   string_list_pop(, _dst, _dst);
> > i.e.
> >   /* Returns 0 if the dst was filled */
> >   int (struct string_list *, char **, void**)
> >
> > as then we do not expose the internals and would not have issues
> > with reallocs.
>
> Yes, I almost suggested that, but there's the question of memory
> ownership of string_dst. Does it need freed or not? Is that answer
> dependent on the strdup_strings flag?

Sure. But as the caller, you should know?
You constructed that string_list.


Re: [PATCH v2] clone: report duplicate entries on case-insensitive filesystems

2018-08-09 Thread Junio C Hamano
Elijah Newren  writes:

> A possibly crazy idea: Don't bother reporting the other filename; just
> report the OID instead.
>
> "Error: Foo.txt cannot be checked out because another file with hash
>  is in the way."  Maybe even add a hint for the user: "Run
> `git ls-files -s` to see see all files and their hash".

Once we start using OID to talk to humans, we are already lost.  At
that point, it would be 1000% better to

 - not check out Foo.txt to unlink and overwrite; instead leave the
   content that is already in the working tree as-is; and

 - report that Foo.txt was not checked out as something else that
   also claims to deserve that path was checked out already.

Then the user can inspect Foo.txt and realize it is actually the
contents that should be in foo.txt or whatever.




Re: [RFC PATCH] packfile: iterate packed objects in pack order

2018-08-09 Thread Jonathan Tan
On Wed, Aug 8, 2018 at 4:25 PM, Jeff King  wrote:
> Even if you just use the oid to do a separate lookup in the object
> database, there's still a benefit in accessing the objects in pack
> order.

You're probably right, but I don't immediately see what the benefit is.

On a not completely unrelated note, I just realized that in my patch,
i should be pack_nr (ordinal in pack order) and pack_nr should be
index_nr (ordinal in index order, i.e. alphabetical order). If you run
with this, feel free to write your own patch and maybe I'll learn how
accessing objects in pack order benefits looking up the object
database through the commit message.

> I can try to pick this up and carry the cat-file bits to completion if
> you want, but probably not until tomorrow or Friday.

Thanks - feel free to do this.


Re: [PATCH 04/10] submodule.c: convert submodule_move_head new argument to object id

2018-08-09 Thread Junio C Hamano
Stefan Beller  writes:

> All callers use oid_to_hex to convert the desired oid to a string before
> calling submodule_move_head. Defer the conversion to the
> submodule_move_head as it will turn out to be useful in a bit.
>
> Signed-off-by: Stefan Beller 
> ---
>  entry.c|  6 +++---
>  submodule.c| 12 ++--
>  submodule.h|  2 +-
>  unpack-trees.c | 13 +
>  4 files changed, 15 insertions(+), 18 deletions(-)
>
> diff --git a/entry.c b/entry.c
> index b5d1d3cf231..4b34dfd30df 100644
> --- a/entry.c
> +++ b/entry.c
> @@ -358,7 +358,7 @@ static int write_entry(struct cache_entry *ce,
>   sub = submodule_from_ce(ce);
>   if (sub)
>   return submodule_move_head(ce->name,
> - NULL, oid_to_hex(>oid),
> + NULL, >oid,
>   state->force ? SUBMODULE_MOVE_HEAD_FORCE : 0);
>   break;

Nice.



Re: [PATCH v2] clone: report duplicate entries on case-insensitive filesystems

2018-08-09 Thread Jeff King
On Thu, Aug 09, 2018 at 02:53:42PM -0700, Elijah Newren wrote:

> On Thu, Aug 9, 2018 at 2:44 PM Jeff King  wrote:
> > > The error message isn't quite as good, but does the user really need
> > > all the names of the file?  If so, we gave them enough information to
> > > figure it out, and this is a really unusual case anyway, right?
> > > Besides, now we're back to linear performance
> >
> > Well, it's still quadratic when they run O(n) iterations of "git
> > ls-files -s | grep $colliding_oid". You've just pushed the second linear
> > search onto the user. ;)
> 
> Wouldn't that be their own fault for not running
>   git ls-files -s | grep -e $colliding_oid_1 ... -e $colliding_oid_n | sort 
> -k 2
> ?   ;-)

Man, this thread is the gift that keeps on giving. :)

That's still quadratic, isn't it? You've just hidden the second
dimension in the single grep call.

Now since these are all going to be constant strings, in theory an
intelligent grep could stick them all in a search trie, and match each
line with complexity k, the length of the matched strings. And since
k=40, that's technically still linear overall.

-Peff


Re: [PATCH 02/10] string-list.h: add string_list_pop function.

2018-08-09 Thread Jeff King
On Thu, Aug 09, 2018 at 02:52:29PM -0700, Stefan Beller wrote:

> > In many cases you can just do:
> >
> >   while (list->nr) {
> > work_on(list->items[list->nr - 1]);
> > list_remove(list, list->nr - 1);
> >   }
> >
> > and then all of those memory ownership issues like:
> 
> [...]
> >
> > just go away. :)
> 
> The only complication here is the lack of list_remove(index),
> we do have list_remove(string), which internally searches the
> item and removes it. Hence I did not want to use it.

Heh, I almost dug into that more.

I think you could have helpers to spell the two lines above even more
nicely:

  while (list->nr) {
work_on(list_top(list));
list_pop(list); /* note this doesn't return anything! */
  }

But yes, it's not possible with the current functions.

> Another idea I had was to keep the list immutable (except amending,
> just like a constitution ;-) and store an index of how far we got in that
> list already. That wastes memory for keeping entries around, but is safe
> for memory due to its nature.

You can also use a list.h linked-list. Then removal from the list and
freeing are two separate operations (but it exercises your malloc a lot
more if you're constantly pushing and popping).

> > Where that falls down is if you really need work_on() to put more items
> > on the stack, but only after you've removed the current top. But then
> > writing it out may still be nicer, because it makes it clear you have to
> > do:
> >
> >   const char *cur_string = xstrdup(list->items[list->nr-1].string);
> 
> Another way would be to use
> 
>   string_list_pop(, _dst, _dst);
> i.e.
>   /* Returns 0 if the dst was filled */
>   int (struct string_list *, char **, void**)
> 
> as then we do not expose the internals and would not have issues
> with reallocs.

Yes, I almost suggested that, but there's the question of memory
ownership of string_dst. Does it need freed or not? Is that answer
dependent on the strdup_strings flag?

> > if you want the data to live past the removal.
> 
> In the code proposed there are no additions (hence no reallocs)
> and the need for the data is short lived.
> 
> But I can see how the design was just fitting my purpose and
> we could come up with some better API.

Yeah, I didn't actually dig into your use case. I just want to make sure
we don't add a crappy function to our API. ;)

-Peff


Re: [PATCH v2] clone: report duplicate entries on case-insensitive filesystems

2018-08-09 Thread Elijah Newren
On Thu, Aug 9, 2018 at 2:44 PM Jeff King  wrote:
> > The error message isn't quite as good, but does the user really need
> > all the names of the file?  If so, we gave them enough information to
> > figure it out, and this is a really unusual case anyway, right?
> > Besides, now we're back to linear performance
>
> Well, it's still quadratic when they run O(n) iterations of "git
> ls-files -s | grep $colliding_oid". You've just pushed the second linear
> search onto the user. ;)

Wouldn't that be their own fault for not running
  git ls-files -s | grep -e $colliding_oid_1 ... -e $colliding_oid_n | sort -k 2
?   ;-)


Re: [PATCH 02/10] string-list.h: add string_list_pop function.

2018-08-09 Thread Stefan Beller
On Thu, Aug 9, 2018 at 2:41 PM Jeff King  wrote:

> >
> >   while (list->nr)
> >   work_on(list_pop(list));
> >
> > which is not so bad.
>
> In many cases you can just do:
>
>   while (list->nr) {
> work_on(list->items[list->nr - 1]);
> list_remove(list, list->nr - 1);
>   }
>
> and then all of those memory ownership issues like:

[...]
>
> just go away. :)

The only complication here is the lack of list_remove(index),
we do have list_remove(string), which internally searches the
item and removes it. Hence I did not want to use it.

Another idea I had was to keep the list immutable (except amending,
just like a constitution ;-) and store an index of how far we got in that
list already. That wastes memory for keeping entries around, but is safe
for memory due to its nature.

> Where that falls down is if you really need work_on() to put more items
> on the stack, but only after you've removed the current top. But then
> writing it out may still be nicer, because it makes it clear you have to
> do:
>
>   const char *cur_string = xstrdup(list->items[list->nr-1].string);

Another way would be to use

  string_list_pop(, _dst, _dst);
i.e.
  /* Returns 0 if the dst was filled */
  int (struct string_list *, char **, void**)

as then we do not expose the internals and would not have issues
with reallocs.

> if you want the data to live past the removal.

In the code proposed there are no additions (hence no reallocs)
and the need for the data is short lived.

But I can see how the design was just fitting my purpose and
we could come up with some better API.

Thanks,
Stefan


Re: [PATCH 03/10] sha1-array: provide oid_array_remove_if

2018-08-09 Thread Junio C Hamano
Jeff King  writes:

> Even with keeping the order this can be done in a single linear pass.
> See filter_string_list() for an example.

Heh, I just wasted a few minutes saying the same; I should have
pointed him at these two lines ;-)


Re: [PATCH 03/10] sha1-array: provide oid_array_remove_if

2018-08-09 Thread Junio C Hamano
Stefan Beller  writes:

> Signed-off-by: Stefan Beller 
> ---
>  sha1-array.c | 39 +++
>  sha1-array.h |  3 +++
>  2 files changed, 42 insertions(+)
>
> diff --git a/sha1-array.c b/sha1-array.c
> index 265941fbf40..10eb08b425e 100644
> --- a/sha1-array.c
> +++ b/sha1-array.c
> @@ -77,3 +77,42 @@ int oid_array_for_each_unique(struct oid_array *array,
>   }
>   return 0;
>  }
> +
> +int oid_array_remove_if(struct oid_array *array,
> + for_each_oid_fn fn,
> + void *data)
> +{
> + int i, j;
> + char *to_remove = xcalloc(array->nr, sizeof(char));
> +
> + /* No oid_array_sort() here! See the api-oid-array.txt docs! */
> +
> + for (i = 0; i < array->nr; i++) {
> + int ret = fn(array->oid + i, data);
> + if (ret)
> + to_remove[i] = 1;
> + }

Doing the same without this secondary array and loop, i.e.

for (src = dst = 0; src < array->nr; src++) {
if (!fn(>oid[src], cbdata)) {
if (dst < src)
oidcpy(>oid[dst], >oid[src]);
dst++;
}
}
array->nr = dst;

would be no less efficient.  The only reason why you might want to
measure move-span by a secondary array and preliminary counting loop
like your version does is that moving contiguous area of memory may
be more efficient than moving only by a single oid sized chunks, but
as far as I can tell you are not doing that "optimization", either.

I doubt that remove_if() is particularly a good name.  A version of
this function, for which polarity of fn() is reversed, can be called
"grep", or "filter", I think, and would be more understandable.



Re: [PATCH v2] clone: report duplicate entries on case-insensitive filesystems

2018-08-09 Thread Jeff King
On Thu, Aug 09, 2018 at 02:40:58PM -0700, Elijah Newren wrote:

> > I worry that the false positives make this a non-starter.  I mean, if
> > clone creates files 'A' and 'B' (both equal) and then tries to create
> > 'b', would the collision code reports that 'b' collided with 'A' because
> > that was the first OID match?  Ideally with this scheme we'd have to
> > search the entire index prior to 'b' and then report that 'b' collided
> > with either 'A' or 'B'.  Neither message instills confidence.  And
> > there's no way to prefer answer 'B' over 'A' without using knowledge
> > of the FS name mangling/aliasing rules -- unless we want to just assume
> > ignore-case for this iteration.
> 
> A possibly crazy idea: Don't bother reporting the other filename; just
> report the OID instead.
> 
> "Error: Foo.txt cannot be checked out because another file with hash
>  is in the way."  Maybe even add a hint for the user: "Run
> `git ls-files -s` to see see all files and their hash".
> 
> Whatever the exact wording for the error message, just create a nice
> post on stackoverflow.com explaining the various weird filesystems out
> there (VFAT, NTFS, HFS, APFS, etc) and how they cause differing
> filenames to be written to the same location.  Have a bunch of folks
> vote it up so it has some nice search-engine juice.

Actually, I kind of like the simplicity of that. It puts the human brain
in the loop.

> The error message isn't quite as good, but does the user really need
> all the names of the file?  If so, we gave them enough information to
> figure it out, and this is a really unusual case anyway, right?
> Besides, now we're back to linear performance

Well, it's still quadratic when they run O(n) iterations of "git
ls-files -s | grep $colliding_oid". You've just pushed the second linear
search onto the user. ;)

-Peff


Re: [PATCH 02/10] string-list.h: add string_list_pop function.

2018-08-09 Thread Jeff King
On Thu, Aug 09, 2018 at 02:29:32PM -0700, Junio C Hamano wrote:

> >> +struct string_list_item *string_list_pop(struct string_list *list)
> >> +{
> >> +   if (list->nr == 0)
> >> +   return 0;
> >
> > return NULL, not 0.
> 
> It is OK to return NULL, which may make the caller a bit simpler,
> i.e.
> 
>   while (item = list_pop(list))
>   work_on(item);
> 
> but if we consider it a good discipline for the caller to see if
> there are still things on the stack before attempting to pop, then
> instead of returning NULL, we can have BUG("") there, which requires
> the caller to be more like this:
> 
>   while (list->nr)
>   work_on(list_pop(list));
> 
> which is not so bad.

In many cases you can just do:

  while (list->nr) {
work_on(list->items[list->nr - 1]);
list_remove(list, list->nr - 1);
  }

and then all of those memory ownership issues like:

> > The memory ownership is now with the caller. That is, the caller needs
> > to check/know `list->strdup_strings` and know `free_util` to be able to
> > properly free all memory.
> 
> > OTOH, the pointer returned by this function is only guaranteed to be
> > valid until you start inserting into the list (well, you can do one
> > insertion per pop without worrying, but that's quite detailed
> > implementation knowledge).
> 
> Yes, it is a more grave limitation when using string_list as a
> stack.  A single realloc() and you are dead X-<.

just go away. :)

Where that falls down is if you really need work_on() to put more items
on the stack, but only after you've removed the current top. But then
writing it out may still be nicer, because it makes it clear you have to
do:

  const char *cur_string = xstrdup(list->items[list->nr-1].string);

if you want the data to live past the removal.

-Peff


Re: [PATCH 01/10] string_list: print_string_list to use trace_printf

2018-08-09 Thread Stefan Beller
On Thu, Aug 9, 2018 at 2:16 PM Junio C Hamano  wrote:
>
> Stefan Beller  writes:
>
> > It is a debugging aid, so it should print to the debugging channel.
>
> Who says?

The comment in the header file?

  /**
   * Dump a string_list to stdout, useful mainly for debugging
   * purposes. It can take an optional header argument and it writes out
   * the string-pointer pairs of the string_list, each one in its own
   * line.
   */

>  In-tree code may say so, and I do not think any in-flight
> topic up to 'pu' uses this to produce non-debugging output, but I do
> not think it is healthy attitude to assume that you can take over an
> existing function and change what it does unilaterally.

That is reasonable, as usual, and given the recent fallout of the
object store refactoring I have these concerns on my mind, too.

But for this instance, I do not see any risk of accidental collisions with
other series in flight:

$ git log -S print_string_list origin/pu --oneline
4f665f2cf33 string-list.h: move documentation from Documentation/api/
into header
d10cb3dfab9 help.c: rename function "pretty_print_string_list"
c455c87c5cd Rename path_list to string_list
1eb056905a2 include $PATH in generating list of commands for "help -a"
70827b15bfb Split up builtin commands into separate files from git.c
27dedf0c3b7 (tag: v1.0rc3, tag: v0.99.9j) GIT 0.99.9j aka 1.0rc3
8e49d503882 C implementation of the 'git' program, take two.

> As there is no in-tree or in-flight user, I think it makes sense if
> the proposed change were to rename it to trace_string_list().  If
> there weren't any print_string_list() and we were adding a debugging
> aid to use trace_printf() to dump these, that would be the name we
> would use anyway.

Good call. I'll rename and send separately (or might even drop this
patch completely; this patch is not used in the patches to follow;
it is more a "FYI: here is a thing that I found helpful for debugging,
unlike the existing function, which I found unhelpful and actively
harmful")


Re: [PATCH v2] clone: report duplicate entries on case-insensitive filesystems

2018-08-09 Thread Elijah Newren
On Thu, Aug 9, 2018 at 2:14 PM Jeff Hostetler  wrote:
> On 8/9/2018 10:23 AM, Jeff King wrote:
> > On Wed, Aug 08, 2018 at 05:41:10PM -0700, Junio C Hamano wrote:
> >> If we found that there is something when we tried to write out
> >> "Foo.txt", if we open "Foo.txt" on the working tree and hash-object
> >> it, we should find the matching blob somewhere in the index _before_
> >> "Foo.txt".  On a case-insensitive filesytem, it may well be
> >> "foo.txt", but we do not even have to know "foo.txt" and "Foo.txt"
> >> only differ in case.
> >
> > Clever. You might still run into false positives when there is
> > duplicated content in the repository (especially, say, zero-length
> > files).  But the fact that you only do the hashing on known duplicates
> > helps with that.
>
> I worry that the false positives make this a non-starter.  I mean, if
> clone creates files 'A' and 'B' (both equal) and then tries to create
> 'b', would the collision code reports that 'b' collided with 'A' because
> that was the first OID match?  Ideally with this scheme we'd have to
> search the entire index prior to 'b' and then report that 'b' collided
> with either 'A' or 'B'.  Neither message instills confidence.  And
> there's no way to prefer answer 'B' over 'A' without using knowledge
> of the FS name mangling/aliasing rules -- unless we want to just assume
> ignore-case for this iteration.

A possibly crazy idea: Don't bother reporting the other filename; just
report the OID instead.

"Error: Foo.txt cannot be checked out because another file with hash
 is in the way."  Maybe even add a hint for the user: "Run
`git ls-files -s` to see see all files and their hash".

Whatever the exact wording for the error message, just create a nice
post on stackoverflow.com explaining the various weird filesystems out
there (VFAT, NTFS, HFS, APFS, etc) and how they cause differing
filenames to be written to the same location.  Have a bunch of folks
vote it up so it has some nice search-engine juice.


The error message isn't quite as good, but does the user really need
all the names of the file?  If so, we gave them enough information to
figure it out, and this is a really unusual case anyway, right?
Besides, now we're back to linear performance


Re: [PATCH v2] clone: report duplicate entries on case-insensitive filesystems

2018-08-09 Thread Jeff King
On Thu, Aug 09, 2018 at 05:14:16PM -0400, Jeff Hostetler wrote:

> > Clever. You might still run into false positives when there is
> > duplicated content in the repository (especially, say, zero-length
> > files).  But the fact that you only do the hashing on known duplicates
> > helps with that.
> > 
> 
> I worry that the false positives make this a non-starter.  I mean, if
> clone creates files 'A' and 'B' (both equal) and then tries to create
> 'b', would the collision code reports that 'b' collided with 'A' because
> that was the first OID match?  Ideally with this scheme we'd have to
> search the entire index prior to 'b' and then report that 'b' collided
> with either 'A' or 'B'.  Neither message instills confidence.  And
> there's no way to prefer answer 'B' over 'A' without using knowledge
> of the FS name mangling/aliasing rules -- unless we want to just assume
> ignore-case for this iteration.

Yeah. If we can get usable unique ids (inode or otherwise) of some form
on each system, I think I prefer that. It's much easier to reason about.

> Sorry to be paranoid, but I have an index with 3.5M entries, the word
> "quadratic" rings all kinds of alarm bells for me.  :-)
> 
> Granted, we expect the number of collisions to be small, but searching
> back for each collision over the already-populated portion of the index
> could be expensive.

Heh. Yeah, it's really O(n*m), and we expect a small "m". But of course
the two are equal in the worst case.

-Peff


Re: [PATCH 0/4] t5552: fix flakiness by introducing proper locking for GIT_TRACE

2018-08-09 Thread Jeff King
On Thu, Aug 09, 2018 at 02:08:56PM -0700, Junio C Hamano wrote:

> > Correct.  But I am more worried about the "mixed/overwriting"
> > breakage, if there is one; it means we may need to be prepared for
> > systems that lack O_APPEND that works correctly.  I initially just
> > assumed that it was what Dscho was seeing, but after re-reading his
> > message, I am not sure anymore.
> >
> > I think the "do not trace the other side" approach you suggest for
> > these tests that only care about one side is more appropriate
> > solution for this particular case.  We then do not have to worry
> > about overwriting or output from both sides mixed randomly.
> 
> A concluding sentence I forgot to add, after saying "this is simpler
> and better to fix test breakage", was
> 
>   But if we really are seeing O_APPEND breakage, a mandatory
>   locking mechanism like this one might be necessary to work
>   around it (I seriously hope we do not have to, though).
> 
> Sorry for an additional noise.

Yeah, my earlier email said "if we just care about this test, then...".

But if we do care about making this multiple-tracing case work all the
time, then we'd need some solution. If there's something that can work
like O_APPEND, even if we have to make some system-specific call, I'd
prefer that to locking.

I did some searching for atomic append on Windows and got mixed reports
on whether their FILE_DATA_APPEND does what we want or not. Knowing
nothing about msys, I'd _assume_ that O_APPEND would get translated to
that flag, but I got lost trying to find it in
git-for-windows/msys2-runtime.

Wouldn't it be wonderful if the solution were as simple as making sure
that flag was plumbed through? I seriously doubt it will be that easy,
though. :)

-Peff


Re: [PATCH 02/10] string-list.h: add string_list_pop function.

2018-08-09 Thread Junio C Hamano
Martin Ågren  writes:

> On 9 August 2018 at 00:17, Stefan Beller  wrote:
>> A string list can be used as a stack, but should we? A later patch shows
>> how useful this will be.
>>
>> Signed-off-by: Stefan Beller 
>> ---
>>  string-list.c | 8 
>>  string-list.h | 6 ++
>>  2 files changed, 14 insertions(+)
>>
>> diff --git a/string-list.c b/string-list.c
>> index 9f651bb4294..ea80afc8a0c 100644
>> --- a/string-list.c
>> +++ b/string-list.c
>> @@ -80,6 +80,14 @@ void string_list_remove(struct string_list *list, const 
>> char *string,
>> }
>>  }
>>
>> +struct string_list_item *string_list_pop(struct string_list *list)
>> +{
>> +   if (list->nr == 0)
>> +   return 0;
>
> return NULL, not 0.

It is OK to return NULL, which may make the caller a bit simpler,
i.e.

while (item = list_pop(list))
work_on(item);

but if we consider it a good discipline for the caller to see if
there are still things on the stack before attempting to pop, then
instead of returning NULL, we can have BUG("") there, which requires
the caller to be more like this:

while (list->nr)
work_on(list_pop(list));

which is not so bad.

>> +/**
>> + * Returns the last item inserted and removes it from the list.
>> + * If the list is empty, return NULL.
>> + */
>> +struct string_list_item *string_list_pop(struct string_list *list);
>> +
>
> The memory ownership is now with the caller. That is, the caller needs
> to check/know `list->strdup_strings` and know `free_util` to be able to
> properly free all memory.

> OTOH, the pointer returned by this function is only guaranteed to be
> valid until you start inserting into the list (well, you can do one
> insertion per pop without worrying, but that's quite detailed
> implementation knowledge).

Yes, it is a more grave limitation when using string_list as a
stack.  A single realloc() and you are dead X-<.



Re: [RFC/WIP PATCH 1/1] merge-recursive: make file/directory conflicts easier to resolve

2018-08-09 Thread Elijah Newren
On Thu, Aug 9, 2018 at 1:54 PM Junio C Hamano  wrote:
> Elijah Newren  writes:
>
> >> Of course, "git rm" and "git mv" must work sensibly, if we want this
> >> change to be truly helpful--but if not, they need to be fixed ;-)
> >
> > That actually brings up an interesting question.  Right now, if given
> > a path that appears in the index but at a stage greater than 0, git mv
> > will fail with "not under version control".  Obviously, that error
> > message is a lie in such a case, but what should it do?
> >
> > (Alternatively, if there is only one entry with stage greater than 0
> > and it has no other conflicts, one could envision git mv doing the
> > rename and dropping to stage 0 at the same time, but that sounds a bit
> > dangerous to me.)
>
> I do not particularly think it is "dangerous".  In fact, that sort
> of behaviour was what I had in mind when I said "work sensibly".
>
> When resolving a conflict that they added a new path at stage #3 to
> remove that path, I can say "git rm $that_path", which removes all
> stages of that path and make the index closer to the next commit.
> Or I may decide to keep that path by "git add $that_path", which
> adds that path at stage #0.  I think "git mv $that_path $over_there"
> that drops that path at stage #3 to stage #0 of another path would
> be in line with these two.

This argument makes sense to me *IF* there's no possibility for
internal textual conflicts.  But if there are textual conflicts, I
don't see how it works.  So either I'm misunderstanding part of what
you're suggesting or you may have overlooked that case.

Let's say we did want to drop to stage #0 when the user runs git mv.
I'm assuming you agree that'd be bad to do if there were still
conflict markers left in that file (which can happen when the file of
a D/F conflict came from a renamed file that had edits on both sides
of history, for example).  That suggests we have to open and parse the
file and look for conflict markers before dropping to stage #0 and
only proceeding when none are found.  That feels a bit magic; this
auto-resolving-upon-mv seems to risk surprising someone to me.  In
particular, I'm imagining a scenario where someone edits some file
enough to remove conflict markers but isn't satisfied that everything
is semantically resolved yet, then runs git mv on the file, then
starts working on other files, and then tries to come back to the
original file only to discover that they can't find it in the list of
unmerged files because we marked it as resolved for them.

Am I missing something here?


Re: [PATCH 2/2] submodule: munge paths to submodule git directories

2018-08-09 Thread Jeff King
On Wed, Aug 08, 2018 at 03:33:23PM -0700, Brandon Williams wrote:

> Commit 0383bbb901 (submodule-config: verify submodule names as paths,
> 2018-04-30) introduced some checks to ensure that submodule names don't
> include directory traversal components (e.g. "../").
> 
> This addresses the vulnerability identified in 0383bbb901 but the root
> cause is that we use submodule names to construct paths to the
> submodule's git directory.  What we really should do is munge the
> submodule name before using it to construct a path.
> 
> Teach "submodule_name_to_gitdir()" to munge a submodule's name (by url
> encoding it) before using it to build a path to the submodule's gitdir.

I like this approach very much, and I think using url encoding is much
better than an opaque hash (purely because it makes debugging and
inspection saner).

Two thoughts, though:

> + modules_len = buf->len;
>   strbuf_addstr(buf, submodule_name);
> +
> + /*
> +  * If the submodule gitdir already exists using the old-fashioned
> +  * location (which uses the submodule name as-is, without munging it)
> +  * then return that.
> +  */
> + if (!access(buf->buf, F_OK))
> + return;

I think this backwards-compatibility is necessary to avoid pain. But
until it goes away, I don't think this is helping the vulnerability from
0383bbb901. Because there the issue was that the submodule name pointed
back into the working tree, so this access() would find the untrusted
working tree code and say "ah, an old-fashioned name!".

In theory a fresh clone could set a config option for "I only speak
use new-style modules". And there could even be a conversion program
that moves the modules as appropriate, fixes up the .git files in the
working tree, and then sets that config.

In fact, I think that config option _could_ be done by bumping
core.repositoryformatversion and then setting extensions.submodulenames
to "url" or something. Then you could never run into the confusing case
where you have a clone done by a new version of git (using new-style
names), but using an old-style version gets confused because it can't
find the module directories (instead, it would barf and say "I don't
know about that extension").

I don't know if any of that is worth it, though. We already fixed the
problem from 0383bbb901. There may be a _different_ "break out of the
modules directory" vulnerability, but since we disallow ".." it's hard
to see what it would be (the best I could come up with is maybe pointing
one module into the interior of another module, but I think you'd have
to trouble overwriting anything useful).

And while an old-style version of Git being confused might be annoying,
I suspect that bumping the repository version would be even _more_
annoying, because it would hit every command, not just ones that try to
touch those submodules.

> diff --git a/t/t7400-submodule-basic.sh b/t/t7400-submodule-basic.sh
> index 2c2c97e144..963693332c 100755
> --- a/t/t7400-submodule-basic.sh
> +++ b/t/t7400-submodule-basic.sh
> @@ -933,7 +933,7 @@ test_expect_success 'recursive relative submodules stay 
> relative' '
>   cd clone2 &&
>   git submodule update --init --recursive &&
>   echo "gitdir: ../.git/modules/sub3" >./sub3/.git_expect &&
> - echo "gitdir: ../../../.git/modules/sub3/modules/dirdir/subsub" 
> >./sub3/dirdir/subsub/.git_expect
> + echo "gitdir: 
> ../../../.git/modules/sub3/modules/dirdir%2fsubsub" 
> >./sub3/dirdir/subsub/.git_expect

One interesting thing about url-encoding is that it's not one-to-one.
This case could also be %2F, which is a different file (on a
case-sensitive filesystem). I think "%20" and "+" are similarly
interchangeable.

If we were decoding the filenames, that's fine. The round-trip is
lossless.

But that's not quite how the new code behaves. We encode the input and
then check to see if it matches an encoding we previously performed. So
if our urlencode routines ever change, this will subtly break.

I don't know how much it's worth caring about. We're not that likely to
change the routines ourself (though certainly a third-party
implementation would need to know our exact url-encoding decisions).

Some possible actions:

 0. Do nothing, and cross our fingers. ;)

 1. Don't use strbuf_addstr_urlencode(), but rather our own munging
function which we know will remain stable (or alternatively, a flag
to strbuf_addstr_urlencode to get the consistent behavior).

 2. Make sure we have tests which cover this, so at least somebody
changing the urlencode decisions will see a breakage. Your test here
covers the upper/lowercase one, but we might want one that covers
"+". (There may be more ambiguous cases, but those are the ones I
know about).

 3. Rather than check for the existence of names, decode what's actually
in the modules/ directory to create an in-memory index of names.

I hesitate to suggest that, 

Re: [PATCH 01/10] string_list: print_string_list to use trace_printf

2018-08-09 Thread Junio C Hamano
Stefan Beller  writes:

> It is a debugging aid, so it should print to the debugging channel.

Who says?  In-tree code may say so, and I do not think any in-flight
topic up to 'pu' uses this to produce non-debugging output, but I do
not think it is healthy attitude to assume that you can take over an
existing function and change what it does unilaterally.

As there is no in-tree or in-flight user, I think it makes sense if
the proposed change were to rename it to trace_string_list().  If
there weren't any print_string_list() and we were adding a debugging
aid to use trace_printf() to dump these, that would be the name we
would use anyway.

> Signed-off-by: Stefan Beller 
> ---
>  string-list.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/string-list.c b/string-list.c
> index 771c4550980..9f651bb4294 100644
> --- a/string-list.c
> +++ b/string-list.c
> @@ -200,9 +200,9 @@ void print_string_list(const struct string_list *p, const 
> char *text)
>  {
>   int i;
>   if ( text )
> - printf("%s\n", text);
> + trace_printf("%s\n", text);
>   for (i = 0; i < p->nr; i++)
> - printf("%s:%p\n", p->items[i].string, p->items[i].util);
> + trace_printf("%s:%p\n", p->items[i].string, p->items[i].util);
>  }
>  
>  struct string_list_item *string_list_append_nodup(struct string_list *list,


Re: [PATCH v2] clone: report duplicate entries on case-insensitive filesystems

2018-08-09 Thread Jeff Hostetler




On 8/9/2018 10:23 AM, Jeff King wrote:

On Wed, Aug 08, 2018 at 05:41:10PM -0700, Junio C Hamano wrote:


If we have an equivalence-class hashmap and feed it inodes (or again,
some system equivalent) as the keys, we should get buckets of
collisions.


I guess one way to get "some system equivalent" that can be used as
the last resort, when there absolutely is no inum equivalent, is to
rehash the working tree file that shouldn't be there when we detect
a collision.

If we found that there is something when we tried to write out
"Foo.txt", if we open "Foo.txt" on the working tree and hash-object
it, we should find the matching blob somewhere in the index _before_
"Foo.txt".  On a case-insensitive filesytem, it may well be
"foo.txt", but we do not even have to know "foo.txt" and "Foo.txt"
only differ in case.


Clever. You might still run into false positives when there is
duplicated content in the repository (especially, say, zero-length
files).  But the fact that you only do the hashing on known duplicates
helps with that.



I worry that the false positives make this a non-starter.  I mean, if
clone creates files 'A' and 'B' (both equal) and then tries to create
'b', would the collision code reports that 'b' collided with 'A' because
that was the first OID match?  Ideally with this scheme we'd have to
search the entire index prior to 'b' and then report that 'b' collided
with either 'A' or 'B'.  Neither message instills confidence.  And
there's no way to prefer answer 'B' over 'A' without using knowledge
of the FS name mangling/aliasing rules -- unless we want to just assume
ignore-case for this iteration.


One of the things I did like about the equivalence-class approach is
that it can be done in a single linear pass in the worst case. Whereas
anything that searches when we see a collision is quite likely to be
quadratic. But as I said before, it may not be worth worrying too much
about that for an error code path where we expect the number of
collisions to be small.

-Peff



Sorry to be paranoid, but I have an index with 3.5M entries, the word
"quadratic" rings all kinds of alarm bells for me.  :-)

Granted, we expect the number of collisions to be small, but searching
back for each collision over the already-populated portion of the index
could be expensive.

Jeff



Re: [PATCH 0/4] t5552: fix flakiness by introducing proper locking for GIT_TRACE

2018-08-09 Thread Junio C Hamano
Junio C Hamano  writes:

> Jeff King  writes:
>
>> Are you sure that it's not well-defined? We open the path with O_APPEND,
>> which means every write() will be atomically positioned at the end of
>> file. So we would never lose or overwrite data.
>>
>> We do our own buffering in a strbuf, writing the result out in a single
>> write() call (modulo the OS returning a short write, but that should not
>> generally happen when writing short strings to a file). So we should get
>> individual trace lines as atomic units.
>>
>> The order of lines from the two processes is undefined, of course.
>
> Correct.  But I am more worried about the "mixed/overwriting"
> breakage, if there is one; it means we may need to be prepared for
> systems that lack O_APPEND that works correctly.  I initially just
> assumed that it was what Dscho was seeing, but after re-reading his
> message, I am not sure anymore.
>
> I think the "do not trace the other side" approach you suggest for
> these tests that only care about one side is more appropriate
> solution for this particular case.  We then do not have to worry
> about overwriting or output from both sides mixed randomly.

A concluding sentence I forgot to add, after saying "this is simpler
and better to fix test breakage", was

But if we really are seeing O_APPEND breakage, a mandatory
locking mechanism like this one might be necessary to work
around it (I seriously hope we do not have to, though).

Sorry for an additional noise.


Re: [GSoC][PATCH v7 05/26] stash: convert apply to builtin

2018-08-09 Thread Junio C Hamano
Paul-Sebastian Ungureanu  writes:

>> Good to see that the right way to forward a patch from another
>> person is used, but is this a GSoC project?
>
> Yes, it is. I forgot to add the [GSoC] tag in the last series of patches.

The reason I asked was because IIRC GSoC was not supposed to be team
effort but "summer job" by individual students.


Re: [RFC/WIP PATCH 1/1] merge-recursive: make file/directory conflicts easier to resolve

2018-08-09 Thread Junio C Hamano
Elijah Newren  writes:

>> Of course, "git rm" and "git mv" must work sensibly, if we want this
>> change to be truly helpful--but if not, they need to be fixed ;-)
>
> That actually brings up an interesting question.  Right now, if given
> a path that appears in the index but at a stage greater than 0, git mv
> will fail with "not under version control".  Obviously, that error
> message is a lie in such a case, but what should it do?
>
> (Alternatively, if there is only one entry with stage greater than 0
> and it has no other conflicts, one could envision git mv doing the
> rename and dropping to stage 0 at the same time, but that sounds a bit
> dangerous to me.)

I do not particularly think it is "dangerous".  In fact, that sort
of behaviour was what I had in mind when I said "work sensibly".

When resolving a conflict that they added a new path at stage #3 to
remove that path, I can say "git rm $that_path", which removes all
stages of that path and make the index closer to the next commit.
Or I may decide to keep that path by "git add $that_path", which
adds that path at stage #0.  I think "git mv $that_path $over_there"
that drops that path at stage #3 to stage #0 of another path would
be in line with these two.



Re: [PATCH 0/4] t5552: fix flakiness by introducing proper locking for GIT_TRACE

2018-08-09 Thread Junio C Hamano
Jeff King  writes:

> Are you sure that it's not well-defined? We open the path with O_APPEND,
> which means every write() will be atomically positioned at the end of
> file. So we would never lose or overwrite data.
>
> We do our own buffering in a strbuf, writing the result out in a single
> write() call (modulo the OS returning a short write, but that should not
> generally happen when writing short strings to a file). So we should get
> individual trace lines as atomic units.
>
> The order of lines from the two processes is undefined, of course.

Correct.  But I am more worried about the "mixed/overwriting"
breakage, if there is one; it means we may need to be prepared for
systems that lack O_APPEND that works correctly.  I initially just
assumed that it was what Dscho was seeing, but after re-reading his
message, I am not sure anymore.

I think the "do not trace the other side" approach you suggest for
these tests that only care about one side is more appropriate
solution for this particular case.  We then do not have to worry
about overwriting or output from both sides mixed randomly.

> diff --git a/t/t5552-skipping-fetch-negotiator.sh 
> b/t/t5552-skipping-fetch-negotiator.sh
> index 3b60bd44e0..5ad5bece55 100755
> --- a/t/t5552-skipping-fetch-negotiator.sh
> +++ b/t/t5552-skipping-fetch-negotiator.sh
> @@ -28,6 +28,19 @@ have_not_sent () {
>   done
>  }
>  
> +# trace_fetch   [args]
> +#
> +# Trace the packet output of fetch, but make sure we disable the variable
> +# in the child upload-pack, so we don't combine the results in the same file.
> +trace_fetch () {
> + client=$1; shift
> + server=$1; shift
> + GIT_TRACE_PACKET="$(pwd)/trace" \
> + git -C "$client" fetch \
> +   --upload-pack 'unset GIT_TRACE_PACKET; git-upload-pack' \
> +   "$server" "$@"
> +}
> +
>  test_expect_success 'commits with no parents are sent regardless of skip 
> distance' '
>   git init server &&
>   test_commit -C server to_fetch &&
> @@ -42,7 +55,7 @@ test_expect_success 'commits with no parents are sent 
> regardless of skip distanc
>   # "c1" has no parent, it is still sent as "have" even though it would
>   # normally be skipped.
>   test_config -C client fetch.negotiationalgorithm skipping &&
> - GIT_TRACE_PACKET="$(pwd)/trace" git -C client fetch "$(pwd)/server" &&
> + trace_fetch client "$(pwd)/server" &&
>   have_sent c7 c5 c2 c1 &&
>   have_not_sent c6 c4 c3
>  '
> @@ -88,7 +101,7 @@ test_expect_success 'when two skips collide, favor the 
> larger one' '
>   # the next "have" sent will be "c1" (from "c6" skip 4) and not "c4"
>   # (from "c5side" skip 1).
>   test_config -C client fetch.negotiationalgorithm skipping &&
> - GIT_TRACE_PACKET="$(pwd)/trace" git -C client fetch "$(pwd)/server" &&
> + trace_fetch client "$(pwd)/server" &&
>   have_sent c5side c11 c9 c6 c1 &&
>   have_not_sent c10 c8 c7 c5 c4 c3 c2
>  '
> @@ -114,7 +127,7 @@ test_expect_success 'use ref advertisement to filter out 
> commits' '
>   # not need to send any ancestors of "c3", but we still need to send "c3"
>   # itself.
>   test_config -C client fetch.negotiationalgorithm skipping &&
> - GIT_TRACE_PACKET="$(pwd)/trace" git -C client fetch origin to_fetch &&
> + trace_fetch client origin to_fetch &&
>   have_sent c5 c4^ c2side &&
>   have_not_sent c4 c4^^ c4^^^
>  '
> @@ -144,7 +157,7 @@ test_expect_success 'handle clock skew' '
>   # and sent, because (due to clock skew) its only parent has already been
>   # popped off the priority queue.
>   test_config -C client fetch.negotiationalgorithm skipping &&
> - GIT_TRACE_PACKET="$(pwd)/trace" git -C client fetch "$(pwd)/server" &&
> + trace_fetch client "$(pwd)/server" &&
>   have_sent c2 c1 old4 old2 old1 &&
>   have_not_sent old3
>  '
> @@ -176,7 +189,7 @@ test_expect_success 'do not send "have" with ancestors of 
> commits that server AC
>   test_commit -C server commit-on-b1 &&
>  
>   test_config -C client fetch.negotiationalgorithm skipping &&
> - GIT_TRACE_PACKET="$(pwd)/trace" git -C client fetch "$(pwd)/server" 
> to_fetch &&
> + trace_fetch client "$(pwd)/server" to_fetch &&
>   grep "  fetch" trace &&
>  
>   # fetch-pack sends 2 requests each containing 16 "have" lines before


Re: [GSoC][PATCH v7 05/26] stash: convert apply to builtin

2018-08-09 Thread Paul-Sebastian Ungureanu

Hello,

On 08.08.2018 23:18, Junio C Hamano wrote:

Paul-Sebastian Ungureanu  writes:


From: Joel Teichroeb 

Add a builtin helper for performing stash commands. Converting
all at once proved hard to review, so starting with just apply
lets conversion get started without the other commands being
finished.

The helper is being implemented as a drop in replacement for
stash so that when it is complete it can simply be renamed and
the shell script deleted.

Delete the contents of the apply_stash shell function and replace
it with a call to stash--helper apply until pop is also
converted.

Signed-off-by: Joel Teichroeb 
Signed-off-by: Paul-Sebastian Ungureanu 


Good to see that the right way to forward a patch from another
person is used, but is this a GSoC project?


Yes, it is. I forgot to add the [GSoC] tag in the last series of patches.


diff --git a/builtin/stash--helper.c b/builtin/stash--helper.c
new file mode 100644
index 0..ef6a9d30d
--- /dev/null
+++ b/builtin/stash--helper.c
@@ -0,0 +1,452 @@
+#include "builtin.h"
+#include "config.h"
+#include "parse-options.h"
+#include "refs.h"
+#include "lockfile.h"
+#include "cache-tree.h"
+#include "unpack-trees.h"
+#include "merge-recursive.h"
+#include "argv-array.h"
+#include "run-command.h"
+#include "dir.h"
+#include "rerere.h"


Wow, "apply" is a biggie, as you'd pretty much have to do
everything, like merging and updating the index and asking rerere to
auto-resolve.  Quite a many include files.


+static const char * const git_stash_helper_usage[] = {
+   N_("git stash--helper apply [--index] [-q|--quiet] []"),
+   NULL
+};
+
+static const char * const git_stash_helper_apply_usage[] = {
+   N_("git stash--helper apply [--index] [-q|--quiet] []"),
+   NULL
+};
+ ...
+static void assert_stash_like(struct stash_info *info, const char * revision)


This inherits an unfortunate name from the scripted version (it does
more than asserting), but it is OK to keep the original name,
especially in this early step in the series.

Lose the SP in "* revision"; the asterisk sticks to the variable/parameter name.


Will do so.


+{
+   if (get_oidf(>b_commit, "%s^1", revision) ||
+   get_oidf(>w_tree, "%s:", revision) ||
+   get_oidf(>b_tree, "%s^1:", revision) ||
+   get_oidf(>i_tree, "%s^2:", revision)) {
+   free_stash_info(info);
+   error(_("'%s' is not a stash-like commit"), revision);
+   exit(128);
+   }
+}



+static int reset_tree(struct object_id *i_tree, int update, int reset)
+{
...
+}


Kind-a surprising that there is no helper function to do this
already.  The implementation looked OK, though.


+static int apply_cached(struct strbuf *out)
+{
+   struct child_process cp = CHILD_PROCESS_INIT;
+
+   /*
+* Apply currently only reads either from stdin or a file, thus
+* apply_all_patches would have to be updated to optionally take a
+* buffer.
+*/
+   cp.git_cmd = 1;
+   argv_array_pushl(, "apply", "--cached", NULL);
+   return pipe_command(, out->buf, out->len, NULL, 0, NULL, 0);
+}


Applying and updating the index is more resource intensive than
spawning a process, and not having to worry about the process dying
is a benefit, so overall, making this into an internal call would be
a lot lower priority, I would guess.


Indeed. In the last but one patch [1], I tried to convert all of the 
"apply" processes. At the moment, `apply_all_patches()` cannot take a 
buffer. A solution for this was to write the buffer into a file and pass 
the name of that file to the function. Of course, this might not be a 
bright idea and for this reason I am not sure if that patch is worth.


[1]
https://public-inbox.org/git/56500d98f9d5daaa5f21a43767885baede86e3a0.1533753605.git.ungureanupaulsebast...@gmail.com/


+static int reset_head(const char *prefix)
+{


This is resetting the index to the HEAD, right?  reset_head sounds
as if it takes a commit-ish and moves HEAD there.



Yes, it resets the index to the HEAD.

Best,
Paul


Re: What's the use case for committing both the freshly created file and it's exclusion in .gitignore?

2018-08-09 Thread Jeff King
On Thu, Aug 09, 2018 at 01:04:43PM +0200, Bartosz Konikiewicz wrote:

> Hi there!
> 
> I hope that the subject of my message (i.e. the question) is
> exhaustive enough, so I'll just stick to reproducing my issue.
> 
> Steps to reproduce:
> 
> 1. Create a new file.
> 2. Stage the file.
> 3. Add the file to .gitignore.
> 4. Stage the .gitignore.
> 5. Commit changes.
> 
> I imagined that the file would now be removed from the stage (because
> it's ignored now and not yet committed) but it isn't. Where this
> behavior would be desirable? I know that a 'git add' command can be
> invoked with an '-f' flag, which would yield the same result, but I
> can't come up with an explanation where can it be applied.

As far as I know, that is not an intentionally supported workflow. It is
merely the result that .gitignore is only considered when adding new
files to the index, not when committing nor when updating the entry for
an existing file.

If you are asking as a more general case: why do we not complain about
.gitignore for files the index already knows about, then I think that is
useful. It lets you override the .gitignore _once_ when adding the file
initially, and then you don't have to deal with it again (and keep in
mind that the pattern excluding it may be broad, like "*.o", or even
just "*", so simply deleting it from the .gitignore is not an option).

You could probably accomplish this these days by using a negative
pattern in your .gitignore file. But I think the behavior in question
may predate negative patterns (but I didn't dig). It's also a bit
simpler to use in practice, IMHO.

-Peff


Re: [PATCH] Documentation/diff-options: explain different diff algorithms

2018-08-09 Thread Stefan Beller
On Mon, Aug 6, 2018 at 4:18 PM Jonathan Nieder  wrote:

> > +DIFF ALGORITHMS
> > +---
>
> Please add some introductory words about what the headings refer to.

ok.

>
> > +the shortest output.
>
> Trivia: the `minimal` variant of Myers doesn't guarantee shortest
> output, either: what it minimizes is the number of lines marked as
> added or removed.  If you want to minimize context lines too, then
> that would be a new variant. ;-)

... and take line length into account. ;-)

It minimizes the edit distance in terms of lines, i.e. in a context-less diff
we get the lowest number of lines possible.

> > +This algorithm finds the longest common substring and recursively
> > +diffs the content before and after the longest common substring.
>
> optional: may be worth a short aside in the text about the distinction
> between LCS and LCS. ;-)
>
> It would be especially useful here, since the alphabet used in these
> strings is *lines* instead of characters, so the first-time reader
> could probably use some help in building their intuition.

That makes sense.

>
> > +This is often the fastest, but in corner cases (when there are
> > +many common substrings of the same length) it produces bad
>
> Can you clarify what "bad" means?  E.g. would "unexpected", or "poorly
> aligned", match what you mean?

I'll just go with unexpected.

> > +results as seen in:
> > +
> > + seq 1 100 >one
> > + echo 99 > two
> > + seq 1 2 98 >>two
> > + git diff --no-index --histogram one two


Re: [PATCH 1/1] verify-tag/verify-commit should exit unsuccessfully when signature is not trusted

2018-08-09 Thread Jeff King
On Thu, Aug 09, 2018 at 11:40:27AM -0700, Junio C Hamano wrote:

> Jeff King  writes:
> 
> > I guess leaving it serves as a sort of cross-check if gpg would return a
> > zero exit code but indicate in the status result that the signature was
> > not good. Sort of a belt-and-suspenders, I guess (which might not be
> > that implausible if we think about somebody wrapping gpg with a sloppy
> > bit of shell code that loses the exit code -- it's their fault, but it
> > might be nice for us to err on the conservative side).
> 
> OK, this time a real log message.
> 
> -- >8 --
> Subject: [PATCH] gpg-interface: propagate exit status from gpg back to the 
> callers
> [...]

Thanks, the explanation and the patch look good to me.

I'm on the fence over whether a follow-up patch to take away the "U" is
worth it. In practice the code should never trigger either way, since
gpg would already have exited non-zero in such a case.

-Peff


Re: [PATCH 0/4] t5552: fix flakiness by introducing proper locking for GIT_TRACE

2018-08-09 Thread Jeff King
On Thu, Aug 09, 2018 at 10:35:24AM -0700, Johannes Schindelin via GitGitGadget 
wrote:

> The culprit is that two processes try simultaneously to write to the same
> file specified via GIT_TRACE_PACKET, and it is not well defined how that
> should work, even on Linux.

Are you sure that it's not well-defined? We open the path with O_APPEND,
which means every write() will be atomically positioned at the end of
file. So we would never lose or overwrite data.

We do our own buffering in a strbuf, writing the result out in a single
write() call (modulo the OS returning a short write, but that should not
generally happen when writing short strings to a file). So we should get
individual trace lines as atomic units.

The order of lines from the two processes is undefined, of course.

> This patch series addresses that by locking the trace fd. I chose to use 
> flock() instead of fcntl() because the Win32 API LockFileEx()
> [https://docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-lockfileex]
>  
> (which does exactly what I want in this context) has much more similar
> semantics to the former than the latter.

I must admit I'm not excited to see us adding extra locking and
complexity when it is not necessarily needed. Is this a feature we
actually care about delivering for normal users of GIT_TRACE, or do we
just care about solving the test flakiness here?

Because we should be able to do the latter with the patch below.

I also wonder if we should be clearing GIT_TRACE as part of clearing
repo-env. It would do what we want automatically in this case, though
there are definitely cases where I prefer the current behavior (because
I really do want to see the upload-pack side of it). Of course I could
always use "--upload-pack 'GIT_TRACE_PACKET=... upload-pack'", so this
is really just a default.

diff --git a/t/t5552-skipping-fetch-negotiator.sh 
b/t/t5552-skipping-fetch-negotiator.sh
index 3b60bd44e0..5ad5bece55 100755
--- a/t/t5552-skipping-fetch-negotiator.sh
+++ b/t/t5552-skipping-fetch-negotiator.sh
@@ -28,6 +28,19 @@ have_not_sent () {
done
 }
 
+# trace_fetch   [args]
+#
+# Trace the packet output of fetch, but make sure we disable the variable
+# in the child upload-pack, so we don't combine the results in the same file.
+trace_fetch () {
+   client=$1; shift
+   server=$1; shift
+   GIT_TRACE_PACKET="$(pwd)/trace" \
+   git -C "$client" fetch \
+ --upload-pack 'unset GIT_TRACE_PACKET; git-upload-pack' \
+ "$server" "$@"
+}
+
 test_expect_success 'commits with no parents are sent regardless of skip 
distance' '
git init server &&
test_commit -C server to_fetch &&
@@ -42,7 +55,7 @@ test_expect_success 'commits with no parents are sent 
regardless of skip distanc
# "c1" has no parent, it is still sent as "have" even though it would
# normally be skipped.
test_config -C client fetch.negotiationalgorithm skipping &&
-   GIT_TRACE_PACKET="$(pwd)/trace" git -C client fetch "$(pwd)/server" &&
+   trace_fetch client "$(pwd)/server" &&
have_sent c7 c5 c2 c1 &&
have_not_sent c6 c4 c3
 '
@@ -88,7 +101,7 @@ test_expect_success 'when two skips collide, favor the 
larger one' '
# the next "have" sent will be "c1" (from "c6" skip 4) and not "c4"
# (from "c5side" skip 1).
test_config -C client fetch.negotiationalgorithm skipping &&
-   GIT_TRACE_PACKET="$(pwd)/trace" git -C client fetch "$(pwd)/server" &&
+   trace_fetch client "$(pwd)/server" &&
have_sent c5side c11 c9 c6 c1 &&
have_not_sent c10 c8 c7 c5 c4 c3 c2
 '
@@ -114,7 +127,7 @@ test_expect_success 'use ref advertisement to filter out 
commits' '
# not need to send any ancestors of "c3", but we still need to send "c3"
# itself.
test_config -C client fetch.negotiationalgorithm skipping &&
-   GIT_TRACE_PACKET="$(pwd)/trace" git -C client fetch origin to_fetch &&
+   trace_fetch client origin to_fetch &&
have_sent c5 c4^ c2side &&
have_not_sent c4 c4^^ c4^^^
 '
@@ -144,7 +157,7 @@ test_expect_success 'handle clock skew' '
# and sent, because (due to clock skew) its only parent has already been
# popped off the priority queue.
test_config -C client fetch.negotiationalgorithm skipping &&
-   GIT_TRACE_PACKET="$(pwd)/trace" git -C client fetch "$(pwd)/server" &&
+   trace_fetch client "$(pwd)/server" &&
have_sent c2 c1 old4 old2 old1 &&
have_not_sent old3
 '
@@ -176,7 +189,7 @@ test_expect_success 'do not send "have" with ancestors of 
commits that server AC
test_commit -C server commit-on-b1 &&
 
test_config -C client fetch.negotiationalgorithm skipping &&
-   GIT_TRACE_PACKET="$(pwd)/trace" git -C client fetch "$(pwd)/server" 
to_fetch &&
+   trace_fetch client "$(pwd)/server" to_fetch &&
grep "  fetch" trace &&
 
# fetch-pack sends 2 requests each containing 16 "have" 

Re: [PATCH v3 0/8] Add delta islands support

2018-08-09 Thread Christian Couder
On Thu, Aug 9, 2018 at 5:55 PM, Christian Couder
 wrote:

> The following changes have been made since the previous iteration:
>
> * suggested by Duy: move the code computing the write order for a
>   layer to a new separate function called compute_layer_order() in
>   builtin/pack-objects.c in new patch 3/8
>
>   I think that indeed this makes the following patch (4/8) shorter and
>   easier to understand as well as the overall result nicer.
>
> * suggested by Duy and Peff: rework the way the 'tree_depth' field is
>   moved from 'struct object_entry' to 'struct packing_data' in
>   pack-object.h in patch 7/8
>
> * suggested by Duy and Peff: move field 'layer' from
>   'struct object_entry' to 'struct packing_data' in pack-object.h in
>   new patch 8/8

I forgot to tell about the following change that is also part of the v3:

* suggested by Junio and Peff: increase the 'regmatch_t matches[]'
array from 8 to 16, advertise 14 group captures max, and add a warning
if the final element gets filled in in delta-islands.c in patch 2/8


Re: [RFC/WIP PATCH 1/1] merge-recursive: make file/directory conflicts easier to resolve

2018-08-09 Thread Elijah Newren
On Thu, Aug 9, 2018 at 10:36 AM Junio C Hamano  wrote:
> Elijah Newren  writes:
>
> > File/directory conflicts are somewhat difficult to resolve; by way of
> > contrast, renames are much easier to handle.  For file/directory
> > conflicts, we already have to record the file under a different name in
> > the working copy due to the directory being in the way; if we also record
> > the file under a different name in the index then it simplifies matters
> > for the user, and ensures that 'git reset --hard' and 'git merge --abort'
> > will clean up files created by the merge operation.
>
> Yeah, and then our file "path" renamed to "path~2" to make room for
> directory "path" they introduced, can be relocated to its final
> place in the merge resolution, e.g. "git mv path~2 path/ours" or
> "git mv path path.theirs && git mv path~2 path".

Yes.  :-)

> Of course, "git rm" and "git mv" must work sensibly, if we want this
> change to be truly helpful--but if not, they need to be fixed ;-)

That actually brings up an interesting question.  Right now, if given
a path that appears in the index but at a stage greater than 0, git mv
will fail with "not under version control".  Obviously, that error
message is a lie in such a case, but what should it do?  Print a more
accurate error message ("Refusing to rename file until you remove
conflicts and git add it"?)  Or, what I'd prefer, rename the entry (or
entries) and keep the higher stage number(s) -- is there an unseen
danger with doing this?

(Alternatively, if there is only one entry with stage greater than 0
and it has no other conflicts, one could envision git mv doing the
rename and dropping to stage 0 at the same time, but that sounds a bit
dangerous to me.)


Re: [PATCH] Documentation/diff-options: explain different diff algorithms

2018-08-09 Thread Stefan Beller
On Tue, Aug 7, 2018 at 8:56 AM Junio C Hamano  wrote:
>
> Jonathan Nieder  writes:
>
> > Both don't seem quite right, since they have an extra space before the
> > period.  The git-diff(1) seems especially not quite right --- does it
> > intend to say something like "See the DIFF ALGORITHMS section for more
> > discussion"?
>
> Good suggestion and doing so would nicely allow side-stepping the
> space before the dot issue, i.e. "See the DIFF ALGO section (in
> git-diff(1)) for more discussion".   I like it.

Me, too.

> >> +`Histogram`
> >> +
> >> +This algorithm finds the longest common substring and recursively
> >> +diffs the content before and after the longest common substring.
>
> As a first-time reader, I felt that it is a bit uneven that there is
> no attribution for only this item, unlike the description for Myers
> and Patience.

Yeah. That is because I spent too much time thinking how this algo is
flawed (in its design as the comments in the initial commit in JGit
indicate that design error when reading closely) when I was trying to
understand it deeply.

Maybe I should emphasize the trade off for performance gain
more as Shawn really cared about performance. After all it offers
the best performance in *many* common cases, but has a
"long tail" of really unfortunate results, too.


Re: [PATCH 03/10] sha1-array: provide oid_array_remove_if

2018-08-09 Thread Jeff King
On Thu, Aug 09, 2018 at 10:25:52AM -0700, Stefan Beller wrote:

> On Thu, Aug 9, 2018 at 12:39 AM Martin Ågren  wrote:
> >
> > On 9 August 2018 at 00:17, Stefan Beller  wrote:
> > > +int oid_array_remove_if(struct oid_array *array,
> > > +   for_each_oid_fn fn,
> > > +   void *data)
> > > +{
> > > +   int i, j;
> > > +   char *to_remove = xcalloc(array->nr, sizeof(char));
> >
> > Do you really need this scratch space?
> 
> I don't think so, when we reorder the items while iterating over them.

Even with keeping the order this can be done in a single linear pass.
See filter_string_list() for an example.

The one twist here is that you cannot:

  oidcpy(array->oid[i], array->oid[j]);

when i==j, because of memcpy restrictions. With the current
implementation it would suffice to use struct assignment (and really,
every oidcpy() could just be struct assignment these days). But you
could also just do:

  if (i != j)
oidcpy(array->oid[i], array->oid[j]);

> > I can't entirely follow this index-fiddling, but then I haven't had my
> > morning coffee yet, so please forgive me if this is nonsense. Would it
> > suffice to let i point out where to place items (starting at the first
> > item not to keep) and j where to take them from (i.e., the items to
> > keep, after the initial run)?
> 
> I thought this is what happens, just after the actual loop of calls.

I think the point is that we can just maintain those meanings during the
single walk through the array. The result is simpler to read and more
efficient.

-Peff


Re: [PATCH 1/4] Introduce a function to lock/unlock file descriptors when appending

2018-08-09 Thread Junio C Hamano
"Johannes Schindelin via GitGitGadget" 
writes:

> From: Johannes Schindelin 
>
> This function will be used to make write accesses in trace_write() a bit
> safer.
> ...
> To set a precedent for a better approach, let's introduce a proper
> abstraction: a function that says in its name precisely what Git
> wants it to do (as opposed to *how* it does it on Linux):
> lock_or_unlock_fd_for_appending().
>
> The next commit will provide a Windows-specific implementation of this
> function/functionality.
>
> Signed-off-by: Johannes Schindelin 
>
> squash! Introduce a function to lock/unlock file descriptors when appending

If we can keep the custom, narrow and easy-to-port API (like this
patch introduces) focused and resist feature creep over time, then
it would be worth spending effort to come up with such a custom
helper that is easy to port.  So I agree with the approach in
general but I tend to think "set a precedent for a better approach"
is a way-too-early and wishful verdict.  We do not know if we can
really keep that custom API easy-to-port-and-maintain yet.

In short, even though I agree with the approach, most of the
verbiage above is unnecessary and mere distraction.

> ---
>  git-compat-util.h |  2 ++
>  wrapper.c | 14 ++
>  2 files changed, 16 insertions(+)
>
> diff --git a/git-compat-util.h b/git-compat-util.h
> index 9a64998b2..13b83bade 100644
> --- a/git-compat-util.h
> +++ b/git-compat-util.h
> @@ -1202,6 +1202,8 @@ struct tm *git_gmtime_r(const time_t *, struct tm *);
>  #define getc_unlocked(fh) getc(fh)
>  #endif
>  
> +extern int lock_or_unlock_fd_for_appending(int fd, int lock_it);
> +
>  /*
>   * Our code often opens a path to an optional file, to work on its
>   * contents when we can successfully open it.  We can ignore a failure
> diff --git a/wrapper.c b/wrapper.c
> index e4fa9d84c..6c2116272 100644
> --- a/wrapper.c
> +++ b/wrapper.c
> @@ -690,3 +690,17 @@ int xgethostname(char *buf, size_t len)
>   buf[len - 1] = 0;
>   return ret;
>  }
> +
> +#ifndef GIT_WINDOWS_NATIVE
> +int lock_or_unlock_fd_for_appending(int fd, int lock_it)
> +{
> + struct flock flock;
> +
> + flock.l_type = lock_it ? F_WRLCK : F_UNLCK;
> + flock.l_whence = SEEK_SET;
> + flock.l_start = 0;
> + flock.l_len = 0x; /* arbitrary number of bytes */

If this can be an arbitrary range, do we need to cover this many (or
only this few, depending on your point of view) bytes?

Would it be sufficient to cover just the first one byte instead?  Or
perhaps give l_len==0 to cover all no matter how large the file
grows, which sounds like a better range specification.

> + return fcntl(fd, F_SETLKW, );
> +}
> +#endif


Re: [RFC/WIP PATCH 0/3] Modifications to handling of non-textual file merge conflicts

2018-08-09 Thread Elijah Newren
On Thu, Aug 9, 2018 at 10:52 AM Junio C Hamano  wrote:
> Elijah Newren  writes:
>
> > 1) Representative example: A modify/delete conflict; the path in question
> > in the working tree would have conflict information at the top of the file
> > followed by the normal file contents; thus it could be of the form:
> >
> >  HEAD
> > Conflict hint: This block of text was not part of the original
> > branch; it serves instead to hint about non-textual conflicts:
> >   MODIFY/DELETE: path foo modified in HEAD and deleted in BRANCH
> > 
> > Conflict hint: This block of text was not part of the original
> > branch; it serves instead to hint about non-textual conflicts:
> >   MODIFY/DELETE: path foo modified in HEAD and deleted in BRANCH
> >  BRANCH
> > Lorem ipsum dolor sit amet, consectetuer sadipscing elitr,
> > sed diam nonumy eirmod tempor invidunt ut labore et dolore
> > magna aliquyam erat, sed diam voluptua. At vero eos et
> > accusam et justo duo dolores et ea rebum. Stet clita kasd
> > gubergren, no sea takimata sanctus est Lorem ipsum dolor
> > sit amet.
> >
> > Alternative ideas for handling the explanatory text here are welcome.
>
> In a modify/delete conflict, we currently do not leave any in-file
> clue, so smudging the modified side like this might be a change that
> helps those who "grep e '<<<'" to find the set of paths that
> need to be examined.

Yes, that's one of the things I'm hoping for with this change.

>  I personally do not feel it would be all that
> useful, as "ls-files -u" is how I'd learn about these paths.

"ls-files -u" is really nice; I love it and try to show it to others,
but keep getting surprised by how many people are surprised to learn
of its existence.

Further, "ls-files -u" may be insufficient even for those who know
about it, for two reasons (and admittedly, I care a lot more about the
second than the first):

* There are more conflict types than the number of permutations of
stages.  For example, how do you know if a file was from a
rename/delete or a modify/delete conflict?  And if from a
rename/delete, how do you determine the original filename?  The index
doesn't store this information.  Granted those two conflict types are
at least similar, but other stage combinations might be more
confusing.  For example, if all three stages are present, is the
conflict in question an edit/edit, a rename/add/delete, or a D/F
conflict where the file came from a rename?

* Future feature: Thomas Rast' proposed remerge-diff[1] (which I want
to resurrect and remove the edge/corner cases from).  This worked by
creating what I call an auto-merged tree, where you just accept all
conflicts and commit.  Then you diff the merge commit to what the
auto-merge tree was to see how the user edited conflicts to create the
merge commit.  Problem is, for the auto-merge tree we won't have an
index anymore so how do we represent conflicts that aren't edit/edit
of normal text files?  My proposal here is an answer for that; I'm
open to others, but it's the best I came up with.

[1] https://public-inbox.org/git/cover.1409860234.git...@thomasrast.ch/

(I'll also note here that others have requested the ability to create
an as-merged-as-possibly tree including conflicts for other purposes;
see "API suggestion" of
https://public-inbox.org/git/cafacib-2fxivxtvwcbvafy3-br7y70hmihfzot0vfk6ju0d...@mail.gmail.com/
and my response noting the various difficulties with non-textual
conflicts.  So it may not be just the remerge-diff ability here.)

> What I would want to see when faced to a modify/delete conflict is
> how the modification side changed the contents, as the change, or
> its moral equivalent, would need to be ported to other locations in
> the context of the deleting side.  But I am not sure if it makes
> sense to attempt to somehow include "diff HEAD...MERGE_HEAD" (or the
> other way around) in the file to show the contents change on the
> modification side.

That's a good point; I'm not sure how to include that either.
Something to think about.

> > 2) Representative example: A binary edit/edit conflict.  In this case,
> > it would be inappropriate to put the conflict markers inside the
> > binary file.  Instead, we create another file (e.g. path~CONFLICTS)
> > and put conflict markers in it:
> >
> >  HEAD
> > Conflict hint: This block of text was not part of the original
> > branch; it serves instead to hint about non-textual conflicts:
> >   BINARY conflict: path foo modified in both branches
> > 
> > Conflict hint: This block of text was not part of the original
> > branch; it serves instead to hint about non-textual conflicts:
> >   BINARY conflict: path foo modified in both branches
> >  BRANCH
> >
> > This file would also be added to the index at stage 1 (so that 'git merge
> > --abort' would clean this file out instead of leaving it around untracked,

Re: [PATCH 1/1] verify-tag/verify-commit should exit unsuccessfully when signature is not trusted

2018-08-09 Thread Junio C Hamano
Jeff King  writes:

> I guess leaving it serves as a sort of cross-check if gpg would return a
> zero exit code but indicate in the status result that the signature was
> not good. Sort of a belt-and-suspenders, I guess (which might not be
> that implausible if we think about somebody wrapping gpg with a sloppy
> bit of shell code that loses the exit code -- it's their fault, but it
> might be nice for us to err on the conservative side).

OK, this time a real log message.

-- >8 --
Subject: [PATCH] gpg-interface: propagate exit status from gpg back to the 
callers

When gpg-interface API unified support for signature verification
codepaths for signed tags and signed commits in mid 2015 at around
v2.6.0-rc0~114, we accidentally loosened the GPG signature
verification.

Before that change, signed commits were verified by looking for
"G"ood signature from GPG, while ignoring the exit status of "gpg
--verify" process, while signed tags were verified by simply passing
the exit status of "gpg --verify" through.  The unified code we
currently have ignores the exit status of "gpg --verify" and returns
successful verification when the signature matches an unexpired key
regardless of the trust placed on the key (i.e. in addition to "G"ood
ones, we accept "U"ntrusted ones).

Make these commands signal failure with their exit status when
underlying "gpg --verify" (or the custom command specified by
"gpg.program" configuration variable) does so.  This essentially
changes their behaviour in a backward incompatible way to reject
signatures that have been made with untrusted keys even if they
correctly verify, as that is how "gpg --verify" behaves.

Note that the code still overrides a zero exit status obtained from
"gpg" (or gpg.program) if the output does not say the signature is
good or computes correctly but made with untrusted keys, to catch
a poorly written wrapper around "gpg" the user may give us.

We could exclude "U"ntrusted support from this fallback code, but
that would be making two backward incompatible changes in a single
commit, so let's avoid that for now.  A follow-up change could do so
if desired.

Helped-by: Vojtech Myslivec 
Helped-by: brian m. carlson 
Helped-by: Jeff King 
Signed-off-by: Junio C Hamano 
---
 gpg-interface.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/gpg-interface.c b/gpg-interface.c
index 0647bd6348..b5e64c55e2 100644
--- a/gpg-interface.c
+++ b/gpg-interface.c
@@ -81,12 +81,13 @@ int check_signature(const char *payload, size_t plen, const 
char *signature,
sigc->gpg_output = strbuf_detach(_output, NULL);
sigc->gpg_status = strbuf_detach(_status, NULL);
parse_gpg_output(sigc);
+   status |= sigc->result != 'G' && sigc->result != 'U';
 
  out:
strbuf_release(_status);
strbuf_release(_output);
 
-   return sigc->result != 'G' && sigc->result != 'U';
+   return !!status;
 }
 
 void print_signature_buffer(const struct signature_check *sigc, unsigned flags)
-- 
2.18.0-547-g1d89318c48



Re: [PATCH v2 2/2] repack: repack promisor objects if -a or -A is set

2018-08-09 Thread Jonathan Tan
On Thu, Aug 9, 2018 at 10:05 AM, Junio C Hamano  wrote:
> Hmm, it is clever to auto-start the pack-objects (and notice there
> wasn't anything needing to pack).  Two things that are worth noting
> are:
>
>  - The code takes advantage of the fact that cmd.in being left as -1
>is a sign that start_command() was never called (well, it could
>be that it was called but failed to open pipe---but then we would
>have died inside write_oid(), so we can ignore taht case).  This
>is not relying on its implementation detail---cmd.in would be
>replaced by a real fd to the pipe if start_command() was called.
>
>  - We know that pack-objects reads all its standard input through to
>the EOF before emitting a single byte to its standard output, and
>that is why we can use bidi pipe and not worry about deadlocking
>caused by not reading from cmd.out at all (before closing cmd.in,
>that is).
>
> So I kind of like the cleverness of the design of this code.

Thanks for checking this.

> For now, as we do not know what is the appropriate thing to leave in
> the file, empty is fine, but perhaps in the future we may want to
> carry forward the info from the existing .promisor file?  What does
> it initially contain?  Transport invoked with from_promisor bit
> seems to call index-pack with "--promisor" option with no argument,
> so this is probably in line with that.  Do we in the future need to
> teach "--promisor" option to pack-objects we invoke here, which will
> be passed to the index-pack it internally performs, so that we do
> not have to do this by hand here?

There might be more than one .promisor file that we are repacking, so
we would also have to deal with the order. It might be better to
document that the contents of .promisor files are lost upon repack.

> In any case, don't we need to update the doc for the "--promisor"
> option of "index-pack"?  The same comment for the new options added
> in the same topic, e.g. "--from-promisor" and "--no-dependents"
> options added to "fetch-pack".  It seems that the topic that ended
> at 0c16cd499d was done rather hastily and under-documented?

Yes, you're right - I'll make a patch documenting these.

>> @@ -440,6 +513,8 @@ int cmd_repack(int argc, const char **argv, const char 
>> *prefix)
>>
>>   /* End of pack replacement. */
>>
>> + reprepare_packed_git(the_repository);
>> +
>
> I do not think this call hurts, but why does this become suddenly
> necessary with _this_ patch?  Is it an unrelated bugfix?

Hmm...I thought I mentioned somewhere that we need
reprepare_packed_git now because for_each_packed_object calls
prepare_packed_git, but apparently I didn't. I would add the following
to the commit message (if you'd rather not do it yourself, let me know
and I'll send a v3 with the new commit message):

Because for_each_packed_object() calls prepare_packed_git(), a call to
reprepare_packed_git() now has to be inserted after all the packfile
manipulation - if not, old information about packed objects would
remain in the_repository->objects->packed_git.


Re: [RFC/WIP PATCH 0/3] Modifications to handling of non-textual file merge conflicts

2018-08-09 Thread Junio C Hamano
Elijah Newren  writes:

> 1) Representative example: A modify/delete conflict; the path in question
> in the working tree would have conflict information at the top of the file
> followed by the normal file contents; thus it could be of the form:
>
>  HEAD
> Conflict hint: This block of text was not part of the original
> branch; it serves instead to hint about non-textual conflicts:
>   MODIFY/DELETE: path foo modified in HEAD and deleted in BRANCH
> 
> Conflict hint: This block of text was not part of the original
> branch; it serves instead to hint about non-textual conflicts:
>   MODIFY/DELETE: path foo modified in HEAD and deleted in BRANCH
>  BRANCH
> Lorem ipsum dolor sit amet, consectetuer sadipscing elitr,
> sed diam nonumy eirmod tempor invidunt ut labore et dolore
> magna aliquyam erat, sed diam voluptua. At vero eos et
> accusam et justo duo dolores et ea rebum. Stet clita kasd
> gubergren, no sea takimata sanctus est Lorem ipsum dolor
> sit amet.
>
> Alternative ideas for handling the explanatory text here are welcome.

In a modify/delete conflict, we currently do not leave any in-file
clue, so smudging the modified side like this might be a change that
helps those who "grep e '<<<'" to find the set of paths that
need to be examined.  I personally do not feel it would be all that
useful, as "ls-files -u" is how I'd learn about these paths.

What I would want to see when faced to a modify/delete conflict is
how the modification side changed the contents, as the change, or
its moral equivalent, would need to be ported to other locations in
the context of the deleting side.  But I am not sure if it makes
sense to attempt to somehow include "diff HEAD...MERGE_HEAD" (or the
other way around) in the file to show the contents change on the
modification side.

> 2) Representative example: A binary edit/edit conflict.  In this case,
> it would be inappropriate to put the conflict markers inside the
> binary file.  Instead, we create another file (e.g. path~CONFLICTS)
> and put conflict markers in it:
>
>  HEAD
> Conflict hint: This block of text was not part of the original
> branch; it serves instead to hint about non-textual conflicts:
>   BINARY conflict: path foo modified in both branches
> 
> Conflict hint: This block of text was not part of the original
> branch; it serves instead to hint about non-textual conflicts:
>   BINARY conflict: path foo modified in both branches
>  BRANCH
>
> This file would also be added to the index at stage 1 (so that 'git merge
> --abort' would clean this file out instead of leaving it around untracked,
> and also because 'git status' would report "deleted in both" which seems
> reasonable).
>
> This type of example could apply for each of the following types of
> conflicts:
>   * binary edit/edit
>   * any of the conflicts from type 1 when binary files are involved
>   * symlink edit/edit (or add/add)
>   * symlink/submodule
>   * symlink/directory
>   * directory/submodule
>   * submodule/submodule
>
> It could also apply to the following new corner case conflict types from
> directory rename detection:
>   * N-way colliding paths (N>=2) due to directory renames
>   * directory rename split; half renamed to one directory and half to another

Hmph, I am starting to wonder if it may be easier to access if
instead you did not touch any working tree file to do any of the
above, and instead write a single file in $GIT_DIR/ to explain what
kind of conflicts these paths are involved in.  That would probably
give a better and easier-to-read summary than "ls-files -u" output.

Or do we have _enough_ information in the "ls-files -u" already to
infer "Ah, we are in symlink edit/edit conflict.", etc.?  If so,
perhaps "git status" can be extended to show what kind of conflict
these paths are in by reading the higher-stage index entries (and
lack of stages thereof, when dealing with a conflict with deletion
involved)?


Re: [PATCH 10/10] fetch: retry fetching submodules if sha1 were not fetched

2018-08-09 Thread Stefan Beller
On Thu, Aug 9, 2018 at 12:50 AM Martin Ågren  wrote:
>
> On 9 August 2018 at 00:17, Stefan Beller  wrote:
> > Currently when git-fetch is asked to recurse into submodules, it dispatches
> > a plain "git-fetch -C " (and some submodule related options
> > such as prefix and recusing strategy, but) without any information of the
> > remote or the tip that should be fetched.
> >
> > This works surprisingly well in some workflows (such as using submodules
> > as a third party library), while not so well in other scenarios, such
> > as in a Gerrit topic-based workflow, that can tie together changes
> > (potentially across repositories) on the server side. One of the parts
> > of such a Gerrit workflow is to download a change when wanting to examine
> > it, and you'd want to have its submodule changes that are in the same
> > topic downloaded as well. However these submodule changes reside in their
> > own repository in their on ref (refs/changes/).
>
> s/on/own/
>
> > Retry fetching a submodule if the object id that the superproject points
> > to, cannot be found.
> >
> > Note: This is an RFC and doesn't support fetching to FETCH_HEAD yet, but
> > only into a local branch. To make fetching into FETCH_HEAD work, we need
> > some refactoring in builtin/fetch.c to adjust the calls to
> > 'check_for_new_submodule_commits'.
> >
> > Signed-off-by: Stefan Beller 
> > ---
>
> > diff --git a/submodule.c b/submodule.c
> > index ec7ea6f8c2d..6cbd0b1a470 100644
> > --- a/submodule.c
> > +++ b/submodule.c
> > @@ -1127,6 +1127,7 @@ struct submodule_parallel_fetch {
> > int result;
> >
> > struct string_list changed_submodule_names;
> > +   struct string_list retry;
> >  };
> >  #define SPF_INIT {0, ARGV_ARRAY_INIT, NULL, NULL, 0, 0, 0, 0, 
> > STRING_LIST_INIT_DUP }
>
> `retry` will effectively be `STRING_LIST_INIT_NODUP`, but making that
> explicit would be better and the next addition to the struct would be
> easier to get right.
>
> > +retry_next:
> > +   retry_it = string_list_pop(>retry);
> > +   if (retry_it) {
> > +   struct strbuf submodule_prefix = STRBUF_INIT;
> > +   const struct submodule *sub =
> > +   submodule_from_name(spf->r,
> > +   _oid,
> > +   retry_it->string);
> > +
> > +   child_process_init(cp);
> > +   cp->dir = get_submodule_git_dir(spf->r, sub->path);
> > +   if (!cp->dir)
> > +   goto retry_next;
>
> So here you just drop the string list item. Since it's NODUP, and since
> the `util` pointers are owned elsewhere(?), this seems fine. Other uses
> of `string_list_pop()` might not be so straightforward.
>
> Just a thought, but rather than pop+if+goto, maybe
>
> while ((retry_it = )) {
> ...
> if (!cp->dir) continue;
> ...
> return 1;
> }

I really want to keep the retry list short and pruned, as this
function is called O(n) times with n the number of submodules
and the retry list will also be up to n.
And with that we'd run O(n^2) times into "if (!..) continue;".

When we use the 'pop-no-work items' logic, then we're still in O(n).

> I haven't commented on any of the submodule stuff, which is probably
> where you'd be most interested in comments. I don't use submodules, nor
> do I know the code that runs them.. I guess my comments are more "if
> those who know something about submodules find this series worthwhile,
> you might want to consider my comments as well".

Thanks for your comments! I'll try to think of another way to
represent this more easily in code.

Thanks,
Stefan


Re: [RFC/WIP PATCH 1/1] merge-recursive: make file/directory conflicts easier to resolve

2018-08-09 Thread Junio C Hamano
Elijah Newren  writes:

> File/directory conflicts are somewhat difficult to resolve; by way of
> contrast, renames are much easier to handle.  For file/directory
> conflicts, we already have to record the file under a different name in
> the working copy due to the directory being in the way; if we also record
> the file under a different name in the index then it simplifies matters
> for the user, and ensures that 'git reset --hard' and 'git merge --abort'
> will clean up files created by the merge operation.

Yeah, and then our file "path" renamed to "path~2" to make room for
directory "path" they introduced, can be relocated to its final
place in the merge resolution, e.g. "git mv path~2 path/ours" or
"git mv path path.theirs && git mv path~2 path".

Of course, "git rm" and "git mv" must work sensibly, if we want this
change to be truly helpful--but if not, they need to be fixed ;-)



[PATCH 2/4] mingw: implement lock_or_unlock_fd_for_appending()

2018-08-09 Thread Johannes Schindelin via GitGitGadget
From: Johannes Schindelin 

For some reason, t/t5552-skipping-fetch-negotiator.sh fails particularly
often on Windows due to racy tracing where the `git upload-pack` and the
`git fetch` processes compete for the same file.

We just introduced a remedy that uses fcntl(), but Windows does not have
fcntl(). So let's implement an alternative.

Signed-off-by: Johannes Schindelin 
---
 compat/mingw.c   | 19 +++
 compat/mingw.h   |  3 +++
 config.mak.uname |  3 +++
 3 files changed, 25 insertions(+)

diff --git a/compat/mingw.c b/compat/mingw.c
index 6ded1c859..6da9ce861 100644
--- a/compat/mingw.c
+++ b/compat/mingw.c
@@ -514,6 +514,25 @@ int mingw_chmod(const char *filename, int mode)
return _wchmod(wfilename, mode);
 }
 
+int mingw_lock_or_unlock_fd_for_appending(int fd, int lock_it)
+{
+   HANDLE handle = (HANDLE)_get_osfhandle(fd);
+   OVERLAPPED overlapped = { 0 };
+   DWORD err;
+
+   if (!lock_it && UnlockFileEx(handle, 0, -1, 0, ))
+   return 0;
+   if (lock_it &&
+   LockFileEx(handle, LOCKFILE_EXCLUSIVE_LOCK, 0, -1, 0, ))
+   return 0;
+
+   err = GetLastError();
+   /* LockFileEx() cannot lock pipes */
+   errno = err == ERROR_INVALID_FUNCTION ?
+   EBADF : err_win_to_posix(GetLastError());
+   return -1;
+}
+
 /*
  * The unit of FILETIME is 100-nanoseconds since January 1, 1601, UTC.
  * Returns the 100-nanoseconds ("hekto nanoseconds") since the epoch.
diff --git a/compat/mingw.h b/compat/mingw.h
index 571019d0b..0f76d89a9 100644
--- a/compat/mingw.h
+++ b/compat/mingw.h
@@ -397,6 +397,9 @@ HANDLE winansi_get_osfhandle(int fd);
  * git specific compatibility
  */
 
+int mingw_lock_or_unlock_fd_for_appending(int fd, int lock_it);
+#define lock_or_unlock_fd_for_appending mingw_lock_or_unlock_fd_for_appending
+
 #define has_dos_drive_prefix(path) \
(isalpha(*(path)) && (path)[1] == ':' ? 2 : 0)
 int mingw_skip_dos_drive_prefix(char **path);
diff --git a/config.mak.uname b/config.mak.uname
index 684fc5bf0..159b7da81 100644
--- a/config.mak.uname
+++ b/config.mak.uname
@@ -376,6 +376,7 @@ ifeq ($(uname_S),Windows)
NO_POSIX_GOODIES = UnfortunatelyYes
NATIVE_CRLF = YesPlease
DEFAULT_HELP_FORMAT = html
+   HAVE_FLOCK = YesWeEmulate
 
CC = compat/vcbuild/scripts/clink.pl
AR = compat/vcbuild/scripts/lib.pl
@@ -523,6 +524,8 @@ ifneq (,$(findstring MINGW,$(uname_S)))
NO_INET_NTOP = YesPlease
NO_POSIX_GOODIES = UnfortunatelyYes
DEFAULT_HELP_FORMAT = html
+   HAVE_FLOCK = YesWeEmulate
+
COMPAT_CFLAGS += -DNOGDI -Icompat -Icompat/win32
COMPAT_CFLAGS += -DSTRIP_EXTENSION=\".exe\"
COMPAT_OBJS += compat/mingw.o compat/winansi.o \
-- 
gitgitgadget



[PATCH 1/4] Introduce a function to lock/unlock file descriptors when appending

2018-08-09 Thread Johannes Schindelin via GitGitGadget
From: Johannes Schindelin 

This function will be used to make write accesses in trace_write() a bit
safer.

Note: this patch takes a very different approach for cross-platform
support than Git is historically taking: the original approach is to
first implement everything on Linux, using the functions available on
Linux, and then trying to emulate these functions on platforms that do
not have those functions such as Windows.

This leads to all kinds of undesirable quirks in Git's source code (and
performance characteristics) because of the lack of a proper abstraction
layer: instead of declaring functions for the functionality we
*actually* need, we abuse POSIX functions to say what we need, even if
those functions serve much broader purposes (and do not make at all
clear what the caller wants of them).

For example, when Git wants to determine whether a path refers to a
symbolic link, it calls lstat(). That POSIX function has to be emulated
on Windows, painstakingly filling all the information lstat() would,
only for the caller to throw away everything except that one bit of
information, and all of the time figuring out the mtime/atime/ctime and
file size and whatnot was spent in vain.

If we were to follow that approach, we would extend the fcntl()
emulation in compat/mingw.c after this commit, to support the function
added in this commit.

But fcntl() allows a lot more versatile file region locking that we
actually need, to by necessity the fcntl() emulation would be quite
complex: To support the `l_whence = SEEK_CUR` (which we would use, if it
did not require so much book-keeping due to our writing between locking
and unlocking the file), we would have to call `SetFilePointerEx()`
(after obtaining the `HANDLE` on which all Win32 API functions work
instead of the integer file descriptors used by all POSIX functions).
Multiplying the number of Win32 API round-trips. Of course, we could
implement an incomplete emulation of fcntl()'s F_WRLCK, but that would
be unsatisfying.

An alternative approach would be to use the `flock()` function, whose
semantics are much closer to existing Win32 API. But `flock()` is not
even a POSIX standard, so we would have to implement emulation of
`flock()` for *other* platforms. And it would again be the wrong thing
to do, as we would once again fail to have an abstraction that clearly
says what *exact*functionality we want to use.

To set a precedent for a better approach, let's introduce a proper
abstraction: a function that says in its name precisely what Git
wants it to do (as opposed to *how* it does it on Linux):
lock_or_unlock_fd_for_appending().

The next commit will provide a Windows-specific implementation of this
function/functionality.

Signed-off-by: Johannes Schindelin 

squash! Introduce a function to lock/unlock file descriptors when appending
---
 git-compat-util.h |  2 ++
 wrapper.c | 14 ++
 2 files changed, 16 insertions(+)

diff --git a/git-compat-util.h b/git-compat-util.h
index 9a64998b2..13b83bade 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -1202,6 +1202,8 @@ struct tm *git_gmtime_r(const time_t *, struct tm *);
 #define getc_unlocked(fh) getc(fh)
 #endif
 
+extern int lock_or_unlock_fd_for_appending(int fd, int lock_it);
+
 /*
  * Our code often opens a path to an optional file, to work on its
  * contents when we can successfully open it.  We can ignore a failure
diff --git a/wrapper.c b/wrapper.c
index e4fa9d84c..6c2116272 100644
--- a/wrapper.c
+++ b/wrapper.c
@@ -690,3 +690,17 @@ int xgethostname(char *buf, size_t len)
buf[len - 1] = 0;
return ret;
 }
+
+#ifndef GIT_WINDOWS_NATIVE
+int lock_or_unlock_fd_for_appending(int fd, int lock_it)
+{
+   struct flock flock;
+
+   flock.l_type = lock_it ? F_WRLCK : F_UNLCK;
+   flock.l_whence = SEEK_SET;
+   flock.l_start = 0;
+   flock.l_len = 0x; /* arbitrary number of bytes */
+
+   return fcntl(fd, F_SETLKW, );
+}
+#endif
-- 
gitgitgadget



[PATCH 3/4] trace: lock the trace file to avoid racy trace_write() calls

2018-08-09 Thread Johannes Schindelin via GitGitGadget
From: Johannes Schindelin 

When multiple processes try to write to the same file, it is not
guaranteed that everything works as expected: those writes can overlap,
and in the worst case even lose messages.

This happens in t/t5552-skipping-fetch-negotiator.sh, where we abuse the
`GIT_TRACE` facility to write traces of two concurrent processes (`git
fetch` and `git upload-pack`) to the same file, and then verify that the
trace contains certain expected breadcrumbs.

To remedy this, let's lock the file descriptors for exclusive writing,
using the function we just introduced in the previous commit.

Signed-off-by: Johannes Schindelin 
---
 trace.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/trace.c b/trace.c
index fc623e91f..6f97dde27 100644
--- a/trace.c
+++ b/trace.c
@@ -114,11 +114,20 @@ static int prepare_trace_line(const char *file, int line,
 
 static void trace_write(struct trace_key *key, const void *buf, unsigned len)
 {
-   if (write_in_full(get_trace_fd(key), buf, len) < 0) {
+   int fd = get_trace_fd(key), locked;
+
+   locked = !lock_or_unlock_fd_for_appending(fd, 1);
+   if (!locked && errno != EBADF)
+   warning("unable to lock file descriptor for %s: %s",
+   key->key, strerror(errno));
+   if (write_in_full(fd, buf, len) < 0) {
warning("unable to write trace for %s: %s",
key->key, strerror(errno));
trace_disable(key);
}
+   if (locked && lock_or_unlock_fd_for_appending(fd, 0) < 0)
+   warning("failed to remove lock on fd for %s: %s",
+   key->key, strerror(errno));
 }
 
 void trace_verbatim(struct trace_key *key, const void *buf, unsigned len)
-- 
gitgitgadget



[PATCH 0/4] t5552: fix flakiness by introducing proper locking for GIT_TRACE

2018-08-09 Thread Johannes Schindelin via GitGitGadget
I reported a couple of times that t5552 is not passing reliably. It has now
reached next, and will no doubt infect master soon.

Turns out that it is not a Windows-specific issue, even if it occurs a lot 
more often on Windows than elsewhere.

The culprit is that two processes try simultaneously to write to the same
file specified via GIT_TRACE_PACKET, and it is not well defined how that
should work, even on Linux.

This patch series addresses that by locking the trace fd. I chose to use 
flock() instead of fcntl() because the Win32 API LockFileEx()
[https://docs.microsoft.com/en-us/windows/desktop/api/fileapi/nf-fileapi-lockfileex]
 
(which does exactly what I want in this context) has much more similar
semantics to the former than the latter.

Of course, I have to admit that I am not super solid on flock() semantics,
and I also do not know which conditional blocks in config.mak.uname should
grow a HAVE_FLOCK = YesWeDo line, still. Reviewers knowledgeable in flock() 
semantics: I would be much indebted if you helped me there. Also: is it safe
to call flock() on file descriptors referring not to files, but, say, pipes
or an interactive terminal?

Johannes Schindelin (4):
  Introduce a function to lock/unlock file descriptors when appending
  mingw: implement lock_or_unlock_fd_for_appending()
  trace: lock the trace file to avoid racy trace_write() calls
  trace: verify that locking works

 Makefile   |   1 +
 compat/mingw.c |  19 ++
 compat/mingw.h |   3 +
 config.mak.uname   |   3 +
 git-compat-util.h  |   2 +
 t/helper/test-tool.c   |   1 +
 t/helper/test-tool.h   |   1 +
 t/helper/test-trace.c  | 130 +
 t/t0070-fundamental.sh |   6 ++
 trace.c|  11 +++-
 wrapper.c  |  14 +
 11 files changed, 190 insertions(+), 1 deletion(-)
 create mode 100644 t/helper/test-trace.c


base-commit: 42cc7485a2ec49ecc440c921d2eb0cae4da80549
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-17%2Fdscho%2Ffetch-negotiator-skipping-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-17/dscho/fetch-negotiator-skipping-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/17
-- 
gitgitgadget


[PATCH 4/4] trace: verify that locking works

2018-08-09 Thread Johannes Schindelin via GitGitGadget
From: Johannes Schindelin 

Recently, t5552 introduced a pattern where two processes try to write to
the same GIT_TRACE file in parallel. This is not safe, as the two
processes fighting over who gets to append to the file can cause garbled
lines and may even result in data loss on Windows (where buffers are
written to before they are flushed).

To remedy this, we introduced the lock_or_unlock_fd_for_appending()
function. And to make sure that this works, this commit introduces a
regression test.

Signed-off-by: Johannes Schindelin 
---
 Makefile   |   1 +
 t/helper/test-tool.c   |   1 +
 t/helper/test-tool.h   |   1 +
 t/helper/test-trace.c  | 130 +
 t/t0070-fundamental.sh |   6 ++
 5 files changed, 139 insertions(+)
 create mode 100644 t/helper/test-trace.c

diff --git a/Makefile b/Makefile
index 617475622..2e3fb5b8d 100644
--- a/Makefile
+++ b/Makefile
@@ -729,6 +729,7 @@ TEST_BUILTINS_OBJS += test-strcmp-offset.o
 TEST_BUILTINS_OBJS += test-string-list.o
 TEST_BUILTINS_OBJS += test-submodule-config.o
 TEST_BUILTINS_OBJS += test-subprocess.o
+TEST_BUILTINS_OBJS += test-trace.o
 TEST_BUILTINS_OBJS += test-urlmatch-normalization.o
 TEST_BUILTINS_OBJS += test-wildmatch.o
 TEST_BUILTINS_OBJS += test-write-cache.o
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index 805a45de9..7adce872b 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -39,6 +39,7 @@ static struct test_cmd cmds[] = {
{ "string-list", cmd__string_list },
{ "submodule-config", cmd__submodule_config },
{ "subprocess", cmd__subprocess },
+   { "trace", cmd__trace },
{ "urlmatch-normalization", cmd__urlmatch_normalization },
{ "wildmatch", cmd__wildmatch },
{ "write-cache", cmd__write_cache },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index 7116ddfb9..c462ac924 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -33,6 +33,7 @@ int cmd__strcmp_offset(int argc, const char **argv);
 int cmd__string_list(int argc, const char **argv);
 int cmd__submodule_config(int argc, const char **argv);
 int cmd__subprocess(int argc, const char **argv);
+int cmd__trace(int argc, const char **argv);
 int cmd__urlmatch_normalization(int argc, const char **argv);
 int cmd__wildmatch(int argc, const char **argv);
 int cmd__write_cache(int argc, const char **argv);
diff --git a/t/helper/test-trace.c b/t/helper/test-trace.c
new file mode 100644
index 0..1cc88b030
--- /dev/null
+++ b/t/helper/test-trace.c
@@ -0,0 +1,130 @@
+#include "test-tool.h"
+#include "cache.h"
+#include "run-command.h"
+
+static struct child_process children[2] = {
+   CHILD_PROCESS_INIT,
+   CHILD_PROCESS_INIT,
+};
+
+#define SAY(child, what) \
+   if (write_in_full(children[child].in, \
+ what "\n", strlen(what) + 1) < 0) \
+   die("Failed to tell child process #%d to %s", child, what)
+
+#define LISTEN(child, what) \
+   if (strbuf_getwholeline_fd(, children[child].out, '\n') < 0) \
+   die("Child process #%d failed to acknowledge %s", child, what)
+
+#define ACK(what) \
+   if (write_in_full(1, what ": ACK\n", strlen(what) + 6) < 0) \
+   die_errno("'%s': %s ACK", child_name, what)
+
+static void contention_orchestrator(const char *argv0)
+{
+   struct strbuf buf = STRBUF_INIT;
+   int i;
+
+   /* Spawn two children and simulate write contention */
+   trace_printf("start");
+
+   for (i = 0; i < 2; i++) {
+   strbuf_reset();
+   strbuf_addf(, "child #%d", i);
+   argv_array_pushl([i].args,
+   argv0, "trace", "lock", buf.buf, NULL);
+   children[i].in = children[i].out = -1;
+   if (start_command([i]) < 0)
+   die("Could not spawn child process #%d", i);
+   }
+
+   SAY(1, "lock");
+   LISTEN(1, "lock");
+
+   SAY(0, "trace delayed");
+   SAY(1, "trace while-locked");
+   LISTEN(1, "trace");
+
+   SAY(1, "unlock");
+   LISTEN(1, "unlock");
+   LISTEN(0, "trace");
+
+   SAY(0, "quit");
+   SAY(1, "quit");
+
+   if (finish_command([0]) < 0 ||
+   finish_command([1]) < 0)
+   die("Child process failed to finish");
+
+   strbuf_release();
+}
+
+static void child(const char *child_name)
+{
+   struct strbuf buf = STRBUF_INIT;
+   int fd, locked = 0;
+   const char *p;
+
+   /* This is the child process */
+   trace_printf("child start: '%s'", child_name);
+   fd = trace_default_key.fd;
+   if (fd <= 0)
+   die("child process: not tracing...");
+   while (!strbuf_getwholeline_fd(, 0, '\n')) {
+   strbuf_rtrim();
+   if (!strcmp("lock", buf.buf)) {
+   if (lock_or_unlock_fd_for_appending(fd, 1) < 0 &&
+   errno != EBADF)
+   

Re: [PATCH 03/10] sha1-array: provide oid_array_remove_if

2018-08-09 Thread Stefan Beller
On Thu, Aug 9, 2018 at 12:39 AM Martin Ågren  wrote:
>
> On 9 August 2018 at 00:17, Stefan Beller  wrote:
> > +int oid_array_remove_if(struct oid_array *array,
> > +   for_each_oid_fn fn,
> > +   void *data)
> > +{
> > +   int i, j;
> > +   char *to_remove = xcalloc(array->nr, sizeof(char));
>
> Do you really need this scratch space?

I don't think so, when we reorder the items while iterating over them.

I though reordering them later would be easier, but I am not sure anymore.

>
> > +   /* No oid_array_sort() here! See the api-oid-array.txt docs! */
> > +
> > +   for (i = 0; i < array->nr; i++) {
> > +   int ret = fn(array->oid + i, data);
> > +   if (ret)
> > +   to_remove[i] = 1;
> > +   }
> > +
> > +   i = 0, j = 0;
> > +   while (i < array->nr && j < array->nr) {
> > +   while (i < array->nr && !to_remove[i])
> > +   i++;
> > +   /* i at first marked for deletion or out */
> > +   if (j < i)
> > +   j = i;
> > +   while (j < array->nr && to_remove[j])
> > +   j++;
> > +   /* j > i; j at first valid after first deletion range or 
> > out */
> > +   if (i < array->nr && j < array->nr)
> > +   oidcpy(>oid[i], >oid[j]);
> > +   else if (i >= array->nr)
> > +   assert(j >= array->nr);
> > +   /* no pruning happened, keep original array->nr */
> > +   else if (j >= array->nr)
> > +   array->nr = i;
> > +   }
> > +
> > +   free(to_remove);
> > +
> > +   return 0;
> > +}
>
> I can't entirely follow this index-fiddling, but then I haven't had my
> morning coffee yet, so please forgive me if this is nonsense. Would it
> suffice to let i point out where to place items (starting at the first
> item not to keep) and j where to take them from (i.e., the items to
> keep, after the initial run)?

I thought this is what happens, just after the actual loop of calls.

> > +int oid_array_remove_if(struct oid_array *array,
> > +   for_each_oid_fn fn,
> > +   void *data);
>
> Maybe some documentation here, but this seems to be following suit. ;-)

Worth mentioning: the order is kept stable. (c.f. shrink_potential_moved_blocks
in diff.c which also "compacts an array", but without stable order).

Thanks for the review. I'll try to rewrite this to be more legible.

Thanks,
Stefan


Re: [PATCH 1/1] verify-tag/verify-commit should exit unsuccessfully when signature is not trusted

2018-08-09 Thread Jeff King
On Thu, Aug 09, 2018 at 08:30:25AM -0700, Junio C Hamano wrote:

> Jeff King  writes:
> 
> > There was a patch at the start of this thread, but it specifically
> > checks for "sigc->result == U".  That's probably OK, since I think it
> > restores the behavior in earlier versions of Git. But I wonder if we
> > should simply be storing the fact that gpg exited non-zero and relaying
> > that. That would fix this problem and truly make the rule "if gpg
> > reported an error, we propagate that".
> 
> Yeah, I like that.  Something like this, perhaps?  Points to note:
> 
>  * status gets the return value from verify_signed_buffer(), which
>essentially is what wait_or_whine() gives us for the "gpg
>--verify" process.
> 
>  * Even if status says "failed", we still need to parse the output
>to set sigc->result.  We used to use sigc->result as the sole
>source of our return value, but now we turn 'status' into 'bad'
>(i.e. non-zero) after parsing and finding it is not mechanically
>good (which is the same criteria as we have always used before).
>An already bad status is left as bad.
> 
>  * And we return 'status'.

Yeah, this is exactly what I had in mind. And the size of the code
change is much smaller than I feared. The case that I thought might be
complicated is still reading the output after we've seen the non-zero
status, but the existing "if (status && !gpg_output.len)" covers that.

> If we choose to blindly trust the exit status of "gpg --verify" and
> not interpret the result ourselves, we can lose the "smudge status
> to be bad if not G/U" bit, which I offhand do not think makes much
> difference either way.  I just left it there because showing what
> can be removed and saying it can be dropped is easier than showing
> the result of removal and saying it can be added--simply because I
> need to describe "it" if I go the latter route.

I guess leaving it serves as a sort of cross-check if gpg would return a
zero exit code but indicate in the status result that the signature was
not good. Sort of a belt-and-suspenders, I guess (which might not be
that implausible if we think about somebody wrapping gpg with a sloppy
bit of shell code that loses the exit code -- it's their fault, but it
might be nice for us to err on the conservative side).

Probably it should go back to just "result != G" then, though (thus
bringing the whole conversation full circle :) ).

I could live with or without it, though.

-Peff


Re: Help with "fatal: unable to read ...." error during GC?

2018-08-09 Thread Jeff King
On Wed, Aug 08, 2018 at 10:45:49PM -0400, Paul Smith wrote:

> On Wed, 2018-08-08 at 14:24 -0400, Jeff King wrote:
> > If so, can you try running it under gdb and getting a stack trace?
> > Something like:
> > 
> >   gdb git
> >   [and then inside gdb...]
> >   set args pack-objects --all --reflog --indexed-objects foo  >   break die
> >   run
> >   bt
> > 
> > That might give us a clue where the broken object reference is coming
> 
> Here we go.  I can rebuild with -Og or -O0 if more detailed debugging
> is needed; most everything appears to be optimized out:

No, I think this is enough to give a general sense of the problem
location.

> Compressing objects: 100% (10/10), done.
> Writing objects:  54% (274416/508176)   
> Thread 1 "git" hit Breakpoint 1, die (err=err@entry=0x5a373a "unable to read 
> %s") at usage.c:119
> 119 {
> (gdb) bt
> #0  die (err=err@entry=0x5a373a "unable to read %s") at usage.c:119
> #1  0x004563f3 in get_delta (entry=) at 
> builtin/pack-objects.c:143
> #2  write_no_reuse_object () at builtin/pack-objects.c:308
> #3  0x00456592 in write_reuse_object (usable_delta=, 
> limit=, entry=, f=) at 
> builtin/pack-objects.c:516
> #4  write_object (write_offset=, entry=0x7fffc9a8d940, 
> f=0x198fb70) at builtin/pack-objects.c:518
> #5  write_one () at builtin/pack-objects.c:576
> #6  0x004592f0 in write_pack_file () at builtin/pack-objects.c:849
> #7  cmd_pack_objects (argc=, argv=, 
> prefix=) at builtin/pack-objects.c:3354
> #8  0x00404f06 in run_builtin (argv=, argc= out>, p=) at git.c:417
> #9  handle_builtin (argc=, argv=) at git.c:632
> #10 0x00405f21 in run_argv (argv=0x7fffe210, 
> argcp=0x7fffe21c) at git.c:761
> #11 cmd_main (argc=, argc@entry=6, argv=, 
> argv@entry=0x7fffe448) at git.c:761
> #12 0x00404b15 in main (argc=6, argv=0x7fffe448) at 
> common-main.c:45

So that's quite unexpected. I assumed we'd have hit this problem while
deciding _which_ objects to write. But we get all the way to the point
of writing out the result before we notice it's missing.

I don't think I've run such a case before, but I wonder if "pack-objects
--all" is too lax about adding missing blobs during its object traversal
(especially during the "unreachable but recent" part of the traversal
that I mentioned, which should silently omit missing objects). I played
around with recreating this situation, though, and I don't think it's
possible to cause the results you're seeing. We come up with a list of
recent objects, but we only use it as a look-up index for discarding
too-old objects. So:

  - it wouldn't ever cause us to choose to write an object into a pack,
which is what you're seeing

  - we'd never consider a missing object; it's a pure lookup table, and
the actual list of objects we consider is found by walking the set
of packs

So that's probably a dead end.

What I really wonder is where we found out about that object name in the
first place. Can you instrument your Git build like this:

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 71056d8294..5ff6de5ddf 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1112,6 +1112,13 @@ static int add_object_entry(const struct object_id *oid, 
enum object_type type,
struct packed_git *found_pack = NULL;
off_t found_offset = 0;
uint32_t index_pos;
+   static const struct object_id funny_oid = {
+   "\xc1\x04\xb8\xfb\x36\x31\xb5\xc5\x46\x95"
+   "\x20\x6b\x2f\x73\x31\x0c\x02\x3c\x99\x63"
+   };
+
+   if (!oidcmp(oid, _oid))
+   warning("found funny oid");
 
display_progress(progress_state, ++nr_seen);
 

and similarly get a backtrace when we hit that warning()? (Or if you're
a gdb expert, you could probably use a conditional breakpoint, but I
find just modifying the source easier).

-Peff


Re: [PATCH v2 2/2] repack: repack promisor objects if -a or -A is set

2018-08-09 Thread Junio C Hamano
Jonathan Tan  writes:

> +/*
> + * Write oid to the given struct child_process's stdin, starting it first if
> + * necessary.
> + */
> +static int write_oid(const struct object_id *oid, struct packed_git *pack,
> +  uint32_t pos, void *data)
> +{
> + struct child_process *cmd = data;
> +
> + if (cmd->in == -1) {
> + if (start_command(cmd))
> + die("Could not start pack-objects to repack promisor 
> objects");
> + }
> +
> + xwrite(cmd->in, oid_to_hex(oid), GIT_SHA1_HEXSZ);
> + xwrite(cmd->in, "\n", 1);
> + return 0;
> +}
> +
> +static void repack_promisor_objects(const struct pack_objects_args *args,
> + struct string_list *names)
> +{
> + struct child_process cmd = CHILD_PROCESS_INIT;
> + FILE *out;
> + struct strbuf line = STRBUF_INIT;
> +
> + prepare_pack_objects(, args);
> + cmd.in = -1;
> +
> + /*
> +  * NEEDSWORK: Giving pack-objects only the OIDs without any ordering
> +  * hints may result in suboptimal deltas in the resulting pack. See if
> +  * the OIDs can be sent with fake paths such that pack-objects can use a
> +  * {type -> existing pack order} ordering when computing deltas instead
> +  * of a {type -> size} ordering, which may produce better deltas.
> +  */
> + for_each_packed_object(write_oid, ,
> +FOR_EACH_OBJECT_PROMISOR_ONLY);
> +
> + if (cmd.in == -1)
> + /* No packed objects; cmd was never started */
> + return;
> + close(cmd.in);
> +
> + out = xfdopen(cmd.out, "r");

Hmm, it is clever to auto-start the pack-objects (and notice there
wasn't anything needing to pack).  Two things that are worth noting
are:

 - The code takes advantage of the fact that cmd.in being left as -1
   is a sign that start_command() was never called (well, it could
   be that it was called but failed to open pipe---but then we would
   have died inside write_oid(), so we can ignore taht case).  This
   is not relying on its implementation detail---cmd.in would be
   replaced by a real fd to the pipe if start_command() was called.

 - We know that pack-objects reads all its standard input through to
   the EOF before emitting a single byte to its standard output, and
   that is why we can use bidi pipe and not worry about deadlocking
   caused by not reading from cmd.out at all (before closing cmd.in,
   that is).

So I kind of like the cleverness of the design of this code.

> + while (strbuf_getline_lf(, out) != EOF) {
> + char *promisor_name;
> + int fd;
> + if (line.len != 40)
> + die("repack: Expecting 40 character sha1 lines only 
> from pack-objects.");
> + string_list_append(names, line.buf);
> +
> + /*
> +  * pack-objects creates the .pack and .idx files, but not the
> +  * .promisor file. Create the .promisor file, which is empty.
> +  */
> + promisor_name = mkpathdup("%s-%s.promisor", packtmp,
> +   line.buf);
> + fd = open(promisor_name, O_CREAT|O_EXCL|O_WRONLY, 0600);
> + if (fd < 0)
> + die_errno("unable to create '%s'", promisor_name);
> + close(fd);
> + free(promisor_name);

For now, as we do not know what is the appropriate thing to leave in
the file, empty is fine, but perhaps in the future we may want to
carry forward the info from the existing .promisor file?  What does
it initially contain?  Transport invoked with from_promisor bit
seems to call index-pack with "--promisor" option with no argument,
so this is probably in line with that.  Do we in the future need to
teach "--promisor" option to pack-objects we invoke here, which will
be passed to the index-pack it internally performs, so that we do
not have to do this by hand here?

In any case, don't we need to update the doc for the "--promisor"
option of "index-pack"?  The same comment for the new options added
in the same topic, e.g. "--from-promisor" and "--no-dependents"
options added to "fetch-pack".  It seems that the topic that ended
at 0c16cd499d was done rather hastily and under-documented?

> @@ -440,6 +513,8 @@ int cmd_repack(int argc, const char **argv, const char 
> *prefix)
>  
>   /* End of pack replacement. */
>  
> + reprepare_packed_git(the_repository);
> +

I do not think this call hurts, but why does this become suddenly
necessary with _this_ patch?  Is it an unrelated bugfix?

>   if (delete_redundant) {
>   int opts = 0;
>   string_list_sort();

Thanks.


Re: [GSoC][PATCH v5 11/20] rebase -i: rewrite complete_action() in C

2018-08-09 Thread Alban Gruin
Le 09/08/2018 à 16:22, Phillip Wood a écrit :
>>   +int complete_action(struct replay_opts *opts, unsigned flags,
>> +    const char *shortrevisions, const char *onto_name,
>> +    const char *onto, const char *orig_head, const char *cmd,
>> +    unsigned autosquash)
>> +{
>> +    const char *shortonto, *todo_file = rebase_path_todo();
>> +    struct todo_list todo_list = TODO_LIST_INIT;
>> +    struct strbuf *buf = &(todo_list.buf);
>> +    struct object_id oid;
>> +    struct stat st;
>> +
>> +    get_oid(onto, );
> 
> Is onto guaranteed to exist? If not the return value of get_oid() needs
> to be checked. Otherwise a comment or an assertion might be useful here.
> 

Yes, either the user defines it manually (with --onto=), or it is
automatically determinated by git-rebase.sh.

>> +    shortonto = find_unique_abbrev(, DEFAULT_ABBREV);
>> +
>> +    if (!lstat(todo_file, ) && st.st_size == 0 &&
>> +    write_message("noop\n", 5, todo_file, 0))
>> +    return error_errno(_("could not write '%s'"), todo_file);
>> +
>> +    if (autosquash && rearrange_squash())
>> +    return 1;
> 
> git functions normally return -1 for an error as the 'return error(...)'
> does above, is there a reason for a different return value here?
> 

No, I will fix this.

>> +
>> +    if (cmd && *cmd)
>> +    sequencer_add_exec_commands(cmd); > +
>> +    if (strbuf_read_file(buf, todo_file, 0) < 0)
>> +    return error_errno(_("could not read '%s'."), todo_file);
>> +
>> +    if (parse_insn_buffer(buf->buf, _list)) {
> 
> I was worried when I saw this because buf is an alias of todo_list.buf
> and I thought passing the same buffer in twice would end badly. However
> parse_insn_buffer() has a confusing signature, it expects the caller to
> have filled todo_list.buf with the buffer to be parsed and to pass a
> pointer to the same buffer. I think this should be cleaned up at some
> point but it is outside the scope of this series.
> 
>> +    todo_list_release(_list);
>> +    return error(_("unusable todo list: '%s'"), todo_file);
>> +    }
>> +
>> +    if (count_commands(_list) == 0) {
>> +    apply_autostash(opts);
>> +    sequencer_remove_state(opts);
>> +    todo_list_release(_list);
>> +
>> +    fputs("Nothing to do\n", stderr);
> 
> The capital 'N' is a faithful copy of the current message in rebase.sh.
> However it might be worth changing it to 'n' to match the normal style
> of git error messages starting with a lowercase letter.
> 
>> +    return 1;
> 
> It might be better to do 'return error(...)' instead
> 

This will require a test change – not a big deal, of course.  It’s
perhaps a good idea to mark this string for translation, too.

>> +    }
>> +
>> +    strbuf_addch(buf, '\n');
>> +    strbuf_commented_addf(buf, Q_("Rebase %s onto %s (%d command)",
>> +  "Rebase %s onto %s (%d commands)",
>> +  count_commands(_list)),
>> +  shortrevisions, shortonto,
>> count_commands(_list));
>> +    append_todo_help(0, flags & TODO_LIST_KEEP_EMPTY, buf);
>> +
>> +    if (write_message(buf->buf, buf->len, todo_file, 0)) {
>> +    todo_list_release(_list);
>> +    return error_errno(_("could not write '%s'"), todo_file);
>> +    }
>> +
>> +    copy_file(rebase_path_todo_backup(), todo_file, 0666);
>> +    transform_todos(flags | TODO_LIST_SHORTEN_IDS);
> 
> Both of the above lines can fail, so the return values need checking I
> think.
> 
>> +
>> +    strbuf_reset(buf);
>> +
>> +    if (launch_sequence_editor(todo_file, buf, NULL)) {
>> +    apply_autostash(opts);
>> +    sequencer_remove_state(opts);
>> +    todo_list_release(_list);
>> +
>> +    return error(_("could not execute editor"));
> 
> I think  launch_sequence_editor() will have printed an error message
> already, so this is unnecessary.
> 

And the error would be more specific, true.

Thanks!
Alban



Re: [GSoC][PATCH v5 04/20] rebase -i: rewrite the edit-todo functionality in C

2018-08-09 Thread Phillip Wood

Hi Alban

On 09/08/18 16:35, Phillip Wood wrote:

Hi Alban

On 09/08/18 14:30, Alban Gruin wrote:

Hi Phillip,

Le 08/08/2018 à 18:04, Phillip Wood a écrit :

+int edit_todo_list(unsigned flags)
+{
+    struct strbuf buf = STRBUF_INIT;
+    const char *todo_file = rebase_path_todo();
+    FILE *todo;
+
+    if (strbuf_read_file(, todo_file, 0) < 0)
+    return error_errno(_("could not read '%s'."), todo_file);
+
+    strbuf_stripspace(, 1);
+    todo = fopen_or_warn(todo_file, "w");


This truncates the existing file, if there are any errors writing the
new one then the user has lost the old one. write_message() in
sequencer.c avoids this problem by writing a new file and then 
renaming
it if the write is successful, maybe it is worth exporting it so it 
can

be used elsewhere.


+    if (!todo) {
+    strbuf_release();
+    return 1;
+    }
+
+    strbuf_write(, todo);
+    fclose(todo);


There needs to be some error checking after the write and the close
(using write_message() would mean you only have to check for errors in
one place)



Right.  Should I find a new nawe for write_message()?


That might be a good idea, I'm not sure what it should be though, maybe
write_file()?, perhaps someone else might have a better suggestion.



write_file() already exists in wrapper.c.  I wondered, as this make use
of the lockfile API, perhaps it could be moved to lockfile.{c,h}, and
renamed to something like write_file_with_lock().


Hmm, I'm not sure about that, to me the use of the lockfile api is just 
a way to get the semantics of an atomic update to the file. Looking in 
wrapper.c there is write_file_buf() which could maybe be changed to to 
an atomic update. Looking at the current callers to that function they 
look like they would work with such a change (sadly none of them bother 
to check the return value to see if the write was successful). I don't 
think we need to worry about them being broken by having a stale 
lockfile left over if there's an error as I think the lockfile api 
automatically cleans them up but I'm not an expert on that api so it 
would be worth checking with someone like Johannes or Jeff King (I seem 
to remember him working in that area relatively recently).


On second thoughts the current function dies rather than returning an 
error so it would be a bit of work to do the conversion properly. For 
now you could just make write_message() available outside sequencer.c 
and see if anyone comes up with a better name when v6 is reviewed.


Best Wishes

Phillip


Best Wishes

Phillip



Cheers,
Alban







[PATCH v3 2/8] Add delta-islands.{c,h}

2018-08-09 Thread Christian Couder
From: Jeff King 

Hosting providers that allow users to "fork" existing
repos want those forks to share as much disk space as
possible.

Alternates are an existing solution to keep all the
objects from all the forks into a unique central repo,
but this can have some drawbacks. Especially when
packing the central repo, deltas will be created
between objects from different forks.

This can make cloning or fetching a fork much slower
and much more CPU intensive as Git might have to
compute new deltas for many objects to avoid sending
objects from a different fork.

Because the inefficiency primarily arises when an
object is delitified against another object that does
not exist in the same fork, we partition objects into
sets that appear in the same fork, and define
"delta islands". When finding delta base, we do not
allow an object outside the same island to be
considered as its base.

So "delta islands" is a way to store objects from
different forks in the same repo and packfile without
having deltas between objects from different forks.

This patch implements the delta islands mechanism in
"delta-islands.{c,h}", but does not yet make use of it.

A few new fields are added in 'struct object_entry'
in "pack-objects.h" though.

The documentation will follow in a patch that actually
uses delta islands in "builtin/pack-objects.c".

Signed-off-by: Jeff King 
Signed-off-by: Christian Couder 
---
 Makefile|   1 +
 delta-islands.c | 497 
 delta-islands.h |  11 ++
 pack-objects.h  |   4 +
 4 files changed, 513 insertions(+)
 create mode 100644 delta-islands.c
 create mode 100644 delta-islands.h

diff --git a/Makefile b/Makefile
index bc4fc8eeab..e7994888e8 100644
--- a/Makefile
+++ b/Makefile
@@ -841,6 +841,7 @@ LIB_OBJS += csum-file.o
 LIB_OBJS += ctype.o
 LIB_OBJS += date.o
 LIB_OBJS += decorate.o
+LIB_OBJS += delta-islands.o
 LIB_OBJS += diffcore-break.o
 LIB_OBJS += diffcore-delta.o
 LIB_OBJS += diffcore-order.o
diff --git a/delta-islands.c b/delta-islands.c
new file mode 100644
index 00..448ddcbbe4
--- /dev/null
+++ b/delta-islands.c
@@ -0,0 +1,497 @@
+#include "cache.h"
+#include "attr.h"
+#include "object.h"
+#include "blob.h"
+#include "commit.h"
+#include "tag.h"
+#include "tree.h"
+#include "delta.h"
+#include "pack.h"
+#include "tree-walk.h"
+#include "diff.h"
+#include "revision.h"
+#include "list-objects.h"
+#include "progress.h"
+#include "refs.h"
+#include "khash.h"
+#include "pack-bitmap.h"
+#include "pack-objects.h"
+#include "delta-islands.h"
+#include "sha1-array.h"
+#include "config.h"
+
+KHASH_INIT(str, const char *, void *, 1, kh_str_hash_func, kh_str_hash_equal)
+
+static khash_sha1 *island_marks;
+static unsigned island_counter;
+static unsigned island_counter_core;
+
+static kh_str_t *remote_islands;
+
+struct remote_island {
+   uint64_t hash;
+   struct oid_array oids;
+};
+
+struct island_bitmap {
+   uint32_t refcount;
+   uint32_t bits[];
+};
+
+static uint32_t island_bitmap_size;
+
+/*
+ * Allocate a new bitmap; if "old" is not NULL, the new bitmap will be a copy
+ * of "old". Otherwise, the new bitmap is empty.
+ */
+static struct island_bitmap *island_bitmap_new(const struct island_bitmap *old)
+{
+   size_t size = sizeof(struct island_bitmap) + (island_bitmap_size * 4);
+   struct island_bitmap *b = xcalloc(1, size);
+
+   if (old)
+   memcpy(b, old, size);
+
+   b->refcount = 1;
+   return b;
+}
+
+static void island_bitmap_or(struct island_bitmap *a, const struct 
island_bitmap *b)
+{
+   uint32_t i;
+
+   for (i = 0; i < island_bitmap_size; ++i)
+   a->bits[i] |= b->bits[i];
+}
+
+static int island_bitmap_is_subset(struct island_bitmap *self,
+   struct island_bitmap *super)
+{
+   uint32_t i;
+
+   if (self == super)
+   return 1;
+
+   for (i = 0; i < island_bitmap_size; ++i) {
+   if ((self->bits[i] & super->bits[i]) != self->bits[i])
+   return 0;
+   }
+
+   return 1;
+}
+
+#define ISLAND_BITMAP_BLOCK(x) (x / 32)
+#define ISLAND_BITMAP_MASK(x) (1 << (x % 32))
+
+static void island_bitmap_set(struct island_bitmap *self, uint32_t i)
+{
+   self->bits[ISLAND_BITMAP_BLOCK(i)] |= ISLAND_BITMAP_MASK(i);
+}
+
+static int island_bitmap_get(struct island_bitmap *self, uint32_t i)
+{
+   return (self->bits[ISLAND_BITMAP_BLOCK(i)] & ISLAND_BITMAP_MASK(i)) != 
0;
+}
+
+int in_same_island(const struct object_id *trg_oid, const struct object_id 
*src_oid)
+{
+   khiter_t trg_pos, src_pos;
+
+   /* If we aren't using islands, assume everything goes together. */
+   if (!island_marks)
+   return 1;
+
+   /*
+* If we don't have a bitmap for the target, we can delta it
+* against anything -- it's not an important object
+*/
+   trg_pos = kh_get_sha1(island_marks, trg_oid->hash);
+   if (trg_pos >= kh_end(island_marks))
+  

[PATCH v3 6/8] t: add t5319-delta-islands.sh

2018-08-09 Thread Christian Couder
From: Jeff King 

Signed-off-by: Jeff King 
Signed-off-by: Christian Couder 
---
 t/t5319-delta-islands.sh | 143 +++
 1 file changed, 143 insertions(+)
 create mode 100755 t/t5319-delta-islands.sh

diff --git a/t/t5319-delta-islands.sh b/t/t5319-delta-islands.sh
new file mode 100755
index 00..fea92a5777
--- /dev/null
+++ b/t/t5319-delta-islands.sh
@@ -0,0 +1,143 @@
+#!/bin/sh
+
+test_description='exercise delta islands'
+. ./test-lib.sh
+
+# returns true iff $1 is a delta based on $2
+is_delta_base () {
+   delta_base=$(echo "$1" | git cat-file --batch-check='%(deltabase)') &&
+   echo >&2 "$1 has base $delta_base" &&
+   test "$delta_base" = "$2"
+}
+
+# generate a commit on branch $1 with a single file, "file", whose
+# content is mostly based on the seed $2, but with a unique bit
+# of content $3 appended. This should allow us to see whether
+# blobs of different refs delta against each other.
+commit() {
+   blob=$({ test-tool genrandom "$2" 10240 && echo "$3"; } |
+  git hash-object -w --stdin) &&
+   tree=$(printf '100644 blob %s\tfile\n' "$blob" | git mktree) &&
+   commit=$(echo "$2-$3" | git commit-tree "$tree" ${4:+-p "$4"}) &&
+   git update-ref "refs/heads/$1" "$commit" &&
+   eval "$1"'=$(git rev-parse $1:file)' &&
+   eval "echo >&2 $1=\$$1"
+}
+
+test_expect_success 'setup commits' '
+   commit one seed 1 &&
+   commit two seed 12
+'
+
+# Note: This is heavily dependent on the "prefer larger objects as base"
+# heuristic.
+test_expect_success 'vanilla repack deltas one against two' '
+   git repack -adf &&
+   is_delta_base $one $two
+'
+
+test_expect_success 'island repack with no island definition is vanilla' '
+   git repack -adfi &&
+   is_delta_base $one $two
+'
+
+test_expect_success 'island repack with no matches is vanilla' '
+   git -c "pack.island=refs/foo" repack -adfi &&
+   is_delta_base $one $two
+'
+
+test_expect_success 'separate islands disallows delta' '
+   git -c "pack.island=refs/heads/(.*)" repack -adfi &&
+   ! is_delta_base $one $two &&
+   ! is_delta_base $two $one
+'
+
+test_expect_success 'same island allows delta' '
+   git -c "pack.island=refs/heads" repack -adfi &&
+   is_delta_base $one $two
+'
+
+test_expect_success 'coalesce same-named islands' '
+   git \
+   -c "pack.island=refs/(.*)/one" \
+   -c "pack.island=refs/(.*)/two" \
+   repack -adfi &&
+   is_delta_base $one $two
+'
+
+test_expect_success 'island restrictions drop reused deltas' '
+   git repack -adfi &&
+   is_delta_base $one $two &&
+   git -c "pack.island=refs/heads/(.*)" repack -adi &&
+   ! is_delta_base $one $two &&
+   ! is_delta_base $two $one
+'
+
+test_expect_success 'island regexes are left-anchored' '
+   git -c "pack.island=heads/(.*)" repack -adfi &&
+   is_delta_base $one $two
+'
+
+test_expect_success 'island regexes follow last-one-wins scheme' '
+   git \
+   -c "pack.island=refs/heads/(.*)" \
+   -c "pack.island=refs/heads/" \
+   repack -adfi &&
+   is_delta_base $one $two
+'
+
+test_expect_success 'setup shared history' '
+   commit root shared root &&
+   commit one shared 1 root &&
+   commit two shared 12-long root
+'
+
+# We know that $two will be preferred as a base from $one,
+# because we can transform it with a pure deletion.
+#
+# We also expect $root as a delta against $two by the "longest is base" rule.
+test_expect_success 'vanilla delta goes between branches' '
+   git repack -adf &&
+   is_delta_base $one $two &&
+   is_delta_base $root $two
+'
+
+# Here we should allow $one to base itself on $root; even though
+# they are in different islands, the objects in $root are in a superset
+# of islands compared to those in $one.
+#
+# Similarly, $two can delta against $root by our rules. And unlike $one,
+# in which we are just allowing it, the island rules actually put $root
+# as a possible base for $two, which it would not otherwise be (due to the size
+# sorting).
+test_expect_success 'deltas allowed against superset islands' '
+   git -c "pack.island=refs/heads/(.*)" repack -adfi &&
+   is_delta_base $one $root &&
+   is_delta_base $two $root
+'
+
+# We are going to test the packfile order here, so we again have to make some
+# assumptions. We assume that "$root", as part of our core "one", must come
+# before "$two". This should be guaranteed by the island code. However, for
+# this test to fail without islands, we are also assuming that it would not
+# otherwise do so. This is true by the current write order, which will put
+# commits (and their contents) before their parents.
+test_expect_success 'island core places core objects first' '
+   cat >expect <<-EOF &&
+   $root
+   $two
+   EOF
+   git -c "pack.island=refs/heads/(.*)" \
+   -c 

[PATCH v3 5/8] repack: add delta-islands support

2018-08-09 Thread Christian Couder
From: Jeff King 

Implement simple support for --delta-islands option and
repack.useDeltaIslands config variable in git repack.

This allows users to setup delta islands in their config and
get the benefit of less disk usage while cloning and fetching
is still quite fast and not much more CPU intensive.

Signed-off-by: Jeff King 
Signed-off-by: Christian Couder 
---
 Documentation/config.txt | 4 
 Documentation/git-repack.txt | 5 +
 builtin/repack.c | 9 +
 3 files changed, 18 insertions(+)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 27bb77f9e7..2bd31078b2 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -3145,6 +3145,10 @@ repack.packKeptObjects::
index is being written (either via `--write-bitmap-index` or
`repack.writeBitmaps`).
 
+repack.useDeltaIslands::
+   If set to true, makes `git repack` act as if `--delta-islands`
+   was passed. Defaults to `false`.
+
 repack.writeBitmaps::
When true, git will write a bitmap index when packing all
objects to disk (e.g., when `git repack -a` is run).  This
diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index d90e7907f4..a8b2d4722f 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -155,6 +155,11 @@ depth is 4095.
being removed. In addition, any unreachable loose objects will
be packed (and their loose counterparts removed).
 
+-i::
+--delta-islands::
+   Pass the `--delta-islands` option to `git-pack-objects`, see
+   linkgit:git-pack-objects[1].
+
 Configuration
 -
 
diff --git a/builtin/repack.c b/builtin/repack.c
index 6c636e159e..5ab9ee69e4 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -12,6 +12,7 @@
 static int delta_base_offset = 1;
 static int pack_kept_objects = -1;
 static int write_bitmaps;
+static int use_delta_islands;
 static char *packdir, *packtmp;
 
 static const char *const git_repack_usage[] = {
@@ -40,6 +41,10 @@ static int repack_config(const char *var, const char *value, 
void *cb)
write_bitmaps = git_config_bool(var, value);
return 0;
}
+   if (!strcmp(var, "repack.usedeltaislands")) {
+   use_delta_islands = git_config_bool(var, value);
+   return 0;
+   }
return git_default_config(var, value, cb);
 }
 
@@ -194,6 +199,8 @@ int cmd_repack(int argc, const char **argv, const char 
*prefix)
N_("pass --local to git-pack-objects")),
OPT_BOOL('b', "write-bitmap-index", _bitmaps,
N_("write bitmap index")),
+   OPT_BOOL('i', "delta-islands", _delta_islands,
+   N_("pass --delta-islands to git-pack-objects")),
OPT_STRING(0, "unpack-unreachable", _unreachable, 
N_("approxidate"),
N_("with -A, do not loosen objects older than 
this")),
OPT_BOOL('k', "keep-unreachable", _unreachable,
@@ -267,6 +274,8 @@ int cmd_repack(int argc, const char **argv, const char 
*prefix)
argv_array_pushf(, "--no-reuse-object");
if (write_bitmaps)
argv_array_push(, "--write-bitmap-index");
+   if (use_delta_islands)
+   argv_array_push(, "--delta-islands");
 
if (pack_everything & ALL_INTO_ONE) {
get_non_kept_pack_filenames(_packs, _pack_list);
-- 
2.18.0.555.g17f9c4abba



[PATCH v3 7/8] pack-objects: move tree_depth into 'struct packing_data'

2018-08-09 Thread Christian Couder
This reduces the size of 'struct object_entry' and therefore
makes packing objects more efficient.

This also renames cmp_tree_depth() into tree_depth_compare(),
as it is more modern to have the name of the compare functions
end with "compare".

Helped-by: Jeff King 
Helped-by: Duy Nguyen 
Signed-off-by: Christian Couder 
---
 builtin/pack-objects.c |  4 ++--
 delta-islands.c| 27 ++-
 pack-objects.c |  6 ++
 pack-objects.h | 21 -
 4 files changed, 46 insertions(+), 12 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 30ef48dc43..fd3e514b31 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2716,8 +2716,8 @@ static void show_object(struct object *obj, const char 
*name, void *data)
depth++;
 
ent = packlist_find(_pack, obj->oid.hash, NULL);
-   if (ent && depth > ent->tree_depth)
-   ent->tree_depth = depth;
+   if (ent && depth > oe_tree_depth(_pack, ent))
+   oe_set_tree_depth(_pack, ent, depth);
}
 }
 
diff --git a/delta-islands.c b/delta-islands.c
index 448ddcbbe4..3c8fe60801 100644
--- a/delta-islands.c
+++ b/delta-islands.c
@@ -224,17 +224,23 @@ static void mark_remote_island_1(struct remote_island 
*rl, int is_core_island)
island_counter++;
 }
 
-static int cmp_tree_depth(const void *va, const void *vb)
+struct tree_islands_todo {
+   struct object_entry *entry;
+   unsigned int depth;
+};
+
+static int tree_depth_compare(const void *a, const void *b)
 {
-   struct object_entry *a = *(struct object_entry **)va;
-   struct object_entry *b = *(struct object_entry **)vb;
-   return a->tree_depth - b->tree_depth;
+   const struct tree_islands_todo *todo_a = a;
+   const struct tree_islands_todo *todo_b = b;
+
+   return todo_a->depth - todo_b->depth;
 }
 
 void resolve_tree_islands(int progress, struct packing_data *to_pack)
 {
struct progress *progress_state = NULL;
-   struct object_entry **todo;
+   struct tree_islands_todo *todo;
int nr = 0;
int i;
 
@@ -250,16 +256,19 @@ void resolve_tree_islands(int progress, struct 
packing_data *to_pack)
 */
ALLOC_ARRAY(todo, to_pack->nr_objects);
for (i = 0; i < to_pack->nr_objects; i++) {
-   if (oe_type(_pack->objects[i]) == OBJ_TREE)
-   todo[nr++] = _pack->objects[i];
+   if (oe_type(_pack->objects[i]) == OBJ_TREE) {
+   todo[nr].entry = _pack->objects[i];
+   todo[nr].depth = oe_tree_depth(to_pack, 
_pack->objects[i]);
+   nr++;
+   }
}
-   QSORT(todo, nr, cmp_tree_depth);
+   QSORT(todo, nr, tree_depth_compare);
 
if (progress)
progress_state = start_progress(_("Propagating island marks"), 
nr);
 
for (i = 0; i < nr; i++) {
-   struct object_entry *ent = todo[i];
+   struct object_entry *ent = todo[i].entry;
struct island_bitmap *root_marks;
struct tree *tree;
struct tree_desc desc;
diff --git a/pack-objects.c b/pack-objects.c
index 92708522e7..30314572e6 100644
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -160,6 +160,9 @@ struct object_entry *packlist_alloc(struct packing_data 
*pdata,
 
if (!pdata->in_pack_by_idx)
REALLOC_ARRAY(pdata->in_pack, pdata->nr_alloc);
+
+   if (pdata->tree_depth)
+   REALLOC_ARRAY(pdata->tree_depth, pdata->nr_alloc);
}
 
new_entry = pdata->objects + pdata->nr_objects++;
@@ -175,5 +178,8 @@ struct object_entry *packlist_alloc(struct packing_data 
*pdata,
if (pdata->in_pack)
pdata->in_pack[pdata->nr_objects - 1] = NULL;
 
+   if (pdata->tree_depth)
+   pdata->tree_depth[pdata->nr_objects - 1] = 0;
+
return new_entry;
 }
diff --git a/pack-objects.h b/pack-objects.h
index 8eecd67991..3cb5527eeb 100644
--- a/pack-objects.h
+++ b/pack-objects.h
@@ -101,7 +101,6 @@ struct object_entry {
unsigned no_try_delta:1;
unsigned in_pack_type:TYPE_BITS; /* could be delta */
 
-   unsigned int tree_depth; /* should be repositioned for packing? */
unsigned char layer;
 
unsigned preferred_base:1; /*
@@ -145,6 +144,9 @@ struct packing_data {
struct packed_git **in_pack;
 
uintmax_t oe_size_limit;
+
+   /* delta islands */
+   unsigned int *tree_depth;
 };
 
 void prepare_packing_data(struct packing_data *pdata);
@@ -350,4 +352,21 @@ static inline void oe_set_delta_size(struct packing_data 
*pack,
"where delta size is the same as entry size");
 }
 
+static inline unsigned int oe_tree_depth(struct packing_data *pack,
+struct object_entry *e)

[PATCH v3 8/8] pack-objects: move 'layer' into 'struct packing_data'

2018-08-09 Thread Christian Couder
This reduces the size of 'struct object_entry' from 88 bytes
to 80 and therefore makes packing objects more efficient.

For example on a Linux repo with 12M objects,
`git pack-objects --all` needs extra 96MB memory even if the
layer feature is not used.

Helped-by: Jeff King 
Helped-by: Duy Nguyen 
Signed-off-by: Christian Couder 
---
 builtin/pack-objects.c |  4 ++--
 delta-islands.c|  4 ++--
 pack-objects.c |  6 ++
 pack-objects.h | 20 ++--
 4 files changed, 28 insertions(+), 6 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index fd3e514b31..d5d91eeed5 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -611,7 +611,7 @@ static inline void add_to_write_order(struct object_entry 
**wo,
   unsigned int *endp,
   struct object_entry *e)
 {
-   if (e->filled || e->layer != write_layer)
+   if (e->filled || oe_layer(_pack, e) != write_layer)
return;
wo[(*endp)++] = e;
e->filled = 1;
@@ -714,7 +714,7 @@ static void compute_layer_order(struct object_entry **wo, 
unsigned int *wo_end)
 * Finally all the rest in really tight order
 */
for (i = last_untagged; i < to_pack.nr_objects; i++) {
-   if (!objects[i].filled && objects[i].layer == write_layer)
+   if (!objects[i].filled && oe_layer(_pack, [i]) == 
write_layer)
add_family_to_write_order(wo, wo_end, [i]);
}
 }
diff --git a/delta-islands.c b/delta-islands.c
index 3c8fe60801..92137f2eca 100644
--- a/delta-islands.c
+++ b/delta-islands.c
@@ -492,13 +492,13 @@ int compute_pack_layers(struct packing_data *to_pack)
struct object_entry *entry = _pack->objects[i];
khiter_t pos = kh_get_sha1(island_marks, entry->idx.oid.hash);
 
-   entry->layer = 1;
+   oe_set_layer(to_pack, entry, 1);
 
if (pos < kh_end(island_marks)) {
struct island_bitmap *bitmap = kh_value(island_marks, 
pos);
 
if (island_bitmap_get(bitmap, island_counter_core))
-   entry->layer = 0;
+   oe_set_layer(to_pack, entry, 0);
}
}
 
diff --git a/pack-objects.c b/pack-objects.c
index 30314572e6..98389460c2 100644
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -163,6 +163,9 @@ struct object_entry *packlist_alloc(struct packing_data 
*pdata,
 
if (pdata->tree_depth)
REALLOC_ARRAY(pdata->tree_depth, pdata->nr_alloc);
+
+   if (pdata->layer)
+   REALLOC_ARRAY(pdata->layer, pdata->nr_alloc);
}
 
new_entry = pdata->objects + pdata->nr_objects++;
@@ -181,5 +184,8 @@ struct object_entry *packlist_alloc(struct packing_data 
*pdata,
if (pdata->tree_depth)
pdata->tree_depth[pdata->nr_objects - 1] = 0;
 
+   if (pdata->layer)
+   pdata->layer[pdata->nr_objects - 1] = 0;
+
return new_entry;
 }
diff --git a/pack-objects.h b/pack-objects.h
index 3cb5527eeb..ad3c208764 100644
--- a/pack-objects.h
+++ b/pack-objects.h
@@ -101,8 +101,6 @@ struct object_entry {
unsigned no_try_delta:1;
unsigned in_pack_type:TYPE_BITS; /* could be delta */
 
-   unsigned char layer;
-
unsigned preferred_base:1; /*
* we do not pack this, but is available
* to be used as the base object to delta
@@ -147,6 +145,7 @@ struct packing_data {
 
/* delta islands */
unsigned int *tree_depth;
+   unsigned char *layer;
 };
 
 void prepare_packing_data(struct packing_data *pdata);
@@ -369,4 +368,21 @@ static inline void oe_set_tree_depth(struct packing_data 
*pack,
pack->tree_depth[e - pack->objects] = tree_depth;
 }
 
+static inline unsigned char oe_layer(struct packing_data *pack,
+struct object_entry *e)
+{
+   if (!pack->layer)
+   return 0;
+   return pack->layer[e - pack->objects];
+}
+
+static inline void oe_set_layer(struct packing_data *pack,
+   struct object_entry *e,
+   unsigned char layer)
+{
+   if (!pack->layer)
+   ALLOC_ARRAY(pack->layer, pack->nr_objects);
+   pack->layer[e - pack->objects] = layer;
+}
+
 #endif
-- 
2.18.0.555.g17f9c4abba



[PATCH v3 4/8] pack-objects: add delta-islands support

2018-08-09 Thread Christian Couder
From: Jeff King 

Implement support for delta islands in git pack-objects
and document how delta islands work in
"Documentation/git-pack-objects.txt" and Documentation/config.txt.

This allows users to setup delta islands in their config and
get the benefit of less disk usage while cloning and fetching
is still quite fast and not much more CPU intensive.

Signed-off-by: Jeff King 
Signed-off-by: Christian Couder 
---
 Documentation/config.txt   | 15 +
 Documentation/git-pack-objects.txt | 97 ++
 builtin/pack-objects.c | 57 +++---
 3 files changed, 161 insertions(+), 8 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 63365dcf3d..27bb77f9e7 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -2585,6 +2585,21 @@ Note that changing the compression level will not 
automatically recompress
 all existing objects. You can force recompression by passing the -F option
 to linkgit:git-repack[1].
 
+pack.island::
+   An extended regular expression configuring a set of delta
+   islands. See "DELTA ISLANDS" in linkgit:git-pack-objects[1]
+   for details.
+
+pack.islandCore::
+   Specify an island name which gets to have its objects be
+   packed first. This creates a kind of pseudo-pack at the front
+   of one pack, so that the objects from the specified island are
+   hopefully faster to copy into any pack that should be served
+   to a user requesting these objects. In practice this means
+   that the island specified should likely correspond to what is
+   the most commonly cloned in the repo. See also "DELTA ISLANDS"
+   in linkgit:git-pack-objects[1].
+
 pack.deltaCacheSize::
The maximum memory in bytes used for caching deltas in
linkgit:git-pack-objects[1] before writing them out to a pack.
diff --git a/Documentation/git-pack-objects.txt 
b/Documentation/git-pack-objects.txt
index d95b472d16..40c825c381 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -289,6 +289,103 @@ Unexpected missing object will raise an error.
 --unpack-unreachable::
Keep unreachable objects in loose form. This implies `--revs`.
 
+--delta-islands::
+   Restrict delta matches based on "islands". See DELTA ISLANDS
+   below.
+
+
+DELTA ISLANDS
+-
+
+When possible, `pack-objects` tries to reuse existing on-disk deltas to
+avoid having to search for new ones on the fly. This is an important
+optimization for serving fetches, because it means the server can avoid
+inflating most objects at all and just send the bytes directly from
+disk. This optimization can't work when an object is stored as a delta
+against a base which the receiver does not have (and which we are not
+already sending). In that case the server "breaks" the delta and has to
+find a new one, which has a high CPU cost. Therefore it's important for
+performance that the set of objects in on-disk delta relationships match
+what a client would fetch.
+
+In a normal repository, this tends to work automatically. The objects
+are mostly reachable from the branches and tags, and that's what clients
+fetch. Any deltas we find on the server are likely to be between objects
+the client has or will have.
+
+But in some repository setups, you may have several related but separate
+groups of ref tips, with clients tending to fetch those groups
+independently. For example, imagine that you are hosting several "forks"
+of a repository in a single shared object store, and letting clients
+view them as separate repositories through `GIT_NAMESPACE` or separate
+repos using the alternates mechanism. A naive repack may find that the
+optimal delta for an object is against a base that is only found in
+another fork. But when a client fetches, they will not have the base
+object, and we'll have to find a new delta on the fly.
+
+A similar situation may exist if you have many refs outside of
+`refs/heads/` and `refs/tags/` that point to related objects (e.g.,
+`refs/pull` or `refs/changes` used by some hosting providers). By
+default, clients fetch only heads and tags, and deltas against objects
+found only in those other groups cannot be sent as-is.
+
+Delta islands solve this problem by allowing you to group your refs into
+distinct "islands". Pack-objects computes which objects are reachable
+from which islands, and refuses to make a delta from an object `A`
+against a base which is not present in all of `A`'s islands. This
+results in slightly larger packs (because we miss some delta
+opportunities), but guarantees that a fetch of one island will not have
+to recompute deltas on the fly due to crossing island boundaries.
+
+When repacking with delta islands the delta window tends to get
+clogged with candidates that are forbidden by the config. Repacking
+with a big --window helps (and doesn't take as long as it otherwise
+might because we can 

[PATCH v3 3/8] pack-objects: refactor code into compute_layer_order()

2018-08-09 Thread Christian Couder
In a following commit, as we will use delta islands, we will
have to compute the write order for different layers, not just
for one.

Let's prepare for that by refactoring the code that will be
used to compute the write order for a given layer into a new
compute_layer_order() function.

This will make it easier to see and understand what the
following changes are doing.

Helped-by: Duy Nguyen 
Signed-off-by: Christian Couder 
---
 builtin/pack-objects.c | 90 +++---
 1 file changed, 50 insertions(+), 40 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 4391504a91..99b6329399 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -667,48 +667,15 @@ static void add_family_to_write_order(struct object_entry 
**wo,
add_descendants_to_write_order(wo, endp, root);
 }
 
-static struct object_entry **compute_write_order(void)
+static void compute_layer_order(struct object_entry **wo, unsigned int *wo_end)
 {
-   unsigned int i, wo_end, last_untagged;
-
-   struct object_entry **wo;
+   unsigned int i, last_untagged;
struct object_entry *objects = to_pack.objects;
 
for (i = 0; i < to_pack.nr_objects; i++) {
-   objects[i].tagged = 0;
-   objects[i].filled = 0;
-   SET_DELTA_CHILD([i], NULL);
-   SET_DELTA_SIBLING([i], NULL);
-   }
-
-   /*
-* Fully connect delta_child/delta_sibling network.
-* Make sure delta_sibling is sorted in the original
-* recency order.
-*/
-   for (i = to_pack.nr_objects; i > 0;) {
-   struct object_entry *e = [--i];
-   if (!DELTA(e))
-   continue;
-   /* Mark me as the first child */
-   e->delta_sibling_idx = DELTA(e)->delta_child_idx;
-   SET_DELTA_CHILD(DELTA(e), e);
-   }
-
-   /*
-* Mark objects that are at the tip of tags.
-*/
-   for_each_tag_ref(mark_tagged, NULL);
-
-   /*
-* Give the objects in the original recency order until
-* we see a tagged tip.
-*/
-   ALLOC_ARRAY(wo, to_pack.nr_objects);
-   for (i = wo_end = 0; i < to_pack.nr_objects; i++) {
if (objects[i].tagged)
break;
-   add_to_write_order(wo, _end, [i]);
+   add_to_write_order(wo, wo_end, [i]);
}
last_untagged = i;
 
@@ -717,7 +684,7 @@ static struct object_entry **compute_write_order(void)
 */
for (; i < to_pack.nr_objects; i++) {
if (objects[i].tagged)
-   add_to_write_order(wo, _end, [i]);
+   add_to_write_order(wo, wo_end, [i]);
}
 
/*
@@ -727,7 +694,7 @@ static struct object_entry **compute_write_order(void)
if (oe_type([i]) != OBJ_COMMIT &&
oe_type([i]) != OBJ_TAG)
continue;
-   add_to_write_order(wo, _end, [i]);
+   add_to_write_order(wo, wo_end, [i]);
}
 
/*
@@ -736,7 +703,7 @@ static struct object_entry **compute_write_order(void)
for (i = last_untagged; i < to_pack.nr_objects; i++) {
if (oe_type([i]) != OBJ_TREE)
continue;
-   add_to_write_order(wo, _end, [i]);
+   add_to_write_order(wo, wo_end, [i]);
}
 
/*
@@ -744,8 +711,51 @@ static struct object_entry **compute_write_order(void)
 */
for (i = last_untagged; i < to_pack.nr_objects; i++) {
if (!objects[i].filled)
-   add_family_to_write_order(wo, _end, [i]);
+   add_family_to_write_order(wo, wo_end, [i]);
}
+}
+
+static struct object_entry **compute_write_order(void)
+{
+   unsigned int i, wo_end;
+
+   struct object_entry **wo;
+   struct object_entry *objects = to_pack.objects;
+
+   for (i = 0; i < to_pack.nr_objects; i++) {
+   objects[i].tagged = 0;
+   objects[i].filled = 0;
+   SET_DELTA_CHILD([i], NULL);
+   SET_DELTA_SIBLING([i], NULL);
+   }
+
+   /*
+* Fully connect delta_child/delta_sibling network.
+* Make sure delta_sibling is sorted in the original
+* recency order.
+*/
+   for (i = to_pack.nr_objects; i > 0;) {
+   struct object_entry *e = [--i];
+   if (!DELTA(e))
+   continue;
+   /* Mark me as the first child */
+   e->delta_sibling_idx = DELTA(e)->delta_child_idx;
+   SET_DELTA_CHILD(DELTA(e), e);
+   }
+
+   /*
+* Mark objects that are at the tip of tags.
+*/
+   for_each_tag_ref(mark_tagged, NULL);
+
+   /*
+* Give the objects in the original recency order until
+* we see a tagged tip.
+*/
+   ALLOC_ARRAY(wo, 

[PATCH v3 1/8] packfile: make get_delta_base() non static

2018-08-09 Thread Christian Couder
From: Jeff King 

As get_delta_base() will be used outside 'packfile.c' in
a following commit, let's make it non static and let's
declare it in 'packfile.h'.

Signed-off-by: Jeff King 
Signed-off-by: Christian Couder 
---
 packfile.c | 10 +-
 packfile.h |  7 +++
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/packfile.c b/packfile.c
index 6974903e58..04d387f312 100644
--- a/packfile.c
+++ b/packfile.c
@@ -1037,11 +1037,11 @@ const struct packed_git *has_packed_and_bad(const 
unsigned char *sha1)
return NULL;
 }
 
-static off_t get_delta_base(struct packed_git *p,
-   struct pack_window **w_curs,
-   off_t *curpos,
-   enum object_type type,
-   off_t delta_obj_offset)
+off_t get_delta_base(struct packed_git *p,
+struct pack_window **w_curs,
+off_t *curpos,
+enum object_type type,
+off_t delta_obj_offset)
 {
unsigned char *base_info = use_pack(p, w_curs, *curpos, NULL);
off_t base_offset;
diff --git a/packfile.h b/packfile.h
index cc7eaffe1b..1265fd9b06 100644
--- a/packfile.h
+++ b/packfile.h
@@ -126,6 +126,13 @@ extern unsigned long unpack_object_header_buffer(const 
unsigned char *buf, unsig
 extern unsigned long get_size_from_delta(struct packed_git *, struct 
pack_window **, off_t);
 extern int unpack_object_header(struct packed_git *, struct pack_window **, 
off_t *, unsigned long *);
 
+/*
+ * Return the offset of the object that is the delta base of the object at 
curpos.
+ */
+extern off_t get_delta_base(struct packed_git *p, struct pack_window **w_curs,
+   off_t *curpos, enum object_type type,
+   off_t delta_obj_offset);
+
 extern void release_pack_memory(size_t);
 
 /* global flag to enable extra checks when accessing packed objects */
-- 
2.18.0.555.g17f9c4abba



[PATCH v3 0/8] Add delta islands support

2018-08-09 Thread Christian Couder
This patch series is upstreaming work made by GitHub and available in:

https://github.com/peff/git/commits/jk/delta-islands

The above work has been already described in the following article:

https://githubengineering.com/counting-objects/

The above branch contains only one patch. In this patch series the
patch has been split into 5 patches (1/8, 2/8, 4/8, 5/8 and 6/8) with
their own commit message, and on top of that 3 new patches (3/8, 7/8
and 8/8) have been added. The new patches implement things that were
requested after the previous iterations.

I kept Peff as the author of the original 5 patches and took the
liberty to add his Signed-off-by to them.

As explained in details in the Counting Object article referenced
above, the goal of the delta island mechanism is for a hosting
provider to make it possible to have the "forks" of a repository share
as much storage as possible while preventing object packs to contain
deltas between different forks.

If deltas between different forks are not prevented, when users clone
or fetch a fork, preparing the pack that should be sent to them can be
very costly CPU wise, as objects from a different fork should not be
sent, which means that a lot of deltas might need to be computed
again (instead of reusing existing deltas).


The following changes have been made since the previous iteration:

* suggested by Duy: move the code computing the write order for a
  layer to a new separate function called compute_layer_order() in
  builtin/pack-objects.c in new patch 3/8

  I think that indeed this makes the following patch (4/8) shorter and
  easier to understand as well as the overall result nicer.

* suggested by Duy and Peff: rework the way the 'tree_depth' field is
  moved from 'struct object_entry' to 'struct packing_data' in
  pack-object.h in patch 7/8

* suggested by Duy and Peff: move field 'layer' from
  'struct object_entry' to 'struct packing_data' in pack-object.h in
  new patch 8/8

This patch series is also available on GitHub in:

https://github.com/chriscool/git/commits/delta-islands

The previous versions are available there:

V2: https://github.com/chriscool/git/commits/delta-islands19
V1: https://github.com/chriscool/git/commits/delta-islands6

V2: https://public-inbox.org/git/20180805172525.15278-1-chrisc...@tuxfamily.org/
V1: https://public-inbox.org/git/20180722054836.28935-1-chrisc...@tuxfamily.org/


Christian Couder (3):
  pack-objects: refactor code into compute_layer_order()
  pack-objects: move tree_depth into 'struct packing_data'
  pack-objects: move 'layer' into 'struct packing_data'

Jeff King (5):
  packfile: make get_delta_base() non static
  Add delta-islands.{c,h}
  pack-objects: add delta-islands support
  repack: add delta-islands support
  t: add t5319-delta-islands.sh

 Documentation/config.txt   |  19 ++
 Documentation/git-pack-objects.txt |  97 ++
 Documentation/git-repack.txt   |   5 +
 Makefile   |   1 +
 builtin/pack-objects.c | 137 +---
 builtin/repack.c   |   9 +
 delta-islands.c| 506 +
 delta-islands.h|  11 +
 pack-objects.c |  12 +
 pack-objects.h |  39 +++
 packfile.c |  10 +-
 packfile.h |   7 +
 t/t5319-delta-islands.sh   | 143 
 13 files changed, 948 insertions(+), 48 deletions(-)
 create mode 100644 delta-islands.c
 create mode 100644 delta-islands.h
 create mode 100755 t/t5319-delta-islands.sh

-- 
2.18.0.555.g17f9c4abba



Re: [PATCH v2] status: -i shorthand for --ignored command line option

2018-08-09 Thread Junio C Hamano
Nicholas Guriev  writes:

> It allows for uniformity with the --untracked-files option. Also
> the new short flag saves the number of keys to press for the
> typically git-status command.

Unlike "-u', "-i" is supported by "git commit" which shares the
underlying implementation, and that is the historical reason why we
never had "-i" shorthand, I think.  

While I do understand why sometimes "-u" is useful, especially
"-uno", to help those whose .gitignore is not managed as well as it
should be, I am not sure why a "typical git-status" invocation would
ask for "--ignored" that often to require such a short-hand.


Re: [GSoC][PATCH v5 04/20] rebase -i: rewrite the edit-todo functionality in C

2018-08-09 Thread Phillip Wood

Hi Alban

On 09/08/18 14:30, Alban Gruin wrote:

Hi Phillip,

Le 08/08/2018 à 18:04, Phillip Wood a écrit :

+int edit_todo_list(unsigned flags)
+{
+    struct strbuf buf = STRBUF_INIT;
+    const char *todo_file = rebase_path_todo();
+    FILE *todo;
+
+    if (strbuf_read_file(, todo_file, 0) < 0)
+    return error_errno(_("could not read '%s'."), todo_file);
+
+    strbuf_stripspace(, 1);
+    todo = fopen_or_warn(todo_file, "w");


This truncates the existing file, if there are any errors writing the
new one then the user has lost the old one. write_message() in
sequencer.c avoids this problem by writing a new file and then renaming
it if the write is successful, maybe it is worth exporting it so it can
be used elsewhere.


+    if (!todo) {
+    strbuf_release();
+    return 1;
+    }
+
+    strbuf_write(, todo);
+    fclose(todo);


There needs to be some error checking after the write and the close
(using write_message() would mean you only have to check for errors in
one place)



Right.  Should I find a new nawe for write_message()?


That might be a good idea, I'm not sure what it should be though, maybe
write_file()?, perhaps someone else might have a better suggestion.



write_file() already exists in wrapper.c.  I wondered, as this make use
of the lockfile API, perhaps it could be moved to lockfile.{c,h}, and
renamed to something like write_file_with_lock().


Hmm, I'm not sure about that, to me the use of the lockfile api is just 
a way to get the semantics of an atomic update to the file. Looking in 
wrapper.c there is write_file_buf() which could maybe be changed to to 
an atomic update. Looking at the current callers to that function they 
look like they would work with such a change (sadly none of them bother 
to check the return value to see if the write was successful). I don't 
think we need to worry about them being broken by having a stale 
lockfile left over if there's an error as I think the lockfile api 
automatically cleans them up but I'm not an expert on that api so it 
would be worth checking with someone like Johannes or Jeff King (I seem 
to remember him working in that area relatively recently).


Best Wishes

Phillip



Cheers,
Alban





Re: [PATCH 1/1] verify-tag/verify-commit should exit unsuccessfully when signature is not trusted

2018-08-09 Thread Junio C Hamano
Jeff King  writes:

> There was a patch at the start of this thread, but it specifically
> checks for "sigc->result == U".  That's probably OK, since I think it
> restores the behavior in earlier versions of Git. But I wonder if we
> should simply be storing the fact that gpg exited non-zero and relaying
> that. That would fix this problem and truly make the rule "if gpg
> reported an error, we propagate that".

Yeah, I like that.  Something like this, perhaps?  Points to note:

 * status gets the return value from verify_signed_buffer(), which
   essentially is what wait_or_whine() gives us for the "gpg
   --verify" process.

 * Even if status says "failed", we still need to parse the output
   to set sigc->result.  We used to use sigc->result as the sole
   source of our return value, but now we turn 'status' into 'bad'
   (i.e. non-zero) after parsing and finding it is not mechanically
   good (which is the same criteria as we have always used before).
   An already bad status is left as bad.

 * And we return 'status'.

If we choose to blindly trust the exit status of "gpg --verify" and
not interpret the result ourselves, we can lose the "smudge status
to be bad if not G/U" bit, which I offhand do not think makes much
difference either way.  I just left it there because showing what
can be removed and saying it can be dropped is easier than showing
the result of removal and saying it can be added--simply because I
need to describe "it" if I go the latter route.

 gpg-interface.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/gpg-interface.c b/gpg-interface.c
index 09ddfbc267..2e0885c511 100644
--- a/gpg-interface.c
+++ b/gpg-interface.c
@@ -67,26 +67,27 @@ static void parse_gpg_output(struct signature_check *sigc)
 int check_signature(const char *payload, size_t plen, const char *signature,
size_t slen, struct signature_check *sigc)
 {
struct strbuf gpg_output = STRBUF_INIT;
struct strbuf gpg_status = STRBUF_INIT;
int status;
 
sigc->result = 'N';
 
status = verify_signed_buffer(payload, plen, signature, slen,
  _output, _status);
if (status && !gpg_output.len)
goto out;
sigc->payload = xmemdupz(payload, plen);
sigc->gpg_output = strbuf_detach(_output, NULL);
sigc->gpg_status = strbuf_detach(_status, NULL);
parse_gpg_output(sigc);
+   status |= sigc->result != 'G' && sigc->result != 'U';
 
  out:
strbuf_release(_status);
strbuf_release(_output);
 
-   return sigc->result != 'G' && sigc->result != 'U';
+   return !!status;
 }
 
 void print_signature_buffer(const struct signature_check *sigc, unsigned flags)


Re: [PATCH v1 01/25] structured-logging: design document

2018-08-09 Thread Jeff Hostetler




On 8/3/2018 11:26 AM, Ben Peart wrote:



On 7/13/2018 12:55 PM, g...@jeffhostetler.com wrote:

From: Jeff Hostetler 

Signed-off-by: Jeff Hostetler 
---
  Documentation/technical/structured-logging.txt | 816 
+

  1 file changed, 816 insertions(+)
  create mode 100644 Documentation/technical/structured-logging.txt

diff --git a/Documentation/technical/structured-logging.txt 
b/Documentation/technical/structured-logging.txt

new file mode 100644
index 000..794c614
--- /dev/null
+++ b/Documentation/technical/structured-logging.txt
@@ -0,0 +1,816 @@
+Structured Logging
+==
+
+Structured Logging (SLOG) is an optional feature to allow Git to
+generate structured log data for executed commands.  This includes
+command line arguments, command run times, error codes and messages,
+child process information, time spent in various critical functions,
+and repository data-shape information.  Data is written to a target
+log file in JSON[1,2,3] format.
+
+SLOG is disabled by default.  Several steps are required to enable it:
+
+1. Add the compile-time flag "STRUCTURED_LOGGING=1" when building git
+   to include the SLOG routines in the git executable.
+


Is the intent to remove this compile-time flag before this is merged? 
With it off by default in builds, the audience for this is limited to 
those who build their own/custom versions of git. I can see other 
organizations wanting to use this that don't have a custom fork of git 
they build and install on their users machines.


Like the GIT_TRACE mechanism today, I think this should be compiled in 
but turned off via the default settings by default.


I would like to get rid of this compile-time flag and just have it
be available to those who want to use it.  And defaulted to off.
But I wasn't sure what kind of reaction or level of interest this
feature would receive from the mailing list.



+2. Set "slog.*" config settings[5] to enable SLOG in your repo.
+
+
+Motivation
+==
+
+Git users may be faced with scenarios that are surprisingly slow or
+produce unexpected results.  And Git developers may have difficulty
+reproducing these experiences.  Structured logging allows users to
+provide developers with additional usage, performance and error data
+that can help diagnose and debug issues.
+
+Many Git hosting providers and users with many developers have bespoke
+efforts to help troubleshoot problems; for example, command wrappers,
+custom pre- and post-command hooks, and custom instrumentation of Git
+code.  These are inefficient and/or difficult to maintain.  The goal
+of SLOG is to provide this data as efficiently as possible.
+
+And having structured rather than free format log data, will help
+developers with their analysis.
+
+
+Background (Git Merge 2018 Barcelona)
+=
+
+Performance and/or error logging was discussed during the contributor's
+summit in Barcelona.  Here are the relevant notes from the meeting
+minutes[6].
+
+> Performance misc (Ævar)
+> ---
+> [...]
+>  - central error reporting for git
+>    - `git status` logging
+>    - git config that collects data, pushes to known endpoint with 
`git push`

+>    - pre_command and post_command hooks, for logs
+>    - `gvfs diagnose` that looks at packfiles, etc
+>    - detect BSODs, etc
+>    - Dropbox writes out json with index properties and command-line
+>    information for status/fetch/push, fork/execs external tool 
to upload

+>    - windows trace facility; would be nice to have cross-platform
+>    - would hosting providers care?
+>    - zipfile of logs to give when debugging
+>    - sanitizing data is harder
+>    - more in a company setting
+>    - fileshare to upload zipfile
+>    - most of the errors are proxy when they shouldn't, wrong proxy, 
proxy

+>    specific to particular URL; so upload endpoint wouldn't work
+>    - GIT_TRACE is supposed to be that (for proxy)
+>    - but we need more trace variables
+>    - series to make tracing cheaper
+>    - except that curl selects the proxy
+>    - trace should have an API, so it can call an executable
+>    - dump to .git/traces/... and everything else happens externally
+>    - tools like visual studio can't set GIT_TRACE, so
+>    - sourcetree has seen user environments where commands just take 
forever
+>    - third-party tools like perf/strace - could we be better 
leveraging those?

+>    - distribute turn-key solution to handout to collect more data?
+


While it makes sense to have clear goals in the design document, the 
motivation and background sections feel somehow out of place.  I'd 
recommend you clearly articulate the design goals and drop the 
background data that led you to the goals.


good point.  thanks.



+
+A Quick Example
+===
+
+Note: JSON pretty-printing is enabled in all of the examples shown in
+this document.  When pretty-printing is turned off, each event is
+written on a single 

Re: [PATCH 1/1] verify-tag/verify-commit should exit unsuccessfully when signature is not trusted

2018-08-09 Thread Jeff King
On Thu, Aug 09, 2018 at 01:43:02AM +, brian m. carlson wrote:

> On Wed, Aug 08, 2018 at 05:59:43PM -0700, Junio C Hamano wrote:
> > "brian m. carlson"  writes:
> > 
> > >> FWIW, I'm on board with returning non-zero in any case where gpg would.
> > >
> > > I think that's probably the best solution overall.
> > 
> > FWIW, I am not married to the current behaviour.  I would not be
> > surprised if it mostly came by accident and not designed.
> 
> Since apparently I was the author of the commit that changed the
> behavior originally, let me simply say that I was not aware that gpg
> signalled the correctness of a signature by its exit status when I wrote
> that patch.  If I had known that, I would have deferred to gpg in my
> change.  My goal was consistency between verify-tag and verify-commit,
> and in retrospect I probably made the wrong decision.

OK, so it seems like we're all in agreement now.

What next?

There was a patch at the start of this thread, but it specifically
checks for "sigc->result == U".  That's probably OK, since I think it
restores the behavior in earlier versions of Git. But I wonder if we
should simply be storing the fact that gpg exited non-zero and relaying
that. That would fix this problem and truly make the rule "if gpg
reported an error, we propagate that".

-Peff


Re: [PATCH 1/5] chainlint: match arbitrary here-docs tags rather than hard-coded names

2018-08-09 Thread Jeff King
On Thu, Aug 09, 2018 at 01:58:05AM -0400, Eric Sunshine wrote:

> On Wed, Aug 8, 2018 at 6:50 PM Jeff King  wrote:
> > On Tue, Aug 07, 2018 at 04:21:31AM -0400, Eric Sunshine wrote:
> > > +# Swallowing here-docs with arbitrary tags requires a bit of finesse. 
> > > When a
> > > +# line such as "cat  > > the front
> > > +# of the line enclosed in angle brackets as a sentinel, giving "cat 
> > > >out".
> >
> > Gross, but OK, as long as we would not get confused by a line that
> > actually started with  at the start.
> 
> It can't get confused by such a line. There here-doc swallower
> prepends that when it starts the swallowing process and removes it add
> the end. Even if a line actually started with that, it would become
> "cmd" while swallowing the here-doc, and be restored to
> "cmd" at the end. Stripping the "" is done non-greedily, so
> it wouldn't remove both of them. Likewise, non-greedy matching is used
> for pulling the "EOF" out of the "<...>" when trying to match against
> the terminating "EOF" line, so there can be no confusion.

Thanks. I figured you probably had thought of that, but it seemed easier
to ask than to wade through the sed code (I do feel like a bad person to
give that answer, because IMHO one of the key things that makes open
source work is a willingness to dig in yourself rather than asking; but
I am making an exception for this sed code).

> Yeah, I was going with the tighter uppercase-only which Jonathan
> suggested[1], but I guess it wouldn't hurt to re-roll to allow
> lowercase too.
>
> [...]
>
> No. I've gotten so used to \EOF in this codebase that it didn't occur
> to me to even think about 'EOF', but a re-roll could add that, as
> well.

Thanks. I could take or leave such fixes, since I think our style
discourages both, so I'll leave it up to you whether you want to pursue
them.

-Peff


Re: [PATCH v2] clone: report duplicate entries on case-insensitive filesystems

2018-08-09 Thread Jeff King
On Wed, Aug 08, 2018 at 05:41:10PM -0700, Junio C Hamano wrote:

> > If we have an equivalence-class hashmap and feed it inodes (or again,
> > some system equivalent) as the keys, we should get buckets of
> > collisions.
> 
> I guess one way to get "some system equivalent" that can be used as
> the last resort, when there absolutely is no inum equivalent, is to
> rehash the working tree file that shouldn't be there when we detect
> a collision.
> 
> If we found that there is something when we tried to write out
> "Foo.txt", if we open "Foo.txt" on the working tree and hash-object
> it, we should find the matching blob somewhere in the index _before_
> "Foo.txt".  On a case-insensitive filesytem, it may well be
> "foo.txt", but we do not even have to know "foo.txt" and "Foo.txt"
> only differ in case.

Clever. You might still run into false positives when there is
duplicated content in the repository (especially, say, zero-length
files).  But the fact that you only do the hashing on known duplicates
helps with that.

One of the things I did like about the equivalence-class approach is
that it can be done in a single linear pass in the worst case. Whereas
anything that searches when we see a collision is quite likely to be
quadratic. But as I said before, it may not be worth worrying too much
about that for an error code path where we expect the number of
collisions to be small.

-Peff


Re: [GSoC][PATCH v5 11/20] rebase -i: rewrite complete_action() in C

2018-08-09 Thread Phillip Wood

Hi Alban

Its nice to see things coming together so that the rebase happens in the 
same process as the some of the todo file preparation. I've ended up 
making quite a few comments on the new implementation, but they're all 
little things, the basic idea looks sound to me.


On 31/07/18 18:59, Alban Gruin wrote:

This rewrites complete_action() from shell to C.

A new mode is added to rebase--helper (`--complete-action`), as well as
a new flag (`--autosquash`).

Finally, complete_action() is stripped from git-rebase--interactive.sh.

The original complete_action() would return the code 2 when the todo
list contained no actions.  This was a special case for rebase -i and
-p; git-rebase.sh would then apply the autostash, delete the state
directory, and die with the message "Nothing to do".  This cleanup is
rewritten in C instead of returning 2.  As rebase -i no longer returns
2, the comment describing this behaviour in git-rebase.sh is updated to
reflect this change.

The first check might seem useless as we write "noop" to the todo list
if it is empty.  Actually, the todo list might contain commented
commands (ie. empty commits).  In this case, complete_action() won’t
write "noop", and will abort without starting the editor.

Signed-off-by: Alban Gruin 
---
No changes since v4.

  builtin/rebase--helper.c   | 12 -
  git-rebase--interactive.sh | 53 ++--
  git-rebase.sh  |  2 +-
  sequencer.c| 99 ++
  sequencer.h|  4 ++
  5 files changed, 118 insertions(+), 52 deletions(-)

diff --git a/builtin/rebase--helper.c b/builtin/rebase--helper.c
index bed3dd2b95..d7fa5a5062 100644
--- a/builtin/rebase--helper.c
+++ b/builtin/rebase--helper.c
@@ -13,13 +13,13 @@ static const char * const builtin_rebase_helper_usage[] = {
  int cmd_rebase__helper(int argc, const char **argv, const char *prefix)
  {
struct replay_opts opts = REPLAY_OPTS_INIT;
-   unsigned flags = 0, keep_empty = 0, rebase_merges = 0;
+   unsigned flags = 0, keep_empty = 0, rebase_merges = 0, autosquash = 0;
int abbreviate_commands = 0, rebase_cousins = -1;
enum {
CONTINUE = 1, ABORT, MAKE_SCRIPT, SHORTEN_OIDS, EXPAND_OIDS,
CHECK_TODO_LIST, SKIP_UNNECESSARY_PICKS, REARRANGE_SQUASH,
ADD_EXEC, APPEND_TODO_HELP, EDIT_TODO, PREPARE_BRANCH,
-   CHECKOUT_ONTO
+   CHECKOUT_ONTO, COMPLETE_ACTION
} command = 0;
struct option options[] = {
OPT_BOOL(0, "ff", _ff, N_("allow fast-forward")),
@@ -29,6 +29,8 @@ int cmd_rebase__helper(int argc, const char **argv, const 
char *prefix)
OPT_BOOL(0, "rebase-merges", _merges, N_("rebase merge 
commits")),
OPT_BOOL(0, "rebase-cousins", _cousins,
 N_("keep original branch points of cousins")),
+   OPT_BOOL(0, "autosquash", ,
+N_("move commits thas begin with squash!/fixup!")),


s/thas/that/


OPT__VERBOSE(, N_("be verbose")),
OPT_CMDMODE(0, "continue", , N_("continue rebase"),
CONTINUE),
@@ -57,6 +59,8 @@ int cmd_rebase__helper(int argc, const char **argv, const 
char *prefix)
N_("prepare the branch to be rebased"), 
PREPARE_BRANCH),
OPT_CMDMODE(0, "checkout-onto", ,
N_("checkout a commit"), CHECKOUT_ONTO),
+   OPT_CMDMODE(0, "complete-action", ,
+   N_("complete the action"), COMPLETE_ACTION),
OPT_END()
};
  
@@ -110,5 +114,9 @@ int cmd_rebase__helper(int argc, const char **argv, const char *prefix)

return !!prepare_branch_to_be_rebased(, argv[1]);
if (command == CHECKOUT_ONTO && argc == 4)
return !!checkout_onto(, argv[1], argv[2], argv[3]);
+   if (command == COMPLETE_ACTION && argc == 6)
+   return !!complete_action(, flags, argv[1], argv[2], 
argv[3],
+argv[4], argv[5], autosquash);
+
usage_with_options(builtin_rebase_helper_usage, options);
  }
diff --git a/git-rebase--interactive.sh b/git-rebase--interactive.sh
index b68f108f28..59dc4888a6 100644
--- a/git-rebase--interactive.sh
+++ b/git-rebase--interactive.sh
@@ -127,54 +127,6 @@ init_revisions_and_shortrevisions () {
fi
  }
  
-complete_action() {

-   test -s "$todo" || echo noop >> "$todo"
-   test -z "$autosquash" || git rebase--helper --rearrange-squash || exit
-   test -n "$cmd" && git rebase--helper --add-exec-commands "$cmd"
-
-   todocount=$(git stripspace --strip-comments <"$todo" | wc -l)
-   todocount=${todocount##* }
-
-cat >>"$todo" <"$todo" ||
die "$(gettext "Could not generate todo list")"
  
-	complete_action

+   exec git rebase--helper --complete-action "$shortrevisions" 
"$onto_name" \
+  

Re: [RFC PATCH v2 03/12] t7411: be nicer to future tests and really clean things up

2018-08-09 Thread Antonio Ospite
On Thu, 2 Aug 2018 11:15:03 -0700
Stefan Beller  wrote:

> On Thu, Aug 2, 2018 at 9:41 AM SZEDER Gábor  wrote:
> >
[...]
> > >
> > > Note that test_when_finished is not used here, both to keep the current 
> > > style
> > > and also because it does not work in sub-shells.
> >
> > That's true, but I think that this:
> >
> >   test_when_finished git -C super reset --hard HEAD~2
> >
> > at the very beginning of the test should work.
> 
> Yeah that is a better way to do it.
> Even better would be to have 2 of these for both tests 5 and 8,
> such that each of them could be skipped individually and any following
> tests still work fine.
>

Test 6 also relies on the error introduced in test 5.

So the options would be either to remove one commit at the time in
test 6 and 8 (with a comment in test 6 to note that the commit is from
the previous test), or to remove both the commits in test 8. I am going
to go with the former, using test_when_finished.

> > >  t/t7411-submodule-config.sh | 4 +++-
> > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/t/t7411-submodule-config.sh b/t/t7411-submodule-config.sh
> > > index 0bde5850ac..248da0bc4f 100755
> > > --- a/t/t7411-submodule-config.sh
> > > +++ b/t/t7411-submodule-config.sh
> > > @@ -135,7 +135,9 @@ test_expect_success 'error in history in 
> > > fetchrecursesubmodule lets continue' '
> > >   HEAD submodule \
> > >   >actual &&
> > >   test_cmp expect_error actual  &&
> > > - git reset --hard HEAD^
> > > + # Remove both the commits which add errors to .gitmodules,
> > > + # the one from this test and the one from a previous test.
> > > + git reset --hard HEAD~2
> 
> I am a bit hesitant to removing the commits though, as it is expected to have
> potentially broken history and submodules still working.
>

The commits which are removed only affected .gitmoudles, no "submodule
init" nor "submoudle update" is ever called after they are added, so I
don't know what problems there could be.

Would a revert be any different?

> The config --unset already fixes the gitmodules file,
> so I think we can rather do
> 
> git commit -a -m 'now the .gitmodules file is fixed at HEAD \
> but has a messy history'
> 
> But as I have only read up to here, not knowing what the future tests will
> bring this is all speculation at this point.

IIUC the "config --unset" is used to cause the error, not to fix it, I
am not sure I understand this point.

Thanks,
   Antonio

-- 
Antonio Ospite
https://ao2.it
https://twitter.com/ao2it

A: Because it messes up the order in which people normally read text.
   See http://en.wikipedia.org/wiki/Posting_style
Q: Why is top-posting such a bad thing?


Re: [GSoC][PATCH v5 04/20] rebase -i: rewrite the edit-todo functionality in C

2018-08-09 Thread Alban Gruin
Hi Phillip,

Le 08/08/2018 à 18:04, Phillip Wood a écrit :
 +int edit_todo_list(unsigned flags)
 +{
 +    struct strbuf buf = STRBUF_INIT;
 +    const char *todo_file = rebase_path_todo();
 +    FILE *todo;
 +
 +    if (strbuf_read_file(, todo_file, 0) < 0)
 +    return error_errno(_("could not read '%s'."), todo_file);
 +
 +    strbuf_stripspace(, 1);
 +    todo = fopen_or_warn(todo_file, "w");
>>>
>>> This truncates the existing file, if there are any errors writing the
>>> new one then the user has lost the old one. write_message() in
>>> sequencer.c avoids this problem by writing a new file and then renaming
>>> it if the write is successful, maybe it is worth exporting it so it can
>>> be used elsewhere.
>>>
 +    if (!todo) {
 +    strbuf_release();
 +    return 1;
 +    }
 +
 +    strbuf_write(, todo);
 +    fclose(todo);
>>>
>>> There needs to be some error checking after the write and the close
>>> (using write_message() would mean you only have to check for errors in
>>> one place)
>>>
>>
>> Right.  Should I find a new nawe for write_message()?
> 
> That might be a good idea, I'm not sure what it should be though, maybe
> write_file()?, perhaps someone else might have a better suggestion.
> 

write_file() already exists in wrapper.c.  I wondered, as this make use
of the lockfile API, perhaps it could be moved to lockfile.{c,h}, and
renamed to something like write_file_with_lock().

Cheers,
Alban



Re: [PATCH v2 2/2] rebase --exec: make it work with --rebase-merges

2018-08-09 Thread Johannes Schindelin
Hi Phillip,

On Thu, 9 Aug 2018, Phillip Wood wrote:

> On 09/08/18 10:22, Johannes Schindelin wrote:
> > 
> > On Mon, 6 Aug 2018, Phillip Wood wrote:
> > 
> >> On 06/08/18 10:52, Johannes Schindelin via GitGitGadget wrote:
> >>>
> >>> + else if (is_fixup(command)) {
> >>> + insert = i + 1;
> >>> + continue;
> >>> + }
> >>> + strbuf_insert(buf,
> >>> +   todo_list.items[insert].offset_in_buf +
> >>> +   offset, commands, commands_len);
> >>>   offset += commands_len;
> >>> + insert = -1;
> >>>   }
> >>> - first = 0;
> >>> +
> >>> + if (command == TODO_PICK || command == TODO_MERGE)
> >>> + insert = i + 1;
> >>>}
> >>>   
> >>>   /* append final  */
> >>> - strbuf_add(buf, commands, commands_len);
> >>> + if (insert >= 0 || !offset)
> >>> + strbuf_add(buf, commands, commands_len);
> >>
> >> Having read your other message about this patch I think if you wanted to 
> >> fix
> >> the position of the final exec in the case where the todo list ends with a
> >> comment you could do something like
> >>
> >>if (insert >= 0)
> >>strbuf_insert(buf,
> >>  todo_list.items[insert].offset_in_buf +
> >>  offset, commands, commands_len);
> >>else
> >>strbuf_add(buf, commands, commands_len);
> > 
> > That does not really work, as `insert` can point *after* the last line, in
> > which case `todo_list.items[insert]` is undefined (and in the worst case,
> > causes a segmentation fault).
> 
> Ah, I'd missed that, does changing the conditions to
> if (insert >= 0 && insert < todo.list_nr) and
> else if (insert >=0 || !offset) work?

That's pretty exactly what I did ;-)

Ciao,
Dscho


Re: pk/rebase-in-c, was Re: What's cooking in git.git (Aug 2018, #01; Thu, 2)

2018-08-09 Thread Johannes Schindelin
Hi Junio,

On Sat, 4 Aug 2018, Junio C Hamano wrote:

> Johannes Schindelin  writes:
> 
> > On Thu, 2 Aug 2018, Junio C Hamano wrote:
> >
> >> * pk/rebase-in-c (2018-07-30) 3 commits
> >>  - builtin/rebase: support running "git rebase "
> >>  - rebase: refactor common shell functions into their own file
> >>  - rebase: start implementing it as a builtin
> >> 
> >>  Rewrite of the "rebase" machinery in C.
> >> 
> >>  Will merge to 'next'.
> >
> > Please hold. I found several bugs in the third patch, and it will need to
> > be fixed before sending another iteration.
> 
> Thanks.  I think the author already corrected/stopped me on this one
> and it is now marked as "hold" in my working draft.

Indeed, looking at public-inbox, I even see it. The strange thing is that
I never received this email in my inbox. Apparently GMX is letting me
down, as I have identified at least a dozen mails that made it to the Git
mailing list but not into my inbox, some even with me in the Cc: line.

Troublesome.

Ciao,
Dscho


Re: [RFC PATCH] line-log: clarify [a,b) notation for ranges

2018-08-09 Thread Johannes Schindelin
Hi Eric,

On Tue, 7 Aug 2018, Eric Sunshine wrote:

> On Tue, Aug 7, 2018 at 9:54 AM Andrei Rybak  wrote:
> > line-log.[ch] use left-closed, right-open interval logic. Change comment
> > and debug output to square brackets+parentheses notation to help
> > developers avoid off-by-one errors.
> > ---
> 
> This seems sensible. There might be some reviewers who suggest
> different notation since "[...)" is not universal (see [1]), but I
> think this is fine.

Indeed. When I started out studying mathematics, I learned the notation
[...[ (which makes a lot more sense, if you think about it).

Besides, it's not like we start from a fresh slate. Git is already over a
decade old. Our commit ranges are "half open", i.e. the exact thing that
is described here. There's gotta be some precedent in the documentation,
and introducing something willfully inconsistent is probably a pretty bad
idea.

Ciao,
Dscho


What's the use case for committing both the freshly created file and it's exclusion in .gitignore?

2018-08-09 Thread Bartosz Konikiewicz
Hi there!

I hope that the subject of my message (i.e. the question) is
exhaustive enough, so I'll just stick to reproducing my issue.

Steps to reproduce:

1. Create a new file.
2. Stage the file.
3. Add the file to .gitignore.
4. Stage the .gitignore.
5. Commit changes.

I imagined that the file would now be removed from the stage (because
it's ignored now and not yet committed) but it isn't. Where this
behavior would be desirable? I know that a 'git add' command can be
invoked with an '-f' flag, which would yield the same result, but I
can't come up with an explanation where can it be applied.


Re: [RFC PATCH v2 01/12] submodule: add a print_config_from_gitmodules() helper

2018-08-09 Thread Antonio Ospite
On Thu, 2 Aug 2018 11:05:02 -0700
Stefan Beller  wrote:

> On Thu, Aug 2, 2018 at 6:47 AM Antonio Ospite  wrote:
> >
[...]
> > +extern int print_config_from_gitmodules(const char *key);
> 
> The only real pushback for this patch I'd have is lack of documentation
> in public functions, though this is pretty self explanatory; so I'd be fine
> for lacking the docs here.
> 
> In case a resend is needed, please drop the extern keyword here.
> 

I'll drop the extern keyword also for the public function added in
patch 02 then.

Thanks,
   Antonio

-- 
Antonio Ospite
https://ao2.it
https://twitter.com/ao2it

A: Because it messes up the order in which people normally read text.
   See http://en.wikipedia.org/wiki/Posting_style
Q: Why is top-posting such a bad thing?


Re: [PATCH v4 2/2] sequencer: fix quoting in write_author_script

2018-08-09 Thread Phillip Wood
On 08/08/18 10:39, Eric Sunshine wrote:
> On Tue, Aug 7, 2018 at 5:35 AM Phillip Wood  wrote:
>> Single quotes should be escaped as \' not \\'. The bad quoting breaks
>> the interactive version of 'rebase --root' (which is used when there
>> is no '--onto' even if the user does not specify --interactive) for
>> authors that contain "'" as sq_dequote() called by read_author_ident()
>> errors out on the bad quoting.
>> [...]
>> Signed-off-by: Phillip Wood 
>> ---
>> diff --git a/sequencer.c b/sequencer.c
>> @@ -636,42 +636,64 @@ static int write_author_script(const char *message)
>>  static int read_env_script(struct argv_array *env)
>>  {
>> if (strbuf_read_file(, rebase_path_author_script(), 256) <= 0)
>> return -1;
> 
> Erm, again, not introduced by this patch, but this is leaking 'script'
> in the error path. You had plugged this leak in the previous round but
> that fix got lost when you reverted to this simpler approach. Not
> critical, though; the leak probably ought to be fixed by a separate
> patch anyhow (which doesn't necessarily need to be part of this
> series).

I'm hoping this will go away when I work on unifying the code to read
the author-script with am.

>> +   /* write_author_script() used to quote incorrectly */
>> +   sq_bug = quoting_is_broken(script.buf, script.len);
>> for (p = script.buf; *p; p++)
>> -   if (skip_prefix(p, "'''", (const char **)))
>> +   if (sq_bug && skip_prefix(p, "'''", ))
>> +   strbuf_splice(, p - script.buf, p2 - p, "'", 
>> 1);
>> +   else if (skip_prefix(p, "'\\''", ))
>> strbuf_splice(, p - script.buf, p2 - p, "'", 
>> 1);
> 
> The two strbuf_splice() invocations are identical, so an alternate way
> of expressing this would be:
> 
>  if ((sq_bug && skip_prefix(p, "'''", )) ||
> skip_prefix(p, "'\\''", ))
> strbuf_splice(, p - script.buf, p2 - p, "'", 1);
> 
> Not necessarily clearer, and certainly not worth a re-roll.
> 
>> diff --git a/t/t3404-rebase-interactive.sh b/t/t3404-rebase-interactive.sh
>> @@ -1382,9 +1382,21 @@ test_expect_success 'rebase -i --gpg-sign= 
>> overrides commit.gpgSign' '
>>  test_expect_success 'valid author header after --root swap' '
>> rebase_setup_and_clean author-header no-conflict-branch &&
>> set_fake_editor &&
>> -   FAKE_LINES="2 1" git rebase -i --root &&
>> -   git cat-file commit HEAD^ >out &&
>> -   grep "^author ..*> [0-9][0-9]* [-+][0-9][0-9][0-9][0-9]$" out
>> +   git commit --amend --author="Au ${SQ}thor " 
>> --no-edit &&
>> +   git cat-file commit HEAD | grep ^author >expected &&
>> +   FAKE_LINES="5 1" git rebase -i --root &&
>> +   git cat-file commit HEAD^ | grep ^author >actual &&
>> +   test_cmp expected actual
>> +'
> 
> It probably would have been clearer to change to this test to use
> test_cmp() instead of 'grep' in a separate patch since it's not
> directly related to the fixes in this patch, and then to do the
> "commit --amend" in this patch. However, probably not worth a re-roll.

In hindsight that would have been clearer, but hopefully this version
isn't too confusing

>> +test_expect_success 'valid author header when author contains single quote' 
>> '
>> +   rebase_setup_and_clean author-header no-conflict-branch &&
>> +   set_fake_editor &&
>> +   git commit --amend --author="Au ${SQ}thor " 
>> --no-edit &&
>> +   git cat-file commit HEAD | grep ^author >expected &&
>> +   FAKE_LINES="2" git rebase -i HEAD~2 &&
>> +   git cat-file commit HEAD | grep ^author >actual &&
>> +   test_cmp expected actual
>>  '
> 
> This test is so similar to the one above that it is tempting to try to
> refactor the code so the tests can share implementation, however, the
> end result would probably be less readable, so they're probably fine
> as-is.

Yes, I think it could look messy

Thank you very much for all your help and comments on this series

Best Wishes

Phillip



Re: [PATCH v4 2/2] sequencer: fix quoting in write_author_script

2018-08-09 Thread Phillip Wood
Hi Eric

On 08/08/18 09:43, Eric Sunshine wrote:
> On Tue, Aug 7, 2018 at 9:54 AM Phillip Wood  wrote:
>> On 07/08/18 11:23, Eric Sunshine wrote:
>>> On Tue, Aug 7, 2018 at 5:35 AM Phillip Wood  
>>> wrote:
 +   if (n > 0 && s[n] != '\'')
 +   return 1;
>>>
>>> To be "technically correct", I think the condition in the 'if'
>>> statement should be ">=". It should never happen in practice that the
>>> entire content of the file is a single character followed by zero or
>>> more newlines, but using the proper condition ">=" would save future
>>> readers of this code a "huh?" moment.
>>
>> I'm not sure it is that simple. If the script consists solely of a
>> single quote then we should return 1, if it is a single character that
>> is not "'" then it should return 0. Currently it returns 0 in both those
>> cases so is technically broken when the script is "'". If it used ">="
>> instead then I think it would return 0 when it should return 1 and vice
>> versa. As you say this shouldn't happen in practice.
> 
> It shouldn't happen in practice, but if a human (power user) edits
> this file, we shouldn't discount the possibility. However, I'm not so
> concerned about quoting_is_broken() returning a meaningful answer in
> this case since we have much bigger problems if we get such a file.
> (Indeed, what answer could quoting_is_broken() return which would be
> useful or meaningful for such a malformed file?)
> 
> What does concern me is that read_env_script() doesn't seem to care
> about such a malformed file; it doesn't do any validation at all.
> Contrast that with read_author_ident() which is pretty strict about
> the content it expects to find in the file. So, it might make sense to
> upgrade read_env_script() to do some sort of validation on each line
> (though that shouldn't be in this patch, and doesn't even need to be
> in this series). For instance, it could check that each line looks
> something like what would be matched by this regex: /[A-Z0-9_]+='.+'/

I think it should just share some code with get_author_ident() that uses
sq_dequote(), that's for another day though.

> (And, no, I'm not saying that regex should be used for validation; I'm
> just using it as an example.)
> 
> As for '>' vs. '>=', it caused more than a slight hiccup when I was
> scanning the code, and I worry that future readers could be similarly
> impacted.

I don't have a strong opinion either way, if Junio wants to fix it up
I'd be happy with that, but I'm not keen on another iteration just for that.

Best Wishes

Phillip




Re: [PATCH v4 2/2] sequencer: fix quoting in write_author_script

2018-08-09 Thread Phillip Wood
Hi Junio
On 08/08/18 17:01, Junio C Hamano wrote:
> Eric Sunshine  writes:
> 
>> What does concern me is that read_env_script() doesn't seem to care
>> about such a malformed file; it doesn't do any validation at all.
>> Contrast that with read_author_ident() which is pretty strict about
>> the content it expects to find in the file. So, it might make sense to
>> upgrade read_env_script() to do some sort of validation on each line
>> (though that shouldn't be in this patch, and doesn't even need to be
>> in this series).
> 
> I do not think it is within the scope of these bugfix patches, but I
> tend to agree that in the longer term it would be a good idea to
> unify these two helpers that read exactly the same file stored at
> rebase_path_author_script(), and make the result stricter, rather
> than tightening two helpers independently.

That's my longer term goal, ideally sharing code with am.

Best Wishes

Phillip


Re: [PATCH v2 2/2] rebase --exec: make it work with --rebase-merges

2018-08-09 Thread Phillip Wood
On 09/08/18 10:22, Johannes Schindelin wrote:
> Hi Phillip,
> 
> On Mon, 6 Aug 2018, Phillip Wood wrote:
> 
>> On 06/08/18 10:52, Johannes Schindelin via GitGitGadget wrote:
>>>
>>> From: Johannes Schindelin 
>>>
>>> The idea of `--exec` is to append an `exec` call after each `pick`.
>>>
>>> Since the introduction of fixup!/squash! commits, this idea was extended
>>> to apply to "pick, possibly followed by a fixup/squash chain", i.e. an
>>> exec would not be inserted between a `pick` and any of its corresponding
>>> `fixup` or `squash` lines.
>>>
>>> The current implementation uses a dirty trick to achieve that: it
>>> assumes that there are only pick/fixup/squash commands, and then
>>> *inserts* the `exec` lines before any `pick` but the first, and appends
>>> a final one.
>>>
>>> With the todo lists generated by `git rebase --rebase-merges`, this
>>> simple implementation shows its problems: it produces the exact wrong
>>> thing when there are `label`, `reset` and `merge` commands.
>>>
>>> Let's change the implementation to do exactly what we want: look for
>>> `pick` lines, skip any fixup/squash chains, and then insert the `exec`
>>> line. Lather, rinse, repeat.
>>>
>>> While at it, also add `exec` lines after `merge` commands, because they
>>> are similar in spirit to `pick` commands: they add new commits.
>>>
>>> Signed-off-by: Johannes Schindelin 
>>> ---
>>>   sequencer.c  | 37 +++--
>>>   t/t3430-rebase-merges.sh |  2 +-
>>>   2 files changed, 28 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/sequencer.c b/sequencer.c
>>> index 31038472f..ed2e694ff 100644
>>> --- a/sequencer.c
>>> +++ b/sequencer.c
>>> @@ -4244,10 +4244,9 @@ int sequencer_add_exec_commands(const char *commands)
>>>   {
>>>const char *todo_file = rebase_path_todo();
>>>struct todo_list todo_list = TODO_LIST_INIT;
>>> -   struct todo_item *item;
>>>struct strbuf *buf = _list.buf;
>>>size_t offset = 0, commands_len = strlen(commands);
>>> -   int i, first;
>>> +   int i, insert;
>>>   
>>>if (strbuf_read_file(_list.buf, todo_file, 0) < 0)
>>> return error(_("could not read '%s'."), todo_file);
>>> @@ -4257,19 +4256,37 @@ int sequencer_add_exec_commands(const char
>>> *commands)
>>> return error(_("unusable todo list: '%s'"), todo_file);
>>>}
>>>   - first = 1;
>>> -   /* insert  before every pick except the first one */
>>> -   for (item = todo_list.items, i = 0; i < todo_list.nr; i++, item++) {
>>> -   if (item->command == TODO_PICK && !first) {
>>> -   strbuf_insert(buf, item->offset_in_buf + offset,
>>> - commands, commands_len);
>>> +   /*
>>> +* Insert  after every pick. Here, fixup/squash chains
>>> +* are considered part of the pick, so we insert the commands *after*
>>> +* those chains if there are any.
>>> +*/
>>> +   insert = -1;
>>> +   for (i = 0; i < todo_list.nr; i++) {
>>> +   enum todo_command command = todo_list.items[i].command;
>>> +
>>> +   if (insert >= 0) {
>>> +   /* skip fixup/squash chains */
>>> +   if (command == TODO_COMMENT)
>>> +   continue;
>>
>> insert is not updated so if the next command is not a fixup the exec
>> line will be inserted before the comment.
> 
> Yes, this is very much on purpose. Take this todo list, for example:
> 
>   pick 123456 this patch
>   # pick 987654 this was an empty commit
> 
> You definitely do not want the `exec` to appear after that commented-out
> empty commit.

Yes, I like it, I was just thinking out loud.

>>> +   else if (is_fixup(command)) {
>>> +   insert = i + 1;
>>> +   continue;
>>> +   }
>>> +   strbuf_insert(buf,
>>> + todo_list.items[insert].offset_in_buf +
>>> + offset, commands, commands_len);
>>> offset += commands_len;
>>> +   insert = -1;
>>> }
>>> -   first = 0;
>>> +
>>> +   if (command == TODO_PICK || command == TODO_MERGE)
>>> +   insert = i + 1;
>>>}
>>>   
>>> /* append final  */
>>> -   strbuf_add(buf, commands, commands_len);
>>> +   if (insert >= 0 || !offset)
>>> +   strbuf_add(buf, commands, commands_len);
>>
>> Having read your other message about this patch I think if you wanted to fix
>> the position of the final exec in the case where the todo list ends with a
>> comment you could do something like
>>
>>  if (insert >= 0)
>>  strbuf_insert(buf,
>>todo_list.items[insert].offset_in_buf +
>>offset, commands, commands_len);
>>  else
>>  strbuf_add(buf, commands, commands_len);
> 
> That does not really work, as `insert` can point *after* the last line, in
> which case 

[PATCH v3 0/2] Make git rebase work with --rebase-merges and --exec

2018-08-09 Thread Johannes Schindelin via GitGitGadget
It was reported via IRC that the exec lines are inserted in the wrong spots
when using --rebase-merges.

The reason is that we used a simple, incorrect implementation that happened
to work as long as the generated todo list only contains pick, fixup and 
squash commands. Which is not the case with--rebase-merges.

Fix this issue by using a correct implementation instead, that even takes
into account merge commands in the --rebase-merges mode.

Changes since v1:

 * Replaced the "look-ahead" design by a "keep looking" one: instead of
   having a nested loop that looks for the end of the fixup/squash chain, we
   continue the loop, delaying the insertion until we know where the
   fixup/squash chain ends, if any.

Johannes Schindelin (2):
  t3430: demonstrate what -r, --autosquash & --exec should do
  rebase --exec: make it work with --rebase-merges

 sequencer.c  | 42 +---
 t/t3430-rebase-merges.sh | 17 
 2 files changed, 48 insertions(+), 11 deletions(-)


base-commit: 1d89318c48d233d52f1db230cf622935ac3c69fa
Published-As: 
https://github.com/gitgitgadget/git/releases/tags/pr-13%2Fdscho%2Frebase-merges-and-exec-commands-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git 
pr-13/dscho/rebase-merges-and-exec-commands-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/13

Range-diff vs v2:

 1:  1d82eb450 = 1:  1d82eb450 t3430: demonstrate what -r, --autosquash & 
--exec should do
 2:  7ca441a89 ! 2:  b436f67ba rebase --exec: make it work with --rebase-merges
 @@ -22,6 +22,11 @@
  `pick` lines, skip any fixup/squash chains, and then insert the `exec`
  line. Lather, rinse, repeat.
  
 +Note: we take pains to insert *before* comment lines whenever 
possible,
 +as empty commits are represented by commented-out pick lines (and we
 +want to insert a preceding pick's exec line *before* such a line, not
 +afterward).
 +
  While at it, also add `exec` lines after `merge` commands, because 
they
  are similar in spirit to `pick` commands: they add new commits.
  
 @@ -81,9 +86,13 @@
  + insert = i + 1;
}
   
 -  /* append final  */
 +- /* append final  */
  - strbuf_add(buf, commands, commands_len);
 -+ if (insert >= 0 || !offset)
 ++ /* insert or append final  */
 ++ if (insert >= 0 && insert < todo_list.nr)
 ++ strbuf_insert(buf, todo_list.items[insert].offset_in_buf +
 ++   offset, commands, commands_len);
 ++ else if (insert >= 0 || !offset)
  + strbuf_add(buf, commands, commands_len);
   
i = write_message(buf->buf, buf->len, todo_file, 0);

-- 
gitgitgadget


[PATCH v3 1/2] t3430: demonstrate what -r, --autosquash & --exec should do

2018-08-09 Thread Johannes Schindelin via GitGitGadget
From: Johannes Schindelin 

The --exec option's implementation is not really well-prepared for
--rebase-merges. Demonstrate this.

Signed-off-by: Johannes Schindelin 
---
 t/t3430-rebase-merges.sh | 17 +
 1 file changed, 17 insertions(+)

diff --git a/t/t3430-rebase-merges.sh b/t/t3430-rebase-merges.sh
index 9e6229727..0bf5eaa37 100755
--- a/t/t3430-rebase-merges.sh
+++ b/t/t3430-rebase-merges.sh
@@ -363,4 +363,21 @@ test_expect_success 'octopus merges' '
EOF
 '
 
+test_expect_failure 'with --autosquash and --exec' '
+   git checkout -b with-exec H &&
+   echo Booh >B.t &&
+   test_tick &&
+   git commit --fixup B B.t &&
+   write_script show.sh <<-\EOF &&
+   subject="$(git show -s --format=%s HEAD)"
+   content="$(git diff HEAD^! | tail -n 1)"
+   echo "$subject: $content"
+   EOF
+   test_tick &&
+   git rebase -ir --autosquash --exec ./show.sh A >actual &&
+   grep "B: +Booh" actual &&
+   grep "E: +Booh" actual &&
+   grep "G: +G" actual
+'
+
 test_done
-- 
gitgitgadget



[PATCH v3 2/2] rebase --exec: make it work with --rebase-merges

2018-08-09 Thread Johannes Schindelin via GitGitGadget
From: Johannes Schindelin 

The idea of `--exec` is to append an `exec` call after each `pick`.

Since the introduction of fixup!/squash! commits, this idea was extended
to apply to "pick, possibly followed by a fixup/squash chain", i.e. an
exec would not be inserted between a `pick` and any of its corresponding
`fixup` or `squash` lines.

The current implementation uses a dirty trick to achieve that: it
assumes that there are only pick/fixup/squash commands, and then
*inserts* the `exec` lines before any `pick` but the first, and appends
a final one.

With the todo lists generated by `git rebase --rebase-merges`, this
simple implementation shows its problems: it produces the exact wrong
thing when there are `label`, `reset` and `merge` commands.

Let's change the implementation to do exactly what we want: look for
`pick` lines, skip any fixup/squash chains, and then insert the `exec`
line. Lather, rinse, repeat.

Note: we take pains to insert *before* comment lines whenever possible,
as empty commits are represented by commented-out pick lines (and we
want to insert a preceding pick's exec line *before* such a line, not
afterward).

While at it, also add `exec` lines after `merge` commands, because they
are similar in spirit to `pick` commands: they add new commits.

Signed-off-by: Johannes Schindelin 
---
 sequencer.c  | 42 +---
 t/t3430-rebase-merges.sh |  2 +-
 2 files changed, 32 insertions(+), 12 deletions(-)

diff --git a/sequencer.c b/sequencer.c
index 31038472f..278d34ce9 100644
--- a/sequencer.c
+++ b/sequencer.c
@@ -4244,10 +4244,9 @@ int sequencer_add_exec_commands(const char *commands)
 {
const char *todo_file = rebase_path_todo();
struct todo_list todo_list = TODO_LIST_INIT;
-   struct todo_item *item;
struct strbuf *buf = _list.buf;
size_t offset = 0, commands_len = strlen(commands);
-   int i, first;
+   int i, insert;
 
if (strbuf_read_file(_list.buf, todo_file, 0) < 0)
return error(_("could not read '%s'."), todo_file);
@@ -4257,19 +4256,40 @@ int sequencer_add_exec_commands(const char *commands)
return error(_("unusable todo list: '%s'"), todo_file);
}
 
-   first = 1;
-   /* insert  before every pick except the first one */
-   for (item = todo_list.items, i = 0; i < todo_list.nr; i++, item++) {
-   if (item->command == TODO_PICK && !first) {
-   strbuf_insert(buf, item->offset_in_buf + offset,
- commands, commands_len);
+   /*
+* Insert  after every pick. Here, fixup/squash chains
+* are considered part of the pick, so we insert the commands *after*
+* those chains if there are any.
+*/
+   insert = -1;
+   for (i = 0; i < todo_list.nr; i++) {
+   enum todo_command command = todo_list.items[i].command;
+
+   if (insert >= 0) {
+   /* skip fixup/squash chains */
+   if (command == TODO_COMMENT)
+   continue;
+   else if (is_fixup(command)) {
+   insert = i + 1;
+   continue;
+   }
+   strbuf_insert(buf,
+ todo_list.items[insert].offset_in_buf +
+ offset, commands, commands_len);
offset += commands_len;
+   insert = -1;
}
-   first = 0;
+
+   if (command == TODO_PICK || command == TODO_MERGE)
+   insert = i + 1;
}
 
-   /* append final  */
-   strbuf_add(buf, commands, commands_len);
+   /* insert or append final  */
+   if (insert >= 0 && insert < todo_list.nr)
+   strbuf_insert(buf, todo_list.items[insert].offset_in_buf +
+ offset, commands, commands_len);
+   else if (insert >= 0 || !offset)
+   strbuf_add(buf, commands, commands_len);
 
i = write_message(buf->buf, buf->len, todo_file, 0);
todo_list_release(_list);
diff --git a/t/t3430-rebase-merges.sh b/t/t3430-rebase-merges.sh
index 0bf5eaa37..90ae613e2 100755
--- a/t/t3430-rebase-merges.sh
+++ b/t/t3430-rebase-merges.sh
@@ -363,7 +363,7 @@ test_expect_success 'octopus merges' '
EOF
 '
 
-test_expect_failure 'with --autosquash and --exec' '
+test_expect_success 'with --autosquash and --exec' '
git checkout -b with-exec H &&
echo Booh >B.t &&
test_tick &&
-- 
gitgitgadget


  1   2   >