Re: [PATCH 3/2] batch check whether submodule needs pushing into one call

2016-09-19 Thread Heiko Voigt
On Fri, Sep 16, 2016 at 11:13:09AM -0700, Junio C Hamano wrote:
> Heiko Voigt  writes:
> > The most exact solution would be to use all actual remote refs available
> > (not sure if we have them at this point in the process?) another
> > solution would be to still append the --remotes= option as a
> > fallback to reduce the revisions.
> 
> I'd say --remotes= is the least problematic thing to do.

Ok then lets drop my last patch and keep it the way it was. Because if
the remote sha1 differs we probably do not have it locally anyway. The
only case this does not catch is when the user specifies a remote URL.
But that just means we will iterate over all revisions instead of a
reduced set, which makes the check slower but still correct. As one can
see from my measurements that should not be that bad anymore.

Cheers Heiko


Re: [PATCH 3/2] batch check whether submodule needs pushing into one call

2016-09-19 Thread Heiko Voigt
On Fri, Sep 16, 2016 at 10:59:37AM -0700, Junio C Hamano wrote:
> Heiko Voigt  writes:
> > +static void check_has_hash(const unsigned char sha1[20], void *data)
> > +{
> > +   int *has_hash = (int *) data;
> > +
> > +   if (!lookup_commit_reference(sha1))
> > +   *has_hash = 0;
> > +}
> > +
> > +static int submodule_has_hashes(const char *path, struct sha1_array 
> > *hashes)
> > +{
> > +   int has_hash = 1;
> > +
> > +   if (add_submodule_odb(path))
> > +   return 0;
> > +
> > +   sha1_array_for_each_unique(hashes, check_has_hash, _hash);
> > +   return has_hash;
> > +}
> > +
> > +static int submodule_needs_pushing(const char *path, struct sha1_array 
> > *hashes)
> > +{
> > +   if (!submodule_has_hashes(path, hashes))
> > return 0;
> 
> I think you meant well, but this optimization is wrong.  A mere
> presence of an object does not mean that the current tip can reach
> that object.  Imagine you pushed commit A earlier to them at the
> tip, then pushed commit A~ to them at the tip, which is the current
> state of the remote of the submodule, and since them they may have
> GC'ed.  They no longer have the commit A.
> 
> For that matter, because you are doing this check by pretending as
> if all the submodule objects are in the object store of the current
> superproject you are working in, and saying "it exists there in the
> submodule repository" when the only thing you know is it exists in
> an object store of either the submodule repository, the superproject
> repository, or any of the other submodule repositories, you really
> cannot tell much from a mere presence of an object.  Not just the
> remote of the submodule repository you are interested in, but the
> submodule repository you are interested in itself, may not have that
> object.
> 
> Drop the previous two helper functions and this short-cut.

This is nothing I added. This came from the original code which I simply
extended to check for all sha1's. The diff is a little bit misleading. This
would be the local diff:

-   if (add_submodule_odb(path) || !lookup_commit_reference(sha1))
+   if (!submodule_has_hashes(path, hashes))
return 0;

I think that this is more a shortcut for: If we can not find the sha1, we do
not need to bother spawning a process for the real check. We default to the
answer no in that case.

It looks like the reason for the sha1 check is to avoid error output from the
spawned 'rev-list' process. The error output might be confusing to the user in
case the sha1 does not exist in the submodule. We should probably die here
since we were not able to check a submodule that has a diff in the revisions we
would push.

After thinking about this further: I think it is actually bad to proceed here,
as the current revisions (just in the superproject) would still be pushed and
the user possibly gets an inconsistent state on the remote side which he tried
to avoid with check/on-demand-push enabled.

So in short this deserves some further love. Will have a look into adding
another patch for this if we agree on my suggestion to die above.

Cheers Heiko


Re: [PATCH 3/2] batch check whether submodule needs pushing into one call

2016-09-16 Thread Junio C Hamano
Heiko Voigt  writes:

> On Fri, Sep 16, 2016 at 11:40:19AM +0200, Heiko Voigt wrote:
>> > By the way, with the two new patches, 'pu' seems to start failing
>> > some tests, e.g. 5533 5404 5405.
>> 
>> Ah ok I did only test on master, will look into those.
>
> Ok I had a look into these and the reason t5533 fails is because on pu
> --recurse-submodules is enabled by default and I missed the case when
> overwriting a ref. In that case we get the sha1 from the remote side as
> old. So we could catch that and fall back to all revisions there, but...
>
> ... tl;dr: The solution to use the old revisions from the remote side
> was too simple and does not make matters better but actually worse for
> some typical usecases. Its only in the last patch.

You may not even have the old one in your copy of the remote
repository if you haven't fetched from them and you are forcing your
push.  "rev-list  --not " may fail in such a case,
not producing the list of new commits.  You'd need to exclude old ones
you learned over the wire that you do not yet have locally.

> The most exact solution would be to use all actual remote refs available
> (not sure if we have them at this point in the process?) another
> solution would be to still append the --remotes= option as a
> fallback to reduce the revisions.

I'd say --remotes= is the least problematic thing to do.



Re: [PATCH 3/2] batch check whether submodule needs pushing into one call

2016-09-16 Thread Junio C Hamano
Heiko Voigt  writes:

> +static void append_hash_to_argv(const unsigned char sha1[20], void *data)
>  {
> - if (add_submodule_odb(path) || !lookup_commit_reference(sha1))
> + struct argv_array *argv = (struct argv_array *) data;
> + argv_array_push(argv, sha1_to_hex(sha1));
> +}

Hmph, why do I think I've seen this before in the previous patch?

... scans through this patch and finds that a similar one is
removed ;-)

OK.  This makes sense.

> +static void check_has_hash(const unsigned char sha1[20], void *data)
> +{
> + int *has_hash = (int *) data;
> +
> + if (!lookup_commit_reference(sha1))
> + *has_hash = 0;
> +}
> +
> +static int submodule_has_hashes(const char *path, struct sha1_array *hashes)
> +{
> + int has_hash = 1;
> +
> + if (add_submodule_odb(path))
> + return 0;
> +
> + sha1_array_for_each_unique(hashes, check_has_hash, _hash);
> + return has_hash;
> +}
> +
> +static int submodule_needs_pushing(const char *path, struct sha1_array 
> *hashes)
> +{
> + if (!submodule_has_hashes(path, hashes))
>   return 0;

I think you meant well, but this optimization is wrong.  A mere
presence of an object does not mean that the current tip can reach
that object.  Imagine you pushed commit A earlier to them at the
tip, then pushed commit A~ to them at the tip, which is the current
state of the remote of the submodule, and since them they may have
GC'ed.  They no longer have the commit A.

For that matter, because you are doing this check by pretending as
if all the submodule objects are in the object store of the current
superproject you are working in, and saying "it exists there in the
submodule repository" when the only thing you know is it exists in
an object store of either the submodule repository, the superproject
repository, or any of the other submodule repositories, you really
cannot tell much from a mere presence of an object.  Not just the
remote of the submodule repository you are interested in, but the
submodule repository you are interested in itself, may not have that
object.

Drop the previous two helper functions and this short-cut.

>   if (for_each_remote_ref_submodule(path, has_remote, NULL) > 0) {
>   struct child_process cp = CHILD_PROCESS_INIT;
> - const char *argv[] = {"rev-list", NULL, "--not", "--remotes", 
> "-n", "1" , NULL};
> +
> + argv_array_push(, "rev-list");
> + sha1_array_for_each_unique(hashes, append_hash_to_argv, 
> );
> + argv_array_pushl(, "--not", "--remotes", "-n", "1" , 
> NULL);
> +
>   struct strbuf buf = STRBUF_INIT;
>   int needs_pushing = 0;
>  
> - argv[1] = sha1_to_hex(sha1);
> - cp.argv = argv;
>   prepare_submodule_repo_env(_array);
>   cp.git_cmd = 1;
>   cp.no_stdin = 1;
>   cp.out = -1;
>   cp.dir = path;
>   if (start_command())
> - die("Could not run 'git rev-list %s --not --remotes -n 
> 1' command in submodule %s",
> - sha1_to_hex(sha1), path);
> + die("Could not run 'git rev-list  --not 
> --remotes -n 1' command in submodule %s",
> + path);
>   if (strbuf_read(, cp.out, 41))
>   needs_pushing = 1;
>   finish_command();
> @@ -601,21 +628,6 @@ static void find_unpushed_submodule_commits(struct 
> commit *commit,
>   diff_tree_combined_merge(commit, 1, );
>  }

Good.  This is the optimization I alluded to in the review of the
first one in the series.


Re: [PATCH 3/2] batch check whether submodule needs pushing into one call

2016-09-16 Thread Heiko Voigt
On Fri, Sep 16, 2016 at 11:40:19AM +0200, Heiko Voigt wrote:
> > By the way, with the two new patches, 'pu' seems to start failing
> > some tests, e.g. 5533 5404 5405.
> 
> Ah ok I did only test on master, will look into those.

Ok I had a look into these and the reason t5533 fails is because on pu
--recurse-submodules is enabled by default and I missed the case when
overwriting a ref. In that case we get the sha1 from the remote side as
old. So we could catch that and fall back to all revisions there, but...

... tl;dr: The solution to use the old revisions from the remote side
was too simple and does not make matters better but actually worse for
some typical usecases. Its only in the last patch.

... that lead me to further thinking about the previous solution (using
the locally cached remote refs) which might actually be a good default
for the non-fastforward cases like creating new ref or overwriting a
ref.

My current patch would handle the --mirror case nicer, since it gets a
lot of old revs to reduce the revisions to check. For the typical one
branch push it would most likely be worse. Except when the user is
updating (fast-forwarding) the remote ref we would scan all revs of a
ref until the root because we do not get enough valid revs that already
exist on the remote.

The most exact solution would be to use all actual remote refs available
(not sure if we have them at this point in the process?) another
solution would be to still append the --remotes= option as a
fallback to reduce the revisions.

What do others think? Will leave this for a little bit further thinking.
Its just the last patch ("use actual start hashes for submodule push
check instead of local refs") that needs to go back to the drawing
board.

Cheers Heiko


Re: [PATCH 3/2] batch check whether submodule needs pushing into one call

2016-09-16 Thread Heiko Voigt
On Thu, Sep 15, 2016 at 02:08:58PM -0700, Junio C Hamano wrote:
> Heiko Voigt  writes:
> 
> > if (for_each_remote_ref_submodule(path, has_remote, NULL) > 0) {
> > struct child_process cp = CHILD_PROCESS_INIT;
> > -   const char *argv[] = {"rev-list", NULL, "--not", "--remotes", 
> > "-n", "1" , NULL};
> > +
> > +   argv_array_push(, "rev-list");
> > +   sha1_array_for_each_unique(hashes, append_hash_to_argv, 
> > );
> > +   argv_array_pushl(, "--not", "--remotes", "-n", "1" , 
> > NULL);
> > +
> > struct strbuf buf = STRBUF_INIT;
> > int needs_pushing = 0;
> 
> These two become decl-after-stmt; move your new lines a bit lower,
> perhaps?

Thanks, missed those. Will do.

> > -   argv[1] = sha1_to_hex(sha1);
> > -   cp.argv = argv;
> > prepare_submodule_repo_env(_array);
> 
> By the way, with the two new patches, 'pu' seems to start failing
> some tests, e.g. 5533 5404 5405.

Ah ok I did only test on master, will look into those.

Cheers Heiko


Re: [PATCH 3/2] batch check whether submodule needs pushing into one call

2016-09-15 Thread Junio C Hamano
Heiko Voigt  writes:

>   if (for_each_remote_ref_submodule(path, has_remote, NULL) > 0) {
>   struct child_process cp = CHILD_PROCESS_INIT;
> - const char *argv[] = {"rev-list", NULL, "--not", "--remotes", 
> "-n", "1" , NULL};
> +
> + argv_array_push(, "rev-list");
> + sha1_array_for_each_unique(hashes, append_hash_to_argv, 
> );
> + argv_array_pushl(, "--not", "--remotes", "-n", "1" , 
> NULL);
> +
>   struct strbuf buf = STRBUF_INIT;
>   int needs_pushing = 0;

These two become decl-after-stmt; move your new lines a bit lower,
perhaps?

> - argv[1] = sha1_to_hex(sha1);
> - cp.argv = argv;
>   prepare_submodule_repo_env(_array);

By the way, with the two new patches, 'pu' seems to start failing
some tests, e.g. 5533 5404 5405.



[PATCH 3/2] batch check whether submodule needs pushing into one call

2016-09-15 Thread Heiko Voigt
We run a command for each sha1 change in a submodule. This is
unnecessary since we can simply batch all sha1's we want to check into
one command. Lets do it so we can speedup the check when many submodule
changes are in need of checking.

Signed-off-by: Heiko Voigt 
---
On Wed, Sep 14, 2016 at 03:30:53PM -0700, Junio C Hamano wrote:
> Heiko Voigt  writes:
> 
> > Sorry about the late reply. I was not able to process emails until now.
> > Here are two patches that should help to improve the situation and batch
> > up some processing. This one is for repositories with submodules, so
> > that they do not iterate over the same submodule twice with the same
> > hash.
> >
> > The second one will be the one people without submodules are interested
> > in.
> 
> Thanks.  Will take a look at later as I'm already deep in today's
> integration cycle.  Very much appreciated.

No problem. While I am at it: Here are actually another two patches that
should make life of submodule users easier (push times of big pushes).

In Numbers with the qt5[1] superproject and all submodules initialized.
The same --mirror test as before with the git repository:

# Without patch:

book:qt5 hvoigt (5.6)$
rm -rf ~/Downloads/git-test && mkdir ~/Downloads/git-test &&
   (cd ~/Downloads/git-test && git init) &&
   time git push --mirror --recurse-submodules=check ~/Downloads/git-test

real4m0.881s
user3m30.139s
sys 0m22.329s

Without --recurse-submodules=check

real0m0.251s
user0m0.218s
sys 0m0.082s


# With patch:

real0m1.167s
user0m0.846s
sys 0m0.262s

real0m1.110s
user0m0.815s
sys 0m0.247s

real0m1.111s
user0m0.818s
sys 0m0.251s

Without --recurse-submodules=check

real0m0.294s
user0m0.221s
sys 0m0.104s

real0m0.248s
user0m0.216s
sys 0m0.080s

real0m0.247s
user0m0.212s
sys 0m0.082s

[1] git://code.qt.io/qt/qt5.git

 submodule.c | 75 -
 1 file changed, 39 insertions(+), 36 deletions(-)

diff --git a/submodule.c b/submodule.c
index a15e346..28bb74e 100644
--- a/submodule.c
+++ b/submodule.c
@@ -522,27 +522,54 @@ static int has_remote(const char *refname, const struct 
object_id *oid,
return 1;
 }
 
-static int submodule_needs_pushing(const char *path, const unsigned char 
sha1[20])
+static void append_hash_to_argv(const unsigned char sha1[20], void *data)
 {
-   if (add_submodule_odb(path) || !lookup_commit_reference(sha1))
+   struct argv_array *argv = (struct argv_array *) data;
+   argv_array_push(argv, sha1_to_hex(sha1));
+}
+
+static void check_has_hash(const unsigned char sha1[20], void *data)
+{
+   int *has_hash = (int *) data;
+
+   if (!lookup_commit_reference(sha1))
+   *has_hash = 0;
+}
+
+static int submodule_has_hashes(const char *path, struct sha1_array *hashes)
+{
+   int has_hash = 1;
+
+   if (add_submodule_odb(path))
+   return 0;
+
+   sha1_array_for_each_unique(hashes, check_has_hash, _hash);
+   return has_hash;
+}
+
+static int submodule_needs_pushing(const char *path, struct sha1_array *hashes)
+{
+   if (!submodule_has_hashes(path, hashes))
return 0;
 
if (for_each_remote_ref_submodule(path, has_remote, NULL) > 0) {
struct child_process cp = CHILD_PROCESS_INIT;
-   const char *argv[] = {"rev-list", NULL, "--not", "--remotes", 
"-n", "1" , NULL};
+
+   argv_array_push(, "rev-list");
+   sha1_array_for_each_unique(hashes, append_hash_to_argv, 
);
+   argv_array_pushl(, "--not", "--remotes", "-n", "1" , 
NULL);
+
struct strbuf buf = STRBUF_INIT;
int needs_pushing = 0;
 
-   argv[1] = sha1_to_hex(sha1);
-   cp.argv = argv;
prepare_submodule_repo_env(_array);
cp.git_cmd = 1;
cp.no_stdin = 1;
cp.out = -1;
cp.dir = path;
if (start_command())
-   die("Could not run 'git rev-list %s --not --remotes -n 
1' command in submodule %s",
-   sha1_to_hex(sha1), path);
+   die("Could not run 'git rev-list  --not 
--remotes -n 1' command in submodule %s",
+   path);
if (strbuf_read(, cp.out, 41))
needs_pushing = 1;
finish_command();
@@ -601,21 +628,6 @@ static void find_unpushed_submodule_commits(struct commit 
*commit,
diff_tree_combined_merge(commit, 1, );
 }
 
-struct collect_submodule_from_sha1s_data {
-   char *submodule_path;
-   struct string_list *needs_pushing;
-};
-
-static void collect_submodules_from_sha1s(const unsigned char sha1[20],
-   void *data)
-{
-   struct collect_submodule_from_sha1s_data *me =
-   (struct