Re: [PATCHv2] push: change submodule default to check

2016-08-24 Thread Jeff King
On Wed, Aug 24, 2016 at 03:37:29PM -0700, Stefan Beller wrote:

> > That sounds massively ... broken.  So before even thinking about
> > flipping it to default, this needs to be fixed first.
> 
> I agree. That sounds bad.
> 
> However having the --auto-check feels like papering over the
> actual problem which to me sounds like a design problem.
> However this may be a viable short term solution.

Sort of...

I may not have been clear, but there are really a few things going on.

One is that the design of find_unpushed_submodules() is just brain-dead.
It does one traversal per updated ref, which means a from-scratch mirror
is O(nr_of_refs * nr_of_commits). That's just silly, and can easily be
fixed behind the scenes to be O(nr_of_commits).

And I _suspect_ it is what made Junio's earlier push so awful; he
probably pushed up the same commits as part of many different branches,
so he did the same diffs over and over.

So clearing that up seems like an obvious first step, and dulls the pain
to "if submodule recursion is on, the worst case is that you walk all
the new objects you are sending". That's still _a_ traversal, but it's
one we have to do anyway in pack-objects, so it's the same order of
magnitude as the rest of the push[1].

Then you've got two cases: the repo is using submodules at all, or they
are not. The former is an easy case, if we can identify it; we can avoid
the traversal at all, and people who do not use submodules are not
regressed at all.

That leaves people who _are_ using submodules with paying the extra
traversal cost. Not great, but only really a pain when you have a really
big chunk of history to push. It may be lost in the noise for such a
push in more normal circumstances (where bandwidth to push up the
objects dominates, though it is unfortunate that we do not even start
utilizing the bandwidth, the critical resource, until we are done with
the submodule check).

[1] Of course with reachability bitmaps the pack-objects traversal goes
away, but the same cannot be accomplished here (because they do not
store the gitlink sha1s at all, because they do not imply
reachability).

> We need to answer the question: Which submodule commits
> are referenced by a given set of superproject commits.
> 
> This question is advancing a very similar question that we'd
> have to ask in git-gc. In gc we would end up without having to
> worry about a specific set, but rather the all reachable commits
> of the superproject are in the given set.
> 
> So we could solve two issues at the same time if we had a quick
> way to answer this question quickly.
> [...]

I snipped here because your solutions sound complicated (which isn't to
say they're wrong, but that I am not willing to give them a lot of
thought at this time in the evening ;) ).

One opposite approach which appeals to me is not to remove the need for
the traversal, but to make it much faster. E.g., by storing commits in a
form that can be traversed more quickly, and possibly keeping a bitmap
cache of which paths are touched in each commit (I have posted patches
to the list for the former, but have only been considering experimenting
with the latter).

That's _also_ complicated, but it applies to way more things. Including
normal log traversals, path-limiting for diffs, the "counting" traversal
done by pack-object, etc.

And while it is complicated in some ways, it's conceptually simple at
the git data model layer. It's returning the same old answers about
commits and trees, just faster.

Anyway, food for thought.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] push: change submodule default to check

2016-08-24 Thread Stefan Beller
On Wed, Aug 24, 2016 at 12:37 PM, Junio C Hamano  wrote:
> Jeff King  writes:
>
> This seems to be dropped from the list, probably due to no "To:"
> header in the original, which led to "no", "To-header" "on" and
> "input <" on YOUR recipient list, so I am quoting it in full without
> trimming.
>
>> On Wed, Aug 24, 2016 at 10:30:17AM -0700, Stefan Beller wrote:
>>
>>> When working with submodules, it is easy to forget to push the submodules.
>>> The setting 'check', which checks if any existing submodule is present on
>>> at least one remote of the submodule remotes, is designed to prevent this
>>> mistake.
>>>
>>> Flipping the default to check for submodules is safer than the current
>>> default of ignoring submodules while pushing.
>>
>> It is safer, and that's good. But it's also slower, because it requires
>> an extra traversal of all of the pushed commits. And now people will
>> have to pay the price even if they are not using submodules at all.
>>
>> For instance, try this from a checkout of linux.git:
>>
>>   for i in no check; do
>>   rm -rf dst.git
>>   git init --bare dst.git
>>   echo "==> Pushing with submodules=$i"
>>   time git push --recurse-submodules=$i dst.git HEAD
>>   done
>>
>> The second case takes 30-40 seconds longer. This is a full push of
>> history, so it's an extreme case[1], but it's still rather unfortunate.
>>
>> Can we tie this default to some sign that submodules are actually in
>> use? I don't think the presence of .gitmodules is perfect (because you
>> might be in a bare repo, for example, and have just fetched some other
>> history you are relaying), but it may be a good compromise.  I'm
>> envisioning something like "--recurse-submodules=auto-check" which
>> auto-detects common situations (e.g., presence of .gitmodules or
>> .git/modules checkouts) and enables "check", and then setting the
>> default to that in the long run.
>>
>> -Peff
>>
>> [1] Actually, there is another much worse case lurking there. Try:
>>
>>   git push --recurse-submodules=check --mirror dst.git
>>
>> from the kernel. I didn't let it finish, but I'd estimate it would
>> take on the order of 5 hours. The problem is that push feeds each
>> updated ref tip to find_unpushed_submodules(), so we end up walking
>> over the same history over and over.
>>
>> I think it should feed all of the "before" and "after" ref tips it
>> proposes to update to a _single_ revision traversal.
>
> That sounds massively ... broken.  So before even thinking about
> flipping it to default, this needs to be fixed first.
>

I agree. That sounds bad.

However having the --auto-check feels like papering over the
actual problem which to me sounds like a design problem.
However this may be a viable short term solution.

About the design issue:
We need to answer the question: Which submodule commits
are referenced by a given set of superproject commits.

This question is advancing a very similar question that we'd
have to ask in git-gc. In gc we would end up without having to
worry about a specific set, but rather the all reachable commits
of the superproject are in the given set.

So we could solve two issues at the same time if we had a quick
way to answer this question quickly.

And for that I considered introducing a ref in the submodule
(e.g.) refs/superproject/gc_protector, which records both the
superprojects commit as well as the submodule commit
in case the superproject changed the submodule pointer.

I have some rough patches for this, but I did not consider sending
that to the list as the patches are still rough and not quite
fully thought out on the design level.

--
Actually screw this. After an office discussion with Jonathan (cc'd):

1)  We need to have an "alternate refs namespace", which contains
secondary data (as this can be derived from the primary data with
lots of compute). So maybe we need a file similar to the packed refs
file, "superproject-refs" that behaves as if it is storing refs,
but that file
is safe to delete as we know all its contents can be recreated.

2) In this new alternate refs space, we can have
refs/superproject/ in the submodule with the sha1 of the
superproject that points to a commit which is a dirty merge of all
submodule commits, that have no other parents.

e.g.
In the superproject we might have a history of:

  A <- B <- C

with A being origin/master and C our local master.

In the submodule we have:

  D <- E <- F
\
  G

F is the master branch in the submodule, G is a dangling commit.

Now we could have the following git links:

A -> D
B -> G
C -> F

Then the refs/superproject/
would be a merge of F and G.

When pushing in the superproject (A..C) we would need to make sure
the submodule is updated as well, i.e. we'd look at
refs/superproject/ to know we need to advance the remote
submodule history to contain at least G and F.

Then we can do a standard rev 

Re: [PATCHv2] push: change submodule default to check

2016-08-24 Thread Junio C Hamano
Junio C Hamano  writes:

> Jeff King  writes:
>
>> For instance, try this from a checkout of linux.git:
>>
>>   for i in no check; do
>>  rm -rf dst.git
>>  git init --bare dst.git
>>  echo "==> Pushing with submodules=$i"
>>  time git push --recurse-submodules=$i dst.git HEAD
>>   done
>>
>> The second case takes 30-40 seconds longer. This is a full push of
>> history, so it's an extreme case[1], but it's still rather unfortunate.

This actually bit me just now.  github.com/gitster/git.git is
mirror pushed, and it would seem to take forever, so I killed it,
and flipped the configuration variable off.  Which means the feature
won't ever get any testing from me in real life.

People, git.git is not a large project in any sense of the word.

Why wasn't it discovered that "push.recursesubmodules = check" is
UNUSABLE since it was introduced is simply beyond me.

I am mad (which is unusual for me, isn't it? -- I've seen somebody
else saying "I am mad", but I do not think I ever said it here).
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] push: change submodule default to check

2016-08-24 Thread Junio C Hamano
Jeff King  writes:

This seems to be dropped from the list, probably due to no "To:"
header in the original, which led to "no", "To-header" "on" and
"input <" on YOUR recipient list, so I am quoting it in full without
trimming.

> On Wed, Aug 24, 2016 at 10:30:17AM -0700, Stefan Beller wrote:
>
>> When working with submodules, it is easy to forget to push the submodules.
>> The setting 'check', which checks if any existing submodule is present on
>> at least one remote of the submodule remotes, is designed to prevent this
>> mistake.
>> 
>> Flipping the default to check for submodules is safer than the current
>> default of ignoring submodules while pushing.
>
> It is safer, and that's good. But it's also slower, because it requires
> an extra traversal of all of the pushed commits. And now people will
> have to pay the price even if they are not using submodules at all.
>
> For instance, try this from a checkout of linux.git:
>
>   for i in no check; do
>   rm -rf dst.git
>   git init --bare dst.git
>   echo "==> Pushing with submodules=$i"
>   time git push --recurse-submodules=$i dst.git HEAD
>   done
>
> The second case takes 30-40 seconds longer. This is a full push of
> history, so it's an extreme case[1], but it's still rather unfortunate.
>
> Can we tie this default to some sign that submodules are actually in
> use? I don't think the presence of .gitmodules is perfect (because you
> might be in a bare repo, for example, and have just fetched some other
> history you are relaying), but it may be a good compromise.  I'm
> envisioning something like "--recurse-submodules=auto-check" which
> auto-detects common situations (e.g., presence of .gitmodules or
> .git/modules checkouts) and enables "check", and then setting the
> default to that in the long run.
>
> -Peff
>
> [1] Actually, there is another much worse case lurking there. Try:
>
>   git push --recurse-submodules=check --mirror dst.git
>
> from the kernel. I didn't let it finish, but I'd estimate it would
> take on the order of 5 hours. The problem is that push feeds each
> updated ref tip to find_unpushed_submodules(), so we end up walking
> over the same history over and over.
>
> I think it should feed all of the "before" and "after" ref tips it
> proposes to update to a _single_ revision traversal.

That sounds massively ... broken.  So before even thinking about
flipping it to default, this needs to be fixed first.

> I also notice that it uses "--remote=...", which is weird, because
> push knows exactly what it proposes to update, which may be ahead of
> where our refs/remotes/* cache is. Not to mention that we may be
> pushing to a remote for which we do not keep tracking refs at all!
>
> So I'd actually suspect that with your patch, a bare URL like:
>
>   git push https://github.com/peff/linux.git
>
> would do the full 40-second walk, even if I was only pushing up one
> or two objects.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] push: change submodule default to check

2016-08-24 Thread Junio C Hamano
Stefan Beller  writes:

> When working with submodules, it is easy to forget to push the submodules.
> The setting 'check', which checks if any existing submodule is present on
> at least one remote of the submodule remotes, is designed to prevent this
> mistake.
>
> Flipping the default to check for submodules is safer than the current
> default of ignoring submodules while pushing.
>
> Signed-off-by: Stefan Beller 
> ---
>
> Slightly reworded commit message than in v1,
> (https://public-inbox.org/git/20160817204848.8983-1-sbel...@google.com/)
> The patch itself is however the same.
>
> I just push it out now with a new commit message, such that we can easier
> pick it up later for Git 3.0, when changes that change default make more 
> sense.
>
> As said in an earlier message, you could however also argue that this is
> fixing a bug in your workflow, so it might be worth fixing before 3.0
> as well. I dunno.

With this change suddenly landing on the version of Git a user just
updated, the only possible negative effect would be that somebody
who did not configure push.recurseSubmodules suddenly starts getting
an error.  What would the error message say?

What I want you to think about and explain in the justification is
if it gives the user enough information to decide the best course of
action to adjust to the new world order, when the user's expectation
has been that the superproject gets pushed independent from its
submodules.  If the existing error message gives enough information,
explains why Git suddenly started refusing the push is generally a
good thing for the user, tells some minority users (in whose
workflow it is perfectly safe to push out the toplevel independently
from the submodule) how to turn it back to the old behaviour clearly
enough, then this single-liner may be good enough.  Otherwise we may
need a new logic to allow us to see if recurse_submodules that is
set to RECURSE_SUBMODULES_CHECK is because the user explicitly
configured, or because the user did not have any configuration, and
issue a different message in the latter case.



>  builtin/push.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/builtin/push.c b/builtin/push.c
> index 3bb9d6b..479150a 100644
> --- a/builtin/push.c
> +++ b/builtin/push.c
> @@ -22,7 +22,7 @@ static int deleterefs;
>  static const char *receivepack;
>  static int verbosity;
>  static int progress = -1;
> -static int recurse_submodules = RECURSE_SUBMODULES_DEFAULT;
> +static int recurse_submodules = RECURSE_SUBMODULES_CHECK;
>  static enum transport_family family;
>  
>  static struct push_cas_option cas;
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2] push: change submodule default to check

2016-08-24 Thread Stefan Beller
When working with submodules, it is easy to forget to push the submodules.
The setting 'check', which checks if any existing submodule is present on
at least one remote of the submodule remotes, is designed to prevent this
mistake.

Flipping the default to check for submodules is safer than the current
default of ignoring submodules while pushing.

Signed-off-by: Stefan Beller 
---

Slightly reworded commit message than in v1,
(https://public-inbox.org/git/20160817204848.8983-1-sbel...@google.com/)
The patch itself is however the same.

I just push it out now with a new commit message, such that we can easier
pick it up later for Git 3.0, when changes that change default make more sense.

As said in an earlier message, you could however also argue that this is
fixing a bug in your workflow, so it might be worth fixing before 3.0
as well. I dunno.

Thanks,
Stefan

 builtin/push.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/builtin/push.c b/builtin/push.c
index 3bb9d6b..479150a 100644
--- a/builtin/push.c
+++ b/builtin/push.c
@@ -22,7 +22,7 @@ static int deleterefs;
 static const char *receivepack;
 static int verbosity;
 static int progress = -1;
-static int recurse_submodules = RECURSE_SUBMODULES_DEFAULT;
+static int recurse_submodules = RECURSE_SUBMODULES_CHECK;
 static enum transport_family family;
 
 static struct push_cas_option cas;
-- 
2.10.0.rc1.1.g1ceb01a

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html