Re: [PATCHv2] push: change submodule default to check
On Wed, Aug 24, 2016 at 03:37:29PM -0700, Stefan Beller wrote: > > That sounds massively ... broken. So before even thinking about > > flipping it to default, this needs to be fixed first. > > I agree. That sounds bad. > > However having the --auto-check feels like papering over the > actual problem which to me sounds like a design problem. > However this may be a viable short term solution. Sort of... I may not have been clear, but there are really a few things going on. One is that the design of find_unpushed_submodules() is just brain-dead. It does one traversal per updated ref, which means a from-scratch mirror is O(nr_of_refs * nr_of_commits). That's just silly, and can easily be fixed behind the scenes to be O(nr_of_commits). And I _suspect_ it is what made Junio's earlier push so awful; he probably pushed up the same commits as part of many different branches, so he did the same diffs over and over. So clearing that up seems like an obvious first step, and dulls the pain to "if submodule recursion is on, the worst case is that you walk all the new objects you are sending". That's still _a_ traversal, but it's one we have to do anyway in pack-objects, so it's the same order of magnitude as the rest of the push[1]. Then you've got two cases: the repo is using submodules at all, or they are not. The former is an easy case, if we can identify it; we can avoid the traversal at all, and people who do not use submodules are not regressed at all. That leaves people who _are_ using submodules with paying the extra traversal cost. Not great, but only really a pain when you have a really big chunk of history to push. It may be lost in the noise for such a push in more normal circumstances (where bandwidth to push up the objects dominates, though it is unfortunate that we do not even start utilizing the bandwidth, the critical resource, until we are done with the submodule check). [1] Of course with reachability bitmaps the pack-objects traversal goes away, but the same cannot be accomplished here (because they do not store the gitlink sha1s at all, because they do not imply reachability). > We need to answer the question: Which submodule commits > are referenced by a given set of superproject commits. > > This question is advancing a very similar question that we'd > have to ask in git-gc. In gc we would end up without having to > worry about a specific set, but rather the all reachable commits > of the superproject are in the given set. > > So we could solve two issues at the same time if we had a quick > way to answer this question quickly. > [...] I snipped here because your solutions sound complicated (which isn't to say they're wrong, but that I am not willing to give them a lot of thought at this time in the evening ;) ). One opposite approach which appeals to me is not to remove the need for the traversal, but to make it much faster. E.g., by storing commits in a form that can be traversed more quickly, and possibly keeping a bitmap cache of which paths are touched in each commit (I have posted patches to the list for the former, but have only been considering experimenting with the latter). That's _also_ complicated, but it applies to way more things. Including normal log traversals, path-limiting for diffs, the "counting" traversal done by pack-object, etc. And while it is complicated in some ways, it's conceptually simple at the git data model layer. It's returning the same old answers about commits and trees, just faster. Anyway, food for thought. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] push: change submodule default to check
On Wed, Aug 24, 2016 at 12:37 PM, Junio C Hamanowrote: > Jeff King writes: > > This seems to be dropped from the list, probably due to no "To:" > header in the original, which led to "no", "To-header" "on" and > "input <" on YOUR recipient list, so I am quoting it in full without > trimming. > >> On Wed, Aug 24, 2016 at 10:30:17AM -0700, Stefan Beller wrote: >> >>> When working with submodules, it is easy to forget to push the submodules. >>> The setting 'check', which checks if any existing submodule is present on >>> at least one remote of the submodule remotes, is designed to prevent this >>> mistake. >>> >>> Flipping the default to check for submodules is safer than the current >>> default of ignoring submodules while pushing. >> >> It is safer, and that's good. But it's also slower, because it requires >> an extra traversal of all of the pushed commits. And now people will >> have to pay the price even if they are not using submodules at all. >> >> For instance, try this from a checkout of linux.git: >> >> for i in no check; do >> rm -rf dst.git >> git init --bare dst.git >> echo "==> Pushing with submodules=$i" >> time git push --recurse-submodules=$i dst.git HEAD >> done >> >> The second case takes 30-40 seconds longer. This is a full push of >> history, so it's an extreme case[1], but it's still rather unfortunate. >> >> Can we tie this default to some sign that submodules are actually in >> use? I don't think the presence of .gitmodules is perfect (because you >> might be in a bare repo, for example, and have just fetched some other >> history you are relaying), but it may be a good compromise. I'm >> envisioning something like "--recurse-submodules=auto-check" which >> auto-detects common situations (e.g., presence of .gitmodules or >> .git/modules checkouts) and enables "check", and then setting the >> default to that in the long run. >> >> -Peff >> >> [1] Actually, there is another much worse case lurking there. Try: >> >> git push --recurse-submodules=check --mirror dst.git >> >> from the kernel. I didn't let it finish, but I'd estimate it would >> take on the order of 5 hours. The problem is that push feeds each >> updated ref tip to find_unpushed_submodules(), so we end up walking >> over the same history over and over. >> >> I think it should feed all of the "before" and "after" ref tips it >> proposes to update to a _single_ revision traversal. > > That sounds massively ... broken. So before even thinking about > flipping it to default, this needs to be fixed first. > I agree. That sounds bad. However having the --auto-check feels like papering over the actual problem which to me sounds like a design problem. However this may be a viable short term solution. About the design issue: We need to answer the question: Which submodule commits are referenced by a given set of superproject commits. This question is advancing a very similar question that we'd have to ask in git-gc. In gc we would end up without having to worry about a specific set, but rather the all reachable commits of the superproject are in the given set. So we could solve two issues at the same time if we had a quick way to answer this question quickly. And for that I considered introducing a ref in the submodule (e.g.) refs/superproject/gc_protector, which records both the superprojects commit as well as the submodule commit in case the superproject changed the submodule pointer. I have some rough patches for this, but I did not consider sending that to the list as the patches are still rough and not quite fully thought out on the design level. -- Actually screw this. After an office discussion with Jonathan (cc'd): 1) We need to have an "alternate refs namespace", which contains secondary data (as this can be derived from the primary data with lots of compute). So maybe we need a file similar to the packed refs file, "superproject-refs" that behaves as if it is storing refs, but that file is safe to delete as we know all its contents can be recreated. 2) In this new alternate refs space, we can have refs/superproject/ in the submodule with the sha1 of the superproject that points to a commit which is a dirty merge of all submodule commits, that have no other parents. e.g. In the superproject we might have a history of: A <- B <- C with A being origin/master and C our local master. In the submodule we have: D <- E <- F \ G F is the master branch in the submodule, G is a dangling commit. Now we could have the following git links: A -> D B -> G C -> F Then the refs/superproject/ would be a merge of F and G. When pushing in the superproject (A..C) we would need to make sure the submodule is updated as well, i.e. we'd look at refs/superproject/ to know we need to advance the remote submodule history to contain at least G and F. Then we can do a standard rev
Re: [PATCHv2] push: change submodule default to check
Junio C Hamanowrites: > Jeff King writes: > >> For instance, try this from a checkout of linux.git: >> >> for i in no check; do >> rm -rf dst.git >> git init --bare dst.git >> echo "==> Pushing with submodules=$i" >> time git push --recurse-submodules=$i dst.git HEAD >> done >> >> The second case takes 30-40 seconds longer. This is a full push of >> history, so it's an extreme case[1], but it's still rather unfortunate. This actually bit me just now. github.com/gitster/git.git is mirror pushed, and it would seem to take forever, so I killed it, and flipped the configuration variable off. Which means the feature won't ever get any testing from me in real life. People, git.git is not a large project in any sense of the word. Why wasn't it discovered that "push.recursesubmodules = check" is UNUSABLE since it was introduced is simply beyond me. I am mad (which is unusual for me, isn't it? -- I've seen somebody else saying "I am mad", but I do not think I ever said it here). -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] push: change submodule default to check
Jeff Kingwrites: This seems to be dropped from the list, probably due to no "To:" header in the original, which led to "no", "To-header" "on" and "input <" on YOUR recipient list, so I am quoting it in full without trimming. > On Wed, Aug 24, 2016 at 10:30:17AM -0700, Stefan Beller wrote: > >> When working with submodules, it is easy to forget to push the submodules. >> The setting 'check', which checks if any existing submodule is present on >> at least one remote of the submodule remotes, is designed to prevent this >> mistake. >> >> Flipping the default to check for submodules is safer than the current >> default of ignoring submodules while pushing. > > It is safer, and that's good. But it's also slower, because it requires > an extra traversal of all of the pushed commits. And now people will > have to pay the price even if they are not using submodules at all. > > For instance, try this from a checkout of linux.git: > > for i in no check; do > rm -rf dst.git > git init --bare dst.git > echo "==> Pushing with submodules=$i" > time git push --recurse-submodules=$i dst.git HEAD > done > > The second case takes 30-40 seconds longer. This is a full push of > history, so it's an extreme case[1], but it's still rather unfortunate. > > Can we tie this default to some sign that submodules are actually in > use? I don't think the presence of .gitmodules is perfect (because you > might be in a bare repo, for example, and have just fetched some other > history you are relaying), but it may be a good compromise. I'm > envisioning something like "--recurse-submodules=auto-check" which > auto-detects common situations (e.g., presence of .gitmodules or > .git/modules checkouts) and enables "check", and then setting the > default to that in the long run. > > -Peff > > [1] Actually, there is another much worse case lurking there. Try: > > git push --recurse-submodules=check --mirror dst.git > > from the kernel. I didn't let it finish, but I'd estimate it would > take on the order of 5 hours. The problem is that push feeds each > updated ref tip to find_unpushed_submodules(), so we end up walking > over the same history over and over. > > I think it should feed all of the "before" and "after" ref tips it > proposes to update to a _single_ revision traversal. That sounds massively ... broken. So before even thinking about flipping it to default, this needs to be fixed first. > I also notice that it uses "--remote=...", which is weird, because > push knows exactly what it proposes to update, which may be ahead of > where our refs/remotes/* cache is. Not to mention that we may be > pushing to a remote for which we do not keep tracking refs at all! > > So I'd actually suspect that with your patch, a bare URL like: > > git push https://github.com/peff/linux.git > > would do the full 40-second walk, even if I was only pushing up one > or two objects. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2] push: change submodule default to check
Stefan Bellerwrites: > When working with submodules, it is easy to forget to push the submodules. > The setting 'check', which checks if any existing submodule is present on > at least one remote of the submodule remotes, is designed to prevent this > mistake. > > Flipping the default to check for submodules is safer than the current > default of ignoring submodules while pushing. > > Signed-off-by: Stefan Beller > --- > > Slightly reworded commit message than in v1, > (https://public-inbox.org/git/20160817204848.8983-1-sbel...@google.com/) > The patch itself is however the same. > > I just push it out now with a new commit message, such that we can easier > pick it up later for Git 3.0, when changes that change default make more > sense. > > As said in an earlier message, you could however also argue that this is > fixing a bug in your workflow, so it might be worth fixing before 3.0 > as well. I dunno. With this change suddenly landing on the version of Git a user just updated, the only possible negative effect would be that somebody who did not configure push.recurseSubmodules suddenly starts getting an error. What would the error message say? What I want you to think about and explain in the justification is if it gives the user enough information to decide the best course of action to adjust to the new world order, when the user's expectation has been that the superproject gets pushed independent from its submodules. If the existing error message gives enough information, explains why Git suddenly started refusing the push is generally a good thing for the user, tells some minority users (in whose workflow it is perfectly safe to push out the toplevel independently from the submodule) how to turn it back to the old behaviour clearly enough, then this single-liner may be good enough. Otherwise we may need a new logic to allow us to see if recurse_submodules that is set to RECURSE_SUBMODULES_CHECK is because the user explicitly configured, or because the user did not have any configuration, and issue a different message in the latter case. > builtin/push.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/builtin/push.c b/builtin/push.c > index 3bb9d6b..479150a 100644 > --- a/builtin/push.c > +++ b/builtin/push.c > @@ -22,7 +22,7 @@ static int deleterefs; > static const char *receivepack; > static int verbosity; > static int progress = -1; > -static int recurse_submodules = RECURSE_SUBMODULES_DEFAULT; > +static int recurse_submodules = RECURSE_SUBMODULES_CHECK; > static enum transport_family family; > > static struct push_cas_option cas; -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHv2] push: change submodule default to check
When working with submodules, it is easy to forget to push the submodules. The setting 'check', which checks if any existing submodule is present on at least one remote of the submodule remotes, is designed to prevent this mistake. Flipping the default to check for submodules is safer than the current default of ignoring submodules while pushing. Signed-off-by: Stefan Beller--- Slightly reworded commit message than in v1, (https://public-inbox.org/git/20160817204848.8983-1-sbel...@google.com/) The patch itself is however the same. I just push it out now with a new commit message, such that we can easier pick it up later for Git 3.0, when changes that change default make more sense. As said in an earlier message, you could however also argue that this is fixing a bug in your workflow, so it might be worth fixing before 3.0 as well. I dunno. Thanks, Stefan builtin/push.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/builtin/push.c b/builtin/push.c index 3bb9d6b..479150a 100644 --- a/builtin/push.c +++ b/builtin/push.c @@ -22,7 +22,7 @@ static int deleterefs; static const char *receivepack; static int verbosity; static int progress = -1; -static int recurse_submodules = RECURSE_SUBMODULES_DEFAULT; +static int recurse_submodules = RECURSE_SUBMODULES_CHECK; static enum transport_family family; static struct push_cas_option cas; -- 2.10.0.rc1.1.g1ceb01a -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html