Re: How to still kill git fetch with too many refs
On Mon, Jul 1, 2013 at 10:01 PM, Jeff King wrote: > On Tue, Jul 02, 2013 at 12:41:51AM -0400, Jeff King wrote: > >> I replicated your test setup, and the problem is that we have many >> common objects on both sides during the ref negotiation. So we end up in >> rev_list_push for each one, which has the same O(n^2) behavior. >> Switching it to just sort at the end is not trivial; we first insert all >> of the objects, but then we actually walk the parents, pushing onto the >> list as we go. So I think we'd want a better data structure (like a >> priority queue). > > Like the patch below, Looks obviously correct to me since I've got essentially the same patch sitting in my local repo. :b I've got the patch coming that fixes the same problem on the push side of things and provides the same order of magnitude improvement. -Brandon -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to still kill git fetch with too many refs
On Tuesday, July 02, 2013 03:24:14 am Michael Haggerty wrote: > > git rev-list HEAD | for nn in $(seq 0 100) ; do for c > > in $(seq 0 1) ; do read sha ; echo $sha > > refs/c/$nn/$c$nn ; done ; done > .git/packed-refs > > I believe this generates a packed-refs file that is not > sorted lexicographically by refname, whereas all > Git-generated packed-refs files are sorted. Yes, you are indeed correct. I was attempting to be too clever with my sharding I guess. Thanks. > There are > some optimizations in refs.c for adding references in > order that might therefore be circumvented by your > unsorted file. Please try sorting the file by refname > and see if that helps. (You can do so by deleting one > of the packed references; then git will sort the > remainder while rewriting the file.) A simple git pack-refs seems to clean it up. The original test did complete in ~77mins last night. A rerun with a sorted file takes ~61mins, -Martin PS: This test was performed with git version 1.8.2.1 on linux 2.6.32-37-generic #81-Ubuntu SMP -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to still kill git fetch with too many refs
On 07/02/2013 05:02 AM, Martin Fick wrote: > I have often reported problems with git fetch when there are > many refs in a repo, and I have been pleasantly surprised > how many problems I reported were so quickly fixed. :) With > time, others have created various synthetic test cases to > ensure that git can handle many many refs. A simple > synthetic test case with 1M refs all pointing to the same > sha1 seems to be easily handled by git these days. However, > in our experience with our internal git repo, we still have > performance issues related to having too many refs, in our > kernel/msm instance we have around 400K. > > When I tried the simple synthetic test case and could not > reproduce bad results, so I tried something just a little > more complex and was able to get atrocious results!!! > Basically, I generate a packed-refs files with many refs > which each point to a different sha1. To get a list of > valid but unique sha1s for the repo, I simply used rev-list. > The result, a copy of linus' repo with a million unique > valid refs and a git fetch of a single updated ref taking a > very long time (55mins and it did not complete yet). Note, > with 100K refs it completes in about 2m40s. It is likely > not linear since 2m40s * 10 would be ~26m (but the > difference could also just be how the data in the sha1s are > ordered). > > > Here is my small reproducible test case for this issue: > > git clone > git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git > cp -rp linux linux.1Mrefs-revlist > > cd linux > echo "Hello" > hello ; git add hello ; git ci -a -m 'hello' > cd .. > > cd linux.1Mrefs-revlist > git rev-list HEAD | for nn in $(seq 0 100) ; do for c in > $(seq 0 1) ; do read sha ; echo $sha refs/c/$nn/$c$nn ; > done ; done > .git/packed-refs I believe this generates a packed-refs file that is not sorted lexicographically by refname, whereas all Git-generated packed-refs files are sorted. There are some optimizations in refs.c for adding references in order that might therefore be circumvented by your unsorted file. Please try sorting the file by refname and see if that helps. (You can do so by deleting one of the packed references; then git will sort the remainder while rewriting the file.) Michael -- Michael Haggerty mhag...@alum.mit.edu http://softwareswirl.blogspot.com/ -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to still kill git fetch with too many refs
On Mon, Jul 01, 2013 at 10:19:51PM -0700, Junio C Hamano wrote: > > Like the patch below, which is built on top of next (which has Junio's > > prio_queue implementation), and has both the priority queue fix for > > rev_list_push and the mark_complete sort-at-the-end fix. > > Wow, I saw "160 lines" in my MUA which scared me a bit until I > opened it to realize 40% is discussion and most of the remaining > lines are context around single liners. > > It just looks too easy/simple, but the result looks correct, at > least from a cursory read. > > Good job ;-) Thanks. :) I'm splitting it out into readable patches now. At first I was made a bit nervous by the "popping" behavior I described as "oddity #2" earlier. But the more I look at it, the more I am convinced it is simply a bug that we can happen to fix along the way. Patches in a few minutes. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to still kill git fetch with too many refs
Jeff King writes: > On Tue, Jul 02, 2013 at 12:41:51AM -0400, Jeff King wrote: > >> I replicated your test setup, and the problem is that we have many >> common objects on both sides during the ref negotiation. So we end up in >> rev_list_push for each one, which has the same O(n^2) behavior. >> Switching it to just sort at the end is not trivial; we first insert all >> of the objects, but then we actually walk the parents, pushing onto the >> list as we go. So I think we'd want a better data structure (like a >> priority queue). > > Like the patch below, which is built on top of next (which has Junio's > prio_queue implementation), and has both the priority queue fix for > rev_list_push and the mark_complete sort-at-the-end fix. Wow, I saw "160 lines" in my MUA which scared me a bit until I opened it to realize 40% is discussion and most of the remaining lines are context around single liners. It just looks too easy/simple, but the result looks correct, at least from a cursory read. Good job ;-) -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to still kill git fetch with too many refs
On Tue, Jul 02, 2013 at 12:41:51AM -0400, Jeff King wrote: > I replicated your test setup, and the problem is that we have many > common objects on both sides during the ref negotiation. So we end up in > rev_list_push for each one, which has the same O(n^2) behavior. > Switching it to just sort at the end is not trivial; we first insert all > of the objects, but then we actually walk the parents, pushing onto the > list as we go. So I think we'd want a better data structure (like a > priority queue). Like the patch below, which is built on top of next (which has Junio's prio_queue implementation), and has both the priority queue fix for rev_list_push and the mark_complete sort-at-the-end fix. With this, your test case completes in a few seconds (no clue if it would ever have finished otherwise; I never let it run for more than a minute or two, but I did confirm that it was spending all of its time in commit_list_insert_by_date). I didn't look too hard at the actual semantics, so there might be away to not mark so many commits in the first place. This is just a mindless replacement of an O(n^2) list insertion with the queue. I did note two oddities, though: 1. In the original, we never free the commit_list pointers when we pop them (or when we clear the list in find_common. So I think the existing code was leaking memory, and my version does not. 2. In the original version of get_rev, we "peek" at the head commit of rev_list: commit = rev_list->item; Then we look at the parents of that commit, and possibly push them onto rev_list: while (parents) { if (!(parents->item->object.flags & SEEN)) rev_list_push(parents->item, mark); ... } and then finally, "pop" the commit off the list: rev_list = rev_list->next; But what guarantee do we have that the commit we just processed is still at the front of the list? If the timestamps are monotonic, then the parent must come after the descendent and therefore will not have gotten pushed onto the front of the list. But in the face of clock skew, this is not the case, and we will have just skipped over the parent with our "pop". So unless I am missing something very clever, I think this was a bug (albeit a difficult one to have matter). And it's also fixed by using prio_queue (which does the peek/pop together at the top of the function). diff --git a/commit.c b/commit.c index 521e49c..ebc0eea 100644 --- a/commit.c +++ b/commit.c @@ -581,7 +581,7 @@ static int compare_commits_by_author_date(const void *a_, const void *b_, return 0; } -static int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused) +int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused) { const struct commit *a = a_, *b = b_; /* newer commits with larger date first */ diff --git a/commit.h b/commit.h index 4d452dc..18a5234 100644 --- a/commit.h +++ b/commit.h @@ -254,4 +254,6 @@ extern void check_commit_signature(const struct commit* commit, struct signature */ extern void check_commit_signature(const struct commit* commit, struct signature_check *sigc); +int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused); + #endif /* COMMIT_H */ diff --git a/fetch-pack.c b/fetch-pack.c index abe5ffb..6684348 100644 --- a/fetch-pack.c +++ b/fetch-pack.c @@ -11,6 +11,7 @@ #include "run-command.h" #include "transport.h" #include "version.h" +#include "prio-queue.h" static int transfer_unpack_limit = -1; static int fetch_unpack_limit = -1; @@ -37,7 +38,7 @@ static int marked; */ #define MAX_IN_VAIN 256 -static struct commit_list *rev_list; +static struct prio_queue rev_list = { compare_commits_by_commit_date }; static int non_common_revs, multi_ack, use_sideband, allow_tip_sha1_in_want; static void rev_list_push(struct commit *commit, int mark) @@ -49,7 +50,7 @@ static void rev_list_push(struct commit *commit, int mark) if (parse_commit(commit)) return; - commit_list_insert_by_date(commit, &rev_list); + prio_queue_put(&rev_list, commit); if (!(commit->object.flags & COMMON)) non_common_revs++; @@ -122,10 +123,10 @@ static const unsigned char *get_rev(void) unsigned int mark; struct commit_list *parents; - if (rev_list == NULL || non_common_revs == 0) + if (rev_list.nr == 0 || non_common_revs == 0) return NULL; - commit = rev_list->item; + commit = prio_queue_get(&rev_list); if (!commit->object.parsed) parse_commit(commit); parents = commit->parents; @@ -152,8 +153,6 @@ st
Re: How to still kill git fetch with too many refs
On Tue, Jul 02, 2013 at 12:07:58AM -0400, Jeff King wrote: > And yet another alternative would be to keep the list unsorted during > the mark_complete calls, and then sort it at the end. Like this: Amusingly, I seem to have posted the exact same patch last year: http://article.gmane.org/gmane.comp.version-control.git/194939 but never polished it up. It does cut a little time off of the initial ref-loading that fetch-pack does (especially when we do not end up transferring any objects at all). But it does not solve your problem. I replicated your test setup, and the problem is that we have many common objects on both sides during the ref negotiation. So we end up in rev_list_push for each one, which has the same O(n^2) behavior. Switching it to just sort at the end is not trivial; we first insert all of the objects, but then we actually walk the parents, pushing onto the list as we go. So I think we'd want a better data structure (like a priority queue). -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to still kill git fetch with too many refs
On Mon, Jul 01, 2013 at 09:02:31PM -0600, Martin Fick wrote: > A simple synthetic test case with 1M refs all pointing to the same > sha1 seems to be easily handled by git these days. However, in our > experience with our internal git repo, we still have performance > issues related to having too many refs, in our kernel/msm instance we > have around 400K. I'm not too surprised. There's O(n^2) behavior in fetch-pack's mark_complete function, as it adds each of the N ref tips to a commit_list, sorting by date (so each insertion is O(n)). I posted two alternative patches in May of 2011. The first simply avoids adding duplicate objects, which is simple and covers many real-world cases (e.g., an "alternates" repository which has a bunch of copies of the same tags, one per fork). The second one switches the commit_list out for a heap-based priority queue. We ended up taking the first (as ea5f220), since it was trivial and obviously correct, but didn't bother with the second since: 1. There had been no real-world reports of it. 2. While in theory a priority queue implementation would be used in other spots, too, it ended up being a pain to use it, as most of the callers wanted list-like splicing. You can see the original here: http://thread.gmane.org/gmane.comp.version-control.git/174003/focus=174005 Though it probably doesn't apply cleanly anymore. However, I've kept it rebased over the years at: git://github.com/peff/git.git jk/fast-commit-list Junio recently added a priority queue implementation in b4b594a (prio-queue: priority queue of pointers to structs, 2013-06-06), which is currently in next. So a modern version of that series would build on top of that, rather than my priority queue. And yet another alternative would be to keep the list unsorted during the mark_complete calls, and then sort it at the end. Like this: diff --git a/fetch-pack.c b/fetch-pack.c index abe5ffb..4df8abd 100644 --- a/fetch-pack.c +++ b/fetch-pack.c @@ -505,7 +505,7 @@ static int mark_complete(const char *refname, const unsigned char *sha1, int fla struct commit *commit = (struct commit *)o; if (!(commit->object.flags & COMPLETE)) { commit->object.flags |= COMPLETE; - commit_list_insert_by_date(commit, &complete); + commit_list_insert(commit, &complete); } } return 0; @@ -622,6 +622,7 @@ static int everything_local(struct fetch_pack_args *args, if (!args->depth) { for_each_ref(mark_complete, NULL); for_each_alternate_ref(mark_alternate_complete, NULL); + commit_list_sort_by_date(&complete); if (cutoff) mark_recent_complete_commits(args, cutoff); } (If you're wondering why we didn't do this trivial bit at the time, it was because back then we did not yet have the René's nice linked-list mergesort that backs commit_list_sort_by_date). > The result, a copy of linus' repo with a million unique > valid refs and a git fetch of a single updated ref taking a > very long time (55mins and it did not complete yet). Note, > with 100K refs it completes in about 2m40s. It is likely > not linear since 2m40s * 10 would be ~26m (but the > difference could also just be how the data in the sha1s are > ordered). That sounds like the O(n^2) problem. My timings back then with 100K refs were 1-2 minutes. Does the patch above fix it for you? -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
How to still kill git fetch with too many refs
I have often reported problems with git fetch when there are many refs in a repo, and I have been pleasantly surprised how many problems I reported were so quickly fixed. :) With time, others have created various synthetic test cases to ensure that git can handle many many refs. A simple synthetic test case with 1M refs all pointing to the same sha1 seems to be easily handled by git these days. However, in our experience with our internal git repo, we still have performance issues related to having too many refs, in our kernel/msm instance we have around 400K. When I tried the simple synthetic test case and could not reproduce bad results, so I tried something just a little more complex and was able to get atrocious results!!! Basically, I generate a packed-refs files with many refs which each point to a different sha1. To get a list of valid but unique sha1s for the repo, I simply used rev-list. The result, a copy of linus' repo with a million unique valid refs and a git fetch of a single updated ref taking a very long time (55mins and it did not complete yet). Note, with 100K refs it completes in about 2m40s. It is likely not linear since 2m40s * 10 would be ~26m (but the difference could also just be how the data in the sha1s are ordered). Here is my small reproducible test case for this issue: git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git cp -rp linux linux.1Mrefs-revlist cd linux echo "Hello" > hello ; git add hello ; git ci -a -m 'hello' cd .. cd linux.1Mrefs-revlist git rev-list HEAD | for nn in $(seq 0 100) ; do for c in $(seq 0 1) ; do read sha ; echo $sha refs/c/$nn/$c$nn ; done ; done > .git/packed-refs time git fetch file:///$(dirname $PWD)/linux refs/heads/master Any insights as to why it is so slow, and how we could possibly speed it up? Thanks, -Martin PS: My tests were performed with git version 1.8.2.1 on linux 2.6.32-37-generic #81-Ubuntu SMP -- The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html