Re: How to still kill git fetch with too many refs

2013-07-02 Thread Brandon Casey
On Mon, Jul 1, 2013 at 10:01 PM, Jeff King  wrote:
> On Tue, Jul 02, 2013 at 12:41:51AM -0400, Jeff King wrote:
>
>> I replicated your test setup, and the problem is that we have many
>> common objects on both sides during the ref negotiation. So we end up in
>> rev_list_push for each one, which has the same O(n^2) behavior.
>> Switching it to just sort at the end is not trivial; we first insert all
>> of the objects, but then we actually walk the parents, pushing onto the
>> list as we go. So I think we'd want a better data structure (like a
>> priority queue).
>
> Like the patch below,



Looks obviously correct to me since I've got essentially the same
patch sitting in my local repo. :b

I've got the patch coming that fixes the same problem on the push side
of things and provides the same order of magnitude improvement.

-Brandon
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to still kill git fetch with too many refs

2013-07-02 Thread Martin Fick
On Tuesday, July 02, 2013 03:24:14 am Michael Haggerty 
wrote:
> > git rev-list HEAD | for nn in $(seq 0 100) ; do for c
> > in $(seq 0 1) ; do  read sha ; echo $sha
> > refs/c/$nn/$c$nn ; done ; done > .git/packed-refs
> 
> I believe this generates a packed-refs file that is not
> sorted lexicographically by refname, whereas all
> Git-generated packed-refs files are sorted.  


Yes, you are indeed correct.  I was attempting to be too 
clever with my sharding I guess.  Thanks.

> There are
> some optimizations in refs.c for adding references in
> order that might therefore be circumvented by your
> unsorted file.  Please try sorting the file by refname
> and see if that helps.  (You can do so by deleting one
> of the packed references; then git will sort the
> remainder while rewriting the file.)

A simple git pack-refs seems to clean it up.

The original test did complete in ~77mins last night.  A 
rerun with a sorted file takes ~61mins,

-Martin


PS: This test was performed with git version 1.8.2.1 on 
linux 2.6.32-37-generic #81-Ubuntu SMP 

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation
 
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to still kill git fetch with too many refs

2013-07-02 Thread Michael Haggerty
On 07/02/2013 05:02 AM, Martin Fick wrote:
> I have often reported problems with git fetch when there are 
> many refs in a repo, and I have been pleasantly surprised 
> how many problems I reported were so quickly fixed. :) With 
> time, others have created various synthetic test cases to 
> ensure that git can handle many many refs.  A simple 
> synthetic test case with 1M refs all pointing to the same 
> sha1 seems to be easily handled by git these days.  However, 
> in our experience with our internal git repo, we still have 
> performance issues related to having too many refs, in our 
> kernel/msm instance we have around 400K.
> 
> When I tried the simple synthetic test case and could not 
> reproduce bad results, so I tried something just a little 
> more complex and was able to get atrocious results!!! 
> Basically, I generate a packed-refs files with many refs 
> which each point to a different sha1.  To get a list of 
> valid but unique sha1s for the repo, I simply used rev-list.  
> The result, a copy of linus' repo with a million unique 
> valid refs and a git fetch of a single updated ref taking a 
> very long time (55mins and it did not complete yet).  Note, 
> with 100K refs it completes in about 2m40s.  It is likely 
> not linear since 2m40s * 10 would be ~26m (but the 
> difference could also just be how the data in the sha1s are 
> ordered).
> 
> 
> Here is my small reproducible test case for this issue:
> 
> git clone 
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> cp -rp linux linux.1Mrefs-revlist
> 
> cd linux
> echo "Hello" > hello ; git add hello ; git ci -a -m 'hello'
> cd ..
> 
> cd linux.1Mrefs-revlist
> git rev-list HEAD | for nn in $(seq 0 100) ; do for c in 
> $(seq 0 1) ; do  read sha ; echo $sha refs/c/$nn/$c$nn ; 
> done ; done > .git/packed-refs

I believe this generates a packed-refs file that is not sorted
lexicographically by refname, whereas all Git-generated packed-refs
files are sorted.  There are some optimizations in refs.c for adding
references in order that might therefore be circumvented by your
unsorted file.  Please try sorting the file by refname and see if that
helps.  (You can do so by deleting one of the packed references; then
git will sort the remainder while rewriting the file.)

Michael

-- 
Michael Haggerty
mhag...@alum.mit.edu
http://softwareswirl.blogspot.com/
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to still kill git fetch with too many refs

2013-07-01 Thread Jeff King
On Mon, Jul 01, 2013 at 10:19:51PM -0700, Junio C Hamano wrote:

> > Like the patch below, which is built on top of next (which has Junio's
> > prio_queue implementation), and has both the priority queue fix for
> > rev_list_push and the mark_complete sort-at-the-end fix.
> 
> Wow, I saw "160 lines" in my MUA which scared me a bit until I
> opened it to realize 40% is discussion and most of the remaining
> lines are context around single liners.
> 
> It just looks too easy/simple, but the result looks correct, at
> least from a cursory read.
> 
> Good job ;-)

Thanks. :)

I'm splitting it out into readable patches now. At first I was made a
bit nervous by the "popping" behavior I described as "oddity #2"
earlier.  But the more I look at it, the more I am convinced it is
simply a bug that we can happen to fix along the way.

Patches in a few minutes.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to still kill git fetch with too many refs

2013-07-01 Thread Junio C Hamano
Jeff King  writes:

> On Tue, Jul 02, 2013 at 12:41:51AM -0400, Jeff King wrote:
>
>> I replicated your test setup, and the problem is that we have many
>> common objects on both sides during the ref negotiation. So we end up in
>> rev_list_push for each one, which has the same O(n^2) behavior.
>> Switching it to just sort at the end is not trivial; we first insert all
>> of the objects, but then we actually walk the parents, pushing onto the
>> list as we go. So I think we'd want a better data structure (like a
>> priority queue).
>
> Like the patch below, which is built on top of next (which has Junio's
> prio_queue implementation), and has both the priority queue fix for
> rev_list_push and the mark_complete sort-at-the-end fix.

Wow, I saw "160 lines" in my MUA which scared me a bit until I
opened it to realize 40% is discussion and most of the remaining
lines are context around single liners.

It just looks too easy/simple, but the result looks correct, at
least from a cursory read.

Good job ;-)
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to still kill git fetch with too many refs

2013-07-01 Thread Jeff King
On Tue, Jul 02, 2013 at 12:41:51AM -0400, Jeff King wrote:

> I replicated your test setup, and the problem is that we have many
> common objects on both sides during the ref negotiation. So we end up in
> rev_list_push for each one, which has the same O(n^2) behavior.
> Switching it to just sort at the end is not trivial; we first insert all
> of the objects, but then we actually walk the parents, pushing onto the
> list as we go. So I think we'd want a better data structure (like a
> priority queue).

Like the patch below, which is built on top of next (which has Junio's
prio_queue implementation), and has both the priority queue fix for
rev_list_push and the mark_complete sort-at-the-end fix. With this, your
test case completes in a few seconds (no clue if it would ever have
finished otherwise; I never let it run for more than a minute or two,
but I did confirm that it was spending all of its time in
commit_list_insert_by_date).

I didn't look too hard at the actual semantics, so there might be away
to not mark so many commits in the first place. This is just a mindless
replacement of an O(n^2) list insertion with the queue.

I did note two oddities, though:

  1. In the original, we never free the commit_list pointers when we pop
 them (or when we clear the list in find_common. So I think the
 existing code was leaking memory, and my version does not.

  2. In the original version of get_rev, we "peek" at the head commit of
 rev_list:

   commit = rev_list->item;

 Then we look at the parents of that commit, and possibly push them
 onto rev_list:

   while (parents) {
  if (!(parents->item->object.flags & SEEN))
  rev_list_push(parents->item, mark);
  ...
   }

 and then finally, "pop" the commit off the list:

   rev_list = rev_list->next;

 But what guarantee do we have that the commit we just processed is
 still at the front of the list? If the timestamps are monotonic,
 then the parent must come after the descendent and therefore will
 not have gotten pushed onto the front of the list. But in the face
 of clock skew, this is not the case, and we will have just skipped
 over the parent with our "pop".

 So unless I am missing something very clever, I think this was a
 bug (albeit a difficult one to have matter). And it's also fixed by
 using prio_queue (which does the peek/pop together at the top of
 the function).

diff --git a/commit.c b/commit.c
index 521e49c..ebc0eea 100644
--- a/commit.c
+++ b/commit.c
@@ -581,7 +581,7 @@ static int compare_commits_by_author_date(const void *a_, 
const void *b_,
return 0;
 }
 
-static int compare_commits_by_commit_date(const void *a_, const void *b_, void 
*unused)
+int compare_commits_by_commit_date(const void *a_, const void *b_, void 
*unused)
 {
const struct commit *a = a_, *b = b_;
/* newer commits with larger date first */
diff --git a/commit.h b/commit.h
index 4d452dc..18a5234 100644
--- a/commit.h
+++ b/commit.h
@@ -254,4 +254,6 @@ extern void check_commit_signature(const struct commit* 
commit, struct signature
  */
 extern void check_commit_signature(const struct commit* commit, struct 
signature_check *sigc);
 
+int compare_commits_by_commit_date(const void *a_, const void *b_, void 
*unused);
+
 #endif /* COMMIT_H */
diff --git a/fetch-pack.c b/fetch-pack.c
index abe5ffb..6684348 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -11,6 +11,7 @@
 #include "run-command.h"
 #include "transport.h"
 #include "version.h"
+#include "prio-queue.h"
 
 static int transfer_unpack_limit = -1;
 static int fetch_unpack_limit = -1;
@@ -37,7 +38,7 @@ static int marked;
  */
 #define MAX_IN_VAIN 256
 
-static struct commit_list *rev_list;
+static struct prio_queue rev_list = { compare_commits_by_commit_date };
 static int non_common_revs, multi_ack, use_sideband, allow_tip_sha1_in_want;
 
 static void rev_list_push(struct commit *commit, int mark)
@@ -49,7 +50,7 @@ static void rev_list_push(struct commit *commit, int mark)
if (parse_commit(commit))
return;
 
-   commit_list_insert_by_date(commit, &rev_list);
+   prio_queue_put(&rev_list, commit);
 
if (!(commit->object.flags & COMMON))
non_common_revs++;
@@ -122,10 +123,10 @@ static const unsigned char *get_rev(void)
unsigned int mark;
struct commit_list *parents;
 
-   if (rev_list == NULL || non_common_revs == 0)
+   if (rev_list.nr == 0 || non_common_revs == 0)
return NULL;
 
-   commit = rev_list->item;
+   commit = prio_queue_get(&rev_list);
if (!commit->object.parsed)
parse_commit(commit);
parents = commit->parents;
@@ -152,8 +153,6 @@ st

Re: How to still kill git fetch with too many refs

2013-07-01 Thread Jeff King
On Tue, Jul 02, 2013 at 12:07:58AM -0400, Jeff King wrote:

> And yet another alternative would be to keep the list unsorted during
> the mark_complete calls, and then sort it at the end. Like this:

Amusingly, I seem to have posted the exact same patch last year:

  http://article.gmane.org/gmane.comp.version-control.git/194939

but never polished it up. It does cut a little time off of the initial
ref-loading that fetch-pack does (especially when we do not end up
transferring any objects at all). But it does not solve your problem.

I replicated your test setup, and the problem is that we have many
common objects on both sides during the ref negotiation. So we end up in
rev_list_push for each one, which has the same O(n^2) behavior.
Switching it to just sort at the end is not trivial; we first insert all
of the objects, but then we actually walk the parents, pushing onto the
list as we go. So I think we'd want a better data structure (like a
priority queue).

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to still kill git fetch with too many refs

2013-07-01 Thread Jeff King
On Mon, Jul 01, 2013 at 09:02:31PM -0600, Martin Fick wrote:

> A simple synthetic test case with 1M refs all pointing to the same
> sha1 seems to be easily handled by git these days.  However, in our
> experience with our internal git repo, we still have performance
> issues related to having too many refs, in our kernel/msm instance we
> have around 400K.

I'm not too surprised. There's O(n^2) behavior in fetch-pack's
mark_complete function, as it adds each of the N ref tips to a
commit_list, sorting by date (so each insertion is O(n)).

I posted two alternative patches in May of 2011.  The first simply
avoids adding duplicate objects, which is simple and covers many
real-world cases (e.g., an "alternates" repository which has a bunch of
copies of the same tags, one per fork). The second one switches the
commit_list out for a heap-based priority queue.

We ended up taking the first (as ea5f220), since it was trivial and
obviously correct, but didn't bother with the second since:

  1. There had been no real-world reports of it.

  2. While in theory a priority queue implementation would be used in
 other spots, too, it ended up being a pain to use it, as most of
 the callers wanted list-like splicing.

You can see the original here:

  http://thread.gmane.org/gmane.comp.version-control.git/174003/focus=174005

Though it probably doesn't apply cleanly anymore.  However, I've kept it
rebased over the years at:

  git://github.com/peff/git.git jk/fast-commit-list

Junio recently added a priority queue implementation in b4b594a
(prio-queue: priority queue of pointers to structs, 2013-06-06), which
is currently in next. So a modern version of that series would build on
top of that, rather than my priority queue.

And yet another alternative would be to keep the list unsorted during
the mark_complete calls, and then sort it at the end. Like this:

diff --git a/fetch-pack.c b/fetch-pack.c
index abe5ffb..4df8abd 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -505,7 +505,7 @@ static int mark_complete(const char *refname, const 
unsigned char *sha1, int fla
struct commit *commit = (struct commit *)o;
if (!(commit->object.flags & COMPLETE)) {
commit->object.flags |= COMPLETE;
-   commit_list_insert_by_date(commit, &complete);
+   commit_list_insert(commit, &complete);
}
}
return 0;
@@ -622,6 +622,7 @@ static int everything_local(struct fetch_pack_args *args,
if (!args->depth) {
for_each_ref(mark_complete, NULL);
for_each_alternate_ref(mark_alternate_complete, NULL);
+   commit_list_sort_by_date(&complete);
if (cutoff)
mark_recent_complete_commits(args, cutoff);
}

(If you're wondering why we didn't do this trivial bit at the time, it
was because back then we did not yet have the René's nice linked-list
mergesort that backs commit_list_sort_by_date).

> The result, a copy of linus' repo with a million unique 
> valid refs and a git fetch of a single updated ref taking a 
> very long time (55mins and it did not complete yet).  Note, 
> with 100K refs it completes in about 2m40s.  It is likely 
> not linear since 2m40s * 10 would be ~26m (but the 
> difference could also just be how the data in the sha1s are 
> ordered).

That sounds like the O(n^2) problem. My timings back then with 100K refs
were 1-2 minutes. Does the patch above fix it for you?

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


How to still kill git fetch with too many refs

2013-07-01 Thread Martin Fick
I have often reported problems with git fetch when there are 
many refs in a repo, and I have been pleasantly surprised 
how many problems I reported were so quickly fixed. :) With 
time, others have created various synthetic test cases to 
ensure that git can handle many many refs.  A simple 
synthetic test case with 1M refs all pointing to the same 
sha1 seems to be easily handled by git these days.  However, 
in our experience with our internal git repo, we still have 
performance issues related to having too many refs, in our 
kernel/msm instance we have around 400K.

When I tried the simple synthetic test case and could not 
reproduce bad results, so I tried something just a little 
more complex and was able to get atrocious results!!! 
Basically, I generate a packed-refs files with many refs 
which each point to a different sha1.  To get a list of 
valid but unique sha1s for the repo, I simply used rev-list.  
The result, a copy of linus' repo with a million unique 
valid refs and a git fetch of a single updated ref taking a 
very long time (55mins and it did not complete yet).  Note, 
with 100K refs it completes in about 2m40s.  It is likely 
not linear since 2m40s * 10 would be ~26m (but the 
difference could also just be how the data in the sha1s are 
ordered).


Here is my small reproducible test case for this issue:

git clone 
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
cp -rp linux linux.1Mrefs-revlist

cd linux
echo "Hello" > hello ; git add hello ; git ci -a -m 'hello'
cd ..

cd linux.1Mrefs-revlist
git rev-list HEAD | for nn in $(seq 0 100) ; do for c in 
$(seq 0 1) ; do  read sha ; echo $sha refs/c/$nn/$c$nn ; 
done ; done > .git/packed-refs

time git fetch file:///$(dirname $PWD)/linux 
refs/heads/master

Any insights as to why it is so slow, and how we could 
possibly speed it up?

Thanks,

-Martin

PS: My tests were performed with git version 1.8.2.1 on 
linux 2.6.32-37-generic #81-Ubuntu SMP 


-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation
 
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html