Re: [PATCH] remote.c: avoid O(n^2) behavior in match_push_refs by using string_list

2013-07-03 Thread Jeff King
On Tue, Jul 02, 2013 at 04:53:48PM -0700, Brandon Casey wrote:

 From: Brandon Casey draf...@gmail.com
 
 When pushing, each ref in the local repository must be paired with a
 ref advertised by the remote server.  Currently, this is performed by
 first applying the refspec to the local ref to transform the local ref
 into the name of the remote ref, and then performing a linear search
 through the list of remote refs to see if the remote ref was advertised
 by the remote system.
 
 This has O(n) complexity and makes match_push_refs() be an O(n^2)
 operation.

Just to be sure I understand correctly, is this actually O(m*n) where
m is the number of local refs and n is the number of remote refs?

For a repository that repeatedly pushes everything it has to the remote,
we end up with m=n, but it would not necessarily be the case if you are
pushing a subset of your refs. But even pushing a small number of refs
into a repository with a very large number of refs would be
unnecessarily slow, as we would do several O(n) lookups which could be
O(log n). So it may speed things up even in the case of a normal-sized
repo pushing to a large one.

 Dry-run push of a repository with 121913 refs:
 
 before after
 real1m40.582s  0m0.804s
 user1m39.914s  0m0.515s
 sys 0m0.125s   0m0.106s

Very nice. :)

 Signed-off-by: Brandon Casey draf...@gmail.com
 ---
  remote.c | 26 --
  1 file changed, 24 insertions(+), 2 deletions(-)

Patch itself looks good to me, although...

 @@ -1362,6 +1378,8 @@ int match_push_refs(struct ref *src, struct ref **dst,
   free(dst_name);
   }
  
 + string_list_clear(ref_list, 0);
 +
   if (flags  MATCH_REFS_FOLLOW_TAGS)
   add_missing_tags(src, dst, dst_tail);
  
 @@ -1376,11 +1394,15 @@ int match_push_refs(struct ref *src, struct ref **dst,
  
   src_name = get_ref_match(rs, nr_refspec, ref, 
 send_mirror, FROM_DST, NULL);
   if (src_name) {
 - if (!find_ref_by_name(src, src_name))
 + if (!ref_list.nr)
 + prepare_searchable_ref_list(src,
 + ref_list);
 + if (!string_list_has_string(ref_list, 
 src_name))

This hunk threw me for a bit, as it looked like we were lazily
initializing ref_list in case we had not done so earlier. But we would
have cleared it mid-way through the function (in the hunk above), and it
is only that we are reusing the same ref_list for two different
purposes.

I do not feel strongly about it, but it might be a little more obvious
to just declare a new variable in the block, like:

diff --git a/remote.c b/remote.c
index 75255af..53bef82 100644
--- a/remote.c
+++ b/remote.c
@@ -1399,6 +1399,7 @@ int match_push_refs(struct ref *src, struct ref **dst,
add_missing_tags(src, dst, dst_tail);
 
if (send_prune) {
+   struct string_list src_ref_index = STRING_LIST_INIT_NODUP;
/* check for missing refs on the remote */
for (ref = *dst; ref; ref = ref-next) {
char *src_name;
@@ -1409,15 +1410,15 @@ int match_push_refs(struct ref *src, struct ref **dst,
 
src_name = get_ref_match(rs, nr_refspec, ref, 
send_mirror, FROM_DST, NULL);
if (src_name) {
-   if (!ref_list.nr)
+   if (!src_ref_index.nr)
prepare_searchable_ref_list(src,
-   ref_list);
-   if (!string_list_has_string(ref_list, 
src_name))
+   src_ref_index);
+   if (!string_list_has_string(src_ref_index, 
src_name))
ref-peer_ref = alloc_delete_ref();
free(src_name);
}
}
-   string_list_clear(ref_list, 0);
+   string_list_clear(src_ref_index, 0);
}
if (errs)
return -1;

And similarly maybe call the outer ref_list dst_ref_index or something.
I also note that we don't do the lazy-prepare for the other loop. I
guess that is because we assume that src is always non-NULL?

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] remote.c: avoid O(n^2) behavior in match_push_refs by using string_list

2013-07-03 Thread Brandon Casey
On Tue, Jul 2, 2013 at 11:23 PM, Jeff King p...@peff.net wrote:
 On Tue, Jul 02, 2013 at 04:53:48PM -0700, Brandon Casey wrote:

 From: Brandon Casey draf...@gmail.com

 When pushing, each ref in the local repository must be paired with a
 ref advertised by the remote server.  Currently, this is performed by
 first applying the refspec to the local ref to transform the local ref
 into the name of the remote ref, and then performing a linear search
 through the list of remote refs to see if the remote ref was advertised
 by the remote system.

 This has O(n) complexity and makes match_push_refs() be an O(n^2)
 operation.

 Just to be sure I understand correctly, is this actually O(m*n) where
 m is the number of local refs and n is the number of remote refs?

Yes, O(m*n) is more correct.  I think you understand completely.

 For a repository that repeatedly pushes everything it has to the remote,
 we end up with m=n, but it would not necessarily be the case if you are
 pushing a subset of your refs. But even pushing a small number of refs
 into a repository with a very large number of refs would be
 unnecessarily slow, as we would do several O(n) lookups which could be
 O(log n). So it may speed things up even in the case of a normal-sized
 repo pushing to a large one.

Right.  For repos with few refs on either side, I don't think there
will be any measurable difference.  When pushing a single ref to a
repo with a very large number of refs, we will see a very small net
loss for the time required to prepare the string list (which grows
linearly with the number of remote refs).  After 2 or 3 refs, we
should see a net gain.

So we're really just improving our worst case performance here.

 Dry-run push of a repository with 121913 refs:

 before after
 real1m40.582s  0m0.804s
 user1m39.914s  0m0.515s
 sys 0m0.125s   0m0.106s

 Very nice. :)

 Signed-off-by: Brandon Casey draf...@gmail.com
 ---
  remote.c | 26 --
  1 file changed, 24 insertions(+), 2 deletions(-)

 Patch itself looks good to me, although...

 @@ -1362,6 +1378,8 @@ int match_push_refs(struct ref *src, struct ref **dst,
   free(dst_name);
   }

 + string_list_clear(ref_list, 0);
 +
   if (flags  MATCH_REFS_FOLLOW_TAGS)
   add_missing_tags(src, dst, dst_tail);

 @@ -1376,11 +1394,15 @@ int match_push_refs(struct ref *src, struct ref 
 **dst,

   src_name = get_ref_match(rs, nr_refspec, ref, 
 send_mirror, FROM_DST, NULL);
   if (src_name) {
 - if (!find_ref_by_name(src, src_name))
 + if (!ref_list.nr)
 + prepare_searchable_ref_list(src,
 + ref_list);
 + if (!string_list_has_string(ref_list, 
 src_name))

 This hunk threw me for a bit, as it looked like we were lazily
 initializing ref_list in case we had not done so earlier. But we would
 have cleared it mid-way through the function (in the hunk above), and it
 is only that we are reusing the same ref_list for two different
 purposes.

 I do not feel strongly about it, but it might be a little more obvious
 to just declare a new variable in the block, like:

 diff --git a/remote.c b/remote.c
 index 75255af..53bef82 100644
 --- a/remote.c
 +++ b/remote.c
 @@ -1399,6 +1399,7 @@ int match_push_refs(struct ref *src, struct ref **dst,
 add_missing_tags(src, dst, dst_tail);

 if (send_prune) {
 +   struct string_list src_ref_index = STRING_LIST_INIT_NODUP;
 /* check for missing refs on the remote */
 for (ref = *dst; ref; ref = ref-next) {
 char *src_name;
 @@ -1409,15 +1410,15 @@ int match_push_refs(struct ref *src, struct ref **dst,

 src_name = get_ref_match(rs, nr_refspec, ref, 
 send_mirror, FROM_DST, NULL);
 if (src_name) {
 -   if (!ref_list.nr)
 +   if (!src_ref_index.nr)
 prepare_searchable_ref_list(src,
 -   ref_list);
 -   if (!string_list_has_string(ref_list, 
 src_name))
 +   src_ref_index);
 +   if (!string_list_has_string(src_ref_index, 
 src_name))
 ref-peer_ref = alloc_delete_ref();
 free(src_name);
 }
 }
 -   string_list_clear(ref_list, 0);
 +   string_list_clear(src_ref_index, 0);
 }
 if (errs)
 return -1;

 And similarly maybe call the outer ref_list dst_ref_index or something.

That looks/sounds reasonable.

 I also note that we don't do the lazy-prepare for the 

Re: [PATCH] remote.c: avoid O(n^2) behavior in match_push_refs by using string_list

2013-07-03 Thread Junio C Hamano
Brandon Casey draf...@gmail.com writes:

 Right.  For repos with few refs on either side, I don't think there
 will be any measurable difference.  When pushing a single ref to a
 repo with a very large number of refs, we will see a very small net
 loss for the time required to prepare the string list (which grows
 linearly with the number of remote refs).  After 2 or 3 refs, we
 should see a net gain.

 So we're really just improving our worst case performance here.

... by penalizing the common case by how much?  If it is not too
much, then this obviously would be a good change.

 ...  But, I don't see a down side to doing the lazy prepare in
 the other loop too, and in fact, it looks like we may be able to avoid
 building the string list when only explicit refspecs are used.  So,
 yeah, we should lazy build in both loops.

OK, so will see a reroll sometime?
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] remote.c: avoid O(n^2) behavior in match_push_refs by using string_list

2013-07-03 Thread Jeff King
On Wed, Jul 03, 2013 at 11:40:12AM -0700, Junio C Hamano wrote:

 Brandon Casey draf...@gmail.com writes:
 
  Right.  For repos with few refs on either side, I don't think there
  will be any measurable difference.  When pushing a single ref to a
  repo with a very large number of refs, we will see a very small net
  loss for the time required to prepare the string list (which grows
  linearly with the number of remote refs).  After 2 or 3 refs, we
  should see a net gain.
 
  So we're really just improving our worst case performance here.
 
 ... by penalizing the common case by how much?  If it is not too
 much, then this obviously would be a good change.

I don't think by much. If we have m local refs to push and n remote
refs, right now we do O(m*n) work (m linear searches of the remote
namespace). With Brandon's patch, we do O(n log n) to build the index,
plus O(m log n) for lookups.

So our break-even point is basically m = log n, and for m smaller than
that, we do more work building the index. Your absolute biggest
difference would be pushing a single ref to a repository with a very
large number of refs.

Here are the timings before and after Brandon's patch for pushing a
no-op single ref from a normal repo to one with 370K refs (the same
pathological repo from the upload-pack tests). Times are
best-of-five.

 before after
 real0m1.087s   0m1.156s
 user0m1.344s   0m1.412s
 sys 0m0.288s   0m0.284s

So it's measurable, but even on a pathological worst-case, we're talking
about 6% slowdown.

You could try to guess about when to build the index based on the size
of m and n, but I suspect you'd waste more time calculating whether
to build the index than you would simply building it in most cases.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] remote.c: avoid O(n^2) behavior in match_push_refs by using string_list

2013-07-03 Thread Brandon Casey
On 7/3/2013 11:40 AM, Junio C Hamano wrote:
 Brandon Casey draf...@gmail.com writes:
 
 Right.  For repos with few refs on either side, I don't think there
 will be any measurable difference.  When pushing a single ref to a
 repo with a very large number of refs, we will see a very small net
 loss for the time required to prepare the string list (which grows
 linearly with the number of remote refs).  After 2 or 3 refs, we
 should see a net gain.

 So we're really just improving our worst case performance here.
 
 ... by penalizing the common case by how much?  If it is not too
 much, then this obviously would be a good change.

For something the size of the git repo, 5 branches, and pushing with
matching refspecs, I can't measure any difference.  The fastest time I
record with or without this patch is the same:

   $ time git push -n
   real0m0.178s
   user0m0.020s
   sys 0m0.008s

Ditto, when only pushing a single branch.  Preparing the string list for
a repo with a normal number of refs has very little overhead.

When the remote side has very many refs, then there is a small penalty
when the local side is pushing very few refs.  But still, the penalty is
small.

My measurements for pushing from a repo with a single local branch into
my 10+ ref repo showed 10% hit and the numbers were in the tens of
milliseconds.

beforeafter
real0m0.525s  0m0.566s
user0m0.243s  0m0.279s
sys 0m0.075s  0m0.099s

 ...  But, I don't see a down side to doing the lazy prepare in
 the other loop too, and in fact, it looks like we may be able to avoid
 building the string list when only explicit refspecs are used.  So,
 yeah, we should lazy build in both loops.
 
 OK, so will see a reroll sometime?

Yeah, I'll reroll.

-Brandon

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] remote.c: avoid O(n^2) behavior in match_push_refs by using string_list

2013-07-03 Thread Brandon Casey
On 7/3/2013 12:00 PM, Jeff King wrote:
 On Wed, Jul 03, 2013 at 11:40:12AM -0700, Junio C Hamano wrote:
 
 Brandon Casey draf...@gmail.com writes:

 Right.  For repos with few refs on either side, I don't think there
 will be any measurable difference.  When pushing a single ref to a
 repo with a very large number of refs, we will see a very small net
 loss for the time required to prepare the string list (which grows
 linearly with the number of remote refs).  After 2 or 3 refs, we
 should see a net gain.

 So we're really just improving our worst case performance here.

 ... by penalizing the common case by how much?  If it is not too
 much, then this obviously would be a good change.
 
 I don't think by much. If we have m local refs to push and n remote
 refs, right now we do O(m*n) work (m linear searches of the remote
 namespace). With Brandon's patch, we do O(n log n) to build the index,

Whoops, yes, n log n, not linear as I misspoke.

 plus O(m log n) for lookups.
 
 So our break-even point is basically m = log n, and for m smaller than
 that, we do more work building the index. Your absolute biggest
 difference would be pushing a single ref to a repository with a very
 large number of refs.
 
 Here are the timings before and after Brandon's patch for pushing a
 no-op single ref from a normal repo to one with 370K refs (the same
 pathological repo from the upload-pack tests). Times are
 best-of-five.
 
  before after
  real0m1.087s   0m1.156s
  user0m1.344s   0m1.412s
  sys 0m0.288s   0m0.284s
 
 So it's measurable, but even on a pathological worst-case, we're talking
 about 6% slowdown.

That agrees with what I've observed.

 You could try to guess about when to build the index based on the size
 of m and n, but I suspect you'd waste more time calculating whether
 to build the index than you would simply building it in most cases.

I agree, I don't think it's worth trying to guess when to build an index
and when to just perform linear searches.  If building the payload for
each element in the index was more expensive than just assigning to a
pointer, than it could be worth it, but we're not, so I don't think it
is worth it.

-Brandon

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] remote.c: avoid O(n^2) behavior in match_push_refs by using string_list

2013-07-03 Thread Junio C Hamano
Brandon Casey bca...@nvidia.com writes:

 ... by penalizing the common case by how much?  If it is not too
 much, then this obviously would be a good change.

 For something the size of the git repo, 5 branches, and pushing with
 matching refspecs, I can't measure any difference.  The fastest time I
 record with or without this patch is the same:

$ time git push -n
real0m0.178s
user0m0.020s
sys 0m0.008s

 Ditto, when only pushing a single branch.  Preparing the string list for
 a repo with a normal number of refs has very little overhead.

My repository git.git and Linus's kernel are not normal.  It did
not matter so far to have O(n*m) when pushing to our histories.

The case that matters is for somebody to be pushing one (or a few)
refs against a repository with many many refs, like pushing a review
request to Gerrit instance, which I think Martin has in mind.


--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html