Re: [PATCH 2/2] commit: use a priority queue in merge base functions
On Wed, Aug 29, 2012 at 05:05:40PM -0400, Jeff King wrote: You would want this on top: [...] but t6024 still fails (it clearly is finding a different merge base than the test expects). I'll trace through it, but it will have to be later tonight. The problem in t6024 is caused by the fact that the commit timestamps for every commit are identical. The linear commit_list has the property that we always insert a new commit at the end of a chain of commits with the same timestamp. So it works as a stable priority queue in the sense that items with the same priority are returned in insertion order. But the heap-based priority queue does not. Nor can it do so without extra storage requirements, as heaps are inherently unstable. The simplest way to make it stable is to add an insertion counter to the comparison function. The patch below does this, and it resolves the issue. It does waste one int per queue element. I think you could also make a priority queue based on a binary search tree that would be stable. It's slightly less efficient to create an initial queue (you can heapify in O(n), but building the tree takes O(n lg n)). But we do not care about that, as we always build the queue by inserting elements, anyway. The other downside of using a tree is that you would want a self-balancing tree for good performance (especially since we tend to insert commits in sorted order), which increases the code complexity. Anyway, since this isn't yielding any performance benefit, I'm not going to go down that route. But stability of the queue is something that we need to consider if we ever do replace commit_list with a different data structure. Here's the patch to make the existing priority queue stable (by wasting space) in case we ever want to use it. -Peff diff --git a/commit.c b/commit.c index 8259871..a99d909 100644 --- a/commit.c +++ b/commit.c @@ -593,7 +593,7 @@ static int interesting(struct queue *q) { int i; for (i = 0; i q-nr; i++) { - struct commit *commit = q-items[i]; + struct commit *commit = q-items[i].item; if (commit-object.flags STALE) continue; return 1; diff --git a/queue.c b/queue.c index 7be6b86..1bdd948 100644 --- a/queue.c +++ b/queue.c @@ -3,18 +3,28 @@ static void queue_heapify_up(struct queue *pq) static inline void queue_swap(struct queue *pq, unsigned i, unsigned j) { - void *tmp = pq-items[i]; + struct queue_item tmp = pq-items[i]; pq-items[i] = pq-items[j]; pq-items[j] = tmp; } +static inline int queue_cmp(struct queue *pq, unsigned i, unsigned j) +{ + int cmp = pq-cmp(pq-items[i].item, pq-items[j].item); + if (cmp) + return cmp; + if (pq-items[i].counter pq-items[j].counter) + return 1; + return -1; +} + static void queue_heapify_up(struct queue *pq) { unsigned i = pq-nr - 1; while (i 0) { int parent = (i-1)/2; - if (pq-cmp(pq-items[i], pq-items[parent]) = 0) + if (queue_cmp(pq, i, parent) = 0) return; queue_swap(pq, i, parent); @@ -25,7 +35,9 @@ void queue_insert(struct queue *pq, void *item) void queue_insert(struct queue *pq, void *item) { ALLOC_GROW(pq-items, pq-nr + 1, pq-alloc); - pq-items[pq-nr++] = item; + pq-items[pq-nr].item = item; + pq-items[pq-nr].counter = pq-counter++; + pq-nr++; queue_heapify_up(pq); } @@ -35,11 +47,9 @@ static void queue_heapify_down(struct queue *pq) while (1) { int largest = i, left = 2*i + 1, right = 2*i + 2; - if (left pq-nr - pq-cmp(pq-items[left], pq-items[largest]) 0) + if (left pq-nr queue_cmp(pq, left, largest) 0) largest = left; - if (right pq-nr - pq-cmp(pq-items[right], pq-items[largest]) 0) + if (right pq-nr queue_cmp(pq, right, largest) 0) largest = right; if (largest == i) @@ -52,7 +62,7 @@ void *queue_peek(struct queue *pq) void *queue_peek(struct queue *pq) { - return pq-nr ? pq-items[0] : NULL; + return pq-nr ? pq-items[0].item : NULL; } void *queue_pop(struct queue *pq) @@ -61,7 +71,7 @@ void *queue_pop(struct queue *pq) if (!pq-nr) return NULL; - ret = pq-items[0]; + ret = pq-items[0].item; pq-items[0] = pq-items[--pq-nr]; queue_heapify_down(pq); diff --git a/queue.h b/queue.h index cc471b5..a70f7d7 100644 --- a/queue.h +++ b/queue.h @@ -3,11 +3,17 @@ struct queue { typedef int (*queue_comparison_func_t)(const void *, const void *); +struct queue_item { + void *item; + unsigned counter; +}; + struct queue { queue_comparison_func_t cmp; - void **items; + struct queue_item *items;
Re: [PATCH 2/2] commit: use a priority queue in merge base functions
On Thu, Aug 30, 2012 at 08:54:21AM -0400, Jeff King wrote: On Wed, Aug 29, 2012 at 05:05:40PM -0400, Jeff King wrote: You would want this on top: [...] but t6024 still fails (it clearly is finding a different merge base than the test expects). I'll trace through it, but it will have to be later tonight. The problem in t6024 is caused by the fact that the commit timestamps for every commit are identical. So I was able to have my queue behave just like commit_list by fixing the stability issue. But I still have no clue what is going on in t6024. It does this for each commit it makes: [...] GIT_AUTHOR_DATE=2006-12-12 23:00:00 git commit -m 1 a1 [...] GIT_AUTHOR_DATE=2006-12-12 23:00:01 git commit -m A a1 [...] which is just bizarre. At first I thought it was buggy, and that it really wanted to be setting COMMITTER_DATE (in which case it should really just be using test_tick, anyway). But if you do that, the test fails (even using a regular commit_list)! So is the test buggy? Or are the identical commit timestamps part of the intended effect? I can't see how that would be, since: 1. You would need to set COMMITTER_DATE for that anyway, as you are otherwise creating a race condition. 2. Why would you set AUTHOR_DATE? It's not used by the merge code at all. The script originally comes from here: http://thread.gmane.org/gmane.comp.version-control.git/33566/focus=33852 and the discussion implies that the AUTHOR_DATEs were added to avoid a race condition with the timestamps. But why would that ever have worked? Confused... -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] commit: use a priority queue in merge base functions
On Thu, Aug 30, 2012 at 09:03:27AM -0400, Jeff King wrote: So I was able to have my queue behave just like commit_list by fixing the stability issue. But I still have no clue what is going on in t6024. It does this for each commit it makes: [...] GIT_AUTHOR_DATE=2006-12-12 23:00:00 git commit -m 1 a1 [...] GIT_AUTHOR_DATE=2006-12-12 23:00:01 git commit -m A a1 [...] which is just bizarre. At first I thought it was buggy, and that it really wanted to be setting COMMITTER_DATE (in which case it should really just be using test_tick, anyway). But if you do that, the test fails (even using a regular commit_list)! So is the test buggy? Or are the identical commit timestamps part of the intended effect? I can't see how that would be, since: 1. You would need to set COMMITTER_DATE for that anyway, as you are otherwise creating a race condition. 2. Why would you set AUTHOR_DATE? It's not used by the merge code at all. The script originally comes from here: http://thread.gmane.org/gmane.comp.version-control.git/33566/focus=33852 and the discussion implies that the AUTHOR_DATEs were added to avoid a race condition with the timestamps. But why would that ever have worked? That thread mentions that this is to fix Shawn's bug, which I think is this: http://article.gmane.org/gmane.comp.version-control.git/33559 IOW, the real thing that t6024 is trying to test is that we handle the fake empty tree properly. And the AUTHOR_DATEs probably were just there to try to fix the race condition, and they should really just be test_ticks (and I can't see how they ever would have helped anything; I suspect they were a placebo inserted at the same time as another change, and got credited with fixing the race). That still leaves the question of why the test fails when the commits get distinct timestamps. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] commit: use a priority queue in merge base functions
Jeff King p...@peff.net writes: Anyway, since this isn't yielding any performance benefit, I'm not going to go down that route. But stability of the queue is something that we need to consider if we ever do replace commit_list with a different data structure. Here's the patch to make the existing priority queue stable (by wasting space) in case we ever want to use it. Thanks for being thorough and showing a good analysis. If we want stability, the space to hold insertion counter is not wasted, by the way. I actually think the generic queue implementation may want to handle elements that are not just single pointers. Just like qsort() is not about sorting an array of pointers but it is about sorting an array of elements of caller specified size, perhaps queue can hold elements of size the caller specified (set in stone when the queue is initialized). Then, a caller that wants a stable priority queue of commits can tell the queue to manage struct { struct commit *c; int gen; }, use the gen field for stability in its comparison callback, etc., while a caller that does not need stability can tell the queue to manage struct commit *. That way, the generic queue layer does not have to worry about wasting the insertion counter space, no? -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] commit: use a priority queue in merge base functions
On Thu, Aug 30, 2012 at 09:23:25AM -0700, Junio C Hamano wrote: Jeff King p...@peff.net writes: Anyway, since this isn't yielding any performance benefit, I'm not going to go down that route. But stability of the queue is something that we need to consider if we ever do replace commit_list with a different data structure. Here's the patch to make the existing priority queue stable (by wasting space) in case we ever want to use it. Thanks for being thorough and showing a good analysis. If we want stability, the space to hold insertion counter is not wasted, by the way. It is if there is another data structure that will do the same job with the same performance characteristics and without using that space. But I'm not even sure it is an issue in practice. There are ~320K objects in linux-2.6. Even if we inserted _all_ of them at once into a queue (which we don't; it's usually more like a couple dozen, with a few pathological cases being equal to the number of refs we have, which is usually in the hundreds), then we are talking about an extra 2.5M on a 64-bit system. Compared to dropping an O(n^2) operation to O(lg n), that's probably not even going to be noticeable. I actually think the generic queue implementation may want to handle elements that are not just single pointers. Just like qsort() is not about sorting an array of pointers but it is about sorting an array of elements of caller specified size, perhaps queue can hold elements of size the caller specified (set in stone when the queue is initialized). Then, a caller that wants a stable priority queue of commits can tell the queue to manage struct { struct commit *c; int gen; }, use the gen field for stability in its comparison callback, etc., while a caller that does not need stability can tell the queue to manage struct commit *. That way, the generic queue layer does not have to worry about wasting the insertion counter space, no? Yeah, that would be more generic. One wonky thing I didn't point out in my implementation is this: +static inline int queue_cmp(struct queue *pq, unsigned i, unsigned j) +{ + int cmp = pq-cmp(pq-items[i].item, pq-items[j].item); + if (cmp) + return cmp; + if (pq-items[i].counter pq-items[j].counter) + return 1; + return -1; +} Notice how the counter comparison is backwards to the regular comparison. That's because the queue is actually a max-queue (I did it that way since we are indexing on timestamp, and want to go in reverse chronological order). But the counter part wants to prioritize minimums, so you have to reverse it. You could also implement a min-queue and ask the caller to reverse their comparison function. But really, a caller could theoretically want to prioritize in either direction (e.g., giving a LIFO behavior to elements with the same priority). Pulling the counter out of the queue would allow that. The reason I didn't, though, is that it would make the interface a huge pain if the caller had to do this: struct stable_commit_queue_element qe; qe.commit = commit; qe.counter = counter++; /* who is holding the global counter? */ queue_push(q, qe); instead of: queue_push(q, commit); I suspect you could build a queue object, and then implement a stable queue on top of it with some clever use of function pointers for the comparison function. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] commit: use a priority queue in merge base functions
On Thu, Aug 30, 2012 at 09:33:48AM -0700, Junio C Hamano wrote: Jeff King p...@peff.net writes: The script originally comes from here: http://thread.gmane.org/gmane.comp.version-control.git/33566/focus=33852 and the discussion implies that the AUTHOR_DATEs were added to avoid a race condition with the timestamps. But why would that ever have worked? I do not see how AUTHOR_DATE would affect anything there, either, other than to give reprodusible object names. The test only sets committer-date upfront, so without setting author-date, you would still get different object names on commits. Which suggests me that there may be something that tiebreaks based on object names? Hmm. I wouldn't think so. The order should come from timestamps, with ties broken by order of insertion, which in turn comes from traversal, which depends only on parent order. We do check some raw sha1s in the expected output, but it is only for blobs, not commits. I guess it could be some weirdness inside merge-recursive, though, and not part of the merge-base computation. So I really don't know how AUTHOR_DATE would change anything. And indeed, after removing them it still passes on my machine. However, I think I may understand why it fails if you tweak the committer dates. When my unstable-queue was used, I noticed that the merge bases returned were the same (as you would expect), but that they came in a different order. Which makes sense, as the order of multiple bases would not necessarily be deterministic (they do not have an ancestry relationship, or they would not be merge bases). So the issue is that when you do a recursive merge with multiple bases, the order in which you visit the recursive bases is going to impact the exact conflicts you see. In theory, after the merge is done, you're going to be at the same state, but the conflicts you see along the way will be different. And it is this conflicted state that the test is looking at. The current test expects a particular order of merge bases based on the same-second commit timestamps. There is no race condition because of the setting of GIT_COMMITTER_DATE at the beginning (and _that_ is the real thing that fixed the race conditions Dscho saw in the thread above; the AUTHOR_DATE was just a red herring). So the test is not broken or racy, which is good. It is just testing something that is somewhat of an implementation detail. We could switch it to use test_tick, and then adjust the expected output to look for the expected conflict that git happens to generate in that case. But that is no better than the current behavior. But I'm not sure there is a way to test what it wants to test (that we hit a conflict that involves one of the recursive merge bases) without relying on the implementation detail. So I'm inclined to just leave it in place. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] commit: use a priority queue in merge base functions
Jeff King p...@peff.net writes: Compared to dropping an O(n^2) operation to O(lg n), that's probably not even going to be noticeable. Yeah, that is something we would want to use when we eventually update the main revision traverser, not this codepath localized to the merge-base. Thanks. I like (and agree with) everything you wrote in this message. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] commit: use a priority queue in merge base functions
Jeff King p...@peff.net writes: So the issue is that when you do a recursive merge with multiple bases, the order in which you visit the recursive bases is going to impact the exact conflicts you see. Yeah, that explains it. So the test is not broken or racy, which is good. It is just testing something that is somewhat of an implementation detail. We could switch it to use test_tick, and then adjust the expected output to look for the expected conflict that git happens to generate in that case. But that is no better than the current behavior. True. But I'm not sure there is a way to test what it wants to test (that we hit a conflict that involves one of the recursive merge bases) without relying on the implementation detail. So I'm inclined to just leave it in place. OK. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] commit: use a priority queue in merge base functions
Jeff King p...@peff.net writes: The merge-base functions internally keep a commit lists and insert by date, which causes a linear search of the commit list for each insertion. Let's use a priority queue instead. Unfortunately, from my experiments, this didn't actually cause any speedup. Signed-off-by: Jeff King p...@peff.net --- I'd probably split this into a few commits if we were really going to apply it, but since it doesn't actually make anything faster, I doubt the code churn is worth it. Thanks. This seems to break t6010-merge-base.sh for me, though... commit.c | 78 ++-- 1 file changed, 47 insertions(+), 31 deletions(-) diff --git a/commit.c b/commit.c index 380f4ec..c64ef94 100644 --- a/commit.c +++ b/commit.c @@ -7,6 +7,7 @@ #include revision.h #include notes.h #include gpg-interface.h +#include queue.h int save_commit_buffer = 1; @@ -360,6 +361,15 @@ struct commit_list *commit_list_insert(struct commit *item, struct commit_list * return new_list; } +static void commit_list_append(struct commit *item, struct commit_list ***tail) +{ + struct commit_list *new_list = xmalloc(sizeof(*new_list)); + new_list-item = item; + new_list-next = NULL; + **tail = new_list; + *tail = new_list-next; +} + unsigned commit_list_count(const struct commit_list *l) { unsigned c = 0; @@ -422,6 +432,16 @@ struct commit *pop_most_recent_commit(struct commit_list **list, return ret; } +static int commit_datecmp(const void *va, const void *vb) +{ + const struct commit *a = va, *b = vb; + if (a-date b-date) + return -1; + else if (a-date b-date) + return 1; + return 0; +} + void clear_commit_marks(struct commit *commit, unsigned int mark) { while (commit) { @@ -569,22 +589,23 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co static const unsigned all_flags = (PARENT1 | PARENT2 | STALE | RESULT); -static struct commit *interesting(struct commit_list *list) +static int interesting(struct queue *q) { - while (list) { - struct commit *commit = list-item; - list = list-next; + int i; + for (i = 0; i q-nr; i++) { + struct commit *commit = q-items[i]; if (commit-object.flags STALE) continue; - return commit; + return 1; } - return NULL; + return 0; } static struct commit_list *merge_bases_many(struct commit *one, int n, struct commit **twos) { - struct commit_list *list = NULL; - struct commit_list *result = NULL; + struct queue result = { commit_datecmp }; + struct queue list = { commit_datecmp }; + struct commit_list *ret = NULL, **tail = ret; int i; for (i = 0; i n; i++) { @@ -593,7 +614,7 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co * We do not mark this even with RESULT so we do not * have to clean it up. */ - return commit_list_insert(one, result); + return commit_list_insert(one, ret); } if (parse_commit(one)) @@ -604,28 +625,24 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co } one-object.flags |= PARENT1; - commit_list_insert_by_date(one, list); + queue_insert(list, one); for (i = 0; i n; i++) { twos[i]-object.flags |= PARENT2; - commit_list_insert_by_date(twos[i], list); + queue_insert(list, twos[i]); } - while (interesting(list)) { + while (interesting(list)) { struct commit *commit; struct commit_list *parents; - struct commit_list *next; int flags; - commit = list-item; - next = list-next; - free(list); - list = next; + commit = queue_pop(list); flags = commit-object.flags (PARENT1 | PARENT2 | STALE); if (flags == (PARENT1 | PARENT2)) { if (!(commit-object.flags RESULT)) { commit-object.flags |= RESULT; - commit_list_insert_by_date(commit, result); + queue_insert(result, commit); } /* Mark parents of a found merge stale */ flags |= STALE; @@ -639,21 +656,16 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co if (parse_commit(p)) return NULL; p-object.flags |= flags; -
Re: [PATCH 2/2] commit: use a priority queue in merge base functions
On Wed, Aug 29, 2012 at 09:36:43AM -0700, Junio C Hamano wrote: Jeff King p...@peff.net writes: The merge-base functions internally keep a commit lists and insert by date, which causes a linear search of the commit list for each insertion. Let's use a priority queue instead. Unfortunately, from my experiments, this didn't actually cause any speedup. Signed-off-by: Jeff King p...@peff.net --- I'd probably split this into a few commits if we were really going to apply it, but since it doesn't actually make anything faster, I doubt the code churn is worth it. Thanks. This seems to break t6010-merge-base.sh for me, though... Interesting. It works fine here, even under --valgrind. Did you apply the patches directly on top of 1251cc7? Not that it matters _too_ much if we are just going to scrap it anyway, but maybe it is an indication that I screwed up something that could impact the timing (I did check that the timed merge-base calculations on linux-2.6 yielded the same results though). -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] commit: use a priority queue in merge base functions
On Wed, Aug 29, 2012 at 04:53:32PM -0400, Jeff King wrote: Thanks. This seems to break t6010-merge-base.sh for me, though... Interesting. It works fine here, even under --valgrind. Did you apply the patches directly on top of 1251cc7? Hmm, this does seem to break t6024 for me, though. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] commit: use a priority queue in merge base functions
On Wed, Aug 29, 2012 at 04:55:25PM -0400, Jeff King wrote: On Wed, Aug 29, 2012 at 04:53:32PM -0400, Jeff King wrote: Thanks. This seems to break t6010-merge-base.sh for me, though... Interesting. It works fine here, even under --valgrind. Did you apply the patches directly on top of 1251cc7? Hmm, this does seem to break t6024 for me, though. Probably because: /* Clean up the result to remove stale ones */ - free_commit_list(list); - list = result; result = NULL; - while (list) { - struct commit_list *next = list-next; - if (!(list-item-object.flags STALE)) - commit_list_insert_by_date(list-item, result); - free(list); - list = next; - } - return result; + while (result.nr) + commit_list_append(queue_pop(result), tail); + queue_clear(result); + queue_clear(list); + return ret; I forgot to to port the STALE flag handling here. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] commit: use a priority queue in merge base functions
On Wed, Aug 29, 2012 at 05:00:32PM -0400, Jeff King wrote: Hmm, this does seem to break t6024 for me, though. Probably because: /* Clean up the result to remove stale ones */ - free_commit_list(list); - list = result; result = NULL; - while (list) { - struct commit_list *next = list-next; - if (!(list-item-object.flags STALE)) - commit_list_insert_by_date(list-item, result); - free(list); - list = next; - } - return result; + while (result.nr) + commit_list_append(queue_pop(result), tail); + queue_clear(result); + queue_clear(list); + return ret; I forgot to to port the STALE flag handling here. You would want this on top: diff --git a/commit.c b/commit.c index c64ef94..8259871 100644 --- a/commit.c +++ b/commit.c @@ -661,8 +661,11 @@ static struct commit_list *merge_bases_many(struct commit *one, int n, struct co } /* Clean up the result to remove stale ones */ - while (result.nr) - commit_list_append(queue_pop(result), tail); + while (result.nr) { + struct commit *commit = queue_pop(result); + if (!(commit-object.flags STALE)) + commit_list_append(commit, tail); + } queue_clear(result); queue_clear(list); return ret; but t6024 still fails (it clearly is finding a different merge base than the test expects). I'll trace through it, but it will have to be later tonight. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] commit: use a priority queue in merge base functions
Jeff King p...@peff.net writes: +while (result.nr) +commit_list_append(queue_pop(result), tail); +queue_clear(result); +queue_clear(list); +return ret; I forgot to to port the STALE flag handling here. Figures.. Thanks. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html