Re: [PATCH 4/4] gc --aggressive: three phase repacking
On Tue, Mar 18, 2014 at 12:00 PM, Duy Nguyen pclo...@gmail.com wrote: size time old aggr. 36MB 5m51 new aggr. 37MB 6m13 repack -adf 48MB 1m12 I am not clear on what these times mean. It looks like the new code is slower _and_ bigger. Can you explain them? That's right :) The upside is faster operations, which is complely missed here. The good thing from those numbers is pack size does not increase much (the upper limit would be repack -adf with default settings). Something is not right. The performance numbers are against me. Still looking.. -- Duy -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] gc --aggressive: three phase repacking
On Wed, Mar 19, 2014 at 9:14 PM, Stefan Beller stefanbel...@gmail.com wrote: Maybe we should introduce another option --pack-for-archive which features the characteristics of the old --aggressive. And the new --aggressive would be a tradeoff between space and time? Thinking further we could add a couple of options there like --developer-friendly which makes blaming fast --hosting-friendly which makes it most efficient when upstream --archival-friendly (the old --aggressive) --... --aggressive-mode=mode would be a good option to select those friendly modes instead of a bunch of new options. Well, maby gc --mode is enough, we already have two modes: normal and aggressive (or although maybe it should be renamed archive). -- Duy -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] gc --aggressive: three phase repacking
On Tue, Mar 18, 2014 at 12:16 PM, Duy Nguyen pclo...@gmail.com wrote: But I think it's orthogonal to gc --aggressive improvement. There's another reason that improving gc may be a good idea (or not). It depends on how other git implementations handle long delta chains. If they hate long delta chains like current git too, then more reason to make gc less aggressive. -- Duy -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] gc --aggressive: three phase repacking
Jeff King p...@peff.net writes: On Tue, Mar 18, 2014 at 12:00:48PM +0700, Duy Nguyen wrote: On Tue, Mar 18, 2014 at 11:50 AM, Jeff King p...@peff.net wrote: On Sun, Mar 16, 2014 at 08:35:04PM +0700, Nguyễn Thái Ngọc Duy wrote: As explained in the previous commit, current aggressive settings --depth=250 --window=250 could slow down repository access significantly. Notice that people usually work on recent history only, we could keep recent history more loosely packed, so that repo access is fast most of the time while the pack file remains small. One thing I have not seen is real-world timings showing the slowdown based on --depth. Did I miss them, or are we just making assumptions based on one old case from 2009 (that, AFAIK does not have real numbers, just speculation)? Has anyone measured the effect of bumping the delta cache size (and its hash implementation)? David tested it with git-blame [1]. I should probably run some tests too (I don't remember if I tested some operations last time). http://thread.gmane.org/gmane.comp.version-control.git/242277/focus=242435 Ah, thanks. I do remember that thread now. It looks like David's last word is that he gets a significant performance from bumping the delta base cache size (and number of buckets). Increasing number of buckets was having comparatively minor effects (that was the suggestion I started with), actually _degrading_ performance rather soon. The delta base cache size was much more noticeable. I had prepared a patch serious increasing it. The reason I have not submitted it yet is that I have not found a compelling real-world test case _apart_ from the fast git-blame that is still missing implementation of -M and -C options. There should be other commands digging through large amounts of old history, but I did not really find something benchmarking convincingly. Either most stuff is inefficient anyway, or the access order is better-behaved, causing fewer unwanted cache flushes. Access order in the optimized git-blame case is basically done with a reverse commit-time based priority queue leading to a breadth-first strategy. It still beats unsorted access solidly in its timing. Don't think I compared depth-first results (inversing the priority queue sorting condition) with regard to cache results, but it's bad for interactive use as it tends to leave some recent history unblamed for a long time while digging up stuff in the remote past. Moderate cache size increases seem like a better strategy, and the default size of 16M does not make a lot of sense with modern computers. In particular since the history digging is rarely competing with other memory intensive operations at the same time. And that matches the timings I just did. I suspect there are still pathological cases that could behave worse, but it really sounds like we should be looking into improving that cache as a first step. I can put up a patch. My git-blame experiments used 128M, and the patch proposes a more conservative 64M. I don't actually have made experiments for the 64M setting, though. The current default is 16M. -- David Kastrup -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] gc --aggressive: three phase repacking
Jeff King p...@peff.net writes: On Tue, Mar 18, 2014 at 12:50:50AM -0400, Jeff King wrote: On Sun, Mar 16, 2014 at 08:35:04PM +0700, Nguyễn Thái Ngọc Duy wrote: As explained in the previous commit, current aggressive settings --depth=250 --window=250 could slow down repository access significantly. Notice that people usually work on recent history only, we could keep recent history more loosely packed, so that repo access is fast most of the time while the pack file remains small. One thing I have not seen is real-world timings showing the slowdown based on --depth. Did I miss them, or are we just making assumptions based on one old case from 2009 (that, AFAIK does not have real numbers, just speculation)? Has anyone measured the effect of bumping the delta cache size (and its hash implementation)? Just as a very quick, rough data point, here are before-and-after timings for the patch below doing git rev-list --objects --all on my linux.git, which is a mix of --aggressive and normal packing (I didn't do a repack -f, but it's partially what I've downloaded from k.org and what I've repacked in various experiments over the past few months). [before] real0m28.824s user0m28.620s sys 0m0.232s [after] real0m21.694s user0m21.544s sys 0m0.172s The numbers below are completely pulled out of a hat, so we can perhaps do even better. But I think it shows that there is room for improvement in the delta base cache. --- diff --git a/environment.c b/environment.c index c3c8606..73ed670 100644 --- a/environment.c +++ b/environment.c @@ -37,7 +37,7 @@ int core_compression_seen; int fsync_object_files; size_t packed_git_window_size = DEFAULT_PACKED_GIT_WINDOW_SIZE; size_t packed_git_limit = DEFAULT_PACKED_GIT_LIMIT; -size_t delta_base_cache_limit = 16 * 1024 * 1024; +size_t delta_base_cache_limit = 128 * 1024 * 1024; You need to change a file in Documentation as well. Can offer a patch. unsigned long big_file_threshold = 512 * 1024 * 1024; const char *pager_program; int pager_use_color = 1; diff --git a/sha1_file.c b/sha1_file.c index b37c6f6..a9ab8e3 100644 --- a/sha1_file.c +++ b/sha1_file.c @@ -1944,7 +1944,7 @@ static void *unpack_compressed_entry(struct packed_git *p, return buffer; } -#define MAX_DELTA_CACHE (256) +#define MAX_DELTA_CACHE (1024) This one really needs experimentation. I found that increases here lead to performance degradation rather soon, probably because of decreased memory locality without significant reduction in cache collisions. Not sure whether it's worth touching at all. -- David Kastrup -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] gc --aggressive: three phase repacking
Duy Nguyen pclo...@gmail.com writes: On Tue, Mar 18, 2014 at 12:16 PM, Duy Nguyen pclo...@gmail.com wrote: But I think it's orthogonal to gc --aggressive improvement. There's another reason that improving gc may be a good idea (or not). It depends on how other git implementations handle long delta chains. Other git implementations including current installations that have a half-life of at last a year. -- David Kastrup -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] gc --aggressive: three phase repacking
Nguyễn Thái Ngọc Duy pclo...@gmail.com writes: As explained in the previous commit,... [PATCH 3/4] becomes a commit with an empty log message for some reason. Have you tried running am -s on it? -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] gc --aggressive: three phase repacking
On Tue, Mar 18, 2014 at 5:12 AM, Junio C Hamano gits...@pobox.com wrote: Nguyễn Thái Ngọc Duy pclo...@gmail.com writes: As explained in the previous commit,... [PATCH 3/4] becomes a commit with an empty log message for some reason. Have you tried running am -s on it? am -s worked fine. am -s --scissors did not (because of my use of scissors in the commit message). Not sure what happened on your side.. -- Duy -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] gc --aggressive: three phase repacking
Duy Nguyen pclo...@gmail.com writes: On Tue, Mar 18, 2014 at 5:12 AM, Junio C Hamano gits...@pobox.com wrote: Nguyễn Thái Ngọc Duy pclo...@gmail.com writes: As explained in the previous commit,... [PATCH 3/4] becomes a commit with an empty log message for some reason. Have you tried running am -s on it? am -s worked fine. am -s --scissors did not (because of my use of scissors in the commit message). Not sure what happened on your side.. Yeah, I meant am -c. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] gc --aggressive: three phase repacking
On Sun, Mar 16, 2014 at 08:35:04PM +0700, Nguyễn Thái Ngọc Duy wrote: As explained in the previous commit, current aggressive settings --depth=250 --window=250 could slow down repository access significantly. Notice that people usually work on recent history only, we could keep recent history more loosely packed, so that repo access is fast most of the time while the pack file remains small. One thing I have not seen is real-world timings showing the slowdown based on --depth. Did I miss them, or are we just making assumptions based on one old case from 2009 (that, AFAIK does not have real numbers, just speculation)? Has anyone measured the effect of bumping the delta cache size (and its hash implementation)? git.git is not a great repo to test it because its size is modest but so are my laptop's cpu and memory, so here are the timings and pack sizes size time old aggr. 36MB 5m51 new aggr. 37MB 6m13 repack -adf 48MB 1m12 I am not clear on what these times mean. It looks like the new code is slower _and_ bigger. Can you explain them? -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] gc --aggressive: three phase repacking
On Tue, Mar 18, 2014 at 11:50 AM, Jeff King p...@peff.net wrote: On Sun, Mar 16, 2014 at 08:35:04PM +0700, Nguyễn Thái Ngọc Duy wrote: As explained in the previous commit, current aggressive settings --depth=250 --window=250 could slow down repository access significantly. Notice that people usually work on recent history only, we could keep recent history more loosely packed, so that repo access is fast most of the time while the pack file remains small. One thing I have not seen is real-world timings showing the slowdown based on --depth. Did I miss them, or are we just making assumptions based on one old case from 2009 (that, AFAIK does not have real numbers, just speculation)? Has anyone measured the effect of bumping the delta cache size (and its hash implementation)? David tested it with git-blame [1]. I should probably run some tests too (I don't remember if I tested some operations last time). http://thread.gmane.org/gmane.comp.version-control.git/242277/focus=242435 git.git is not a great repo to test it because its size is modest but so are my laptop's cpu and memory, so here are the timings and pack sizes size time old aggr. 36MB 5m51 new aggr. 37MB 6m13 repack -adf 48MB 1m12 I am not clear on what these times mean. It looks like the new code is slower _and_ bigger. Can you explain them? That's right :) The upside is faster operations, which is complely missed here. The good thing from those numbers is pack size does not increase much (the upper limit would be repack -adf with default settings). -- Duy -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] gc --aggressive: three phase repacking
On Tue, Mar 18, 2014 at 12:50:50AM -0400, Jeff King wrote: On Sun, Mar 16, 2014 at 08:35:04PM +0700, Nguyễn Thái Ngọc Duy wrote: As explained in the previous commit, current aggressive settings --depth=250 --window=250 could slow down repository access significantly. Notice that people usually work on recent history only, we could keep recent history more loosely packed, so that repo access is fast most of the time while the pack file remains small. One thing I have not seen is real-world timings showing the slowdown based on --depth. Did I miss them, or are we just making assumptions based on one old case from 2009 (that, AFAIK does not have real numbers, just speculation)? Has anyone measured the effect of bumping the delta cache size (and its hash implementation)? Just as a very quick, rough data point, here are before-and-after timings for the patch below doing git rev-list --objects --all on my linux.git, which is a mix of --aggressive and normal packing (I didn't do a repack -f, but it's partially what I've downloaded from k.org and what I've repacked in various experiments over the past few months). [before] real0m28.824s user0m28.620s sys 0m0.232s [after] real0m21.694s user0m21.544s sys 0m0.172s The numbers below are completely pulled out of a hat, so we can perhaps do even better. But I think it shows that there is room for improvement in the delta base cache. --- diff --git a/environment.c b/environment.c index c3c8606..73ed670 100644 --- a/environment.c +++ b/environment.c @@ -37,7 +37,7 @@ int core_compression_seen; int fsync_object_files; size_t packed_git_window_size = DEFAULT_PACKED_GIT_WINDOW_SIZE; size_t packed_git_limit = DEFAULT_PACKED_GIT_LIMIT; -size_t delta_base_cache_limit = 16 * 1024 * 1024; +size_t delta_base_cache_limit = 128 * 1024 * 1024; unsigned long big_file_threshold = 512 * 1024 * 1024; const char *pager_program; int pager_use_color = 1; diff --git a/sha1_file.c b/sha1_file.c index b37c6f6..a9ab8e3 100644 --- a/sha1_file.c +++ b/sha1_file.c @@ -1944,7 +1944,7 @@ static void *unpack_compressed_entry(struct packed_git *p, return buffer; } -#define MAX_DELTA_CACHE (256) +#define MAX_DELTA_CACHE (1024) static size_t delta_base_cached; -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] gc --aggressive: three phase repacking
On Tue, Mar 18, 2014 at 12:00:48PM +0700, Duy Nguyen wrote: On Tue, Mar 18, 2014 at 11:50 AM, Jeff King p...@peff.net wrote: On Sun, Mar 16, 2014 at 08:35:04PM +0700, Nguyễn Thái Ngọc Duy wrote: As explained in the previous commit, current aggressive settings --depth=250 --window=250 could slow down repository access significantly. Notice that people usually work on recent history only, we could keep recent history more loosely packed, so that repo access is fast most of the time while the pack file remains small. One thing I have not seen is real-world timings showing the slowdown based on --depth. Did I miss them, or are we just making assumptions based on one old case from 2009 (that, AFAIK does not have real numbers, just speculation)? Has anyone measured the effect of bumping the delta cache size (and its hash implementation)? David tested it with git-blame [1]. I should probably run some tests too (I don't remember if I tested some operations last time). http://thread.gmane.org/gmane.comp.version-control.git/242277/focus=242435 Ah, thanks. I do remember that thread now. It looks like David's last word is that he gets a significant performance from bumping the delta base cache size (and number of buckets). And that matches the timings I just did. I suspect there are still pathological cases that could behave worse, but it really sounds like we should be looking into improving that cache as a first step. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] gc --aggressive: three phase repacking
On Tue, Mar 18, 2014 at 12:07 PM, Jeff King p...@peff.net wrote: [before] real0m28.824s user0m28.620s sys 0m0.232s [after] real0m21.694s user0m21.544s sys 0m0.172s The numbers below are completely pulled out of a hat, so we can perhaps do even better. But I think it shows that there is room for improvement in the delta base cache. There is definitely room for improvement in delta base cache: increasing cache size, better eviction strategies.. But I think it's orthogonal to gc --aggressive improvement. From what Linus described ---aggressive is more for archival purposes but I think people have been using it for upstream repos as well. Better packed repos could speed up current clients immediately (although this argument is somewhat flawed, server updates are usually at slower pace than clients..). -- Duy -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] gc --aggressive: three phase repacking
As explained in the previous commit, current aggressive settings --depth=250 --window=250 could slow down repository access significantly. Notice that people usually work on recent history only, we could keep recent history more loosely packed, so that repo access is fast most of the time while the pack file remains small. Three more configuration variables are used to make that happen. The first one, gc.aggressiveCommitLimits covers the old history part, which will be tightly packed. The remaining part will be packed with gc.lessAggresiveWindow and gc.lessAggressiveDepth. If gc.aggressiveCommitLimits is empty, everything will be tightly packed as before. The repack process becomes: - repack -adf on old history (e.g. the default --before=1.year.ago) mark to keep that pack - repack the second time with lessAggressive settings, the kept pack should be left untouched. - remove .keep file and repack the final time, reusing all deltas This process costs more time, but produce a more effecient pack. It is assumed that people who do gc --aggressive do not do this often and do not mind if it takes a bit longer. git.git is not a great repo to test it because its size is modest but so are my laptop's cpu and memory, so here are the timings and pack sizes size time old aggr. 36MB 5m51 new aggr. 37MB 6m13 repack -adf 48MB 1m12 Signed-off-by: Nguyễn Thái Ngọc Duy pclo...@gmail.com --- Documentation/config.txt | 19 + builtin/gc.c | 109 +-- 2 files changed, 124 insertions(+), 4 deletions(-) diff --git a/Documentation/config.txt b/Documentation/config.txt index 5ce7f9a..47979dc 100644 --- a/Documentation/config.txt +++ b/Documentation/config.txt @@ -1161,6 +1161,25 @@ gc.aggressiveWindow:: algorithm used by 'git gc --aggressive'. This defaults to 250. +gc.aggressiveCommitLimits:: + This one parameter to linkgit:git-rev-list[1] to select + commits that are repacked with gc.aggressiveDepth and + gc.aggressiveWindow, while the remaining commits are repacked + with gc.lessAggressiveDepth and gc.lessAggressiveWindow. ++ +If this is an empty string, everything will be repacked with +gc.aggressiveWindow and gc.aggressiveDepth. + +gc.lessAggressiveDepth:: + The depth parameter used in the delta compression + algorithm used by 'git gc --aggressive' when + gc.aggressiveCommitLimits is set. This defaults to 50. + +gc.lessAggressiveWindow:: + The window size parameter used in the delta compression + algorithm used by 'git gc --aggressive' when + gc.aggressiveCommitLimits is set. This defaults to 250. + gc.auto:: When there are approximately more than this many loose objects in the repository, `git gc --auto` will pack them. diff --git a/builtin/gc.c b/builtin/gc.c index 72aa912..37fc603 100644 --- a/builtin/gc.c +++ b/builtin/gc.c @@ -28,10 +28,14 @@ static const char * const builtin_gc_usage[] = { static int pack_refs = 1; static int aggressive_depth = 250; static int aggressive_window = 250; +static const char *aggressive_rev_list = --before=1.year.ago; +static int less_aggressive_depth = 50; +static int less_aggressive_window = 250; static int gc_auto_threshold = 6700; static int gc_auto_pack_limit = 50; static int detach_auto = 1; static const char *prune_expire = 2.weeks.ago; +static int delta_base_offset = 1; static struct argv_array pack_refs_cmd = ARGV_ARRAY_INIT; static struct argv_array reflog = ARGV_ARRAY_INIT; @@ -39,10 +43,13 @@ static struct argv_array repack = ARGV_ARRAY_INIT; static struct argv_array prune = ARGV_ARRAY_INIT; static struct argv_array rerere = ARGV_ARRAY_INIT; +static char *keep_file; static char *pidfile; static void remove_pidfile(void) { + if (keep_file) + unlink_or_warn(keep_file); if (pidfile) unlink(pidfile); } @@ -54,6 +61,63 @@ static void remove_pidfile_on_signal(int signo) raise(signo); } +static void pack_old_history(int quiet) +{ + struct child_process pack_objects; + struct child_process rev_list; + struct argv_array av_po = ARGV_ARRAY_INIT; + struct argv_array av_rl = ARGV_ARRAY_INIT; + char sha1[41]; + + argv_array_pushl(av_rl, rev-list, --all, --objects, +--reflog, NULL); + argv_array_push(av_rl, aggressive_rev_list); + + memset(rev_list, 0, sizeof(rev_list)); + rev_list.no_stdin = 1; + rev_list.out = -1; + rev_list.git_cmd = 1; + rev_list.argv = av_rl.argv; + + if (start_command(rev_list)) + die(_(gc: unable to fork git-rev-list)); + + argv_array_pushl(av_po, pack-objects, --keep-true-parents, +--honor-pack-keep, --non-empty, --no-reuse-delta, +--keep, --local, NULL); + if (delta_base_offset) + argv_array_push(av_po,