Re: [PATCH 4/4] gc --aggressive: three phase repacking

2014-03-19 Thread Duy Nguyen
On Tue, Mar 18, 2014 at 12:00 PM, Duy Nguyen pclo...@gmail.com wrote:
 size  time
 old aggr.   36MB  5m51
 new aggr.   37MB  6m13
 repack -adf 48MB  1m12

 I am not clear on what these times mean. It looks like the new code is
 slower _and_ bigger. Can you explain them?

 That's right :) The upside is faster operations, which is complely
 missed here. The good thing from those numbers is pack size does not
 increase much (the upper limit would be repack -adf with default
 settings).

Something is not right. The performance numbers are against me. Still looking..
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] gc --aggressive: three phase repacking

2014-03-19 Thread Duy Nguyen
On Wed, Mar 19, 2014 at 9:14 PM, Stefan Beller stefanbel...@gmail.com wrote:
 Maybe we should introduce another option --pack-for-archive
 which features the characteristics of the old --aggressive.
 And the new --aggressive would be a tradeoff between space and
 time?
 Thinking further we could add a couple of options there like
  --developer-friendly which makes blaming fast
  --hosting-friendly which makes it most efficient when upstream
  --archival-friendly (the old --aggressive)
  --...

--aggressive-mode=mode would be a good option to select those
friendly modes instead of a bunch of new options. Well, maby gc
--mode is enough, we already have two modes: normal and aggressive
(or although maybe it should be renamed archive).
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] gc --aggressive: three phase repacking

2014-03-18 Thread Duy Nguyen
On Tue, Mar 18, 2014 at 12:16 PM, Duy Nguyen pclo...@gmail.com wrote:
 But I think it's orthogonal to gc --aggressive improvement.

There's another reason that improving gc may be a good idea (or not).
It depends on how other git implementations handle long delta chains.
If they hate long delta chains like current git too, then more reason
to make gc less aggressive.
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] gc --aggressive: three phase repacking

2014-03-18 Thread David Kastrup
Jeff King p...@peff.net writes:

 On Tue, Mar 18, 2014 at 12:00:48PM +0700, Duy Nguyen wrote:

 On Tue, Mar 18, 2014 at 11:50 AM, Jeff King p...@peff.net wrote:
  On Sun, Mar 16, 2014 at 08:35:04PM +0700, Nguyễn Thái Ngọc Duy wrote:
 
  As explained in the previous commit, current aggressive settings
  --depth=250 --window=250 could slow down repository access
  significantly. Notice that people usually work on recent history only,
  we could keep recent history more loosely packed, so that repo access
  is fast most of the time while the pack file remains small.
 
  One thing I have not seen is real-world timings showing the slowdown
  based on --depth. Did I miss them, or are we just making assumptions
  based on one old case from 2009 (that, AFAIK does not have real numbers,
  just speculation)? Has anyone measured the effect of bumping the delta
  cache size (and its hash implementation)?
 
 David tested it with git-blame [1]. I should probably run some tests
 too (I don't remember if I tested some operations last time).
 
 http://thread.gmane.org/gmane.comp.version-control.git/242277/focus=242435

 Ah, thanks. I do remember that thread now.

 It looks like David's last word is that he gets a significant
 performance from bumping the delta base cache size (and number of
 buckets).

Increasing number of buckets was having comparatively minor effects
(that was the suggestion I started with), actually _degrading_
performance rather soon.  The delta base cache size was much more
noticeable.  I had prepared a patch serious increasing it.  The reason
I have not submitted it yet is that I have not found a compelling
real-world test case _apart_ from the fast git-blame that is still
missing implementation of -M and -C options.

There should be other commands digging through large amounts of old
history, but I did not really find something benchmarking convincingly.
Either most stuff is inefficient anyway, or the access order is
better-behaved, causing fewer unwanted cache flushes.

Access order in the optimized git-blame case is basically done with a
reverse commit-time based priority queue leading to a breadth-first
strategy.  It still beats unsorted access solidly in its timing.  Don't
think I compared depth-first results (inversing the priority queue
sorting condition) with regard to cache results, but it's bad for
interactive use as it tends to leave some recent history unblamed for a
long time while digging up stuff in the remote past.

Moderate cache size increases seem like a better strategy, and the
default size of 16M does not make a lot of sense with modern computers.
In particular since the history digging is rarely competing with other
memory intensive operations at the same time.

 And that matches the timings I just did. I suspect there are still
 pathological cases that could behave worse, but it really sounds like
 we should be looking into improving that cache as a first step.

I can put up a patch.  My git-blame experiments used 128M, and the patch
proposes a more conservative 64M.  I don't actually have made
experiments for the 64M setting, though.  The current default is 16M.

-- 
David Kastrup
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] gc --aggressive: three phase repacking

2014-03-18 Thread David Kastrup
Jeff King p...@peff.net writes:

 On Tue, Mar 18, 2014 at 12:50:50AM -0400, Jeff King wrote:

 On Sun, Mar 16, 2014 at 08:35:04PM +0700, Nguyễn Thái Ngọc Duy wrote:
 
  As explained in the previous commit, current aggressive settings
  --depth=250 --window=250 could slow down repository access
  significantly. Notice that people usually work on recent history only,
  we could keep recent history more loosely packed, so that repo access
  is fast most of the time while the pack file remains small.
 
 One thing I have not seen is real-world timings showing the slowdown
 based on --depth. Did I miss them, or are we just making assumptions
 based on one old case from 2009 (that, AFAIK does not have real numbers,
 just speculation)? Has anyone measured the effect of bumping the delta
 cache size (and its hash implementation)?

 Just as a very quick, rough data point, here are before-and-after
 timings for the patch below doing git rev-list --objects --all on my
 linux.git, which is a mix of --aggressive and normal packing (I didn't
 do a repack -f, but it's partially what I've downloaded from k.org and
 what I've repacked in various experiments over the past few months).

   [before]
   real0m28.824s
   user0m28.620s
   sys 0m0.232s

   [after]
   real0m21.694s
   user0m21.544s
   sys 0m0.172s

 The numbers below are completely pulled out of a hat, so we can perhaps
 do even better. But I think it shows that there is room for improvement
 in the delta base cache.

 ---
 diff --git a/environment.c b/environment.c
 index c3c8606..73ed670 100644
 --- a/environment.c
 +++ b/environment.c
 @@ -37,7 +37,7 @@ int core_compression_seen;
  int fsync_object_files;
  size_t packed_git_window_size = DEFAULT_PACKED_GIT_WINDOW_SIZE;
  size_t packed_git_limit = DEFAULT_PACKED_GIT_LIMIT;
 -size_t delta_base_cache_limit = 16 * 1024 * 1024;
 +size_t delta_base_cache_limit = 128 * 1024 * 1024;

You need to change a file in Documentation as well.  Can offer a patch.

  unsigned long big_file_threshold = 512 * 1024 * 1024;
  const char *pager_program;
  int pager_use_color = 1;
 diff --git a/sha1_file.c b/sha1_file.c
 index b37c6f6..a9ab8e3 100644
 --- a/sha1_file.c
 +++ b/sha1_file.c
 @@ -1944,7 +1944,7 @@ static void *unpack_compressed_entry(struct packed_git 
 *p,
   return buffer;
  }
  
 -#define MAX_DELTA_CACHE (256)
 +#define MAX_DELTA_CACHE (1024)

This one really needs experimentation.  I found that increases here lead
to performance degradation rather soon, probably because of decreased
memory locality without significant reduction in cache collisions.  Not
sure whether it's worth touching at all.

-- 
David Kastrup
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] gc --aggressive: three phase repacking

2014-03-18 Thread David Kastrup
Duy Nguyen pclo...@gmail.com writes:

 On Tue, Mar 18, 2014 at 12:16 PM, Duy Nguyen pclo...@gmail.com wrote:
 But I think it's orthogonal to gc --aggressive improvement.

 There's another reason that improving gc may be a good idea (or not).
 It depends on how other git implementations handle long delta chains.

Other git implementations including current installations that have a
half-life of at last a year.

-- 
David Kastrup
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] gc --aggressive: three phase repacking

2014-03-17 Thread Junio C Hamano
Nguyễn Thái Ngọc Duy  pclo...@gmail.com writes:

 As explained in the previous commit,...

[PATCH 3/4] becomes a commit with an empty log message for some
reason.  Have you tried running am -s on it?
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] gc --aggressive: three phase repacking

2014-03-17 Thread Duy Nguyen
On Tue, Mar 18, 2014 at 5:12 AM, Junio C Hamano gits...@pobox.com wrote:
 Nguyễn Thái Ngọc Duy  pclo...@gmail.com writes:

 As explained in the previous commit,...

 [PATCH 3/4] becomes a commit with an empty log message for some
 reason.  Have you tried running am -s on it?

am -s worked fine. am -s --scissors did not (because of my use of
scissors in the commit message). Not sure what happened on your side..
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] gc --aggressive: three phase repacking

2014-03-17 Thread Junio C Hamano
Duy Nguyen pclo...@gmail.com writes:

 On Tue, Mar 18, 2014 at 5:12 AM, Junio C Hamano gits...@pobox.com wrote:
 Nguyễn Thái Ngọc Duy  pclo...@gmail.com writes:

 As explained in the previous commit,...

 [PATCH 3/4] becomes a commit with an empty log message for some
 reason.  Have you tried running am -s on it?

 am -s worked fine. am -s --scissors did not (because of my use of
 scissors in the commit message). Not sure what happened on your side..

Yeah, I meant am -c.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] gc --aggressive: three phase repacking

2014-03-17 Thread Jeff King
On Sun, Mar 16, 2014 at 08:35:04PM +0700, Nguyễn Thái Ngọc Duy wrote:

 As explained in the previous commit, current aggressive settings
 --depth=250 --window=250 could slow down repository access
 significantly. Notice that people usually work on recent history only,
 we could keep recent history more loosely packed, so that repo access
 is fast most of the time while the pack file remains small.

One thing I have not seen is real-world timings showing the slowdown
based on --depth. Did I miss them, or are we just making assumptions
based on one old case from 2009 (that, AFAIK does not have real numbers,
just speculation)? Has anyone measured the effect of bumping the delta
cache size (and its hash implementation)?

 git.git is not a great repo to test it because its size is modest but
 so are my laptop's cpu and memory, so here are the timings and pack
 sizes
 
 size  time
 old aggr.   36MB  5m51
 new aggr.   37MB  6m13
 repack -adf 48MB  1m12

I am not clear on what these times mean. It looks like the new code is
slower _and_ bigger. Can you explain them?

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] gc --aggressive: three phase repacking

2014-03-17 Thread Duy Nguyen
On Tue, Mar 18, 2014 at 11:50 AM, Jeff King p...@peff.net wrote:
 On Sun, Mar 16, 2014 at 08:35:04PM +0700, Nguyễn Thái Ngọc Duy wrote:

 As explained in the previous commit, current aggressive settings
 --depth=250 --window=250 could slow down repository access
 significantly. Notice that people usually work on recent history only,
 we could keep recent history more loosely packed, so that repo access
 is fast most of the time while the pack file remains small.

 One thing I have not seen is real-world timings showing the slowdown
 based on --depth. Did I miss them, or are we just making assumptions
 based on one old case from 2009 (that, AFAIK does not have real numbers,
 just speculation)? Has anyone measured the effect of bumping the delta
 cache size (and its hash implementation)?

David tested it with git-blame [1]. I should probably run some tests
too (I don't remember if I tested some operations last time).

http://thread.gmane.org/gmane.comp.version-control.git/242277/focus=242435

 git.git is not a great repo to test it because its size is modest but
 so are my laptop's cpu and memory, so here are the timings and pack
 sizes

 size  time
 old aggr.   36MB  5m51
 new aggr.   37MB  6m13
 repack -adf 48MB  1m12

 I am not clear on what these times mean. It looks like the new code is
 slower _and_ bigger. Can you explain them?

That's right :) The upside is faster operations, which is complely
missed here. The good thing from those numbers is pack size does not
increase much (the upper limit would be repack -adf with default
settings).
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] gc --aggressive: three phase repacking

2014-03-17 Thread Jeff King
On Tue, Mar 18, 2014 at 12:50:50AM -0400, Jeff King wrote:

 On Sun, Mar 16, 2014 at 08:35:04PM +0700, Nguyễn Thái Ngọc Duy wrote:
 
  As explained in the previous commit, current aggressive settings
  --depth=250 --window=250 could slow down repository access
  significantly. Notice that people usually work on recent history only,
  we could keep recent history more loosely packed, so that repo access
  is fast most of the time while the pack file remains small.
 
 One thing I have not seen is real-world timings showing the slowdown
 based on --depth. Did I miss them, or are we just making assumptions
 based on one old case from 2009 (that, AFAIK does not have real numbers,
 just speculation)? Has anyone measured the effect of bumping the delta
 cache size (and its hash implementation)?

Just as a very quick, rough data point, here are before-and-after
timings for the patch below doing git rev-list --objects --all on my
linux.git, which is a mix of --aggressive and normal packing (I didn't
do a repack -f, but it's partially what I've downloaded from k.org and
what I've repacked in various experiments over the past few months).

  [before]
  real0m28.824s
  user0m28.620s
  sys 0m0.232s

  [after]
  real0m21.694s
  user0m21.544s
  sys 0m0.172s

The numbers below are completely pulled out of a hat, so we can perhaps
do even better. But I think it shows that there is room for improvement
in the delta base cache.

---
diff --git a/environment.c b/environment.c
index c3c8606..73ed670 100644
--- a/environment.c
+++ b/environment.c
@@ -37,7 +37,7 @@ int core_compression_seen;
 int fsync_object_files;
 size_t packed_git_window_size = DEFAULT_PACKED_GIT_WINDOW_SIZE;
 size_t packed_git_limit = DEFAULT_PACKED_GIT_LIMIT;
-size_t delta_base_cache_limit = 16 * 1024 * 1024;
+size_t delta_base_cache_limit = 128 * 1024 * 1024;
 unsigned long big_file_threshold = 512 * 1024 * 1024;
 const char *pager_program;
 int pager_use_color = 1;
diff --git a/sha1_file.c b/sha1_file.c
index b37c6f6..a9ab8e3 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1944,7 +1944,7 @@ static void *unpack_compressed_entry(struct packed_git *p,
return buffer;
 }
 
-#define MAX_DELTA_CACHE (256)
+#define MAX_DELTA_CACHE (1024)
 
 static size_t delta_base_cached;
 
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] gc --aggressive: three phase repacking

2014-03-17 Thread Jeff King
On Tue, Mar 18, 2014 at 12:00:48PM +0700, Duy Nguyen wrote:

 On Tue, Mar 18, 2014 at 11:50 AM, Jeff King p...@peff.net wrote:
  On Sun, Mar 16, 2014 at 08:35:04PM +0700, Nguyễn Thái Ngọc Duy wrote:
 
  As explained in the previous commit, current aggressive settings
  --depth=250 --window=250 could slow down repository access
  significantly. Notice that people usually work on recent history only,
  we could keep recent history more loosely packed, so that repo access
  is fast most of the time while the pack file remains small.
 
  One thing I have not seen is real-world timings showing the slowdown
  based on --depth. Did I miss them, or are we just making assumptions
  based on one old case from 2009 (that, AFAIK does not have real numbers,
  just speculation)? Has anyone measured the effect of bumping the delta
  cache size (and its hash implementation)?
 
 David tested it with git-blame [1]. I should probably run some tests
 too (I don't remember if I tested some operations last time).
 
 http://thread.gmane.org/gmane.comp.version-control.git/242277/focus=242435

Ah, thanks. I do remember that thread now.

It looks like David's last word is that he gets a significant
performance from bumping the delta base cache size (and number of
buckets). And that matches the timings I just did. I suspect there are
still pathological cases that could behave worse, but it really sounds
like we should be looking into improving that cache as a first step.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] gc --aggressive: three phase repacking

2014-03-17 Thread Duy Nguyen
On Tue, Mar 18, 2014 at 12:07 PM, Jeff King p...@peff.net wrote:
   [before]
   real0m28.824s
   user0m28.620s
   sys 0m0.232s

   [after]
   real0m21.694s
   user0m21.544s
   sys 0m0.172s

 The numbers below are completely pulled out of a hat, so we can perhaps
 do even better. But I think it shows that there is room for improvement
 in the delta base cache.

There is definitely room for improvement in delta base cache:
increasing cache size, better eviction strategies.. But I think it's
orthogonal to gc --aggressive improvement. From what Linus described
---aggressive is more for archival purposes but I think people have
been using it for upstream repos as well. Better packed repos could
speed up current clients immediately (although this argument is
somewhat flawed, server updates are usually at slower pace than
clients..).
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] gc --aggressive: three phase repacking

2014-03-16 Thread Nguyễn Thái Ngọc Duy
As explained in the previous commit, current aggressive settings
--depth=250 --window=250 could slow down repository access
significantly. Notice that people usually work on recent history only,
we could keep recent history more loosely packed, so that repo access
is fast most of the time while the pack file remains small.

Three more configuration variables are used to make that happen. The
first one, gc.aggressiveCommitLimits covers the old history part,
which will be tightly packed. The remaining part will be packed with
gc.lessAggresiveWindow and gc.lessAggressiveDepth. If
gc.aggressiveCommitLimits is empty, everything will be tightly packed
as before.

The repack process becomes:

 - repack -adf on old history (e.g. the default --before=1.year.ago)
   mark to keep that pack

 - repack the second time with lessAggressive settings, the kept pack
   should be left untouched.

 - remove .keep file and repack the final time, reusing all deltas

This process costs more time, but produce a more effecient pack. It is
assumed that people who do gc --aggressive do not do this often and
do not mind if it takes a bit longer.

git.git is not a great repo to test it because its size is modest but
so are my laptop's cpu and memory, so here are the timings and pack
sizes

size  time
old aggr.   36MB  5m51
new aggr.   37MB  6m13
repack -adf 48MB  1m12

Signed-off-by: Nguyễn Thái Ngọc Duy pclo...@gmail.com
---
 Documentation/config.txt |  19 +
 builtin/gc.c | 109 +--
 2 files changed, 124 insertions(+), 4 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 5ce7f9a..47979dc 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -1161,6 +1161,25 @@ gc.aggressiveWindow::
algorithm used by 'git gc --aggressive'.  This defaults
to 250.
 
+gc.aggressiveCommitLimits::
+   This one parameter to linkgit:git-rev-list[1] to select
+   commits that are repacked with gc.aggressiveDepth and
+   gc.aggressiveWindow, while the remaining commits are repacked
+   with gc.lessAggressiveDepth and gc.lessAggressiveWindow.
++
+If this is an empty string, everything will be repacked with
+gc.aggressiveWindow and gc.aggressiveDepth.
+
+gc.lessAggressiveDepth::
+   The depth parameter used in the delta compression
+   algorithm used by 'git gc --aggressive' when
+   gc.aggressiveCommitLimits is set.  This defaults to 50.
+
+gc.lessAggressiveWindow::
+   The window size parameter used in the delta compression
+   algorithm used by 'git gc --aggressive' when
+   gc.aggressiveCommitLimits is set.  This defaults to 250.
+
 gc.auto::
When there are approximately more than this many loose
objects in the repository, `git gc --auto` will pack them.
diff --git a/builtin/gc.c b/builtin/gc.c
index 72aa912..37fc603 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -28,10 +28,14 @@ static const char * const builtin_gc_usage[] = {
 static int pack_refs = 1;
 static int aggressive_depth = 250;
 static int aggressive_window = 250;
+static const char *aggressive_rev_list = --before=1.year.ago;
+static int less_aggressive_depth = 50;
+static int less_aggressive_window = 250;
 static int gc_auto_threshold = 6700;
 static int gc_auto_pack_limit = 50;
 static int detach_auto = 1;
 static const char *prune_expire = 2.weeks.ago;
+static int delta_base_offset = 1;
 
 static struct argv_array pack_refs_cmd = ARGV_ARRAY_INIT;
 static struct argv_array reflog = ARGV_ARRAY_INIT;
@@ -39,10 +43,13 @@ static struct argv_array repack = ARGV_ARRAY_INIT;
 static struct argv_array prune = ARGV_ARRAY_INIT;
 static struct argv_array rerere = ARGV_ARRAY_INIT;
 
+static char *keep_file;
 static char *pidfile;
 
 static void remove_pidfile(void)
 {
+   if (keep_file)
+   unlink_or_warn(keep_file);
if (pidfile)
unlink(pidfile);
 }
@@ -54,6 +61,63 @@ static void remove_pidfile_on_signal(int signo)
raise(signo);
 }
 
+static void pack_old_history(int quiet)
+{
+   struct child_process pack_objects;
+   struct child_process rev_list;
+   struct argv_array av_po = ARGV_ARRAY_INIT;
+   struct argv_array av_rl = ARGV_ARRAY_INIT;
+   char sha1[41];
+
+   argv_array_pushl(av_rl, rev-list, --all, --objects,
+--reflog, NULL);
+   argv_array_push(av_rl, aggressive_rev_list);
+
+   memset(rev_list, 0, sizeof(rev_list));
+   rev_list.no_stdin = 1;
+   rev_list.out = -1;
+   rev_list.git_cmd = 1;
+   rev_list.argv = av_rl.argv;
+
+   if (start_command(rev_list))
+   die(_(gc: unable to fork git-rev-list));
+
+   argv_array_pushl(av_po, pack-objects, --keep-true-parents,
+--honor-pack-keep, --non-empty, --no-reuse-delta,
+--keep, --local, NULL);
+   if (delta_base_offset)
+   argv_array_push(av_po,