Re: [PATCH 0/2] optimizing pack access on read only fetch repos

2013-02-02 Thread Shawn Pearce
On Fri, Feb 1, 2013 at 1:14 AM, Jeff King p...@peff.net wrote:
 On Thu, Jan 31, 2013 at 08:47:37AM -0800, Shawn O. Pearce wrote:

   - System resource cost we incur by having to keep 50 file
 descriptors open and maintaining 50 mmap windows will reduce by
 50 fold.
 
  I wonder how measurable that is (and if it matters on Linux versus less
  efficient platforms).

 It does matter. We know it has a negative impact on JGit even on Linux
 for example. You don't want 300 packs in a repository. 50 might be
 tolerable. 300 is not.

 I'd love to see numbers if you have them. It's not that I don't believe
 it is slower, but knowing _how much_ is important when thinking about
 what kind of performance increase we are looking to get (which in turn
 impacts how much effort to put into the repacking).

Never done a formal experiment. Just working from memory where 4 years
and 3 desks ago I got called because one of our Git servers was
struggling to keep up with user git fetch commands. Turns out the
repository had O(200) pack files. git gc that normally took only about
5 minutes took a hellva lot longer, like an hour or more.

The problem happened because the server was saving every push to a
pack and never exploding incoming packs to loose objects. This meant
the thin packs from the wire got delta bases appended to them.
pack-objects was looking at O(50) different alternatives for most
objects when trying to decide which one it should copy into the output
pack... for either a fetch/clone client, or the gc I was trying to run
to fix the repository.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] optimizing pack access on read only fetch repos

2013-02-01 Thread Jeff King
On Thu, Jan 31, 2013 at 08:47:37AM -0800, Shawn O. Pearce wrote:

   - System resource cost we incur by having to keep 50 file
 descriptors open and maintaining 50 mmap windows will reduce by
 50 fold.
 
  I wonder how measurable that is (and if it matters on Linux versus less
  efficient platforms).
 
 It does matter. We know it has a negative impact on JGit even on Linux
 for example. You don't want 300 packs in a repository. 50 might be
 tolerable. 300 is not.

I'd love to see numbers if you have them. It's not that I don't believe
it is slower, but knowing _how much_ is important when thinking about
what kind of performance increase we are looking to get (which in turn
impacts how much effort to put into the repacking).

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] optimizing pack access on read only fetch repos

2013-01-31 Thread Shawn Pearce
On Tue, Jan 29, 2013 at 1:19 PM, Jeff King p...@peff.net wrote:
 On Tue, Jan 29, 2013 at 07:58:01AM -0800, Junio C Hamano wrote:

 The point is not about space.  Disk is cheap, and it is not making
 it any worse than what happens to your target audience, that is a
 fetch-only repository with only gc --auto in it, where nobody
 passes -f to repack to cause recomputation of delta.

 What I was trying to seek was a way to reduce the runtime penalty we
 pay every time we run git in such a repository.

  - Object look-up cost will become log2(50*n) from 50*log2(n), which
is about 50/log2(50) improvement;

 Yes and no. Our heuristic is to look at the last-used pack for an
 object. So assuming we have locality of requests, we should quite often
 get lucky and find the object in the first log2 search. Even if we
 don't assume locality, a situation with one large pack and a few small
 packs will have the large one as last used more often than the others,
 and it will also have the looked-for object more often than the others

Opening all of those files does impact performance. It depends on how
slow your open(2) syscall is. I know on Mac OS X that its not the
fastest function we get from the C library. Performing ~40 opens to
look through the most recent pack files and finally find the real
pack that contains that tag you asked `git show` for isn't that quick.

Some of us also use Git on filesystems that are network based, and
slow compared to local disk Linux ext2/3/4 with gobs of free RAM.

 So I can see how it is something we could potentially optimize, but I
 could also see it being surprisingly not a big deal. I'd be very
 interested to see real measurements, even of something as simple as a
 master index which can reference multiple packfiles.

I actually tried this many many years ago. There are threads in the
archive about it. Its slower. We ruled it out.

  - System resource cost we incur by having to keep 50 file
descriptors open and maintaining 50 mmap windows will reduce by
50 fold.

 I wonder how measurable that is (and if it matters on Linux versus less
 efficient platforms).

It does matter. We know it has a negative impact on JGit even on Linux
for example. You don't want 300 packs in a repository. 50 might be
tolerable. 300 is not.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] optimizing pack access on read only fetch repos

2013-01-29 Thread Shawn Pearce
On Sat, Jan 26, 2013 at 10:32 PM, Junio C Hamano gits...@pobox.com wrote:
 Jeff King p...@peff.net writes:

 This is a repost from here:

   http://thread.gmane.org/gmane.comp.version-control.git/211176

 which got no response initially. Basically the issue is that read-only
 repos (e.g., a CI server) whose workflow is something like:

   git fetch $some_branch 
   git checkout -f $some_branch 
   make test

 will never run git-gc, and will accumulate a bunch of small packs and
 loose objects, leading to poor performance.
...
 I also wonder if we would be helped by another repack mode that
 coalesces small packs into a single one with minimum overhead, and
 run that often from gc --auto, so that we do not end up having to
 have 50 packfiles.

Yes. This does help

 When we have 2 or more small and young packs, we could:

  - iterate over idx files for these packs to enumerate the objects
to be packed, replacing read_object_list_from_stdin() step;

  - always choose to copy the data we have in these existing packs,
instead of doing a full prepare_pack(); and

  - use the order the objects appear in the original packs, bypassing
compute_write_order().

Hmm, sounds familiar. Seems like its what we do in JGit for Android. :-)
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] optimizing pack access on read only fetch repos

2013-01-29 Thread Jeff King
On Sat, Jan 26, 2013 at 10:32:42PM -0800, Junio C Hamano wrote:

 Both makes sense to me.
 
 I also wonder if we would be helped by another repack mode that
 coalesces small packs into a single one with minimum overhead, and
 run that often from gc --auto, so that we do not end up having to
 have 50 packfiles.
 
 When we have 2 or more small and young packs, we could:
 
  - iterate over idx files for these packs to enumerate the objects
to be packed, replacing read_object_list_from_stdin() step;
 
  - always choose to copy the data we have in these existing packs,
instead of doing a full prepare_pack(); and
 
  - use the order the objects appear in the original packs, bypassing
compute_write_order().

I'm not sure. If I understand you correctly, it would basically just be
concatenating packs without trying to do delta compression between the
objects which are ending up in the same pack. So it would save us from
having to do (up to) 50 binary searches to find an object in a pack, but
would not actually save us much space.

I would be interested to see the timing on how quick it is compared to a
real repack, as the I/O that happens during a repack is non-trivial
(although if you are leaving aside the big main pack, then it is
probably not bad).

But how do these somewhat mediocre concatenated packs get turned into
real packs? Pack-objects does not consider deltas between objects in the
same pack. And when would you decide to make a real pack? How do you
know you have 50 young and small packs, and not 50 mediocre coalesced
packs?

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] optimizing pack access on read only fetch repos

2013-01-29 Thread Duy Nguyen
On Sun, Jan 27, 2013 at 1:32 PM, Junio C Hamano gits...@pobox.com wrote:
 I also wonder if we would be helped by another repack mode that
 coalesces small packs into a single one with minimum overhead, and
 run that often from gc --auto, so that we do not end up having to
 have 50 packfiles.

 When we have 2 or more small and young packs, we could:

  - iterate over idx files for these packs to enumerate the objects
to be packed, replacing read_object_list_from_stdin() step;

  - always choose to copy the data we have in these existing packs,
instead of doing a full prepare_pack(); and

  - use the order the objects appear in the original packs, bypassing
compute_write_order().

Isn't it easier and cheaper to create the master index, something
like bup does?
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] optimizing pack access on read only fetch repos

2013-01-29 Thread Martin Fick


Jeff King p...@peff.net wrote:

On Sat, Jan 26, 2013 at 10:32:42PM -0800, Junio C Hamano wrote:

 Both makes sense to me.
 
 I also wonder if we would be helped by another repack mode that
 coalesces small packs into a single one with minimum overhead, and
 run that often from gc --auto, so that we do not end up having to
 have 50 packfiles.
 
 When we have 2 or more small and young packs, we could:
 
  - iterate over idx files for these packs to enumerate the objects
to be packed, replacing read_object_list_from_stdin() step;
 
  - always choose to copy the data we have in these existing packs,
instead of doing a full prepare_pack(); and
 
  - use the order the objects appear in the original packs, bypassing
compute_write_order().

I'm not sure. If I understand you correctly, it would basically just be
concatenating packs without trying to do delta compression between the
objects which are ending up in the same pack. So it would save us from
having to do (up to) 50 binary searches to find an object in a pack,
but
would not actually save us much space.

I would be interested to see the timing on how quick it is compared to
a
real repack, as the I/O that happens during a repack is non-trivial
(although if you are leaving aside the big main pack, then it is
probably not bad).

But how do these somewhat mediocre concatenated packs get turned into
real packs? Pack-objects does not consider deltas between objects in
the
same pack. And when would you decide to make a real pack? How do you
know you have 50 young and small packs, and not 50 mediocre coalesced
packs?


If we are reconsidering repacking strategies, I would like to propose an 
approach that might be a more general improvement to repacking which would help 
in more situations. 

You could roll together any packs which are close in size, say within 50% of 
each other.  With this strategy you will end up with files which are spread out 
by size exponentially.  I implementated this strategy on top of the current gc 
script using keep files, it works fairly well:

https://gerrit-review.googlesource.com/#/c/35215/3/contrib/git-exproll.sh

This saves some time, but mostly it saves I/O when repacking regularly.  I 
suspect that if this strategy were used in core git that further optimizations 
could be made to also reduce the repack time, but I don't know enough about 
repacking to know?  We run it nightly on our servers, both write and read only 
mirrors. We us are a ratio of 5 currently to drastically reduce large repack 
file rollovers,

-Martin

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] optimizing pack access on read only fetch repos

2013-01-29 Thread Junio C Hamano
Jeff King p...@peff.net writes:

 I also wonder if we would be helped by another repack mode that
 coalesces small packs into a single one with minimum overhead, and
 run that often from gc --auto, so that we do not end up having to
 have 50 packfiles.
 ...

 I'm not sure. If I understand you correctly, it would basically just be
 concatenating packs without trying to do delta compression between the
 objects which are ending up in the same pack. So it would save us from
 having to do (up to) 50 binary searches to find an object in a pack, but
 would not actually save us much space.

The point is not about space.  Disk is cheap, and it is not making
it any worse than what happens to your target audience, that is a
fetch-only repository with only gc --auto in it, where nobody
passes -f to repack to cause recomputation of delta.

What I was trying to seek was a way to reduce the runtime penalty we
pay every time we run git in such a repository.

 - Object look-up cost will become log2(50*n) from 50*log2(n), which
   is about 50/log2(50) improvement;

 - System resource cost we incur by having to keep 50 file
   descriptors open and maintaining 50 mmap windows will reduce by
   50 fold.

 - Anything else I missed?

 I would be interested to see the timing on how quick it is compared to a
 real repack,...

Yes, that is what I meant by wonder if we would be helped by ;-)

 But how do these somewhat mediocre concatenated packs get turned into
 real packs?

How do they get processed in a fetch-only repositories that
sometimes run gc --auto today?  By runninng repack -a -d -f
occasionally, perhaps?

At some point, you would need to run a repack that involves a real
object-graph traversal that feeds you the paths for objects to
obtain a reasonably compressed pack.  We can inspect existing packs
and guess a rough traversal order for commits, but for trees and
blobs, you cannot unify existing delta chains from multiple packs
effectively with data in the pack alone.

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] optimizing pack access on read only fetch repos

2013-01-29 Thread Jeff King
On Tue, Jan 29, 2013 at 07:58:01AM -0800, Junio C Hamano wrote:

 The point is not about space.  Disk is cheap, and it is not making
 it any worse than what happens to your target audience, that is a
 fetch-only repository with only gc --auto in it, where nobody
 passes -f to repack to cause recomputation of delta.
 
 What I was trying to seek was a way to reduce the runtime penalty we
 pay every time we run git in such a repository.
 
  - Object look-up cost will become log2(50*n) from 50*log2(n), which
is about 50/log2(50) improvement;

Yes and no. Our heuristic is to look at the last-used pack for an
object. So assuming we have locality of requests, we should quite often
get lucky and find the object in the first log2 search. Even if we
don't assume locality, a situation with one large pack and a few small
packs will have the large one as last used more often than the others,
and it will also have the looked-for object more often than the others

So I can see how it is something we could potentially optimize, but I
could also see it being surprisingly not a big deal. I'd be very
interested to see real measurements, even of something as simple as a
master index which can reference multiple packfiles.

  - System resource cost we incur by having to keep 50 file
descriptors open and maintaining 50 mmap windows will reduce by
50 fold.

I wonder how measurable that is (and if it matters on Linux versus less
efficient platforms).

  I would be interested to see the timing on how quick it is compared to a
  real repack,...
 
 Yes, that is what I meant by wonder if we would be helped by ;-)

There is only one way to find out... :)

Maybe I am blessed with nice machines, but I have mostly found the
repack process not to be that big a deal these days (especially with
threaded delta compression).

  But how do these somewhat mediocre concatenated packs get turned into
  real packs?
 
 How do they get processed in a fetch-only repositories that
 sometimes run gc --auto today?  By runninng repack -a -d -f
 occasionally, perhaps?

Do we run repack -adf regularly? The usual git gc procedure will not
use -f, and without that, we will not even consider making deltas
between objects that were formerly in different packs (but now are in
the same pack).

So you are avoiding doing medium-effort packs (repack -ad) in favor of
doing potentially quick packs, but occasionally doing a big-effort pack
(repack -adf). It may be reasonable advice to repack -adf
occasionally, but I suspect most people are not doing it regularly (if
only because git gc does not do it by default).

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] optimizing pack access on read only fetch repos

2013-01-29 Thread Junio C Hamano
Jeff King p...@peff.net writes:

  But how do these somewhat mediocre concatenated packs get turned into
  real packs?
 
 How do they get processed in a fetch-only repositories that
 sometimes run gc --auto today?  By runninng repack -a -d -f
 occasionally, perhaps?

 Do we run repack -adf regularly? The usual git gc procedure will not
 use -f, and without that, we will not even consider making deltas
 between objects that were formerly in different packs (but now are in
 the same pack).

Correct.  It is not a new problem, and while I think it would need
some solution, the coalesce 50 small young packs into one is not
an attempt to address it.

 So you are avoiding doing medium-effort packs (repack -ad) in favor of
 doing potentially quick packs, but occasionally doing a big-effort pack
 (repack -adf).

So I think but occasionally part is not correct.  In either way,
the packs use suboptimal delta, and you have to eventually pack with
-f, whether you coalesce 50 packs into 1 often or keep paying the
cost of having 50 packs longer.

The trade-off is purely between one potentially quick coalescing
per fetch in fetch-only repository vs any use of the data in the
fetch-only repository (what do they do?  build?  serving gitweb
locally?) has to deal with 25 packs on average, and we still need to
pay medium repack cost every 50 fetches.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] optimizing pack access on read only fetch repos

2013-01-26 Thread Junio C Hamano
Jeff King p...@peff.net writes:

 This is a repost from here:

   http://thread.gmane.org/gmane.comp.version-control.git/211176

 which got no response initially. Basically the issue is that read-only
 repos (e.g., a CI server) whose workflow is something like:

   git fetch $some_branch 
   git checkout -f $some_branch 
   make test

 will never run git-gc, and will accumulate a bunch of small packs and
 loose objects, leading to poor performance.

 Patch 1 runs gc --auto on fetch, which I think is sane to do.

 Patch 2 optimizes our pack dir re-scanning for fetch-pack (which, unlike
 the rest of git, should expect to be missing lots of objects, since we
 are deciding what to fetch).

 I think 1 is a no-brainer. If your repo is packed, patch 2 matters less,
 but it still seems like a sensible optimization to me.

   [1/2]: fetch: run gc --auto after fetching
   [2/2]: fetch-pack: avoid repeatedly re-scanning pack directory

 -Peff

Both makes sense to me.

I also wonder if we would be helped by another repack mode that
coalesces small packs into a single one with minimum overhead, and
run that often from gc --auto, so that we do not end up having to
have 50 packfiles.

When we have 2 or more small and young packs, we could:

 - iterate over idx files for these packs to enumerate the objects
   to be packed, replacing read_object_list_from_stdin() step;

 - always choose to copy the data we have in these existing packs,
   instead of doing a full prepare_pack(); and

 - use the order the objects appear in the original packs, bypassing
   compute_write_order().

The procedure cannot be a straight byte-for-byte copy, because some
objects may appear in multiple packs, and extra copies of the same
object have to be excised from the result.  OFS_DELTA offsets need
to be adjusted for objects that appear later in the output and for
objects that were deltified against such an object that recorded its
base with OFS_DELTA format.

But other than such OFS_DELTA adjustments, it feels that such an
only coalesce multiple packs into one mode should be fairly quick.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html