Re: [PATCH v5] pack-objects mru

2016-08-11 Thread Jeff King
On Thu, Aug 11, 2016 at 08:11:33AM -0700, Junio C Hamano wrote:

> Jeff King  writes:
> 
> > So considering "--depth" as a space-saving measure for --aggressive does
> > not seem that effective. But it feels weird to quietly drop actions
> > people might have done with previous aggressive runs.
> 
> That argument cuts both ways, doesn't it?
> 
> If the user explicitly asks to use lower "--depth" from the command
> line when the second repack runs, the intention is clear: the
> existing pack may use delta chains that are too long and is
> detrimental to the run-time performance, and the user wants to
> correct it by repacking with shorter delta chain.
> 
> Should the act of letting "gc --auto" use lower "--depth", by not
> configuring to always use deeper chain, be interpreted the same way?
> I am not sure.  The old packing with large --depth is something the
> user did long time ago, and the decision the user made not to use
> large depth always is also something the user did long time ago.  I
> do not think it is so cut-and-dried which one of the two conflicting
> wishes we should honor when running the second repack, especially
> when it is run unattended like "gc --auto" does.

Good points. Explicitly saying "repack --depth=..." carries a lot more
weight to me than "git gc --auto" randomly kicking in, as far as knowing
that what the user actually wants. My patch doesn't differentiate, of
course, but I think it could.

The other problem with my patch is the fact that we don't do a good job
of finding new, in-limit deltas for the ones we discard. If you want to
do that, you really need to "git repack -f" (at least with the current
code). At which point we do not reuse the on-disk deltas at all, and the
problem is moot (you could also interpret the fact that the user did
_not_ pass "-f" as "you want to reuse deltas, which means you want to
reuse even long chains", but as you've argued above, you can make a lot
of guesses about the user's intention from what they did or did not
say).

So if we were to go this route, I don't think my patch is quite
sufficient; we'd want something else on top to do a better job of
finding replacement deltas.

Regarding my "does not seem that effective" above, I think we should
drop the aggressive depth to 50, and I just posted a patch with
reasoning and numbers:

  
http://public-inbox.org/git/20160811161309.ecmebaafcz6rk...@sigill.intra.peff.net/

That's maybe orthogonal, but it does remove the weird "gc --aggressive
followed by gc --auto produces a bad pack" issue, because unless you are
doing something clever, the depth will always be 50 (modulo people who
did an aggressive pack with an older version of git :-/ ).

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5] pack-objects mru

2016-08-11 Thread Junio C Hamano
Jeff King  writes:

> So considering "--depth" as a space-saving measure for --aggressive does
> not seem that effective. But it feels weird to quietly drop actions
> people might have done with previous aggressive runs.

That argument cuts both ways, doesn't it?

If the user explicitly asks to use lower "--depth" from the command
line when the second repack runs, the intention is clear: the
existing pack may use delta chains that are too long and is
detrimental to the run-time performance, and the user wants to
correct it by repacking with shorter delta chain.

Should the act of letting "gc --auto" use lower "--depth", by not
configuring to always use deeper chain, be interpreted the same way?
I am not sure.  The old packing with large --depth is something the
user did long time ago, and the decision the user made not to use
large depth always is also something the user did long time ago.  I
do not think it is so cut-and-dried which one of the two conflicting
wishes we should honor when running the second repack, especially
when it is run unattended like "gc --auto" does.



--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5] pack-objects mru

2016-08-11 Thread Jeff King
On Thu, Aug 11, 2016 at 05:20:30AM -0400, Jeff King wrote:

> Here it is. It ended up needing a few preparatory patches.
> 
>   [1/4]: provide an initializer for "struct object_info"
>   [2/4]: sha1_file: make packed_object_info public
>   [3/4]: pack-objects: break delta cycles before delta-search phase
>   [4/4]: pack-objects: use mru list when iterating over packs
> 
> I had originally intended to include an extra patch to handle the
> --depth limits better. But after writing it, I'm not sure it's actually
> a good idea. I'll post it as an addendum with more discussion.

And here's the depth patch. It does work, as you can see by running the
snippet at the bottom of the commit message.

But I began to wonder if it's actually a desirable thing. For instance,
if you do:

  git gc --aggressive
  ... time passes ...
  git gc

the first gc may generate chains up to 250 objects. When the second gc
runs (and you may not even run it yourself; it might be "gc --auto"!),
it will generally reuse most of your existing deltas, even though the
default depth is only 50.

But with the patch below, it will drop deltas from the middle of those
long chains, undoing the prior --aggressive results. Worse, we don't
find new deltas for those objects, because it falls afoul of the "they
are in the same pack, so don't bother looking for a delta" rule.

An --aggressive repack of my git.git is 52MB. Repacking that with the
patch below and "--depth=50" bumps it to 55MB. Dumping the "do not
bother" condition in try_delta() drops that to 54MB.

So it _is_ worse for the space to drop those high-depth deltas. Even if
we fixed the "do not bother" (e.g., by recording a bit that says "even
though these are in the same pack, try anyway, because we had to break
the delta for other reasons"), it's still a loss.

OTOH, I am not altogether convinced that the tradeoff of a giant --depth
is worth. I'm looking only at the space here, but deeper delta chains
cost more CPU to access (especially if they start to exceed the delta
cache size). And the space savings aren't that amazing. Doing a "git
repack -adf --depth=50 --window=250" (i.e., if aggressive had just
tweaked the window size and not the depth in the first place), the
result is only 53MB.

So considering "--depth" as a space-saving measure for --aggressive does
not seem that effective. But it feels weird to quietly drop actions
people might have done with previous aggressive runs.

-- >8 --
Subject: [PATCH] pack-objects: enforce --depth limit in reused deltas

Since 898b14c (pack-objects: rework check_delta_limit usage,
2007-04-16), we check the delta depth limit only when
figuring out whether we should make a new delta. We don't
consider it at all when reusing deltas, which means that
packing once with --depth=50, and then against with
--depth=10, the second pack my still contain chains larger
than 10.

This is probably not a huge deal, as it is limited to
whatever chains you happened to create in a previous run.
But as we start allowing cross-pack delta reuse in a future
commit, this maximum will rise to the number of packs times
the per-pack depth (in the worst case; on average, it will
likely be much smaller).

We can easily detect this as part of the existing search for
cycles, since we visit every node in a depth-first way. That
lets us compute the depth of any node based on the depth of
its base, because we know the base is DFS_DONE by the time
we look at it (modulo any cycles in the graph, but we know
there cannot be any because we break them as we see them).

There is some subtlety worth mentioning, though. We record
the depth of each object as we compute it. It might seem
like we could save the per-object storage space by just
keeping track of the depth of our traversal (i.e., have
break_delta_chains() report how deep it went). But we may
visit an object through multiple delta paths, and on
subsequent paths we want to know its depth immediately,
without having to walk back down to its final base (doing so
would make our graph walk quadratic rather than linear).

Likewise, one could try to record the depth not from the
base, but from our starting point (i.e., start
recursion_depth at 0, and pass "recursion_depth + 1" to each
invocation of break_delta_chains()). And then when
recursion_depth gets too big, we know that we must cut the
delta chain.  But that technique is wrong if we do not visit
the nodes in topological order. In a chain A->B->C, it
if we visit "C", then "B", then "A", we will never recurse
deeper than 1 link (because we see at each node that we have
already visited it).

Unfortunately there is no automated test, because it's hard
to convince pack-objects to reliably produce delta chains.
Naively, it would seem that a sequence of ever-increasing
blobs would work. E.g., something like:

  for i in 1 2 3 4 5; do
  test-genrandom $i 4096 >>file
  git add file
  git commit -m $i
  done

where a reasonable set of deltas would use "1:file" as the
b

[PATCH v5] pack-objects mru

2016-08-11 Thread Jeff King
On Thu, Aug 11, 2016 at 02:57:51AM -0400, Jeff King wrote:

> > One thing to be careful of is that there are more things this
> > drop_reused_delta() should be doing. But I looked through the rest of
> > check_object() and could not find anything else.
> 
> Argh, I spoke too soon.
> 
> It is true that the size lookup is the only part of check_object()
> we skip if we are reusing the delta. But what I didn't notice is that
> when we choose to reuse a delta, we overwrite entry->type (the real
> type!) with the in_pack_type (OFS_DELTA, etc). We need to undo that so
> that later stages see the real type.
> 
> I'm not sure how my existing tests worked (I confirmed that they do
> indeed break the delta). It may be that only some code paths actually
> care about the real type. But when playing with the depth-limit (which
> uses the same drop_reused_delta() helper), I managed to make some pretty
> broken packs.
> 
> So please disregard the v4 patch I just sent; I haven't triggered it,
> but I imagine it has the same problem, and I just didn't manage to
> trigger it.
> 
> I'll clean that up and send out a new series.

Here it is. It ended up needing a few preparatory patches.

  [1/4]: provide an initializer for "struct object_info"
  [2/4]: sha1_file: make packed_object_info public
  [3/4]: pack-objects: break delta cycles before delta-search phase
  [4/4]: pack-objects: use mru list when iterating over packs

I had originally intended to include an extra patch to handle the
--depth limits better. But after writing it, I'm not sure it's actually
a good idea. I'll post it as an addendum with more discussion.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html