Re: [DO NOT APPLY PATCH 4/4] index-pack: optionally skip duplicate packfile entries

2013-08-21 Thread Jeff King
On Wed, Aug 21, 2013 at 04:20:07PM -0700, Junio C Hamano wrote:

> I do not understand the last bit.  If two copies of an object exist
> but you have only one slot for the object in the index, and another
> object names it as its base with ref-delta, then reconstituting it
> should work just fine---whichever representation of the base object
> is recorded in the .idx, that first needs to be reconstituted before
> the delta is applied to it, and both copies should yield identical
> contents for the delta base object, no?
> 
> In any case, ejecting one from the pack .idx would not help in the
> presense of either broken or malicious packer that reuses delta too
> aggressively.  Suppose you have objects A and B and somehow manage
> to create a cycle of deltas, A names B as its delta base and B names
> A as its delta base.  The packer may notice its mistake and then add
> another copy of A as a base object.  The pack contains two copies of
> A (one is delta based on B, the other is full) and B (delta against
> A).

Yes, that is the potentially problematic scenario...

> If B refers to the copy of A that is delta against B using ofs-delta,
> fixing the .idx file will have no effect.  read_packed_sha1(B) will
> read the delta data of B, finds the offset to start reading the data
> for A which was excised from the .idx and unless that codepath is
> updated to consult revindex (i.e. you mark one copy of A in the .pack
> as bad, and when B refers to that bad copy of A via ofs-delta, you
> check the offset against revindex to get the object name of A and go
> to the good copy of A), you will never finish reading B because
> reading the bad copy of A will lead you to first reconstitute B.

Yes. There are two levels of brokenness here.

One is that the pack has a delta base offset for B that leads to the
"wrong" A, creating a cycle. This is an utterly broken pack and cannot
be fixed automatically (you do not even know the name of the cycled A
because you cannot constitute it to find its name and realize that it
has a duplicate!). But we should notice this during index-pack, because
we cannot reconstruct the object.

Another situation is that your delta base points to the "right" A. You
can reconstruct either A or B, because although there is a duplicate,
there are no cycles in the delta chain (i.e., the chain does not care
about object name, only offset, and offsets point only one way).

So we do not have a problem at all with reconstructing the object data.
We do have two ways we might access A from the same pack. For regular
object lookup, that is probably not a big deal. For operations like
repacking that look more closely at the on-disk representation, I do not
know. We may mark A as "has a delta" or "does not have a delta" at one
point in the code, but then later find the other copy of A which does
not match. I did not trace through all of pack-objects to find whether
that is possible, or what bugs it might cause. But indexing only one
copy as the "master" means that we will always get a consistent view of
the object (from that particular pack, at least).

Now consider REF_DELTA. We lookup the base object by name, so we may get
a different answer depending on how it is indexed. E.g., index-pack
keeps an internal hash table, whereas later callers will use the .idx
file. When looking at the .idx file, we may use a regular binary search,
or sha1_entry_pos. The exact order of the search may even change from
git version to git version.

So you may have a situation where index-pack finds the "right" A and
properly reconstructs the file while creating the .idx, and thinks all
is well. But later lookups may find the "wrong" A, and fail. In other
words, we cannot trust that data that makes it through index-pack is not
corrupted (and it is index-pack's output that lets receive-pack commit
to the client that we have their objects, so we want to be reasonably
sure we have uncorrupted copies at that point).

Choosing a single "master" A to go in the .idx means we will get a
consistent view of A once the .idx is generated. But it may not be the
right one, and it may not be the one we used to check that we can
resolve A and B in the first place.

The right fix for that situation is to keep both entries in the .idx,
detect delta cycles, then look up extra copies of A, but being aware
that you already found one copy in the .idx and that sha1_entry_pos (or
the vanilla binary search) should return any other copies, if they
exist. I do not think we detect delta cycles at all (though I didn't
check), but certainly we do not have a way to tell sha1_entry_pos any
different information so that it will find the "other" A.

So I think it is a recoverable state, but it is a non-trivial bit of
code for something that should never happen. Rejecting on push/fetch via
index-pack is simple and easy, and we can deal with the fallout if and
when it ever actually happens.

> > I think this line of "fixing" should probably be scrapped.
> 
> I tend 

Re: [DO NOT APPLY PATCH 4/4] index-pack: optionally skip duplicate packfile entries

2013-08-21 Thread Junio C Hamano
Jeff King  writes:

> When we are building the pack index, we can notice that
> there are duplicate objects, pick one "winner" instance, and
> mention the object only once in the index (mapped to the
> winner's offset).
>
> This has the effect that the duplicate packfile entries are
> never found by lookup. The data still exists in the
> packfile, though, and can be used as a delta base if delta
> base offsets are used. If delta refs are used, then it is
> possible that some deltas may be broken.

I do not understand the last bit.  If two copies of an object exist
but you have only one slot for the object in the index, and another
object names it as its base with ref-delta, then reconstituting it
should work just fine---whichever representation of the base object
is recorded in the .idx, that first needs to be reconstituted before
the delta is applied to it, and both copies should yield identical
contents for the delta base object, no?

In any case, ejecting one from the pack .idx would not help in the
presense of either broken or malicious packer that reuses delta too
aggressively.  Suppose you have objects A and B and somehow manage
to create a cycle of deltas, A names B as its delta base and B names
A as its delta base.  The packer may notice its mistake and then add
another copy of A as a base object.  The pack contains two copies of
A (one is delta based on B, the other is full) and B (delta against
A).

If B refers to the copy of A that is delta against B using ofs-delta,
fixing the .idx file will have no effect.  read_packed_sha1(B) will
read the delta data of B, finds the offset to start reading the data
for A which was excised from the .idx and unless that codepath is
updated to consult revindex (i.e. you mark one copy of A in the .pack
as bad, and when B refers to that bad copy of A via ofs-delta, you
check the offset against revindex to get the object name of A and go
to the good copy of A), you will never finish reading B because
reading the bad copy of A will lead you to first reconstitute B.

> I think this line of "fixing" should probably be scrapped.

I tend to agree.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html