Re: Zero padded file modes...

2013-09-05 Thread Jeff King
On Thu, Sep 05, 2013 at 01:13:40PM -0400, John Szakmeister wrote:

> > Yep. These were mostly caused by a bug in Grit that is long-fixed.  But
> > the objects remain in many histories. It would have painful to rewrite
> > them back then, and it would be even more painful now.
> 
> I guess there's still the other side of the question though.  Are
> these repositories busted in the sense that something no longer works?

No, as far as I know, everything still works fine. However, some diffs
may be suboptimal, because we may have two different sha1s for the same
subtree (so we may descend into the tree unnecessarily only to find that
they are equivalent). And by the same token, any scripts doing
non-recursive diffs may erroneously mark the trees as differing, even
though they do not contain any differing files.

But neither is a big problem in practice. If you had two clients in
active use which were flip-flopping a sub-tree back and forth between
representations, it would be a problem. But we are talking about a few
isolated incidents far back in history.

> I doesn't appear to be the case, but I've not used it extensively say
> I can't say for certain one way or another.  In the sense that the
> content is not strictly compliant, transfer.fsckObjects did its job,
> but I wonder if fsck needs to be a little more tolerant now (at least
> with respect to transfer objects)?

Fsck actually treats this as a warning, not an error. It is
transfer.fsckObjects (via "index-pack --strict") that actually treats
warnings as errors.

It's possible that this should be loosened to allow through problems
marked as FSCK_WARN (with a message, kind of like...a warning). Though
it may also make sense to revisit some of the classifications in fsck
(e.g., many of the warnings are indicative of seriously broken objects).

GitHub uses transfer.fsckObjects, rejecting all warnings[1]. In practice
it is not usually a big deal, as people are happy to fix up their
objects _before_ they get widely published. The biggest push-back we get
is when somebody tries to re-push history they got from another GitHub
repo, and then says "But why are you complaining? You served this crappy
broken history?" And it's a fair point. If you are forking (but not
joining the existing fork network) of an existing project with
irregularities in the history, it's not really an option to simply
rewrite the history you are basing on.

-Peff

[1] Actually, we do let through 0-padded modes with a warning,
explicitly because of the problem mentioned above.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Zero padded file modes...

2013-09-05 Thread Jeff King
On Thu, Sep 05, 2013 at 01:09:34PM -0400, Nicolas Pitre wrote:

> On Thu, 5 Sep 2013, Jeff King wrote:
> 
> > There are basically two solutions:
> > 
> >   1. Add a single-bit flag for "I am 0-padded in the real data". We
> >  could probably even squeeze it into the same integer.
> > 
> >   2. Have a "classic" section of the pack that stores the raw object
> >  bytes. For objects which do not match our expectations, store them
> >  raw instead of in v4 format. They will not get the benefit of v4
> >  optimizations, but if they are the minority of objects, that will
> >  only end up with a slight slow-down.
> 
> That is basically what I just suggested.  But instead of a special 
> section, simply using a special object type number would do it.

Yeah, I think we are in agreement. I only suggested a separate section
because I hadn't carefully read the v4 patches yet, and didn't know if
there was room in the normal sequence. A special object number seems
much more elegant.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Zero padded file modes...

2013-09-05 Thread Nicolas Pitre
On Thu, 5 Sep 2013, Jeff King wrote:

> On Thu, Sep 05, 2013 at 11:18:24PM +0700, Nguyen Thai Ngoc Duy wrote:
> 
> > > There are basically two solutions:
> > >
> > >   1. Add a single-bit flag for "I am 0-padded in the real data". We
> > >  could probably even squeeze it into the same integer.
> > >
> > >   2. Have a "classic" section of the pack that stores the raw object
> > >  bytes. For objects which do not match our expectations, store them
> > >  raw instead of in v4 format. They will not get the benefit of v4
> > >  optimizations, but if they are the minority of objects, that will
> > >  only end up with a slight slow-down.
> > 
> > 3. Detect this situation and fall back to v2.
> > 
> > 4. Update v4 to allow storing raw tree entries mixing with v4-encoded
> > tree entries. This is something between (1) and (2)
> 
> I wouldn't want to do (3). At some point pack v4 may become the standard
> format, but there will be some repositories which will never be allowed
> to adopt it.
> 
> For (4), yes, that could work. But like (1), it only solves problems in
> tree entries. What happens if we have a quirky commit object that needs
> the same treatment (e.g., a timezone that does not fit into the commit
> name dictionary properly)?
> 
> > I think (4) fits better in v4 design and probably not hard to do. Nico
> > recently added a code to embed a tree entry inline, but the mode must
> > be encoded (and can't contain leading zeros). We could have another
> > code to store mode in ascii. This also makes me wonder if we might
> > have similar problems with timezones, which are also specially encoded
> > in v4..
> 
> Yeah, that might be more elegant.
> 
> > (3) is probably easiest. We need to scan through all tree entries
> > first when creating v4 anyway. If we detect any anomalies, just switch
> > back to v2 generation. The user will be force to rewrite history in
> > order to take full advantage of v4 (they can have a pack of weird
> > trees in v2 and the rest in v4 pack, but that's not optimal).
> 
> Splitting across two packs isn't great, though. What if v4 eventually
> becomes the normal on-the-wire format? I'd rather have some method for
> just embedding what are essentially v2 objects into the v4 pack, which
> would give us future room for handling these sorts of things.
> 
> But like I said, I haven't looked closely yet, so maybe there are
> complications with that. In the meantime, I'll defer to the judgement of
> people who know what they are talking about. :)

None of the above is particularly appealing to me.

Pack v4 has to enforce some standardization in the object encoding to be 
efficient.  Some compromizes have been applied to accommodate the fixing 
of a thin pack, although I was initially tempted to simply dodge the 
issue and allow thin packs in a repository.

On this particular mode issue, I remember making a fuss at the time when 
this was discovered because the github implementation did generate such 
tree objects at the time.

So instead of compromizing the pack v4 object encoding further, I'd 
simply suggest adding a special object type which is in fact simply the 
pack v2 representation i.e. the canonical object version, deflated.  
Right now pack v4 encodes only 5 object types: commit, tree, blob, delta 
and tag.  Only the commit and tree objects have their representation 
transcoded.  So that means we only need to add native_commit and 
native_tree object types.

Then, anything that doesn't fit the strict expectation for transcoding a 
tree or a commit object is simply included as is without transcoding 
just like in pack v2.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Zero padded file modes...

2013-09-05 Thread Nicolas Pitre
On Thu, 5 Sep 2013, Jeff King wrote:

> There are basically two solutions:
> 
>   1. Add a single-bit flag for "I am 0-padded in the real data". We
>  could probably even squeeze it into the same integer.
> 
>   2. Have a "classic" section of the pack that stores the raw object
>  bytes. For objects which do not match our expectations, store them
>  raw instead of in v4 format. They will not get the benefit of v4
>  optimizations, but if they are the minority of objects, that will
>  only end up with a slight slow-down.

That is basically what I just suggested.  But instead of a special 
section, simply using a special object type number would do it.

I'm even wondering if that couldn't be used for fixing a thin pack 
instead of the special provision I just added last night.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Zero padded file modes...

2013-09-05 Thread John Szakmeister
On Thu, Sep 5, 2013 at 11:36 AM, Jeff King  wrote:
> On Thu, Sep 05, 2013 at 10:00:39AM -0400, John Szakmeister wrote:
>
>> I went to clone a repository from GitHub today and discovered
>> something interesting:
>>
>> :: git clone https://github.com/liebke/incanter.git
>> Cloning into 'incanter'...
>> remote: Counting objects: 10457, done.
>> remote: Compressing objects: 100% (3018/3018), done.
>> error: object 4946e1ba09ba5655202a7a5d81ae106b08411061:contains
>> zero-padded file modes
>> fatal: Error in object
>> fatal: index-pack failed
>
> Yep. These were mostly caused by a bug in Grit that is long-fixed.  But
> the objects remain in many histories. It would have painful to rewrite
> them back then, and it would be even more painful now.

I guess there's still the other side of the question though.  Are
these repositories busted in the sense that something no longer works?
 I doesn't appear to be the case, but I've not used it extensively say
I can't say for certain one way or another.  In the sense that the
content is not strictly compliant, transfer.fsckObjects did its job,
but I wonder if fsck needs to be a little more tolerant now (at least
with respect to transfer objects)?

I can certainly cope with the issue--it's not a problem for me to flip
the flag on the command line.  I think it'd be nice to have
transer.fsckObjects be the default at some point, considering how
little people run fsck otherwise and how long these sorts of issues go
undiscovered.  Issues like the above seem to stand in the way of that
happening though.

-John
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Zero padded file modes...

2013-09-05 Thread Jeff King
On Thu, Sep 05, 2013 at 11:18:24PM +0700, Nguyen Thai Ngoc Duy wrote:

> > There are basically two solutions:
> >
> >   1. Add a single-bit flag for "I am 0-padded in the real data". We
> >  could probably even squeeze it into the same integer.
> >
> >   2. Have a "classic" section of the pack that stores the raw object
> >  bytes. For objects which do not match our expectations, store them
> >  raw instead of in v4 format. They will not get the benefit of v4
> >  optimizations, but if they are the minority of objects, that will
> >  only end up with a slight slow-down.
> 
> 3. Detect this situation and fall back to v2.
> 
> 4. Update v4 to allow storing raw tree entries mixing with v4-encoded
> tree entries. This is something between (1) and (2)

I wouldn't want to do (3). At some point pack v4 may become the standard
format, but there will be some repositories which will never be allowed
to adopt it.

For (4), yes, that could work. But like (1), it only solves problems in
tree entries. What happens if we have a quirky commit object that needs
the same treatment (e.g., a timezone that does not fit into the commit
name dictionary properly)?

> I think (4) fits better in v4 design and probably not hard to do. Nico
> recently added a code to embed a tree entry inline, but the mode must
> be encoded (and can't contain leading zeros). We could have another
> code to store mode in ascii. This also makes me wonder if we might
> have similar problems with timezones, which are also specially encoded
> in v4..

Yeah, that might be more elegant.

> (3) is probably easiest. We need to scan through all tree entries
> first when creating v4 anyway. If we detect any anomalies, just switch
> back to v2 generation. The user will be force to rewrite history in
> order to take full advantage of v4 (they can have a pack of weird
> trees in v2 and the rest in v4 pack, but that's not optimal).

Splitting across two packs isn't great, though. What if v4 eventually
becomes the normal on-the-wire format? I'd rather have some method for
just embedding what are essentially v2 objects into the v4 pack, which
would give us future room for handling these sorts of things.

But like I said, I haven't looked closely yet, so maybe there are
complications with that. In the meantime, I'll defer to the judgement of
people who know what they are talking about. :)

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Zero padded file modes...

2013-09-05 Thread A Large Angry SCM



On 09/05/2013 11:36 AM, Jeff King wrote:
[...]


I haven't looked carefully at the pack v4 patches yet, but I suspect
that yes, it's still a problem. The premise of pack v4 is that we can do
better by not storing the raw git object bytes, but rather storing
specialized representations of the various components. For example, by
using an integer to store the mode rather than the ascii representation.
But that representation does not represent the "oops, I have a 0-padded
mode" quirk. And we have to be able to recover the original object, byte
for byte, from the v4 representation (to verify sha1, or to generate a
loose object or v2 pack).

There are basically two solutions:

   1. Add a single-bit flag for "I am 0-padded in the real data". We
  could probably even squeeze it into the same integer.

   2. Have a "classic" section of the pack that stores the raw object
  bytes. For objects which do not match our expectations, store them
  raw instead of in v4 format. They will not get the benefit of v4
  optimizations, but if they are the minority of objects, that will
  only end up with a slight slow-down.

As I said, I have not looked carefully at the v4 patches, so maybe they
handle this case already. But of the two solutions, I prefer (2). Doing
(1) can solve _this_ problem, but it complicates the format, and does
nothing for any future compatibility issues. Whereas (2) is easy to
implement, since it is basically just pack v2 (and implementations would
need a pack v2 reader anyway).


3. Keep those objects in v2 packs instead of the v4 pack. Transfers 
would have to be v3 or multi-pack transfers would need to be supported.


4. Don't use v4 packs with projects that have "crufty" objects. Projects 
with such objects may choose to pay the "cost" to "upgrade" to v4 
compatibility.


There's nothing that requires the next pack format to support all of the 
broken stuff that's happened over the years.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Zero padded file modes...

2013-09-05 Thread Duy Nguyen
On Thu, Sep 5, 2013 at 10:36 PM, Jeff King  wrote:
>> > This is going to screw up pack v4 (yes, someday I'll have the
>> > time to make it real).
>>
>> I don't know if this is still true, but given that patches are
>> being sent out about it, I thought it relevant.
>
> I haven't looked carefully at the pack v4 patches yet, but I suspect
> that yes, it's still a problem. The premise of pack v4 is that we can do
> better by not storing the raw git object bytes, but rather storing
> specialized representations of the various components. For example, by
> using an integer to store the mode rather than the ascii representation.
> But that representation does not represent the "oops, I have a 0-padded
> mode" quirk. And we have to be able to recover the original object, byte
> for byte, from the v4 representation (to verify sha1, or to generate a
> loose object or v2 pack).
>
> There are basically two solutions:
>
>   1. Add a single-bit flag for "I am 0-padded in the real data". We
>  could probably even squeeze it into the same integer.
>
>   2. Have a "classic" section of the pack that stores the raw object
>  bytes. For objects which do not match our expectations, store them
>  raw instead of in v4 format. They will not get the benefit of v4
>  optimizations, but if they are the minority of objects, that will
>  only end up with a slight slow-down.

3. Detect this situation and fall back to v2.

4. Update v4 to allow storing raw tree entries mixing with v4-encoded
tree entries. This is something between (1) and (2)

> As I said, I have not looked carefully at the v4 patches, so maybe they
> handle this case already. But of the two solutions, I prefer (2). Doing
> (1) can solve _this_ problem, but it complicates the format, and does
> nothing for any future compatibility issues. Whereas (2) is easy to
> implement, since it is basically just pack v2 (and implementations would
> need a pack v2 reader anyway).

I think (4) fits better in v4 design and probably not hard to do. Nico
recently added a code to embed a tree entry inline, but the mode must
be encoded (and can't contain leading zeros). We could have another
code to store mode in ascii. This also makes me wonder if we might
have similar problems with timezones, which are also specially encoded
in v4..

(3) is probably easiest. We need to scan through all tree entries
first when creating v4 anyway. If we detect any anomalies, just switch
back to v2 generation. The user will be force to rewrite history in
order to take full advantage of v4 (they can have a pack of weird
trees in v2 and the rest in v4 pack, but that's not optimal).
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Zero padded file modes...

2013-09-05 Thread Jeff King
On Thu, Sep 05, 2013 at 10:00:39AM -0400, John Szakmeister wrote:

> I went to clone a repository from GitHub today and discovered
> something interesting:
> 
> :: git clone https://github.com/liebke/incanter.git
> Cloning into 'incanter'...
> remote: Counting objects: 10457, done.
> remote: Compressing objects: 100% (3018/3018), done.
> error: object 4946e1ba09ba5655202a7a5d81ae106b08411061:contains
> zero-padded file modes
> fatal: Error in object
> fatal: index-pack failed

Yep. These were mostly caused by a bug in Grit that is long-fixed.  But
the objects remain in many histories. It would have painful to rewrite
them back then, and it would be even more painful now.

> > This is going to screw up pack v4 (yes, someday I'll have the
> > time to make it real).
> 
> I don't know if this is still true, but given that patches are
> being sent out about it, I thought it relevant.

I haven't looked carefully at the pack v4 patches yet, but I suspect
that yes, it's still a problem. The premise of pack v4 is that we can do
better by not storing the raw git object bytes, but rather storing
specialized representations of the various components. For example, by
using an integer to store the mode rather than the ascii representation.
But that representation does not represent the "oops, I have a 0-padded
mode" quirk. And we have to be able to recover the original object, byte
for byte, from the v4 representation (to verify sha1, or to generate a
loose object or v2 pack).

There are basically two solutions:

  1. Add a single-bit flag for "I am 0-padded in the real data". We
 could probably even squeeze it into the same integer.

  2. Have a "classic" section of the pack that stores the raw object
 bytes. For objects which do not match our expectations, store them
 raw instead of in v4 format. They will not get the benefit of v4
 optimizations, but if they are the minority of objects, that will
 only end up with a slight slow-down.

As I said, I have not looked carefully at the v4 patches, so maybe they
handle this case already. But of the two solutions, I prefer (2). Doing
(1) can solve _this_ problem, but it complicates the format, and does
nothing for any future compatibility issues. Whereas (2) is easy to
implement, since it is basically just pack v2 (and implementations would
need a pack v2 reader anyway).

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html