Re: Zero padded file modes...
On Thu, Sep 05, 2013 at 01:13:40PM -0400, John Szakmeister wrote: > > Yep. These were mostly caused by a bug in Grit that is long-fixed. But > > the objects remain in many histories. It would have painful to rewrite > > them back then, and it would be even more painful now. > > I guess there's still the other side of the question though. Are > these repositories busted in the sense that something no longer works? No, as far as I know, everything still works fine. However, some diffs may be suboptimal, because we may have two different sha1s for the same subtree (so we may descend into the tree unnecessarily only to find that they are equivalent). And by the same token, any scripts doing non-recursive diffs may erroneously mark the trees as differing, even though they do not contain any differing files. But neither is a big problem in practice. If you had two clients in active use which were flip-flopping a sub-tree back and forth between representations, it would be a problem. But we are talking about a few isolated incidents far back in history. > I doesn't appear to be the case, but I've not used it extensively say > I can't say for certain one way or another. In the sense that the > content is not strictly compliant, transfer.fsckObjects did its job, > but I wonder if fsck needs to be a little more tolerant now (at least > with respect to transfer objects)? Fsck actually treats this as a warning, not an error. It is transfer.fsckObjects (via "index-pack --strict") that actually treats warnings as errors. It's possible that this should be loosened to allow through problems marked as FSCK_WARN (with a message, kind of like...a warning). Though it may also make sense to revisit some of the classifications in fsck (e.g., many of the warnings are indicative of seriously broken objects). GitHub uses transfer.fsckObjects, rejecting all warnings[1]. In practice it is not usually a big deal, as people are happy to fix up their objects _before_ they get widely published. The biggest push-back we get is when somebody tries to re-push history they got from another GitHub repo, and then says "But why are you complaining? You served this crappy broken history?" And it's a fair point. If you are forking (but not joining the existing fork network) of an existing project with irregularities in the history, it's not really an option to simply rewrite the history you are basing on. -Peff [1] Actually, we do let through 0-padded modes with a warning, explicitly because of the problem mentioned above. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Zero padded file modes...
On Thu, Sep 05, 2013 at 01:09:34PM -0400, Nicolas Pitre wrote: > On Thu, 5 Sep 2013, Jeff King wrote: > > > There are basically two solutions: > > > > 1. Add a single-bit flag for "I am 0-padded in the real data". We > > could probably even squeeze it into the same integer. > > > > 2. Have a "classic" section of the pack that stores the raw object > > bytes. For objects which do not match our expectations, store them > > raw instead of in v4 format. They will not get the benefit of v4 > > optimizations, but if they are the minority of objects, that will > > only end up with a slight slow-down. > > That is basically what I just suggested. But instead of a special > section, simply using a special object type number would do it. Yeah, I think we are in agreement. I only suggested a separate section because I hadn't carefully read the v4 patches yet, and didn't know if there was room in the normal sequence. A special object number seems much more elegant. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Zero padded file modes...
On Thu, 5 Sep 2013, Jeff King wrote: > On Thu, Sep 05, 2013 at 11:18:24PM +0700, Nguyen Thai Ngoc Duy wrote: > > > > There are basically two solutions: > > > > > > 1. Add a single-bit flag for "I am 0-padded in the real data". We > > > could probably even squeeze it into the same integer. > > > > > > 2. Have a "classic" section of the pack that stores the raw object > > > bytes. For objects which do not match our expectations, store them > > > raw instead of in v4 format. They will not get the benefit of v4 > > > optimizations, but if they are the minority of objects, that will > > > only end up with a slight slow-down. > > > > 3. Detect this situation and fall back to v2. > > > > 4. Update v4 to allow storing raw tree entries mixing with v4-encoded > > tree entries. This is something between (1) and (2) > > I wouldn't want to do (3). At some point pack v4 may become the standard > format, but there will be some repositories which will never be allowed > to adopt it. > > For (4), yes, that could work. But like (1), it only solves problems in > tree entries. What happens if we have a quirky commit object that needs > the same treatment (e.g., a timezone that does not fit into the commit > name dictionary properly)? > > > I think (4) fits better in v4 design and probably not hard to do. Nico > > recently added a code to embed a tree entry inline, but the mode must > > be encoded (and can't contain leading zeros). We could have another > > code to store mode in ascii. This also makes me wonder if we might > > have similar problems with timezones, which are also specially encoded > > in v4.. > > Yeah, that might be more elegant. > > > (3) is probably easiest. We need to scan through all tree entries > > first when creating v4 anyway. If we detect any anomalies, just switch > > back to v2 generation. The user will be force to rewrite history in > > order to take full advantage of v4 (they can have a pack of weird > > trees in v2 and the rest in v4 pack, but that's not optimal). > > Splitting across two packs isn't great, though. What if v4 eventually > becomes the normal on-the-wire format? I'd rather have some method for > just embedding what are essentially v2 objects into the v4 pack, which > would give us future room for handling these sorts of things. > > But like I said, I haven't looked closely yet, so maybe there are > complications with that. In the meantime, I'll defer to the judgement of > people who know what they are talking about. :) None of the above is particularly appealing to me. Pack v4 has to enforce some standardization in the object encoding to be efficient. Some compromizes have been applied to accommodate the fixing of a thin pack, although I was initially tempted to simply dodge the issue and allow thin packs in a repository. On this particular mode issue, I remember making a fuss at the time when this was discovered because the github implementation did generate such tree objects at the time. So instead of compromizing the pack v4 object encoding further, I'd simply suggest adding a special object type which is in fact simply the pack v2 representation i.e. the canonical object version, deflated. Right now pack v4 encodes only 5 object types: commit, tree, blob, delta and tag. Only the commit and tree objects have their representation transcoded. So that means we only need to add native_commit and native_tree object types. Then, anything that doesn't fit the strict expectation for transcoding a tree or a commit object is simply included as is without transcoding just like in pack v2. Nicolas -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Zero padded file modes...
On Thu, 5 Sep 2013, Jeff King wrote: > There are basically two solutions: > > 1. Add a single-bit flag for "I am 0-padded in the real data". We > could probably even squeeze it into the same integer. > > 2. Have a "classic" section of the pack that stores the raw object > bytes. For objects which do not match our expectations, store them > raw instead of in v4 format. They will not get the benefit of v4 > optimizations, but if they are the minority of objects, that will > only end up with a slight slow-down. That is basically what I just suggested. But instead of a special section, simply using a special object type number would do it. I'm even wondering if that couldn't be used for fixing a thin pack instead of the special provision I just added last night. Nicolas -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Zero padded file modes...
On Thu, Sep 5, 2013 at 11:36 AM, Jeff King wrote: > On Thu, Sep 05, 2013 at 10:00:39AM -0400, John Szakmeister wrote: > >> I went to clone a repository from GitHub today and discovered >> something interesting: >> >> :: git clone https://github.com/liebke/incanter.git >> Cloning into 'incanter'... >> remote: Counting objects: 10457, done. >> remote: Compressing objects: 100% (3018/3018), done. >> error: object 4946e1ba09ba5655202a7a5d81ae106b08411061:contains >> zero-padded file modes >> fatal: Error in object >> fatal: index-pack failed > > Yep. These were mostly caused by a bug in Grit that is long-fixed. But > the objects remain in many histories. It would have painful to rewrite > them back then, and it would be even more painful now. I guess there's still the other side of the question though. Are these repositories busted in the sense that something no longer works? I doesn't appear to be the case, but I've not used it extensively say I can't say for certain one way or another. In the sense that the content is not strictly compliant, transfer.fsckObjects did its job, but I wonder if fsck needs to be a little more tolerant now (at least with respect to transfer objects)? I can certainly cope with the issue--it's not a problem for me to flip the flag on the command line. I think it'd be nice to have transer.fsckObjects be the default at some point, considering how little people run fsck otherwise and how long these sorts of issues go undiscovered. Issues like the above seem to stand in the way of that happening though. -John -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Zero padded file modes...
On Thu, Sep 05, 2013 at 11:18:24PM +0700, Nguyen Thai Ngoc Duy wrote: > > There are basically two solutions: > > > > 1. Add a single-bit flag for "I am 0-padded in the real data". We > > could probably even squeeze it into the same integer. > > > > 2. Have a "classic" section of the pack that stores the raw object > > bytes. For objects which do not match our expectations, store them > > raw instead of in v4 format. They will not get the benefit of v4 > > optimizations, but if they are the minority of objects, that will > > only end up with a slight slow-down. > > 3. Detect this situation and fall back to v2. > > 4. Update v4 to allow storing raw tree entries mixing with v4-encoded > tree entries. This is something between (1) and (2) I wouldn't want to do (3). At some point pack v4 may become the standard format, but there will be some repositories which will never be allowed to adopt it. For (4), yes, that could work. But like (1), it only solves problems in tree entries. What happens if we have a quirky commit object that needs the same treatment (e.g., a timezone that does not fit into the commit name dictionary properly)? > I think (4) fits better in v4 design and probably not hard to do. Nico > recently added a code to embed a tree entry inline, but the mode must > be encoded (and can't contain leading zeros). We could have another > code to store mode in ascii. This also makes me wonder if we might > have similar problems with timezones, which are also specially encoded > in v4.. Yeah, that might be more elegant. > (3) is probably easiest. We need to scan through all tree entries > first when creating v4 anyway. If we detect any anomalies, just switch > back to v2 generation. The user will be force to rewrite history in > order to take full advantage of v4 (they can have a pack of weird > trees in v2 and the rest in v4 pack, but that's not optimal). Splitting across two packs isn't great, though. What if v4 eventually becomes the normal on-the-wire format? I'd rather have some method for just embedding what are essentially v2 objects into the v4 pack, which would give us future room for handling these sorts of things. But like I said, I haven't looked closely yet, so maybe there are complications with that. In the meantime, I'll defer to the judgement of people who know what they are talking about. :) -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Zero padded file modes...
On 09/05/2013 11:36 AM, Jeff King wrote: [...] I haven't looked carefully at the pack v4 patches yet, but I suspect that yes, it's still a problem. The premise of pack v4 is that we can do better by not storing the raw git object bytes, but rather storing specialized representations of the various components. For example, by using an integer to store the mode rather than the ascii representation. But that representation does not represent the "oops, I have a 0-padded mode" quirk. And we have to be able to recover the original object, byte for byte, from the v4 representation (to verify sha1, or to generate a loose object or v2 pack). There are basically two solutions: 1. Add a single-bit flag for "I am 0-padded in the real data". We could probably even squeeze it into the same integer. 2. Have a "classic" section of the pack that stores the raw object bytes. For objects which do not match our expectations, store them raw instead of in v4 format. They will not get the benefit of v4 optimizations, but if they are the minority of objects, that will only end up with a slight slow-down. As I said, I have not looked carefully at the v4 patches, so maybe they handle this case already. But of the two solutions, I prefer (2). Doing (1) can solve _this_ problem, but it complicates the format, and does nothing for any future compatibility issues. Whereas (2) is easy to implement, since it is basically just pack v2 (and implementations would need a pack v2 reader anyway). 3. Keep those objects in v2 packs instead of the v4 pack. Transfers would have to be v3 or multi-pack transfers would need to be supported. 4. Don't use v4 packs with projects that have "crufty" objects. Projects with such objects may choose to pay the "cost" to "upgrade" to v4 compatibility. There's nothing that requires the next pack format to support all of the broken stuff that's happened over the years. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Zero padded file modes...
On Thu, Sep 5, 2013 at 10:36 PM, Jeff King wrote: >> > This is going to screw up pack v4 (yes, someday I'll have the >> > time to make it real). >> >> I don't know if this is still true, but given that patches are >> being sent out about it, I thought it relevant. > > I haven't looked carefully at the pack v4 patches yet, but I suspect > that yes, it's still a problem. The premise of pack v4 is that we can do > better by not storing the raw git object bytes, but rather storing > specialized representations of the various components. For example, by > using an integer to store the mode rather than the ascii representation. > But that representation does not represent the "oops, I have a 0-padded > mode" quirk. And we have to be able to recover the original object, byte > for byte, from the v4 representation (to verify sha1, or to generate a > loose object or v2 pack). > > There are basically two solutions: > > 1. Add a single-bit flag for "I am 0-padded in the real data". We > could probably even squeeze it into the same integer. > > 2. Have a "classic" section of the pack that stores the raw object > bytes. For objects which do not match our expectations, store them > raw instead of in v4 format. They will not get the benefit of v4 > optimizations, but if they are the minority of objects, that will > only end up with a slight slow-down. 3. Detect this situation and fall back to v2. 4. Update v4 to allow storing raw tree entries mixing with v4-encoded tree entries. This is something between (1) and (2) > As I said, I have not looked carefully at the v4 patches, so maybe they > handle this case already. But of the two solutions, I prefer (2). Doing > (1) can solve _this_ problem, but it complicates the format, and does > nothing for any future compatibility issues. Whereas (2) is easy to > implement, since it is basically just pack v2 (and implementations would > need a pack v2 reader anyway). I think (4) fits better in v4 design and probably not hard to do. Nico recently added a code to embed a tree entry inline, but the mode must be encoded (and can't contain leading zeros). We could have another code to store mode in ascii. This also makes me wonder if we might have similar problems with timezones, which are also specially encoded in v4.. (3) is probably easiest. We need to scan through all tree entries first when creating v4 anyway. If we detect any anomalies, just switch back to v2 generation. The user will be force to rewrite history in order to take full advantage of v4 (they can have a pack of weird trees in v2 and the rest in v4 pack, but that's not optimal). -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Zero padded file modes...
On Thu, Sep 05, 2013 at 10:00:39AM -0400, John Szakmeister wrote: > I went to clone a repository from GitHub today and discovered > something interesting: > > :: git clone https://github.com/liebke/incanter.git > Cloning into 'incanter'... > remote: Counting objects: 10457, done. > remote: Compressing objects: 100% (3018/3018), done. > error: object 4946e1ba09ba5655202a7a5d81ae106b08411061:contains > zero-padded file modes > fatal: Error in object > fatal: index-pack failed Yep. These were mostly caused by a bug in Grit that is long-fixed. But the objects remain in many histories. It would have painful to rewrite them back then, and it would be even more painful now. > > This is going to screw up pack v4 (yes, someday I'll have the > > time to make it real). > > I don't know if this is still true, but given that patches are > being sent out about it, I thought it relevant. I haven't looked carefully at the pack v4 patches yet, but I suspect that yes, it's still a problem. The premise of pack v4 is that we can do better by not storing the raw git object bytes, but rather storing specialized representations of the various components. For example, by using an integer to store the mode rather than the ascii representation. But that representation does not represent the "oops, I have a 0-padded mode" quirk. And we have to be able to recover the original object, byte for byte, from the v4 representation (to verify sha1, or to generate a loose object or v2 pack). There are basically two solutions: 1. Add a single-bit flag for "I am 0-padded in the real data". We could probably even squeeze it into the same integer. 2. Have a "classic" section of the pack that stores the raw object bytes. For objects which do not match our expectations, store them raw instead of in v4 format. They will not get the benefit of v4 optimizations, but if they are the minority of objects, that will only end up with a slight slow-down. As I said, I have not looked carefully at the v4 patches, so maybe they handle this case already. But of the two solutions, I prefer (2). Doing (1) can solve _this_ problem, but it complicates the format, and does nothing for any future compatibility issues. Whereas (2) is easy to implement, since it is basically just pack v2 (and implementations would need a pack v2 reader anyway). -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html