Re: missing objects -- prevention
On Sun, Jan 13, 2013 at 06:26:53AM +0530, Sitaram Chamarty wrote: > > Right, I meant if you have receive.fsckObjects on. It won't help this > > situation at all, as we already do a connectivity check separate from > > the fsck. But I do recommend it in general, just because it helps catch > > bad objects before they gets disseminated to a wider audience (at which > > point it is often infeasible to rewind history). And it has found git > > bugs (e.g., null sha1s in tree entries). > > I will add this. Any idea if there's a significant performance hit? Not usually; we are already resolving all of the sent deltas as a precaution, anyway. I do notice after a push to GitHub there is sometimes a second or two of pause from the server before the push status is shown. But I haven't narrowed it down to fsck (versus connectivity check, versus our post-receive hook). So you may want to keep an eye on the effects (and if you have numbers, please share :) ). > That's always the hard part. System admins (at the Unix level) insist > there's nothing wrong and no disk errors and so on... that is why I > was interested in network errors causing problems and so on. Yeah, I feel bad saying "well, this repo is totally corrupted, but it couldn't possibly be git's fault, because that's not what its failure modes look like". But luckily our Ops people are very understanding, and most of the problems I have seen have turned out to be fs corruption after all (the pack-refs things is the big exception). > Thanks once again for your patient replies! No problem. There aren't many people dealing with large-scale server-side issues, so it's something that doesn't come up much on the list. I'm happy to talk about it. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: missing objects -- prevention
On Sat, Jan 12, 2013 at 6:43 PM, Jeff King wrote: > On Sat, Jan 12, 2013 at 06:39:52AM +0530, Sitaram Chamarty wrote: > >> > 1. The repo has a ref R pointing at commit X. >> > >> > 2. A user starts a push to another ref, Q, of commit Y that builds on >> > X. Git advertises ref R, so the sender knows they do not need to >> > send X, but only Y. The user then proceeds to send the packfile >> > (which might take a very long time). >> > >> > 3. Meanwhile, another user deletes ref R. X becomes unreferenced. >> >> The gitolite logs show that no deletion of refs has happened. > > To be pedantic, step 3 could also be rewinding R to a commit before X. > Anything that causes X to become unreferenced. Right, but there were no rewinds also; I should have mentioned that. (Gitolite log files mark rewinds and deletes specially, so they're easy to search. There were two attempted rewinds but they failed the gitolite update hook so -- while the new objects would have landed in the object store -- the old ones were not dereferenced). >> > There is a race with simultaneously deleting and packing refs. It >> > doesn't cause object db corruption, but it will cause refs to "rewind" >> > back to their packed versions. I have seen that one in practice (though >> > relatively rare). I fixed it in b3f1280, which is not yet in any >> > released version. >> >> This is for the packed-refs file right? And it could result in a ref >> getting deleted right? > > Yes, if the ref was not previously packed, it could result in the ref > being deleted entirely. > >> I said above that the gitolite logs say no ref was deleted. What if >> the ref "deletion" happened because of this race, making the rest of >> your 4-step scenario above possible? > > It's possible. I do want to highlight how unlikely it is, though. Agreed. >> > up in the middle, or fsck rejects the pack). We have historically left >> >> fsck... you mean if I had 'receive.fsckObjects' true, right? I don't. >> Should I? Would it help this overall situation? As I understand it, >> thats only about the internals of each object to check corruption, and >> cannot detect a *missing* object on the local object store. > > Right, I meant if you have receive.fsckObjects on. It won't help this > situation at all, as we already do a connectivity check separate from > the fsck. But I do recommend it in general, just because it helps catch > bad objects before they gets disseminated to a wider audience (at which > point it is often infeasible to rewind history). And it has found git > bugs (e.g., null sha1s in tree entries). I will add this. Any idea if there's a significant performance hit? >> > At GitHub, we've taken to just cleaning them up aggressively (I think >> > after an hour), though I am tempted to put in an optional signal/atexit >> >> OK; I'll do the same then. I suppose a cron job is the best way; I >> didn't find any config for expiring these files. > > If you run "git prune --expire=1.hour.ago", it should prune stale > tmp_pack_* files more than an hour old. But you may not be comfortable > with such a short expiration for the objects themselves. :) > >> Thanks again for your help. I'm going to treat it (for now) as a >> disk/fs error after hearing from you about the other possibility I >> mentioned above, although I find it hard to believe one repo can be >> hit buy *two* races occurring together! > > Yeah, the race seems pretty unlikely (though it could be just the one > race with a rewind). As I said, I haven't actually ever seen it in > practice. In my experience, though, disk/fs issues do not manifest as > just missing objects, but as corrupted packfiles (e.g., the packfile > directory entry ends up pointing to the wrong inode, which is easy to > see because the inode's content is actually a reflog). And then of > course with the packfile unreadable, you have missing objects. But YMMV, > depending on the fs and what's happened to the machine to cause the fs > problem. That's always the hard part. System admins (at the Unix level) insist there's nothing wrong and no disk errors and so on... that is why I was interested in network errors causing problems and so on. Anyway, now that I know the tmp_pack_* files are caused mostly by failed pushes than by failed auto-gc, at least I can deal with the immediate problem easily! Thanks once again for your patient replies! sitaram -- Sitaram -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: missing objects -- prevention
On Sat, Jan 12, 2013 at 06:39:52AM +0530, Sitaram Chamarty wrote: > > 1. The repo has a ref R pointing at commit X. > > > > 2. A user starts a push to another ref, Q, of commit Y that builds on > > X. Git advertises ref R, so the sender knows they do not need to > > send X, but only Y. The user then proceeds to send the packfile > > (which might take a very long time). > > > > 3. Meanwhile, another user deletes ref R. X becomes unreferenced. > > The gitolite logs show that no deletion of refs has happened. To be pedantic, step 3 could also be rewinding R to a commit before X. Anything that causes X to become unreferenced. > > There is a race with simultaneously deleting and packing refs. It > > doesn't cause object db corruption, but it will cause refs to "rewind" > > back to their packed versions. I have seen that one in practice (though > > relatively rare). I fixed it in b3f1280, which is not yet in any > > released version. > > This is for the packed-refs file right? And it could result in a ref > getting deleted right? Yes, if the ref was not previously packed, it could result in the ref being deleted entirely. > I said above that the gitolite logs say no ref was deleted. What if > the ref "deletion" happened because of this race, making the rest of > your 4-step scenario above possible? It's possible. I do want to highlight how unlikely it is, though. > > up in the middle, or fsck rejects the pack). We have historically left > > fsck... you mean if I had 'receive.fsckObjects' true, right? I don't. > Should I? Would it help this overall situation? As I understand it, > thats only about the internals of each object to check corruption, and > cannot detect a *missing* object on the local object store. Right, I meant if you have receive.fsckObjects on. It won't help this situation at all, as we already do a connectivity check separate from the fsck. But I do recommend it in general, just because it helps catch bad objects before they gets disseminated to a wider audience (at which point it is often infeasible to rewind history). And it has found git bugs (e.g., null sha1s in tree entries). > > At GitHub, we've taken to just cleaning them up aggressively (I think > > after an hour), though I am tempted to put in an optional signal/atexit > > OK; I'll do the same then. I suppose a cron job is the best way; I > didn't find any config for expiring these files. If you run "git prune --expire=1.hour.ago", it should prune stale tmp_pack_* files more than an hour old. But you may not be comfortable with such a short expiration for the objects themselves. :) > Thanks again for your help. I'm going to treat it (for now) as a > disk/fs error after hearing from you about the other possibility I > mentioned above, although I find it hard to believe one repo can be > hit buy *two* races occurring together! Yeah, the race seems pretty unlikely (though it could be just the one race with a rewind). As I said, I haven't actually ever seen it in practice. In my experience, though, disk/fs issues do not manifest as just missing objects, but as corrupted packfiles (e.g., the packfile directory entry ends up pointing to the wrong inode, which is easy to see because the inode's content is actually a reflog). And then of course with the packfile unreadable, you have missing objects. But YMMV, depending on the fs and what's happened to the machine to cause the fs problem. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: missing objects -- prevention
Thanks for the very detailed answer. On Fri, Jan 11, 2013 at 10:12 PM, Jeff King wrote: > On Fri, Jan 11, 2013 at 04:40:38PM +0530, Sitaram Chamarty wrote: > >> I find a lot of info on how to recover from and/or repair a repo that >> has missing (or corrupted) objects. >> >> What I need is info on common reasons (other than disk errors -- we've >> checked for those) for such errors to occur, any preventive measures >> we can take, and so on. > > I don't think any race can cause corruption of the object or packfiles > because of the way they are written. At GitHub, every case of file-level > corruption we've seen has been a filesystem issue. > > So I think the main thing systemic/race issue to worry about is missing > objects. And since git only deletes objects during a prune (assuming you > are using git-gc or "repack -A" so that repack cannot drop objects), I > think prune is the only thing to watch out for. No one runs anything manually under normal conditions. If there's any gc happening, it's gc --auto. > The --expire time saves us from the obvious races where you write object > X but have not yet referenced it, and a simultaneous prune wants to > delete it. However, it's possible that you have an old object that is > unreferenced, but would become referenced as a result of an in-progress > operation. For example, commit X is unreferenced and ready to be pruned, > you build commit Y on top of it, but before you write the ref, git-prune > removes X. > > The server-side version of that would happen via receive-pack, and is > even more unlikely, because X would have to be referenced initially for > us to advertise it. So it's something like: > > 1. The repo has a ref R pointing at commit X. > > 2. A user starts a push to another ref, Q, of commit Y that builds on > X. Git advertises ref R, so the sender knows they do not need to > send X, but only Y. The user then proceeds to send the packfile > (which might take a very long time). > > 3. Meanwhile, another user deletes ref R. X becomes unreferenced. The gitolite logs show that no deletion of refs has happened. > 4. After step 3 but before step 2 has finished, somebody runs prune > (this might sound unlikely, but if you kick off a "gc" job after > each push, or after N pushes, it's not so unlikely). It sees that > X is unreferenced, and it may very well be older than the --expire > setting. Prune deletes X. > > 5. The packfile in (2) arrives, and receive-pack attempts to update > the refs. > > So it's even a bit more unlikely than the local case, because > receive-pack would not otherwise build on dangling objects. You have > to race steps (2) and (3) just to create the situation. > > Then we have an extra protection in the form of > check_everything_connected, which receive-pack runs before writing the > refs into place. So if step 4 happens while the packfile is being sent > (which is the most likely case, since it is the longest stretch of > receive-pack's time), we would still catch it there and reject the push > (annoying to the user, but the repo remains consistent). > > However, that's not foolproof. We might hit step 4 after we've checked > that everything is connected but right before we write the ref. In which > case we drop X, which has just become referenced, and we have a missing > object. > > So I think it's possible. But I have never actually seen it in practice, > and come up with this scenario only by brainstorming "what could go > wrong" scenarios. > > This could be mitigated if there was a "proposed refs" storage. > Receive-pack would write a note saying "consider Y for pruning purposes, > but it's not really referenced yet", check connectivity for Y against > the current refs, and then eventually write Y to its real ref (or reject > it if there are problems). Prune would either run before the "proposed" > note is written, which would mean it deletes X, but the connectivity > check fails. Or it would run after, in which case it would leave X > alone. > >> For example, can *any* type of network error or race condition cause >> this? (Say, can one push writes an object, then fails an update >> check, and a later push succeeds and races against a gc that removes >> the unreachable object?) Or... the repo is pretty large -- about 6-7 >> GB, so could size cause a race that would not show up on a smaller >> repo? > > The above is the only open issue I know about. I don't think it is > dependent on repo size, but the window is widened for a really large > push, because rev-list takes longer to run. It does not widen if you > have receive.fsckobjects set, because that happens before we do the > connectivity check (and the connectivity check is run in a sub-process, > so the race timer starts when we exec rev-list, which may open and mmap > packfiles or otherwise cache the presence of X in memory). > >> Anything else I can watch out for or caution the team about? > > That's the only ope
Re: missing objects -- prevention
On Fri, Jan 11, 2013 at 04:40:38PM +0530, Sitaram Chamarty wrote: > I find a lot of info on how to recover from and/or repair a repo that > has missing (or corrupted) objects. > > What I need is info on common reasons (other than disk errors -- we've > checked for those) for such errors to occur, any preventive measures > we can take, and so on. I don't think any race can cause corruption of the object or packfiles because of the way they are written. At GitHub, every case of file-level corruption we've seen has been a filesystem issue. So I think the main thing systemic/race issue to worry about is missing objects. And since git only deletes objects during a prune (assuming you are using git-gc or "repack -A" so that repack cannot drop objects), I think prune is the only thing to watch out for. The --expire time saves us from the obvious races where you write object X but have not yet referenced it, and a simultaneous prune wants to delete it. However, it's possible that you have an old object that is unreferenced, but would become referenced as a result of an in-progress operation. For example, commit X is unreferenced and ready to be pruned, you build commit Y on top of it, but before you write the ref, git-prune removes X. The server-side version of that would happen via receive-pack, and is even more unlikely, because X would have to be referenced initially for us to advertise it. So it's something like: 1. The repo has a ref R pointing at commit X. 2. A user starts a push to another ref, Q, of commit Y that builds on X. Git advertises ref R, so the sender knows they do not need to send X, but only Y. The user then proceeds to send the packfile (which might take a very long time). 3. Meanwhile, another user deletes ref R. X becomes unreferenced. 4. After step 3 but before step 2 has finished, somebody runs prune (this might sound unlikely, but if you kick off a "gc" job after each push, or after N pushes, it's not so unlikely). It sees that X is unreferenced, and it may very well be older than the --expire setting. Prune deletes X. 5. The packfile in (2) arrives, and receive-pack attempts to update the refs. So it's even a bit more unlikely than the local case, because receive-pack would not otherwise build on dangling objects. You have to race steps (2) and (3) just to create the situation. Then we have an extra protection in the form of check_everything_connected, which receive-pack runs before writing the refs into place. So if step 4 happens while the packfile is being sent (which is the most likely case, since it is the longest stretch of receive-pack's time), we would still catch it there and reject the push (annoying to the user, but the repo remains consistent). However, that's not foolproof. We might hit step 4 after we've checked that everything is connected but right before we write the ref. In which case we drop X, which has just become referenced, and we have a missing object. So I think it's possible. But I have never actually seen it in practice, and come up with this scenario only by brainstorming "what could go wrong" scenarios. This could be mitigated if there was a "proposed refs" storage. Receive-pack would write a note saying "consider Y for pruning purposes, but it's not really referenced yet", check connectivity for Y against the current refs, and then eventually write Y to its real ref (or reject it if there are problems). Prune would either run before the "proposed" note is written, which would mean it deletes X, but the connectivity check fails. Or it would run after, in which case it would leave X alone. > For example, can *any* type of network error or race condition cause > this? (Say, can one push writes an object, then fails an update > check, and a later push succeeds and races against a gc that removes > the unreachable object?) Or... the repo is pretty large -- about 6-7 > GB, so could size cause a race that would not show up on a smaller > repo? The above is the only open issue I know about. I don't think it is dependent on repo size, but the window is widened for a really large push, because rev-list takes longer to run. It does not widen if you have receive.fsckobjects set, because that happens before we do the connectivity check (and the connectivity check is run in a sub-process, so the race timer starts when we exec rev-list, which may open and mmap packfiles or otherwise cache the presence of X in memory). > Anything else I can watch out for or caution the team about? That's the only open issue I know about for missing objects. There is a race with simultaneously deleting and packing refs. It doesn't cause object db corruption, but it will cause refs to "rewind" back to their packed versions. I have seen that one in practice (though relatively rare). I fixed it in b3f1280, which is not yet in any released version. > The symptom is usually a disk space crunch caused by tmp_pack_* files >