Re: [RFC] git gc "--prune=now" semantics considered harmful
On Fri, Jun 1, 2018 at 2:04 AM Jeff King wrote: > > We'd also accept relative times like "5.minutes.ago" (in fact, the > default is a relative 2.weeks.ago, though it's long enough that the > difference between "2 weeks" and "2 weeks plus 5 minutes" may not matter > much). So we probably ought to just normalize _everything_ without even > bothering to match "now". It's a noop for non-relative times, but that's > OK. Heh. That's what I tried to do at first. It's surprisingly hard. You can't normalize it as a date, because we have a few times that aren't expressible as dates because they are just the maximum value (ie "all"). And then I tried to just express it just as a standard numerical value, which we accept on input. But we *only* accept that if it's more than eight digits. And regardless, you need to special-case "now", since parse_expiry_date() special cases it. Or you'd need to just make another version of parse_expiry_date() entirely. End result: I only special-cased "now". > There are still possibilities for a race, even with the grace period. Agreed. But you *really* have to work at it. Linus
Re: [RFC] git gc "--prune=now" semantics considered harmful
On Sun, May 27, 2018 at 08:31:14AM +0900, Junio C Hamano wrote: > > So I actually would much prefer that foir git gc, "--prune=now" means > > > > (a) "now" > > > > (b) now at the _start_ of the "git gc" operation, not the time at > > the _end_ of the operation when we've already spent a minute or > > two doing repacking and are now doing the final pruning. > > > > anyway, with that explanation in mind, I'm appending a patch that is > > pretty small and does that. It's a bit hacky, but I think it still makes > > sense. > > > > Comments? > > Closing the possiblity of racing a running "gc" and new object > creation like the above generally makes sense, I would think, > whether the creation is due to 'pull/fetch', 'add', or even 'push'. I think Linus's suggestion is an obvious improvement. It does shorten the window for confusing things to happen, and I think it makes things much easier to reason about if all parts of the gc are using the same timestamp. Regarding the implementation: > > - if (prune_expire && parse_expiry_date(prune_expire, )) > > - die(_("failed to parse prune expiry value %s"), prune_expire); > > + if (prune_expire) { > > + if (!strcmp(prune_expire, "now")) > > + prune_expire = show_date(time(NULL), 0, > > DATE_MODE(ISO8601)); > > + if (parse_expiry_date(prune_expire, )) > > + die(_("failed to parse prune expiry value %s"), > > prune_expire); > > + } We'd also accept relative times like "5.minutes.ago" (in fact, the default is a relative 2.weeks.ago, though it's long enough that the difference between "2 weeks" and "2 weeks plus 5 minutes" may not matter much). So we probably ought to just normalize _everything_ without even bothering to match "now". It's a noop for non-relative times, but that's OK. > I however have to wonder if there are opposite "oops" end-user > operation we also need to worry about, i.e. we are doing a large-ish > fetch, and get bored and run a gc fron another terminal. Perhaps > *that* is a bit too stupid to worry about? Auto-gc deliberately > does not use 'now' because it wants to leave a grace period to avoid > exactly that kind of race. There are still possibilities for a race, even with the grace period. You can have an unreferenced 2-week-old object sitting on disk, and somebody can choose to reference it at the same time as we are pruning it. My freshness patches from a few years ago made things a bit better: - when we optimize out the write of an existing object, we now at least update its timestamp - we consider non-fresh objects reachable from fresh ones to also be fresh But fundamentally none of this is atomic. You can have an old tree, and while you're pruning somebody writes a new commit referencing it and sticks that in a ref. It's more common if your grace period is "now", but it can still happen with any grace period. -Peff
Re: [RFC] git gc "--prune=now" semantics considered harmful
On Sat, May 26, 2018 at 4:31 PM Junio C Hamanowrote: > *That* is something I don't do. After all, I am fully aware that I > have started end-of-day ritual by that time, so I won't even look at > a new patch (or a pull request for that matter). Sounds like you're more organized about the end-of-day ritual than I am. For me the gc is not quite so structured. > I however have to wonder if there are opposite "oops" end-user > operation we also need to worry about, i.e. we are doing a large-ish > fetch, and get bored and run a gc fron another terminal. Perhaps > *that* is a bit too stupid to worry about? Auto-gc deliberately > does not use 'now' because it wants to leave a grace period to avoid > exactly that kind of race. For me, a "pull" never takes that long. Sure, any manual merging and the writing of the commit message might take a while, but it's "foreground" activity for me, I'd not start a gc in the middle of it. So at least to me, doing "git fsck --full" and "git gc --prune=now" are somewhat special because they take a while and tend to be background things that I "start and forget" about (the same way I sometimes start and forget a kernel build). Which is why that current "git gc --prune=now" behavior seems a bit dangerous. Linus
Re: [RFC] git gc "--prune=now" semantics considered harmful
Linus Torvaldswrites: > Soes my use pattern of "git gc --prune=now" make sense? Maybe not. But > it's what I've gotten used to, and it's at least not entirely insane. FWIW, my end-of-day ritual is to do repack -a -d -f with a large window and a small depth, followed by prune, which boils down to about the same. So I'd hope that is not entirely insane. I however do not think I bother with an explicit --expire=now when running the prune, though. In any case, that makes two of us, and the suggested patch protects only one of the two ;-) > But at least once now, I've done that "git gc" at the end of the day, and > a new pull request comes in, so I do the "git pull" without even thinking > about the fact that "git gc" is still running. *That* is something I don't do. After all, I am fully aware that I have started end-of-day ritual by that time, so I won't even look at a new patch (or a pull request for that matter). > So I actually would much prefer that foir git gc, "--prune=now" means > > (a) "now" > > (b) now at the _start_ of the "git gc" operation, not the time at > the _end_ of the operation when we've already spent a minute or > two doing repacking and are now doing the final pruning. > > anyway, with that explanation in mind, I'm appending a patch that is > pretty small and does that. It's a bit hacky, but I think it still makes > sense. > > Comments? Closing the possiblity of racing a running "gc" and new object creation like the above generally makes sense, I would think, whether the creation is due to 'pull/fetch', 'add', or even 'push'. I however have to wonder if there are opposite "oops" end-user operation we also need to worry about, i.e. we are doing a large-ish fetch, and get bored and run a gc fron another terminal. Perhaps *that* is a bit too stupid to worry about? Auto-gc deliberately does not use 'now' because it wants to leave a grace period to avoid exactly that kind of race. > diff --git a/builtin/gc.c b/builtin/gc.c > index c4777b244..98368c8b5 100644 > --- a/builtin/gc.c > +++ b/builtin/gc.c > @@ -535,8 +535,12 @@ int cmd_gc(int argc, const char **argv, const char > *prefix) > if (argc > 0) > usage_with_options(builtin_gc_usage, builtin_gc_options); > > - if (prune_expire && parse_expiry_date(prune_expire, )) > - die(_("failed to parse prune expiry value %s"), prune_expire); > + if (prune_expire) { > + if (!strcmp(prune_expire, "now")) > + prune_expire = show_date(time(NULL), 0, > DATE_MODE(ISO8601)); > + if (parse_expiry_date(prune_expire, )) > + die(_("failed to parse prune expiry value %s"), > prune_expire); > + } > > if (aggressive) { > argv_array_push(, "-f");
[RFC] git gc "--prune=now" semantics considered harmful
So this is a RFC patch, I'm not sure how much people really care, but I find the current behavior of "git gc --prune=now" to be unnecessarily dangerous. There's two issues with it: (a) parse_expiry_date() considers "now" to be special, and it actually doesn't mean "now" at all, it means "everything". (b) the date parsing isn't actually done "now", it's done *after* gc has already run, and we run "git prune --expire". So even if (a) wasn't true, "--prune=now" wouldn't actually mean "now" when the user expects it to happen, but "after doing repacking". I actually think that the "parse_expiry_date()" behavior makes sense within the context of "git prune --expire", so I'm not really complaining about (a) per se. I just think that what makes sense within the context of "git prune" does *not* necessarily make sense within the context of "git gc". Why do I care? I end up doing lots of random things in my local repository, and I also end up wanting to keep my repository fairly clean, so I tend to do "git gc --prune=now" to just make sure everything is packed and I've gotten rid of all the temporary stuff that so often happens when doing lots of git merges (which is what I do). You won't see those temporary objects for the usual trivial merges, but whenever you have a real recursive merge with automated conflict resolution, there will be things like those temporary merge-only objects for the 3-way base merge state. Soes my use pattern of "git gc --prune=now" make sense? Maybe not. But it's what I've gotten used to, and it's at least not entirely insane. But at least once now, I've done that "git gc" at the end of the day, and a new pull request comes in, so I do the "git pull" without even thinking about the fact that "git gc" is still running. And then the "--prune=now" behavior is actually really pretty dangerous. Because it will prune *all* unreachable objects, even if they are only *currently* unreachable because they are in the process of being unpacked by the concurrent "git fetch" (and I didn't check - I might just have been unlocky, bit I think "git prune" ignores FETCH_HEAD). So I actually would much prefer that foir git gc, "--prune=now" means (a) "now" (b) now at the _start_ of the "git gc" operation, not the time at the _end_ of the operation when we've already spent a minute or two doing repacking and are now doing the final pruning. anyway, with that explanation in mind, I'm appending a patch that is pretty small and does that. It's a bit hacky, but I think it still makes sense. Comments? Note that this really isn't likely very noticeable on most projects. When I do "git gc" on a fairly well-packed repo of git itself, it takes under 4s for me. So the window for that whole "do git pull at the same time" is simply not much of an issue. For the kernel, "git gc" takes a minute and a half on the same machine (again, this is already a packed repo, it can be worse). So there's a much bigger window there to do something stupid, Linus --- builtin/gc.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/builtin/gc.c b/builtin/gc.c index c4777b244..98368c8b5 100644 --- a/builtin/gc.c +++ b/builtin/gc.c @@ -535,8 +535,12 @@ int cmd_gc(int argc, const char **argv, const char *prefix) if (argc > 0) usage_with_options(builtin_gc_usage, builtin_gc_options); - if (prune_expire && parse_expiry_date(prune_expire, )) - die(_("failed to parse prune expiry value %s"), prune_expire); + if (prune_expire) { + if (!strcmp(prune_expire, "now")) + prune_expire = show_date(time(NULL), 0, DATE_MODE(ISO8601)); + if (parse_expiry_date(prune_expire, )) + die(_("failed to parse prune expiry value %s"), prune_expire); + } if (aggressive) { argv_array_push(, "-f");