Re: [PATCH 1/1] delete multiple tags in a single transaction
On Thu, Aug 08, 2019 at 04:43:16PM -0700, Phil Hord wrote: > > I also get really slow times on a repo with ~20,000 tags (though order > > ~3 minutes rather than ~30, probably due to having an SSD on this > > machine) -- but ONLY IF the refs are packed first (git pack-refs > > --all). If the refs are loose, it's relatively quick to delete a > > dozen thousand or so tags (order of a few seconds). It might be worth > > mentioning in the commit message that this only makes a significant > > difference in the case where the refs are packed. > > I'm also using an SSD but I still see about 10 tags per second being > deleted with the current code (and packed-refs). I see that I'm > CPU-bound, so I guess most of the time is spent searching through > .git/packed-refs. Probably it will run faster as it progresses. I > guess the 18,000 branches in my repo keep me on the wrong end of O(N). Right, deleting individually from packed-refs is inherently quadratic, because each deletion has to rewrite the entire file. So if you delete all (or the majority of them), that's O(n^2) individual entry writes. The loose case is just touching the filesystem for each entry (and the refs code is smart enough not to bother rewriting packed-refs if the entry isn't present there). That _can_ be slow if you have a lot of entries in the same directory (because some filesystems are particularly bad at this). So the actual backing storage speed isn't really that important. All the time goes to copying the same packed-refs entries over and over, whether they hit the disk or not. Your solution (using a single transaction) is definitely the right one (and probably should apply to "branch -d", too). That's what we did long ago for update-ref, and I think nobody ever really noticed for the porcelain commands because they don't tend to be used for such bulk changes. > But it should have occurred to me while I was in the code that there > is a different path for unpacked refs which could explain my previous > speeds. I didn't think I had any unpacked refs, though, since every > time I look in .git/refs for what I want, I find it relatively empty. > I see 'git pack-refs --help' says that new refs should show up loose, > but I can't say that has happened for me. Maybe a new clone uses > packed-refs for *everything* and only newly fetched things are loose. > Is that it? I guess since I seldom fetch tags after the first clone, > it makes sense they would all be packed. Right, a fresh clone always writes all of its entries as packed refs. It used to be done by hand, but it happens in a special "initial transaction" method these days, since 58f233ce1e (initial_ref_transaction_commit(): function for initial ref creation, 2015-06-22). > > In constrast, it appears that `git update-ref --stdin` is fast > > regardless of whether the refs are packed, e.g. > >git tag -l feature/* | sed -e 's%^%delete refs/tags/%' | git > > update-ref --stdin > > finishes quickly (order of a few seconds). > > Nice! That trick is going in my wiki for devs to use on their VMs. > Thanks for that. Please do encourage people to use for-each-ref instead of the "tag -l" porcelain, as the latter is subject to change. My usual bulk deletion command is: git for-each-ref --format='delete %(refname)' refs/tags/feature/ | git update-ref --stdin -Peff
Re: [PATCH 1/1] delete multiple tags in a single transaction
On Thu, Aug 8, 2019 at 12:39 PM Junio C Hamano wrote: > > Phil Hord writes: > > > From: Phil Hord > > > > 'git tag -d' accepts one or more tag refs to delete, but each deletion > > is done by calling `delete_ref` on each argv. This is painfully slow > > when removing from packed refs. Use delete_refs instead so all the > > removals can be done inside a single transaction with a single write. > > > > I have a repo with 24,000 tags, most of which are not useful to any > > developers. Having this many refs slows down many operations that > > would otherwise be very fast. Removing these tags when they've been > > accidentally fetched again takes about 30 minutes using delete_ref. > > > > git tag -l feature/* | xargs git tag -d > > > > Removing the same tags using delete_refs takes less than 5 seconds. > > Makes sense. As mentioned elsewhere in the thread already, > a batched update-ref would open the packed-refs ony once because > everything is done in a single transaction, so presumably a pipeline > like this > > git tag -l feature/* | > sed -e 's|^|delete refs/tags/|' | > git update-ref --stdin > > may work well, and "git tag -d" that gets these refs on the command > line should be capable of doing the same. > > > -static int delete_tag(const char *name, const char *ref, > > - const struct object_id *oid, const void *cb_data) > > +struct tag_args { > > + char *oid_abbrev; > > + char *refname; > > +}; > > + > > +static int make_string_list(const char *name, const char *ref, > > + const struct object_id *oid, void *cb_data) > > Please think about a few more minutes before naming a function like > this, and make it a habit for your future patches. > > We can see that the callback is used to insert more strings into a > string list, but the type (i.e. string_list) used to represent the > set is not all that important. What is more important is why you > are building that set for, and saying what is in the set (as opposed > to saying that the container happens to be a string_list) would be a > good first step. > > I presume that you are enumerating the tags to be deleted, together > with the data necessary for you to report the deletion of the tags? Hm. collect_tags? collect_tags_to_delete? It's true I didn't put enought thought into that. I was experimenting a bit here and was surprised how little code I ended up needing. > > { > > - if (delete_ref(NULL, ref, oid, 0)) > > - return 1; > > - printf(_("Deleted tag '%s' (was %s)\n"), name, > > -find_unique_abbrev(oid, DEFAULT_ABBREV)); > > + struct string_list *ref_list = cb_data; > > + struct tag_args *info = xmalloc(sizeof(struct tag_args)); > > + > > + string_list_append(ref_list, ref); > > + > > + info->oid_abbrev = xstrdup(find_unique_abbrev(oid, DEFAULT_ABBREV)); > > + info->refname = xstrdup(name); > > + ref_list->items[ref_list->nr - 1].util = info; > > return 0; > > } > > > > +static int delete_tags(const char **argv) > > +{ > > + int result; > > + struct string_list ref_list = STRING_LIST_INIT_DUP; > > + struct string_list_item *ref_list_item; > > + > > + result = for_each_tag_name(argv, make_string_list, (void *) > > &ref_list); > > + if (!result) > > + result = delete_refs(NULL, &ref_list, REF_NO_DEREF); > > + > > + for_each_string_list_item(ref_list_item, &ref_list) { > > + struct tag_args * info = ref_list_item->util; > > + if (!result) > > + printf(_("Deleted tag '%s' (was %s)\n"), > > info->refname, > > + info->oid_abbrev); > > + free(info->oid_abbrev); > > + free(info->refname); > > + free(info); > > It is not performance critical, but info->refname is computable from > ref_list_item->string, isn't it? Oh, I guess it is. It's a fixed offset into the string, after all. Thanks. I did look for a way to avoid the struct noise. Just not well. > I am just wondering if we can do > this without having to allocate the .util field for each of 20,000 > tags. We still need to remember oid (or oid_abbrev, but if I were > writing this, I'd record the full oid in .util and make the code > that prints call find_unique_abbrev() on it), so I guess we cannot > really leave .util NULL. My original patch did this (.util = oid). But then I needed a name. I'll go back to keeping the oid. Much cleaner. > > > + } > > + string_list_clear(&ref_list, 0); > > + return result; > > We used to return the returned value from for_each_tag_name() that > repeatedly called delete_tag(). > > Now we return value from delete_refs(). Are our caller(s) OK with > the values that may come back from that function? Can delete_refs() > return a value that is not appropriate to be returned from > cmd_tag(), for example a negative value? Yes it does. Will fix. > > > +} > > + > >
Re: [PATCH 1/1] delete multiple tags in a single transaction
On Thu, Aug 8, 2019 at 11:15 AM Elijah Newren wrote: > > On Wed, Aug 7, 2019 at 9:11 PM Phil Hord wrote: > > > > From: Phil Hord > > > > 'git tag -d' accepts one or more tag refs to delete, but each deletion > > is done by calling `delete_ref` on each argv. This is painfully slow > > when removing from packed refs. Use delete_refs instead so all the > > removals can be done inside a single transaction with a single write. > > Nice, thanks for working on this. > > > I have a repo with 24,000 tags, most of which are not useful to any > > developers. Having this many refs slows down many operations that > > would otherwise be very fast. Removing these tags when they've been > > accidentally fetched again takes about 30 minutes using delete_ref. > > I also get really slow times on a repo with ~20,000 tags (though order > ~3 minutes rather than ~30, probably due to having an SSD on this > machine) -- but ONLY IF the refs are packed first (git pack-refs > --all). If the refs are loose, it's relatively quick to delete a > dozen thousand or so tags (order of a few seconds). It might be worth > mentioning in the commit message that this only makes a significant > difference in the case where the refs are packed. I'm also using an SSD but I still see about 10 tags per second being deleted with the current code (and packed-refs). I see that I'm CPU-bound, so I guess most of the time is spent searching through .git/packed-refs. Probably it will run faster as it progresses. I guess the 18,000 branches in my repo keep me on the wrong end of O(N). My VM is on an all-flash storage array, but I can't say much about its write throughput since it's one VM among many. Previously I thought I saw a significant speedup between v2.7.4 (on my development vm) and v2.22.0 (on my laptop). But this week I saw it was slow again on my laptop. I looked for the regression but didn't find anyone touching that code. Then I wrote this patch. But it should have occurred to me while I was in the code that there is a different path for unpacked refs which could explain my previous speeds. I didn't think I had any unpacked refs, though, since every time I look in .git/refs for what I want, I find it relatively empty. I see 'git pack-refs --help' says that new refs should show up loose, but I can't say that has happened for me. Maybe a new clone uses packed-refs for *everything* and only newly fetched things are loose. Is that it? I guess since I seldom fetch tags after the first clone, it makes sense they would all be packed. > > git tag -l feature/* | xargs git tag -d > > > > Removing the same tags using delete_refs takes less than 5 seconds. > > It appears this same bug also affects `git branch -d` when deleting > lots of branches (or remote tracking branches) and they are all > packed; could you apply the same fix there? Will do. > In constrast, it appears that `git update-ref --stdin` is fast > regardless of whether the refs are packed, e.g. >git tag -l feature/* | sed -e 's%^%delete refs/tags/%' | git > update-ref --stdin > finishes quickly (order of a few seconds). Nice! That trick is going in my wiki for devs to use on their VMs. Thanks for that.
Re: [PATCH 1/1] delete multiple tags in a single transaction
Phil Hord writes: > From: Phil Hord > > 'git tag -d' accepts one or more tag refs to delete, but each deletion > is done by calling `delete_ref` on each argv. This is painfully slow > when removing from packed refs. Use delete_refs instead so all the > removals can be done inside a single transaction with a single write. > > I have a repo with 24,000 tags, most of which are not useful to any > developers. Having this many refs slows down many operations that > would otherwise be very fast. Removing these tags when they've been > accidentally fetched again takes about 30 minutes using delete_ref. > > git tag -l feature/* | xargs git tag -d > > Removing the same tags using delete_refs takes less than 5 seconds. Makes sense. As mentioned elsewhere in the thread already, a batched update-ref would open the packed-refs ony once because everything is done in a single transaction, so presumably a pipeline like this git tag -l feature/* | sed -e 's|^|delete refs/tags/|' | git update-ref --stdin may work well, and "git tag -d" that gets these refs on the command line should be capable of doing the same. > -static int delete_tag(const char *name, const char *ref, > - const struct object_id *oid, const void *cb_data) > +struct tag_args { > + char *oid_abbrev; > + char *refname; > +}; > + > +static int make_string_list(const char *name, const char *ref, > + const struct object_id *oid, void *cb_data) Please think about a few more minutes before naming a function like this, and make it a habit for your future patches. We can see that the callback is used to insert more strings into a string list, but the type (i.e. string_list) used to represent the set is not all that important. What is more important is why you are building that set for, and saying what is in the set (as opposed to saying that the container happens to be a string_list) would be a good first step. I presume that you are enumerating the tags to be deleted, together with the data necessary for you to report the deletion of the tags? > { > - if (delete_ref(NULL, ref, oid, 0)) > - return 1; > - printf(_("Deleted tag '%s' (was %s)\n"), name, > -find_unique_abbrev(oid, DEFAULT_ABBREV)); > + struct string_list *ref_list = cb_data; > + struct tag_args *info = xmalloc(sizeof(struct tag_args)); > + > + string_list_append(ref_list, ref); > + > + info->oid_abbrev = xstrdup(find_unique_abbrev(oid, DEFAULT_ABBREV)); > + info->refname = xstrdup(name); > + ref_list->items[ref_list->nr - 1].util = info; > return 0; > } > > +static int delete_tags(const char **argv) > +{ > + int result; > + struct string_list ref_list = STRING_LIST_INIT_DUP; > + struct string_list_item *ref_list_item; > + > + result = for_each_tag_name(argv, make_string_list, (void *) &ref_list); > + if (!result) > + result = delete_refs(NULL, &ref_list, REF_NO_DEREF); > + > + for_each_string_list_item(ref_list_item, &ref_list) { > + struct tag_args * info = ref_list_item->util; > + if (!result) > + printf(_("Deleted tag '%s' (was %s)\n"), info->refname, > + info->oid_abbrev); > + free(info->oid_abbrev); > + free(info->refname); > + free(info); It is not performance critical, but info->refname is computable from ref_list_item->string, isn't it? I am just wondering if we can do this without having to allocate the .util field for each of 20,000 tags. We still need to remember oid (or oid_abbrev, but if I were writing this, I'd record the full oid in .util and make the code that prints call find_unique_abbrev() on it), so I guess we cannot really leave .util NULL. > + } > + string_list_clear(&ref_list, 0); > + return result; We used to return the returned value from for_each_tag_name() that repeatedly called delete_tag(). Now we return value from delete_refs(). Are our caller(s) OK with the values that may come back from that function? Can delete_refs() return a value that is not appropriate to be returned from cmd_tag(), for example a negative value? > +} > + > static int verify_tag(const char *name, const char *ref, > - const struct object_id *oid, const void *cb_data) > + const struct object_id *oid, void *cb_data) > { > int flags; > const struct ref_format *format = cb_data; > @@ -511,7 +543,7 @@ int cmd_tag(int argc, const char **argv, const char > *prefix) > if (filter.merge_commit) > die(_("--merged and --no-merged options are only allowed in > list mode")); > if (cmdmode == 'd') > - return for_each_tag_name(argv, delete_tag, NULL); > + return delete_tags(argv); Thanks.
Re: [PATCH 1/1] delete multiple tags in a single transaction
On Wed, Aug 7, 2019 at 9:11 PM Phil Hord wrote: > > From: Phil Hord > > 'git tag -d' accepts one or more tag refs to delete, but each deletion > is done by calling `delete_ref` on each argv. This is painfully slow > when removing from packed refs. Use delete_refs instead so all the > removals can be done inside a single transaction with a single write. Nice, thanks for working on this. > I have a repo with 24,000 tags, most of which are not useful to any > developers. Having this many refs slows down many operations that > would otherwise be very fast. Removing these tags when they've been > accidentally fetched again takes about 30 minutes using delete_ref. I also get really slow times on a repo with ~20,000 tags (though order ~3 minutes rather than ~30, probably due to having an SSD on this machine) -- but ONLY IF the refs are packed first (git pack-refs --all). If the refs are loose, it's relatively quick to delete a dozen thousand or so tags (order of a few seconds). It might be worth mentioning in the commit message that this only makes a significant difference in the case where the refs are packed. > git tag -l feature/* | xargs git tag -d > > Removing the same tags using delete_refs takes less than 5 seconds. It appears this same bug also affects `git branch -d` when deleting lots of branches (or remote tracking branches) and they are all packed; could you apply the same fix there? In constrast, it appears that `git update-ref --stdin` is fast regardless of whether the refs are packed, e.g. git tag -l feature/* | sed -e 's%^%delete refs/tags/%' | git update-ref --stdin finishes quickly (order of a few seconds).
Re: [PATCH 1/1] delete multiple tags in a single transaction
On Thu, 8 Aug 2019 at 06:09, Phil Hord wrote: > I have a repo with 24,000 tags, most of which are not useful to any > developers. Having this many refs slows down many operations that > would otherwise be very fast. Removing these tags when they've been > accidentally fetched again takes about 30 minutes using delete_ref. > > git tag -l feature/* | xargs git tag -d > > Removing the same tags using delete_refs takes less than 5 seconds. This looks worthwhile pursuing... > -static int delete_tag(const char *name, const char *ref, > - const struct object_id *oid, const void *cb_data) > +struct tag_args { > + char *oid_abbrev; > + char *refname; > +}; > + > +static int make_string_list(const char *name, const char *ref, > + const struct object_id *oid, void *cb_data) > { > - if (delete_ref(NULL, ref, oid, 0)) > - return 1; This provides `oid` for verifying that the tag actually points at that particular oid before deleting. As far as I can tell, `oid` is no longer used like that in the post-image. I'm not sure it matters, since we just looked it up, but that might be worth mentioning, perhaps. > - printf(_("Deleted tag '%s' (was %s)\n"), name, > - find_unique_abbrev(oid, DEFAULT_ABBREV)); > + struct string_list *ref_list = cb_data; > + struct tag_args *info = xmalloc(sizeof(struct tag_args)); > + > + string_list_append(ref_list, ref); > + > + info->oid_abbrev = xstrdup(find_unique_abbrev(oid, DEFAULT_ABBREV)); > + info->refname = xstrdup(name); > + ref_list->items[ref_list->nr - 1].util = info; > return 0; > } > > +static int delete_tags(const char **argv) > +{ > + int result; > + struct string_list ref_list = STRING_LIST_INIT_DUP; > + struct string_list_item *ref_list_item; > + > + result = for_each_tag_name(argv, make_string_list, (void *) > &ref_list); If any tag is non-existing (or some other error happens here), we don't continue to the actual deleting. That breaks t7004 which has a test for removing an existing and a non-existing tag -- it wants the existing one to be removed and the non-existing one not to interfere. > + if (!result) > + result = delete_refs(NULL, &ref_list, REF_NO_DEREF); So this should perhaps be something more like an unconditional result |= delete_refs(...); That makes the test suite happy, but perhaps only short-term ... See below... > + for_each_string_list_item(ref_list_item, &ref_list) { > + struct tag_args * info = ref_list_item->util; > + if (!result) > + printf(_("Deleted tag '%s' (was %s)\n"), > info->refname, > + info->oid_abbrev); Change this conditional here, too, methinks. You'd need to separate errors from looking up tags from errors about deleting refs, so having a single "result" is probably not sufficient. Probably worth inspecting the output of that `git tag -d` a bit in t7004, to make sure we just claim to delete one tag, and have errors. Your patch reshuffles the error and success messages (for certain usages). I think that's ok, but might be worth mentioning. I'm not too familiar with the refs API, so take this with a grain of salt... > + free(info->oid_abbrev); > + free(info->refname); > + free(info); > + } > + string_list_clear(&ref_list, 0); > + return result; > +} Martin
[PATCH 1/1] delete multiple tags in a single transaction
From: Phil Hord 'git tag -d' accepts one or more tag refs to delete, but each deletion is done by calling `delete_ref` on each argv. This is painfully slow when removing from packed refs. Use delete_refs instead so all the removals can be done inside a single transaction with a single write. I have a repo with 24,000 tags, most of which are not useful to any developers. Having this many refs slows down many operations that would otherwise be very fast. Removing these tags when they've been accidentally fetched again takes about 30 minutes using delete_ref. git tag -l feature/* | xargs git tag -d Removing the same tags using delete_refs takes less than 5 seconds. Signed-off-by: Phil Hord --- builtin/tag.c | 52 +-- 1 file changed, 42 insertions(+), 10 deletions(-) diff --git a/builtin/tag.c b/builtin/tag.c index e0a4c25382..f652af83e7 100644 --- a/builtin/tag.c +++ b/builtin/tag.c @@ -72,10 +72,10 @@ static int list_tags(struct ref_filter *filter, struct ref_sorting *sorting, } typedef int (*each_tag_name_fn)(const char *name, const char *ref, - const struct object_id *oid, const void *cb_data); + const struct object_id *oid, void *cb_data); static int for_each_tag_name(const char **argv, each_tag_name_fn fn, -const void *cb_data) +void *cb_data) { const char **p; struct strbuf ref = STRBUF_INIT; @@ -97,18 +97,50 @@ static int for_each_tag_name(const char **argv, each_tag_name_fn fn, return had_error; } -static int delete_tag(const char *name, const char *ref, - const struct object_id *oid, const void *cb_data) +struct tag_args { + char *oid_abbrev; + char *refname; +}; + +static int make_string_list(const char *name, const char *ref, + const struct object_id *oid, void *cb_data) { - if (delete_ref(NULL, ref, oid, 0)) - return 1; - printf(_("Deleted tag '%s' (was %s)\n"), name, - find_unique_abbrev(oid, DEFAULT_ABBREV)); + struct string_list *ref_list = cb_data; + struct tag_args *info = xmalloc(sizeof(struct tag_args)); + + string_list_append(ref_list, ref); + + info->oid_abbrev = xstrdup(find_unique_abbrev(oid, DEFAULT_ABBREV)); + info->refname = xstrdup(name); + ref_list->items[ref_list->nr - 1].util = info; return 0; } +static int delete_tags(const char **argv) +{ + int result; + struct string_list ref_list = STRING_LIST_INIT_DUP; + struct string_list_item *ref_list_item; + + result = for_each_tag_name(argv, make_string_list, (void *) &ref_list); + if (!result) + result = delete_refs(NULL, &ref_list, REF_NO_DEREF); + + for_each_string_list_item(ref_list_item, &ref_list) { + struct tag_args * info = ref_list_item->util; + if (!result) + printf(_("Deleted tag '%s' (was %s)\n"), info->refname, + info->oid_abbrev); + free(info->oid_abbrev); + free(info->refname); + free(info); + } + string_list_clear(&ref_list, 0); + return result; +} + static int verify_tag(const char *name, const char *ref, - const struct object_id *oid, const void *cb_data) + const struct object_id *oid, void *cb_data) { int flags; const struct ref_format *format = cb_data; @@ -511,7 +543,7 @@ int cmd_tag(int argc, const char **argv, const char *prefix) if (filter.merge_commit) die(_("--merged and --no-merged options are only allowed in list mode")); if (cmdmode == 'd') - return for_each_tag_name(argv, delete_tag, NULL); + return delete_tags(argv); if (cmdmode == 'v') { if (format.format && verify_ref_format(&format)) usage_with_options(git_tag_usage, options); -- 2.23.0.rc1.174.g4cc1b04b4c.dirty