Re: Poor performance of git describe in big repos
On 31 May 2013 10:57, Alex Bennée kernel-hac...@bennee.com wrote: On 31 May 2013 09:46, Thomas Rast tr...@inf.ethz.ch wrote: So that deleted all unannotated tags pointing at commits, and then it was fast. Curious. However, if that turns out to be the culprit, it's not fixable currently[1]. Having commits with insanely long messages is just, well, insane. [1] unless we do a major rework of the loading infrastructure, so that we can teach it to load only the beginning of a commit as long as we are only interested in parents and such I'll do a bit of scripting to dig into the nature of these uber-commits and try and work out how they cam about. I suspect they are simply start of branch states in our broken and disparate history. I'll get back to you once I've dug a little deeper. So I wrote a little script [1] which I ran to remove all tags that did not exist on any branches: git-tag-cleaner.py -d no-branch After a lot of churning: 17:26 ajb@sloy/x86_64 [work.git] time /usr/bin/git --no-pager describe --long --tags ajb-build-test-5225-2-gdc0b771 real0m0.799s user0m0.024s sys 0m0.052s So at least I can fix up my repo. All the big ones look at least as though they were weird cvs2svn creations that exist to represent the detached state of a strange CVS tag from the converted repository. However it does raise one question. Why is git attempting to parse a commit not on the DAG for the branch I'm attempting to describe? Anyway as I have a work around I'm going to do a slightly more conservative clean of the repo with my script and move on. [1] https://github.com/stsquad/git-tag-cleaner -- Alex, homepage: http://www.bennee.com/~alex/ -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Poor performance of git describe in big repos
Alex Bennée kernel-hac...@bennee.com writes: Why is git attempting to parse a commit not on the DAG for the branch I'm attempting to describe? I think that is because you need to parse the objects at the tip of refs to see if they are on the DAG in the first place. If there weren't any annotated tag, conceivably you could do without parsing these objects. You would: - First read the refs without parsing anything to learn the object name of the tips of refs; - Traverse the DAG, starting from the commit and notice when you see commits that are at the tips of refs you learned in the first step, arranging to stop when you found the closest tip. But with annotated tags (and git describe is designed to be primarily used with them; you would need --tags option to make it notice unannotated tags), the object name you see sitting at the tip will never appear during the DAG traversal. You will only see commits from the latter, so you would need to parse the tips to learn what commits they refer to. And of course, then parse only annotated tags, without parsing commits would not work, because you wouldn't know what the object is without looking at it ;-) -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Poor performance of git describe in big repos
On 30 May 2013 20:30, John Keeping j...@keeping.me.uk wrote: On Thu, May 30, 2013 at 06:21:55PM +0200, Thomas Rast wrote: Alex Bennée kernel-hac...@bennee.com writes: On 30 May 2013 16:33, Thomas Rast tr...@inf.ethz.ch wrote: Alex Bennée kernel-hac...@bennee.com writes: snip Will it be loading the blob for every commit it traverses or just ones that hit a tag? Why does it need to load the blob at all? Surely the commit tree state doesn't need to be walked down? No, my theory is that you tagged *the blobs*. Git supports this. Wait is this the difference between annotated and non-annotated tags? I thought a non-annotated just acted like references to a particular tree state? You can see if that is the case by doing something like this: eval $(git for-each-ref --shell --format ' test $(git cat-file -t %(objectname)^{}) = commit || echo %(refname);') That will print out the name of any ref that doesn't point at a commit. Hmm that didn't seem to work. But looking at the output by hand I certainly have a mix of tags that are commits vs tags: 09:08 ajb@sloy/x86_64 [work.git] git for-each-ref | grep refs/tags | grep commit | wc -l 1345 09:12 ajb@sloy/x86_64 [work.git] git for-each-ref | grep refs/tags | grep -v commit | wc -l 66 Unfortunately I can't just delete those tags as they do refer to known releases which we obviously care about. If I delete the tags on my local repo and test for a speed increase can I re-create them as annotated tag objects? -- Alex, homepage: http://www.bennee.com/~alex/ -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Poor performance of git describe in big repos
Alex Bennée kernel-hac...@bennee.com writes: On 30 May 2013 20:30, John Keeping j...@keeping.me.uk wrote: On Thu, May 30, 2013 at 06:21:55PM +0200, Thomas Rast wrote: Alex Bennée kernel-hac...@bennee.com writes: On 30 May 2013 16:33, Thomas Rast tr...@inf.ethz.ch wrote: Alex Bennée kernel-hac...@bennee.com writes: snip Will it be loading the blob for every commit it traverses or just ones that hit a tag? Why does it need to load the blob at all? Surely the commit tree state doesn't need to be walked down? No, my theory is that you tagged *the blobs*. Git supports this. Wait is this the difference between annotated and non-annotated tags? I thought a non-annotated just acted like references to a particular tree state? A tag is just a ref. It can point at anything, in particular also a blob (= some file *contents*). An annotated tag is just a tag pointing at a tag object. A tag object contains tagger name/email/date, a reference to an object, and a tag message. The slowness I found relates to having tags that point at blobs directly (unannotated). You can see if that is the case by doing something like this: eval $(git for-each-ref --shell --format ' test $(git cat-file -t %(objectname)^{}) = commit || echo %(refname);') That will print out the name of any ref that doesn't point at a commit. Hmm that didn't seem to work. But looking at the output by hand I certainly have a mix of tags that are commits vs tags: 09:08 ajb@sloy/x86_64 [work.git] git for-each-ref | grep refs/tags | grep commit | wc -l 1345 09:12 ajb@sloy/x86_64 [work.git] git for-each-ref | grep refs/tags | grep -v commit | wc -l 66 Unfortunately I can't just delete those tags as they do refer to known releases which we obviously care about. If I delete the tags on my local repo and test for a speed increase can I re-create them as annotated tag objects? I would be more interested in this: git for-each-ref | grep ' blob' and (git for-each-ref | grep ' blob' | cut -d\ -f1 | xargs -n1 git cat-file blob) | wc -c The first tells you if you have any refs pointing at blobs. The second computes their total unpacked size. My theory is that the second yields some large number (hundreds of megabytes at least). It would be nice if you checked, because if there turn out to be big blobs, we have all the pieces and just need to assemble the best solution. Otherwise, there's something else going on and the problem remains open. -- Thomas Rast trast@{inf,student}.ethz.ch -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Poor performance of git describe in big repos
On Fri, May 31, 2013 at 09:14:49AM +0100, Alex Bennée wrote: On 30 May 2013 20:30, John Keeping j...@keeping.me.uk wrote: On Thu, May 30, 2013 at 06:21:55PM +0200, Thomas Rast wrote: Alex Bennée kernel-hac...@bennee.com writes: On 30 May 2013 16:33, Thomas Rast tr...@inf.ethz.ch wrote: Alex Bennée kernel-hac...@bennee.com writes: snip Will it be loading the blob for every commit it traverses or just ones that hit a tag? Why does it need to load the blob at all? Surely the commit tree state doesn't need to be walked down? No, my theory is that you tagged *the blobs*. Git supports this. Wait is this the difference between annotated and non-annotated tags? I thought a non-annotated just acted like references to a particular tree state? No, this is something slightly different. In Git there are four types of object: tag, commit, tree and blob. When you have a heavyweight tag, the tag reference points at a tag object (which in turn points at another object). With a lightweight tag, the tag reference typically points at a commit object. However, there is no restriction that says that a tag object must point to a commit or that a lightweight tag must point at a commit - it is equally possible to point directly at a tree or a blob (although a lot less common). Thomas is suggesting that you might have a tag that does not point at a commit but instead points to a blob object. You can see if that is the case by doing something like this: eval $(git for-each-ref --shell --format ' test $(git cat-file -t %(objectname)^{}) = commit || echo %(refname);') That will print out the name of any ref that doesn't point at a commit. Hmm that didn't seem to work. You mean there was no output? In that case it's likely that all your references do indeed point at commits. But looking at the output by hand I certainly have a mix of tags that are commits vs tags: 09:08 ajb@sloy/x86_64 [work.git] git for-each-ref | grep refs/tags | grep commit | wc -l 1345 09:12 ajb@sloy/x86_64 [work.git] git for-each-ref | grep refs/tags | grep -v commit | wc -l 66 This means that you have 1345 lightweight tags and 66 heavyweight tags, assuming that all of the lines that don't say commit do say tag. By the way, I don't remember if you said which version of Git you're using. If it's an older version then it's possible that something has changed. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Poor performance of git describe in big repos
On 31 May 2013 09:24, Thomas Rast tr...@inf.ethz.ch wrote: Alex Bennée kernel-hac...@bennee.com writes: On 30 May 2013 20:30, John Keeping j...@keeping.me.uk wrote: On Thu, May 30, 2013 at 06:21:55PM +0200, Thomas Rast wrote: Alex Bennée kernel-hac...@bennee.com writes: On 30 May 2013 16:33, Thomas Rast tr...@inf.ethz.ch wrote: snip No, my theory is that you tagged *the blobs*. Git supports this. Wait is this the difference between annotated and non-annotated tags? I thought a non-annotated just acted like references to a particular tree state? A tag is just a ref. It can point at anything, in particular also a blob (= some file *contents*). An annotated tag is just a tag pointing at a tag object. A tag object contains tagger name/email/date, a reference to an object, and a tag message. The slowness I found relates to having tags that point at blobs directly (unannotated). I think you are right. I was brave (well I assumed the tags would come back from the upstream repo) and ran: git for-each-ref | grep refs/tags | grep commit | cut -d '/' -f 3 | xargs git tag -d And boom: 09:19 ajb@sloy/x86_64 [work.git] time /usr/bin/git --no-pager describe --long --tags ajb-build-test-5225-2-gdc0b771 real0m0.009s user0m0.008s sys 0m0.000s Which is much better performance. So it does look like unannotated tags pointing at binary blobs is the failure case. snip I would be more interested in this: git for-each-ref | grep ' blob' Hmmm that gives nothing. All the refs are either tag or commit and (git for-each-ref | grep ' blob' | cut -d\ -f1 | xargs -n1 git cat-file blob) | wc -c However I have some big commits it seems: 09:37 ajb@sloy/x86_64 [work.git] (git for-each-ref | grep ' commit' | cut -d\ -f1 | xargs -n1 git cat-file commit) | wc -c 1147231984 The first tells you if you have any refs pointing at blobs. The second computes their total unpacked size. My theory is that the second yields some large number (hundreds of megabytes at least). It would be nice if you checked, because if there turn out to be big blobs, we have all the pieces and just need to assemble the best solution. Otherwise, there's something else going on and the problem remains open. If you want any other numbers I'm only too happy to help. Sorry I can't share the repo though... -- Alex, homepage: http://www.bennee.com/~alex/ -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Poor performance of git describe in big repos
Alex Bennée kernel-hac...@bennee.com writes: I think you are right. I was brave (well I assumed the tags would come back from the upstream repo) and ran: git for-each-ref | grep refs/tags | grep commit | cut -d '/' -f 3 | xargs git tag -d So that deleted all unannotated tags pointing at commits, and then it was fast. Curious. However I have some big commits it seems: 09:37 ajb@sloy/x86_64 [work.git] (git for-each-ref | grep ' commit' | cut -d\ -f1 | xargs -n1 git cat-file commit) | wc -c 1147231984 How many unique entries are there in that list, i.e., what does git for-each-ref | grep ' commit' | cut -d\ -f1 | sort -u | wc -l say? Perhaps you can also find the biggest commit, e.g. like so: git for-each-ref | grep ' commit' | cut -d\ -f1 | while read sha; do git cat-file commit $sha | wc -c; done | sort -n However, if that turns out to be the culprit, it's not fixable currently[1]. Having commits with insanely long messages is just, well, insane. [1] unless we do a major rework of the loading infrastructure, so that we can teach it to load only the beginning of a commit as long as we are only interested in parents and such -- Thomas Rast trast@{inf,student}.ethz.ch -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Poor performance of git describe in big repos
On 31 May 2013 09:32, John Keeping j...@keeping.me.uk wrote: On Fri, May 31, 2013 at 09:14:49AM +0100, Alex Bennée wrote: On 30 May 2013 20:30, John Keeping j...@keeping.me.uk wrote: On Thu, May 30, 2013 at 06:21:55PM +0200, Thomas Rast wrote: Alex Bennée kernel-hac...@bennee.com writes: On 30 May 2013 16:33, Thomas Rast tr...@inf.ethz.ch wrote: Alex Bennée kernel-hac...@bennee.com writes: snip Will it be loading the blob for every commit it traverses or just ones that hit a tag? Why does it need to load the blob at all? Surely the commit tree state doesn't need to be walked down? No, my theory is that you tagged *the blobs*. Git supports this. Wait is this the difference between annotated and non-annotated tags? I thought a non-annotated just acted like references to a particular tree state? No, this is something slightly different. In Git there are four types of object: tag, commit, tree and blob. When you have a heavyweight tag, the tag reference points at a tag object (which in turn points at another object). With a lightweight tag, the tag reference typically points at a commit object. I think this is the case in my repo. However, there is no restriction that says that a tag object must point to a commit or that a lightweight tag must point at a commit - it is equally possible to point directly at a tree or a blob (although a lot less common). Thomas is suggesting that you might have a tag that does not point at a commit but instead points to a blob object. It's looking like I just have some very heavy commits. One data point I probably should have mentioned at the beginning is this was a converted CVS repo and I'm wondering if some of the artifacts that introduced has contributed to this. You can see if that is the case by doing something like this: eval $(git for-each-ref --shell --format ' test $(git cat-file -t %(objectname)^{}) = commit || echo %(refname);') That will print out the name of any ref that doesn't point at a commit. Hmm that didn't seem to work. You mean there was no output? In that case it's likely that all your references do indeed point at commits. Correct. But looking at the output by hand I certainly have a mix of tags that are commits vs tags: 09:08 ajb@sloy/x86_64 [work.git] git for-each-ref | grep refs/tags | grep commit | wc -l 1345 09:12 ajb@sloy/x86_64 [work.git] git for-each-ref | grep refs/tags | grep -v commit | wc -l 66 This means that you have 1345 lightweight tags and 66 heavyweight tags, assuming that all of the lines that don't say commit do say tag. Yep all commits and tags, nothing else By the way, I don't remember if you said which version of Git you're using. If it's an older version then it's possible that something has changed. I'm running the GIT stable PPA: 09:38 ajb@sloy/x86_64 [work.git] git --version git version 1.8.3 Although I have also tested with the latest git.git maint. I'm happy to try master if it's likely to have changed. -- Alex, homepage: http://www.bennee.com/~alex/ -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Poor performance of git describe in big repos
On Fri, May 31, 2013 at 09:49:57AM +0100, Alex Bennée wrote: On 31 May 2013 09:32, John Keeping j...@keeping.me.uk wrote: Thomas is suggesting that you might have a tag that does not point at a commit but instead points to a blob object. It's looking like I just have some very heavy commits. One data point I probably should have mentioned at the beginning is this was a converted CVS repo and I'm wondering if some of the artifacts that introduced has contributed to this. You can try another for-each-ref invocation to see if that's the case: eval $(git for-each-ref --format 'printf %s %s\n \ $(git cat-file -s %(objectname)) %(refname);') | sort -n That will print the size of each object followed by the ref that points to it, sorted by size. I'm running the GIT stable PPA: 09:38 ajb@sloy/x86_64 [work.git] git --version git version 1.8.3 Although I have also tested with the latest git.git maint. I'm happy to try master if it's likely to have changed. master's still very close to 1.8.3 at the moment, so I don't think that will make a difference. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Poor performance of git describe in big repos
On 31 May 2013 09:46, Thomas Rast tr...@inf.ethz.ch wrote: Alex Bennée kernel-hac...@bennee.com writes: I think you are right. I was brave (well I assumed the tags would come back from the upstream repo) and ran: git for-each-ref | grep refs/tags | grep commit | cut -d '/' -f 3 | xargs git tag -d So that deleted all unannotated tags pointing at commits, and then it was fast. Curious. However I have some big commits it seems: 09:37 ajb@sloy/x86_64 [work.git] (git for-each-ref | grep ' commit' | cut -d\ -f1 | xargs -n1 git cat-file commit) | wc -c 1147231984 How many unique entries are there in that list, i.e., what does git for-each-ref | grep ' commit' | cut -d\ -f1 | sort -u | wc -l 09:49 ajb@sloy/x86_64 [work.git] git for-each-ref | grep ' commit' | cut -d\ -f1 | sort -u | wc -l 1508 say? Perhaps you can also find the biggest commit, e.g. like so: git for-each-ref | grep ' commit' | cut -d\ -f1 | while read sha; do git cat-file commit $sha | wc -c; done | sort -n Yeah there is a range from a few hundred bytes to a large number of 3M commits. I guess I need to identify which commits they are and remove the tags or convert them to annotated reference tags. However, if that turns out to be the culprit, it's not fixable currently[1]. Having commits with insanely long messages is just, well, insane. [1] unless we do a major rework of the loading infrastructure, so that we can teach it to load only the beginning of a commit as long as we are only interested in parents and such I'll do a bit of scripting to dig into the nature of these uber-commits and try and work out how they cam about. I suspect they are simply start of branch states in our broken and disparate history. I'll get back to you once I've dug a little deeper. -- Thomas Rast trast@{inf,student}.ethz.ch -- Alex, homepage: http://www.bennee.com/~alex/ -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Poor performance of git describe in big repos
Thomas Rast tr...@inf.ethz.ch writes: However, if that turns out to be the culprit, it's not fixable currently[1]. Having commits with insanely long messages is just, well, insane. [1] unless we do a major rework of the loading infrastructure, so that we can teach it to load only the beginning of a commit as long as we are only interested in parents and such Actually, Peff, doesn't your commit parent/tree pointer caching give us this for free? -- Thomas Rast trast@{inf,student}.ethz.ch -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Poor performance of git describe in big repos
On Fri, May 31, 2013 at 12:27:11PM +0200, Thomas Rast wrote: Thomas Rast tr...@inf.ethz.ch writes: However, if that turns out to be the culprit, it's not fixable currently[1]. Having commits with insanely long messages is just, well, insane. [1] unless we do a major rework of the loading infrastructure, so that we can teach it to load only the beginning of a commit as long as we are only interested in parents and such Actually, Peff, doesn't your commit parent/tree pointer caching give us this for free? It does. You can test it from the jk/metapacks branch at git://github.com/peff/git. After building, you'd need to do: $ git gc $ git metapack --all --commits in the target repository. You can check that it's working because git rev-list --all --count should be an order of magnitude faster. You may need to add save_commit_buffer = 0 in any commands you are checking, though, as the optimization can only kick in if parse_commit does not want to save the buffer as a side effect. I also looked into trying to just read the beginning part of a commit[1], but it turned out not to be all that much of an improvement. -Peff [1] http://article.gmane.org/gmane.comp.version-control.git/212301 -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Poor performance of git describe in big repos
Hi, I'm a fairly heavy user of the magit Emacs extension for interacting with my git repos. However I've noticed there are some cases where lag is very high. By analysing strace output of emacs calling git I found two commands that where particularly problematic when interrogating the repo: 11:00 ajb@sloy/x86_64 [work.git] time /usr/bin/git --no-pager describe --long --tags ajb-build-test-5224-10-gfa296e6 real0m5.016s user0m4.364s sys 0m0.444s 11:34 ajb@sloy/x86_64 [work.git] time /usr/bin/git --no-pager describe --contains HEAD fatal: cannot describe 'fa296e61f549a1252a65a13b2f734d7afbc7e88e' real0m4.805s user0m4.388s sys 0m0.400s Running with first command with the --debug flag on gives: 11:34 ajb@sloy/x86_64 [work.git] time /usr/bin/git --no-pager describe --long --tags --debug searching to describe HEAD lightweight 10 ajb-build-test-5224 lightweight 41 ajb-build-test-5222 annotated146 vnms-2-1-36-32 annotated155 vnms-2-1-36-31 annotated174 vnms-2-1-36-30 annotated183 vnms-2-1-36-29 lightweight 188 vnms-2-1-36-28 annotated193 vnms-2-1-36-27 annotated206 vnms-2-1-36-26 annotated215 vectastar-4-2-83-5 traversed 223 commits more than 10 tags found; listed 10 most recent gave up search at 2b69df72d47be8440e3ce4cee91b9b7ceaf8b77c ajb-build-test-5224-10-gfa296e6 real0m4.817s user0m4.320s sys 0m0.464s Which has only traversed 223 before coming to a decision. This seems like a very low number of commits given the time it's spent doing this. One factor might be the size of my repo (.git is around 2.4G). Could this just be due to computational cost of searching through large packs to walk the commit chain? Is there any way to make this easier for git to do? -- Alex, homepage: http://www.bennee.com/~alex/ -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Poor performance of git describe in big repos
Alex Bennée wrote: time /usr/bin/git --no-pager traversed 223 commits real0m4.817s user0m4.320s sys 0m0.464s I'm quite clueless about why it is taking this long: I think it's IO because there's nothing to compute? I really can't trace anything unless you can reproduce it on a public repository. On linux.git with my rotating hard disk: $ time git describe --debug --long --tags HEAD~1 searching to describe HEAD~1 annotated 5445 v2.6.33 annotated 5660 v2.6.33-rc8 annotated 5884 v2.6.33-rc7 annotated 6140 v2.6.33-rc6 annotated 6467 v2.6.33-rc5 annotated 6999 v2.6.33-rc4 annotated 7430 v2.6.33-rc3 annotated 7746 v2.6.33-rc2 annotated 8212 v2.6.33-rc1 annotated 13854 v2.6.32 traversed 18895 commits more than 10 tags found; listed 10 most recent gave up search at 648f4e3e50c4793d9dbf9a09afa193631f76fa26 v2.6.33-5445-ge7c84ee real0m0.509s user0m0.470s sys 0m0.037s 18k+ commits traversed in half a second here, so I really don't know what is going on. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Poor performance of git describe in big repos
On Thu, May 30, 2013 at 11:38:32AM +0100, Alex Bennée wrote: One factor might be the size of my repo (.git is around 2.4G). Could this just be due to computational cost of searching through large packs to walk the commit chain? Is there any way to make this easier for git to do? What does git count-objects -v say for your repository? You may find that performance improves if you repack with git gc --aggressive. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Poor performance of git describe in big repos
The repo is a fairly hairy one as it includes two historically un-related but content related repos which I'm the process of cherry-picking stuff across. 11:58 ajb@sloy/x86_64 [work.git] git count-objects -v count: 493 size: 4572 in-pack: 399307 packs: 1 size-pack: 1930755 prune-packable: 0 garbage: 0 size-garbage: 0 This was after a repack which did have slight negative effect on performance. The pack file is: 13:27 ajb@sloy/x86_64 [work.git] ls -lh ./.git/objects/pack/* -r--r--r-- 1 ajb cvs 11M May 30 11:56 ./.git/objects/pack/pack-a9ba133a6f25ffa74c3c407e09ab030f8745b201.idx -r--r--r-- 1 ajb cvs 1.9G May 30 11:56 ./.git/objects/pack/pack-a9ba133a6f25ffa74c3c407e09ab030f8745b201.pack I ran perf on it and the top items in the report where: 41.58% git libcrypto.so.1.0.0 [.] 0x6ae73 33.96% git libz.so.1.2.3.4 [.] 0xe0ec 10.39% git libz.so.1.2.3.4 [.] adler32 2.03% git [kernel.kallsyms] [k] clear_page_c So I'm guessing it's spending a lot of non-cache efficient time un-packing and processing the deltas? -- Alex. On 30 May 2013 12:48, John Keeping j...@keeping.me.uk wrote: On Thu, May 30, 2013 at 11:38:32AM +0100, Alex Bennée wrote: One factor might be the size of my repo (.git is around 2.4G). Could this just be due to computational cost of searching through large packs to walk the commit chain? Is there any way to make this easier for git to do? What does git count-objects -v say for your repository? You may find that performance improves if you repack with git gc --aggressive. -- Alex, homepage: http://www.bennee.com/~alex/ -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Poor performance of git describe in big repos
It looks like it's a file caching effect combined with my repo being more pathalogical in size and contents. Note run 1 (cold) vs run 2 on the linux file tree: 13:52 ajb@sloy/x86_64 [linux.git] time git describe --debug --long --tags HEAD~1 searching to describe HEAD~1 annotated 57 v2.6.34-rc2 annotated 1688 v2.6.34-rc1 annotated 7932 v2.6.33 annotated 8157 v2.6.33-rc8 annotated 8381 v2.6.33-rc7 annotated 8637 v2.6.33-rc6 annotated 8964 v2.6.33-rc5 annotated 9493 v2.6.33-rc4 annotated 9912 v2.6.33-rc3 annotated 10202 v2.6.33-rc2 traversed 10547 commits more than 10 tags found; listed 10 most recent gave up search at 55639353a0035052d9ea6cfe4dde0ac7fcbb2c9f v2.6.34-rc2-57-gef5da59 real0m7.332s user0m0.308s sys 0m0.244s 14:03 ajb@sloy/x86_64 [linux.git] time git describe --debug --long --tags HEAD~1 searching to describe HEAD~1 annotated 57 v2.6.34-rc2 annotated 1688 v2.6.34-rc1 annotated 7932 v2.6.33 annotated 8157 v2.6.33-rc8 annotated 8381 v2.6.33-rc7 annotated 8637 v2.6.33-rc6 annotated 8964 v2.6.33-rc5 annotated 9493 v2.6.33-rc4 annotated 9912 v2.6.33-rc3 annotated 10202 v2.6.33-rc2 traversed 10547 commits more than 10 tags found; listed 10 most recent gave up search at 55639353a0035052d9ea6cfe4dde0ac7fcbb2c9f v2.6.34-rc2-57-gef5da59 real0m0.298s user0m0.244s sys 0m0.036s Although the perf profile looks subtly different. First through the linux tree: 22.35% git libz.so.1.2.3.4[.] inflate 18.56% git libz.so.1.2.3.4[.] inflate_fast 17.48% git libz.so.1.2.3.4[.] inflate_table 7.84% git git[.] hashcmp 3.93% git git[.] get_sha1_hex 3.46% git libz.so.1.2.3.4[.] adler32 And through my special repo: 41.58% git libcrypto.so.1.0.0 [.] sha1_block_data_order_ssse3 33.62% git libz.so.1.2.3.4 [.] inflate_fast 10.39% git libz.so.1.2.3.4 [.] adler32 2.03% git [kernel.kallsyms] [k] clear_page_c I'm not sure why libcrypto features so highly in the results -- Alex. On 30 May 2013 12:33, Ramkumar Ramachandra artag...@gmail.com wrote: Alex Bennée wrote: time /usr/bin/git --no-pager traversed 223 commits real0m4.817s user0m4.320s sys 0m0.464s I'm quite clueless about why it is taking this long: I think it's IO because there's nothing to compute? I really can't trace anything unless you can reproduce it on a public repository. On linux.git with my rotating hard disk: $ time git describe --debug --long --tags HEAD~1 searching to describe HEAD~1 annotated 5445 v2.6.33 annotated 5660 v2.6.33-rc8 annotated 5884 v2.6.33-rc7 annotated 6140 v2.6.33-rc6 annotated 6467 v2.6.33-rc5 annotated 6999 v2.6.33-rc4 annotated 7430 v2.6.33-rc3 annotated 7746 v2.6.33-rc2 annotated 8212 v2.6.33-rc1 annotated 13854 v2.6.32 traversed 18895 commits more than 10 tags found; listed 10 most recent gave up search at 648f4e3e50c4793d9dbf9a09afa193631f76fa26 v2.6.33-5445-ge7c84ee real0m0.509s user0m0.470s sys 0m0.037s 18k+ commits traversed in half a second here, so I really don't know what is going on. -- Alex, homepage: http://www.bennee.com/~alex/ -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Poor performance of git describe in big repos
You may find that performance improves if you repack with git gc --aggressive. It seems that increases the time to get to where it wants to: 14:12 ajb@sloy/x86_64 [work.git] time /usr/bin/git --no-pager describe --long --tags --debug searching to describe HEAD lightweight 10 ajb-build-test-5224 lightweight 41 ajb-build-test-5222 annotated146 vnms-2-1-36-32 annotated155 vnms-2-1-36-31 annotated174 vnms-2-1-36-30 annotated183 vnms-2-1-36-29 lightweight 188 vnms-2-1-36-28 annotated193 vnms-2-1-36-27 annotated206 vnms-2-1-36-26 annotated215 vectastar-4-2-83-5 traversed 223 commits more than 10 tags found; listed 10 most recent gave up search at 2b69df72d47be8440e3ce4cee91b9b7ceaf8b77c ajb-build-test-5224-10-gfa296e6 real0m14.658s user0m12.845s sys 0m1.776s On 30 May 2013 12:48, John Keeping j...@keeping.me.uk wrote: On Thu, May 30, 2013 at 11:38:32AM +0100, Alex Bennée wrote: One factor might be the size of my repo (.git is around 2.4G). Could this just be due to computational cost of searching through large packs to walk the commit chain? Is there any way to make this easier for git to do? What does git count-objects -v say for your repository? You may find that performance improves if you repack with git gc --aggressive. -- Alex, homepage: http://www.bennee.com/~alex/ -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Poor performance of git describe in big repos
On Thu, May 30, 2013 at 7:29 PM, Alex Bennée kernel-hac...@bennee.com wrote: I ran perf on it and the top items in the report where: 41.58% git libcrypto.so.1.0.0 [.] 0x6ae73 33.96% git libz.so.1.2.3.4 [.] 0xe0ec 10.39% git libz.so.1.2.3.4 [.] adler32 2.03% git [kernel.kallsyms] [k] clear_page_c So I'm guessing it's spending a lot of non-cache efficient time un-packing and processing the deltas? If I'm not mistaken, commits are never deltified. They are usually small and packed close together for better I/O patterns. If you really just read hundreds of commits, it can't take that long. Maybe some code paths accidentally open a tree, a blob or something.. Can you try setting core.logpackaccess to a path on and rerun describe? Jugding from the code (I never actually tried it), it'll create a file at the given path with the accessed pack offsets. You can check what offset corresponds to what object with verify-pack -v. -- Duy -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Poor performance of git describe in big repos
On Thu, May 30, 2013 at 8:34 PM, Alex Bennée kernel-hac...@bennee.com wrote: From the following run: 14:31 ajb@sloy/x86_64 [work.git] time /usr/bin/git --no-pager describe --long --tags ajb-build-test-5224-11-g9660048 real0m14.720s user0m12.985s sys 0m1.700s 14:31 ajb@sloy/x86_64 [work.git] wc -l /tmp/log-pack.txt 1610 /tmp/log-pack.txt The pack has been tuned with a gc --aggressive. Assuming the numbers are offsets into the pack it looks fairly random access until the last 100 or so. [snipped] Thanks. Can you share verify-pack -v output of pack-a9ba133a6f25ffa74c3c407e09ab030f8745b201.pack? I think you need to put it somewhere on Internet temporarily as it's likely to exceed git@vger limits. -- Duy -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Poor performance of git describe in big repos
On 30 May 2013 14:45, Duy Nguyen pclo...@gmail.com wrote: On Thu, May 30, 2013 at 8:34 PM, Alex Bennée kernel-hac...@bennee.com wrote: snip Thanks. Can you share verify-pack -v output of pack-a9ba133a6f25ffa74c3c407e09ab030f8745b201.pack? I think you need to put it somewhere on Internet temporarily as it's likely to exceed git@vger limits. -- Duy http://www.bennee.com/~alex/stuff/git-pack-access.tar.bz2 -- Alex, homepage: http://www.bennee.com/~alex/ -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Poor performance of git describe in big repos
Alex Bennée wrote: And through my special repo: 41.58% git libcrypto.so.1.0.0 [.] sha1_block_data_order_ssse3 33.62% git libz.so.1.2.3.4 [.] inflate_fast 10.39% git libz.so.1.2.3.4 [.] adler32 2.03% git [kernel.kallsyms] [k] clear_page_c I'm not sure why libcrypto features so highly in the results While Duy churns on the delta chain, let me try to make a (rather crude) observation: What does it mean for libcrypto to be so high in your perf report? sha1_block_data_order is ultimately by object.c:parse_object. While it indicates that deltas are taking a long time to apply (or are somehow not optimally organized for IO), I think it indicates either: 1. Your history is very deep and there are an unusually high number of deltas for each blob. What are the total number of commits? 2. You have have huge (binary) files checked into your repository. Do you? If so, why isn't the code in streaming.c kicking in? -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Poor performance of git describe in big repos
On 30 May 2013 15:32, Ramkumar Ramachandra artag...@gmail.com wrote: Alex Bennée wrote: And through my special repo: 41.58% git libcrypto.so.1.0.0 [.] sha1_block_data_order_ssse3 33.62% git libz.so.1.2.3.4 [.] inflate_fast 10.39% git libz.so.1.2.3.4 [.] adler32 2.03% git [kernel.kallsyms] [k] clear_page_c I'm not sure why libcrypto features so highly in the results While Duy churns on the delta chain, let me try to make a (rather crude) observation: What does it mean for libcrypto to be so high in your perf report? sha1_block_data_order is ultimately by object.c:parse_object. While it indicates that deltas are taking a long time to apply (or are somehow not optimally organized for IO), I think it indicates either: 1. Your history is very deep and there are an unusually high number of deltas for each blob. What are the total number of commits? Well the history does en-compose about 10 years of product development and has a high number of files in the repo (including about 3 copies of the kernel - sans upstream history). 15:50 ajb@sloy/x86_64 [work.git] time git log --pretty=oneline | wc -l 24648 real0m0.434s user0m0.388s sys 0m0.112s Although it doesn't take too long to walk the whole mainline history (obviously ignoring all the other branches). 15:52 ajb@sloy/x86_64 [work.git] git count-objects -v -H count: 581 size: 5.09 MiB in-pack: 399307 packs: 1 size-pack: 1.49 GiB prune-packable: 0 garbage: 0 size-garbage: 0 bytes It is a pick repo. The gc --aggressive nearly took out my machine keeping around 4gb resident for most of the half hour and using nearly 8gb of VM. Of course most of the history is not needed for day to day stuff. Maybe if I split the pack files up it wouldn't be quite such a strain to work through them? 2. You have have huge (binary) files checked into your repository. Do you? If so, why isn't the code in streaming.c kicking in? We do have some binary blobs in the repository (mainly DSP and FPGA images) although not a huge number: 15:58 ajb@sloy/x86_64 [work.git] time git log --pretty=oneline -- xxx xxx/xx/*.out ./xxx/xxx/*.out ./xxx/xxx/*.out | wc -l 234 real0m0.590s user0m0.552s sys 0m0.040s How can I tell if streaming is kicking in or now? -- Alex, homepage: http://www.bennee.com/~alex/ -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Poor performance of git describe in big repos
Alex Bennée wrote: 15:50 ajb@sloy/x86_64 [work.git] time git log --pretty=oneline | wc -l 24648 real0m0.434s user0m0.388s sys 0m0.112s Although it doesn't take too long to walk the whole mainline history (obviously ignoring all the other branches). Damn, non-starter. linux.git has 361k+ commits in mainline history. Nit: use git rev-list --count HEAD next time. 15:52 ajb@sloy/x86_64 [work.git] git count-objects -v -H count: 581 size: 5.09 MiB in-pack: 399307 packs: 1 size-pack: 1.49 GiB prune-packable: 0 garbage: 0 size-garbage: 0 bytes linux.git has 2.9m+ in-pack. The pack-size is much lower at about 800+ MiB, but I don't think 1.49 GiB is a problem in itself. Looking forward to your big-files report to see why it's so big. It is a pick repo. The gc --aggressive nearly took out my machine keeping around 4gb resident for most of the half hour and using nearly 8gb of VM. Of course most of the history is not needed for day to day stuff. Maybe if I split the pack files up it wouldn't be quite such a strain to work through them? Really out of my depth here, sorry. Let's see what Duy (or the others) have to say. 2. You have have huge (binary) files checked into your repository. Do you? If so, why isn't the code in streaming.c kicking in? We do have some binary blobs in the repository (mainly DSP and FPGA images) although not a huge number: 15:58 ajb@sloy/x86_64 [work.git] time git log --pretty=oneline -- xxx xxx/xx/*.out ./xxx/xxx/*.out ./xxx/xxx/*.out | wc -l 234 real0m0.590s user0m0.552s sys 0m0.040s log is streaming, and is not a good measure: it doesn't even walk the entire commit graph. How big are these files? How can I tell if streaming is kicking in or now? I use callgrind (and kcachegrind to visualize). Can you post callgrind output? It will be helpful in figuring out where exactly git is spending time. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Poor performance of git describe in big repos
Alex Bennée kernel-hac...@bennee.com writes: 41.58% git libcrypto.so.1.0.0 [.] sha1_block_data_order_ssse3 33.62% git libz.so.1.2.3.4 [.] inflate_fast 10.39% git libz.so.1.2.3.4 [.] adler32 2.03% git [kernel.kallsyms] [k] clear_page_c Do you have any large blobs in the repo that are referenced directly by a tag? Because this just so happens to exactly reproduce your symptoms: # in a random git.git $ time git describe --debug [...] real0m0.390s user0m0.037s sys 0m0.011s $ git tag big1 $(dd if=/dev/urandom bs=1M count=512 | git hash-object -w --stdin) 512+0 records in 512+0 records out 536870912 bytes (537 MB) copied, 45.5088 s, 11.8 MB/s $ time git describe --debug [...] real0m1.875s user0m1.738s sys 0m0.129s $ git tag big2 $(dd if=/dev/urandom bs=1M count=512 | git hash-object -w --stdin) 512+0 records in 512+0 records out 536870912 bytes (537 MB) copied, 44.972 s, 11.9 MB/s $ time git describe --debugsuche zur Beschreibung von HEAD [...] real0m3.620s user0m3.357s sys 0m0.248s (I actually ran the git-describe invocations more than once to ensure that they are again cache-hot.) git-describe should probably be fixed to avoid loading blobs, though I'm not sure off hand if we have any infrastructure to infer the type of a loose object without inflating it. (This could probably be added by inflating only the first block.) We do have this for packed objects, so at least for packed repos there's a speedup to be had. -- Thomas Rast trast@{inf,student}.ethz.ch -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Poor performance of git describe in big repos
Alex Bennée kernel-hac...@bennee.com writes: On 30 May 2013 16:33, Thomas Rast tr...@inf.ethz.ch wrote: Alex Bennée kernel-hac...@bennee.com writes: 41.58% git libcrypto.so.1.0.0 [.] sha1_block_data_order_ssse3 33.62% git libz.so.1.2.3.4 [.] inflate_fast 10.39% git libz.so.1.2.3.4 [.] adler32 2.03% git [kernel.kallsyms] [k] clear_page_c Do you have any large blobs in the repo that are referenced directly by a tag? Most probably. I've certainly done a bunch of releases (which are tagged) were the last thing that was updated was an FPGA image. [...] git-describe should probably be fixed to avoid loading blobs, though I'm not sure off hand if we have any infrastructure to infer the type of a loose object without inflating it. (This could probably be added by inflating only the first block.) We do have this for packed objects, so at least for packed repos there's a speedup to be had. Will it be loading the blob for every commit it traverses or just ones that hit a tag? Why does it need to load the blob at all? Surely the commit tree state doesn't need to be walked down? No, my theory is that you tagged *the blobs*. Git supports this. git-describe needs to look at the commit (if any) obtained by peeling each tag (i.e. dereferencing tags until it reaches a non-tag). So to do that, it resolves the tag's referent and loads it. Usually this will be a commit, in which case it is marked as reached by the tag. As my example shows, it also resolves tags' referents if they refer to non-commits, in particular, it will decompress large blobs that are (directly) referenced by a tag. Note that while annotated tags provide the type information themselves, e.g. $ git cat-file tag junio-gpg-pub object fe113d3f96636710600c6b02d5fd421fa7e87dd6 type blob tag junio-gpg-pub [...] unannotated tags are simply refs, so it is not enough to just look at the tag objects' referent type. I had a brief look around sha1_file.c, in particular sha1_object_info, and it turns out we lack the deflate only early part logic as I suspected. So that'll have to be fixed first. After that I *think* it should automatically carry over into the tag readers. -- Thomas Rast trast@{inf,student}.ethz.ch -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Poor performance of git describe in big repos
The culprit, according to some callgrind investigation, is lookup_commit_reference_gently() [for the unannotated case] or deref_tag() [annotated case] calling parse_object(). Using the scenario you described earlier, I think it ends-up spending most of its time in check_sha1_signature (both deref_tag and lookup_commit_reference_gently() go there) with 20% inflating, 80% in SHA1_Update(). Not much we can do about that, can we ? -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Poor performance of git describe in big repos
On Thu, May 30, 2013 at 06:21:55PM +0200, Thomas Rast wrote: Alex Bennée kernel-hac...@bennee.com writes: On 30 May 2013 16:33, Thomas Rast tr...@inf.ethz.ch wrote: Alex Bennée kernel-hac...@bennee.com writes: 41.58% git libcrypto.so.1.0.0 [.] sha1_block_data_order_ssse3 33.62% git libz.so.1.2.3.4 [.] inflate_fast 10.39% git libz.so.1.2.3.4 [.] adler32 2.03% git [kernel.kallsyms] [k] clear_page_c Do you have any large blobs in the repo that are referenced directly by a tag? Most probably. I've certainly done a bunch of releases (which are tagged) were the last thing that was updated was an FPGA image. [...] git-describe should probably be fixed to avoid loading blobs, though I'm not sure off hand if we have any infrastructure to infer the type of a loose object without inflating it. (This could probably be added by inflating only the first block.) We do have this for packed objects, so at least for packed repos there's a speedup to be had. Will it be loading the blob for every commit it traverses or just ones that hit a tag? Why does it need to load the blob at all? Surely the commit tree state doesn't need to be walked down? No, my theory is that you tagged *the blobs*. Git supports this. You can see if that is the case by doing something like this: eval $(git for-each-ref --shell --format ' test $(git cat-file -t %(objectname)^{}) = commit || echo %(refname);') That will print out the name of any ref that doesn't point at a commit. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html