Re: Poor performance of git describe in big repos

2013-06-03 Thread Alex Bennée
On 31 May 2013 10:57, Alex Bennée kernel-hac...@bennee.com wrote:
 On 31 May 2013 09:46, Thomas Rast tr...@inf.ethz.ch wrote:

 So that deleted all unannotated tags pointing at commits, and then it
 was fast.  Curious.

 However, if that turns out to be the culprit, it's not fixable
 currently[1].  Having commits with insanely long messages is just, well,
 insane.


 [1]  unless we do a major rework of the loading infrastructure, so that
 we can teach it to load only the beginning of a commit as long as we are
 only interested in parents and such

 I'll do a bit of scripting to dig into the nature of these
 uber-commits and try and work out how they cam about. I suspect they
 are simply start of branch states in our broken and disparate history.

 I'll get back to you once I've dug a little deeper.

So I wrote a little script [1] which I ran to remove all tags that did
not exist on any branches:

git-tag-cleaner.py -d no-branch

After a lot of churning:

17:26 ajb@sloy/x86_64 [work.git] time /usr/bin/git --no-pager
describe --long --tags
ajb-build-test-5225-2-gdc0b771

real0m0.799s
user0m0.024s
sys 0m0.052s

So at least I can fix up my repo. All the big ones look at least as
though they were weird cvs2svn creations that exist to represent the
detached state of a strange CVS tag from the converted repository.
However it does raise one question.

Why is git attempting to parse a commit not on the DAG for the branch
I'm attempting to describe?

Anyway as I have a work around I'm going to do a slightly more
conservative clean of the repo with my script and move on.

[1] https://github.com/stsquad/git-tag-cleaner

-- 
Alex, homepage: http://www.bennee.com/~alex/
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance of git describe in big repos

2013-06-03 Thread Alex Bennée
On 31 May 2013 17:17, Jeff King p...@peff.net wrote:
 On Fri, May 31, 2013 at 12:27:11PM +0200, Thomas Rast wrote:

 Thomas Rast tr...@inf.ethz.ch writes:

  However, if that turns out to be the culprit, it's not fixable
  currently[1].  Having commits with insanely long messages is just, well,
  insane.
 
  [1]  unless we do a major rework of the loading infrastructure, so that
  we can teach it to load only the beginning of a commit as long as we are
  only interested in parents and such

 Actually, Peff, doesn't your commit parent/tree pointer caching give us
 this for free?

 It does. You can test it from the jk/metapacks branch at
 git://github.com/peff/git. After building, you'd need to do:

   $ git gc
   $ git metapack --all --commits

 in the target repository. You can check that it's working because git
 rev-list --all --count should be an order of magnitude faster. You may
 need to add save_commit_buffer = 0 in any commands you are checking,
 though, as the optimization can only kick in if parse_commit does not
 want to save the buffer as a side effect.

Is this a command line argument? The tools don't seem to think so.

Anyway it seems to make a marginal difference to my case:

09:08 ajb@sloy/x86_64 [work.git] time git --no-pager describe --long --tags
ajb-build-test-5225-2-gdc0b771

real0m14.105s
user0m12.409s
sys 0m1.660s
09:11 ajb@sloy/x86_64 [work.git] git gc
Counting objects: 399436, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (110874/110874), done.
Writing objects: 100% (399436/399436), done.
Total 399436 (delta 281538), reused 398357 (delta 280493)
Checking connectivity: 399436, done.
09:12 ajb@sloy/x86_64 [work.git] git metapack --all --commits
09:13 ajb@sloy/x86_64 [work.git] time git --no-pager describe --long --tags
ajb-build-test-5225-2-gdc0b771

real0m12.781s
user0m11.669s
sys 0m1.080s
09:32 ajb@sloy/x86_64 [work.git] time git --no-pager describe --long --tags
ajb-build-test-5225-2-gdc0b771

real0m12.768s
user0m11.817s
sys 0m0.908s
09:33 ajb@sloy/x86_64 [work.git] time git --no-pager describe --long --tags
ajb-build-test-5225-2-gdc0b771

real0m12.642s
user0m11.705s
sys 0m0.904s



 I also looked into trying to just read the beginning part of a commit[1],
 but it turned out not to be all that much of an improvement.

 -Peff

 [1] http://article.gmane.org/gmane.comp.version-control.git/212301



-- 
Alex, homepage: http://www.bennee.com/~alex/
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance of git describe in big repos

2013-06-03 Thread Junio C Hamano
Alex Bennée kernel-hac...@bennee.com writes:

 Why is git attempting to parse a commit not on the DAG for the branch
 I'm attempting to describe?

I think that is because you need to parse the objects at the tip of
refs to see if they are on the DAG in the first place.

If there weren't any annotated tag, conceivably you could do without
parsing these objects.  You would:

 - First read the refs without parsing anything to learn the object
   name of the tips of refs;

 - Traverse the DAG, starting from the commit and notice when you
   see commits that are at the tips of refs you learned in the first
   step, arranging to stop when you found the closest tip.

But with annotated tags (and git describe is designed to be
primarily used with them; you would need --tags option to make it
notice unannotated tags), the object name you see sitting at the tip
will never appear during the DAG traversal.  You will only see
commits from the latter, so you would need to parse the tips to
learn what commits they refer to.

And of course, then parse only annotated tags, without parsing
commits would not work, because you wouldn't know what the object
is without looking at it ;-)
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance of git describe in big repos

2013-06-03 Thread Junio C Hamano
Junio C Hamano gits...@pobox.com writes:

 Alex Bennée kernel-hac...@bennee.com writes:

 Why is git attempting to parse a commit not on the DAG for the branch
 I'm attempting to describe?

 I think that is because you need to parse the objects at the tip of
 refs to see if they are on the DAG in the first place.

 If there weren't any annotated tag, conceivably you could do without
 parsing these objects.  You would:

  - First read the refs without parsing anything to learn the object
name of the tips of refs;

  - Traverse the DAG, starting from the commit and notice when you
see commits that are at the tips of refs you learned in the first
step, arranging to stop when you found the closest tip.

 But with annotated tags (and git describe is designed to be
 primarily used with them; you would need --tags option to make it
 notice unannotated tags), the object name you see sitting at the tip
 will never appear during the DAG traversal.  You will only see
 commits from the latter, so you would need to parse the tips to
 learn what commits they refer to.

 And of course, then parse only annotated tags, without parsing
 commits would not work, because you wouldn't know what the object
 is without looking at it ;-)

Having said all that, with changes by Peff and Michael Haggerty
around f85354b5c7b8 (pack_one_ref(): use function peel_entry(),
2013-04-22), recent Git does not parse as many refs as it used to,
only to figure out what commit an annotated tag points at when your
refs are packed, so we may be a lot closer to the optimum than I
hinted by the above description.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance of git describe in big repos

2013-05-31 Thread Alex Bennée
On 30 May 2013 20:30, John Keeping j...@keeping.me.uk wrote:
 On Thu, May 30, 2013 at 06:21:55PM +0200, Thomas Rast wrote:
 Alex Bennée kernel-hac...@bennee.com writes:

  On 30 May 2013 16:33, Thomas Rast tr...@inf.ethz.ch wrote:
  Alex Bennée kernel-hac...@bennee.com writes:
 snip
  Will it be loading the blob for every commit it traverses or just ones 
  that hit
  a tag? Why does it need to load the blob at all? Surely the commit
  tree state doesn't
  need to be walked down?

 No, my theory is that you tagged *the blobs*.  Git supports this.

Wait is this the difference between annotated and non-annotated tags?
I thought a non-annotated just acted like references to a particular
tree state?


 You can see if that is the case by doing something like this:

 eval $(git for-each-ref --shell --format '
 test $(git cat-file -t %(objectname)^{}) = commit ||
 echo %(refname);')

 That will print out the name of any ref that doesn't point at a
 commit.

Hmm that didn't seem to work. But looking at the output by hand I
certainly have a mix of tags that are commits vs tags:


09:08 ajb@sloy/x86_64 [work.git] git for-each-ref | grep refs/tags
| grep commit | wc -l
1345
09:12 ajb@sloy/x86_64 [work.git] git for-each-ref | grep refs/tags
| grep -v commit | wc -l
66

Unfortunately I can't just delete those tags as they do refer to known
releases which we obviously care about. If I delete the tags on my
local repo and test for a speed increase can I re-create them as
annotated tag objects?

-- 
Alex, homepage: http://www.bennee.com/~alex/
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance of git describe in big repos

2013-05-31 Thread Thomas Rast
Alex Bennée kernel-hac...@bennee.com writes:

 On 30 May 2013 20:30, John Keeping j...@keeping.me.uk wrote:
 On Thu, May 30, 2013 at 06:21:55PM +0200, Thomas Rast wrote:
 Alex Bennée kernel-hac...@bennee.com writes:

  On 30 May 2013 16:33, Thomas Rast tr...@inf.ethz.ch wrote:
  Alex Bennée kernel-hac...@bennee.com writes:
 snip
  Will it be loading the blob for every commit it traverses or just ones 
  that hit
  a tag? Why does it need to load the blob at all? Surely the commit
  tree state doesn't
  need to be walked down?

 No, my theory is that you tagged *the blobs*.  Git supports this.

 Wait is this the difference between annotated and non-annotated tags?
 I thought a non-annotated just acted like references to a particular
 tree state?

A tag is just a ref.  It can point at anything, in particular also a
blob (= some file *contents*).

An annotated tag is just a tag pointing at a tag object.  A tag object
contains tagger name/email/date, a reference to an object, and a tag
message.

The slowness I found relates to having tags that point at blobs directly
(unannotated).

 You can see if that is the case by doing something like this:

 eval $(git for-each-ref --shell --format '
 test $(git cat-file -t %(objectname)^{}) = commit ||
 echo %(refname);')

 That will print out the name of any ref that doesn't point at a
 commit.

 Hmm that didn't seem to work. But looking at the output by hand I
 certainly have a mix of tags that are commits vs tags:


 09:08 ajb@sloy/x86_64 [work.git] git for-each-ref | grep refs/tags
 | grep commit | wc -l
 1345
 09:12 ajb@sloy/x86_64 [work.git] git for-each-ref | grep refs/tags
 | grep -v commit | wc -l
 66

 Unfortunately I can't just delete those tags as they do refer to known
 releases which we obviously care about. If I delete the tags on my
 local repo and test for a speed increase can I re-create them as
 annotated tag objects?

I would be more interested in this:

  git for-each-ref | grep ' blob'

and

  (git for-each-ref | grep ' blob' | cut -d\  -f1 | xargs -n1 git cat-file 
blob) | wc -c

The first tells you if you have any refs pointing at blobs.  The second
computes their total unpacked size.  My theory is that the second yields
some large number (hundreds of megabytes at least).

It would be nice if you checked, because if there turn out to be big
blobs, we have all the pieces and just need to assemble the best
solution.  Otherwise, there's something else going on and the problem
remains open.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance of git describe in big repos

2013-05-31 Thread John Keeping
On Fri, May 31, 2013 at 09:14:49AM +0100, Alex Bennée wrote:
 On 30 May 2013 20:30, John Keeping j...@keeping.me.uk wrote:
  On Thu, May 30, 2013 at 06:21:55PM +0200, Thomas Rast wrote:
  Alex Bennée kernel-hac...@bennee.com writes:
 
   On 30 May 2013 16:33, Thomas Rast tr...@inf.ethz.ch wrote:
   Alex Bennée kernel-hac...@bennee.com writes:
  snip
   Will it be loading the blob for every commit it traverses or just ones 
   that hit
   a tag? Why does it need to load the blob at all? Surely the commit
   tree state doesn't
   need to be walked down?
 
  No, my theory is that you tagged *the blobs*.  Git supports this.
 
 Wait is this the difference between annotated and non-annotated tags?
 I thought a non-annotated just acted like references to a particular
 tree state?

No, this is something slightly different.  In Git there are four types
of object: tag, commit, tree and blob.  When you have a heavyweight tag,
the tag reference points at a tag object (which in turn points at
another object).  With a lightweight tag, the tag reference typically
points at a commit object.

However, there is no restriction that says that a tag object must point
to a commit or that a lightweight tag must point at a commit - it is
equally possible to point directly at a tree or a blob (although a lot
less common).

Thomas is suggesting that you might have a tag that does not point at a
commit but instead points to a blob object.

  You can see if that is the case by doing something like this:
 
  eval $(git for-each-ref --shell --format '
  test $(git cat-file -t %(objectname)^{}) = commit ||
  echo %(refname);')
 
  That will print out the name of any ref that doesn't point at a
  commit.
 
 Hmm that didn't seem to work.

You mean there was no output?  In that case it's likely that all your
references do indeed point at commits.

   But looking at the output by hand I
 certainly have a mix of tags that are commits vs tags:
 
 
 09:08 ajb@sloy/x86_64 [work.git] git for-each-ref | grep refs/tags
 | grep commit | wc -l
 1345
 09:12 ajb@sloy/x86_64 [work.git] git for-each-ref | grep refs/tags
 | grep -v commit | wc -l
 66

This means that you have 1345 lightweight tags and 66 heavyweight tags,
assuming that all of the lines that don't say commit do say tag.

By the way, I don't remember if you said which version of Git you're
using.  If it's an older version then it's possible that something has
changed.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance of git describe in big repos

2013-05-31 Thread Alex Bennée
On 31 May 2013 09:24, Thomas Rast tr...@inf.ethz.ch wrote:
 Alex Bennée kernel-hac...@bennee.com writes:
 On 30 May 2013 20:30, John Keeping j...@keeping.me.uk wrote:
 On Thu, May 30, 2013 at 06:21:55PM +0200, Thomas Rast wrote:
 Alex Bennée kernel-hac...@bennee.com writes:
  On 30 May 2013 16:33, Thomas Rast tr...@inf.ethz.ch wrote:
 snip
 No, my theory is that you tagged *the blobs*.  Git supports this.

 Wait is this the difference between annotated and non-annotated tags?
 I thought a non-annotated just acted like references to a particular
 tree state?

 A tag is just a ref.  It can point at anything, in particular also a
 blob (= some file *contents*).

 An annotated tag is just a tag pointing at a tag object.  A tag object
 contains tagger name/email/date, a reference to an object, and a tag
 message.

 The slowness I found relates to having tags that point at blobs directly
 (unannotated).

I think you are right. I was brave (well I assumed the tags would come
back from the upstream repo) and ran:

git for-each-ref | grep refs/tags | grep commit | cut -d '/' -f 3
| xargs git tag -d

And boom:

09:19 ajb@sloy/x86_64 [work.git] time /usr/bin/git --no-pager
describe --long --tags
ajb-build-test-5225-2-gdc0b771

real0m0.009s
user0m0.008s
sys 0m0.000s

Which is much better performance. So it does look like unannotated
tags pointing at binary blobs is the failure case.

snip

 I would be more interested in this:

   git for-each-ref | grep ' blob'

Hmmm that gives nothing. All the refs are either tag or commit

 and

   (git for-each-ref | grep ' blob' | cut -d\  -f1 | xargs -n1 git
cat-file blob) | wc -c

However I have some big commits it seems:

09:37 ajb@sloy/x86_64 [work.git] (git for-each-ref | grep ' commit' |
cut -d\  -f1 | xargs -n1 git cat-file commit) | wc -c
1147231984


 The first tells you if you have any refs pointing at blobs.  The second
 computes their total unpacked size.  My theory is that the second yields
 some large number (hundreds of megabytes at least).

 It would be nice if you checked, because if there turn out to be big
 blobs, we have all the pieces and just need to assemble the best
 solution.  Otherwise, there's something else going on and the problem
 remains open.

If you want any other numbers I'm only too happy to help. Sorry I
can't share the repo though...

-- 
Alex, homepage: http://www.bennee.com/~alex/
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance of git describe in big repos

2013-05-31 Thread Thomas Rast
Alex Bennée kernel-hac...@bennee.com writes:

 I think you are right. I was brave (well I assumed the tags would come
 back from the upstream repo) and ran:

 git for-each-ref | grep refs/tags | grep commit | cut -d '/' -f 3
 | xargs git tag -d

So that deleted all unannotated tags pointing at commits, and then it
was fast.  Curious.

 However I have some big commits it seems:

 09:37 ajb@sloy/x86_64 [work.git] (git for-each-ref | grep ' commit' |
 cut -d\  -f1 | xargs -n1 git cat-file commit) | wc -c
 1147231984

How many unique entries are there in that list, i.e., what does

  git for-each-ref | grep ' commit' | cut -d\  -f1 | sort -u | wc -l

say?  Perhaps you can also find the biggest commit, e.g. like so:

  git for-each-ref | grep ' commit' | cut -d\  -f1 |
  while read sha; do git cat-file commit $sha | wc -c; done |
  sort -n

However, if that turns out to be the culprit, it's not fixable
currently[1].  Having commits with insanely long messages is just, well,
insane.


[1]  unless we do a major rework of the loading infrastructure, so that
we can teach it to load only the beginning of a commit as long as we are
only interested in parents and such

-- 
Thomas Rast
trast@{inf,student}.ethz.ch
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance of git describe in big repos

2013-05-31 Thread Alex Bennée
On 31 May 2013 09:32, John Keeping j...@keeping.me.uk wrote:
 On Fri, May 31, 2013 at 09:14:49AM +0100, Alex Bennée wrote:
 On 30 May 2013 20:30, John Keeping j...@keeping.me.uk wrote:
  On Thu, May 30, 2013 at 06:21:55PM +0200, Thomas Rast wrote:
  Alex Bennée kernel-hac...@bennee.com writes:
 
   On 30 May 2013 16:33, Thomas Rast tr...@inf.ethz.ch wrote:
   Alex Bennée kernel-hac...@bennee.com writes:
  snip
   Will it be loading the blob for every commit it traverses or just ones 
   that hit
   a tag? Why does it need to load the blob at all? Surely the commit
   tree state doesn't
   need to be walked down?
 
  No, my theory is that you tagged *the blobs*.  Git supports this.

 Wait is this the difference between annotated and non-annotated tags?
 I thought a non-annotated just acted like references to a particular
 tree state?

 No, this is something slightly different.  In Git there are four types
 of object: tag, commit, tree and blob.  When you have a heavyweight tag,
 the tag reference points at a tag object (which in turn points at
 another object).  With a lightweight tag, the tag reference typically
 points at a commit object.

I think this is the case in my repo.

 However, there is no restriction that says that a tag object must point
 to a commit or that a lightweight tag must point at a commit - it is
 equally possible to point directly at a tree or a blob (although a lot
 less common).

 Thomas is suggesting that you might have a tag that does not point at a
 commit but instead points to a blob object.

It's looking like I just have some very heavy commits. One data point
I probably should have mentioned at the beginning is this was a
converted CVS repo and I'm wondering if some of the artifacts that
introduced has contributed to this.

  You can see if that is the case by doing something like this:
 
  eval $(git for-each-ref --shell --format '
  test $(git cat-file -t %(objectname)^{}) = commit ||
  echo %(refname);')
 
  That will print out the name of any ref that doesn't point at a
  commit.

 Hmm that didn't seem to work.

 You mean there was no output?  In that case it's likely that all your
 references do indeed point at commits.

Correct.


   But looking at the output by hand I
 certainly have a mix of tags that are commits vs tags:


 09:08 ajb@sloy/x86_64 [work.git] git for-each-ref | grep refs/tags
 | grep commit | wc -l
 1345
 09:12 ajb@sloy/x86_64 [work.git] git for-each-ref | grep refs/tags
 | grep -v commit | wc -l
 66

 This means that you have 1345 lightweight tags and 66 heavyweight tags,
 assuming that all of the lines that don't say commit do say tag.

Yep all commits and tags, nothing else

 By the way, I don't remember if you said which version of Git you're
 using.  If it's an older version then it's possible that something has
 changed.

I'm running the GIT stable PPA:

09:38 ajb@sloy/x86_64 [work.git] git --version
git version 1.8.3

Although I have also tested with the latest git.git maint. I'm happy
to try master if it's likely to have changed.

-- 
Alex, homepage: http://www.bennee.com/~alex/
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance of git describe in big repos

2013-05-31 Thread John Keeping
On Fri, May 31, 2013 at 09:49:57AM +0100, Alex Bennée wrote:
 On 31 May 2013 09:32, John Keeping j...@keeping.me.uk wrote:
  Thomas is suggesting that you might have a tag that does not point at a
  commit but instead points to a blob object.
 
 It's looking like I just have some very heavy commits. One data point
 I probably should have mentioned at the beginning is this was a
 converted CVS repo and I'm wondering if some of the artifacts that
 introduced has contributed to this.

You can try another for-each-ref invocation to see if that's the case:

eval $(git for-each-ref --format 'printf %s %s\n \
$(git cat-file -s %(objectname)) %(refname);') | sort -n

That will print the size of each object followed by the ref that points
to it, sorted by size.

 I'm running the GIT stable PPA:
 
 09:38 ajb@sloy/x86_64 [work.git] git --version
 git version 1.8.3
 
 Although I have also tested with the latest git.git maint. I'm happy
 to try master if it's likely to have changed.

master's still very close to 1.8.3 at the moment, so I don't think that
will make a difference.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance of git describe in big repos

2013-05-31 Thread Alex Bennée
On 31 May 2013 09:46, Thomas Rast tr...@inf.ethz.ch wrote:
 Alex Bennée kernel-hac...@bennee.com writes:

 I think you are right. I was brave (well I assumed the tags would come
 back from the upstream repo) and ran:

 git for-each-ref | grep refs/tags | grep commit | cut -d '/' -f 3
 | xargs git tag -d

 So that deleted all unannotated tags pointing at commits, and then it
 was fast.  Curious.

 However I have some big commits it seems:

 09:37 ajb@sloy/x86_64 [work.git] (git for-each-ref | grep ' commit' |
 cut -d\  -f1 | xargs -n1 git cat-file commit) | wc -c
 1147231984

 How many unique entries are there in that list, i.e., what does

   git for-each-ref | grep ' commit' | cut -d\  -f1 | sort -u | wc -l

09:49 ajb@sloy/x86_64 [work.git] git for-each-ref | grep ' commit' |
cut -d\  -f1 | sort -u | wc -l
1508

 say?  Perhaps you can also find the biggest commit, e.g. like so:

   git for-each-ref | grep ' commit' | cut -d\  -f1 |
   while read sha; do git cat-file commit $sha | wc -c; done |
   sort -n

Yeah there is a range from a few hundred bytes to a large number of 3M
commits. I guess I need to identify which commits they are and remove
the tags or convert them to annotated reference tags.

 However, if that turns out to be the culprit, it's not fixable
 currently[1].  Having commits with insanely long messages is just, well,
 insane.



 [1]  unless we do a major rework of the loading infrastructure, so that
 we can teach it to load only the beginning of a commit as long as we are
 only interested in parents and such

I'll do a bit of scripting to dig into the nature of these
uber-commits and try and work out how they cam about. I suspect they
are simply start of branch states in our broken and disparate history.

I'll get back to you once I've dug a little deeper.


 --
 Thomas Rast
 trast@{inf,student}.ethz.ch



-- 
Alex, homepage: http://www.bennee.com/~alex/
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance of git describe in big repos

2013-05-31 Thread Thomas Rast
Thomas Rast tr...@inf.ethz.ch writes:

 However, if that turns out to be the culprit, it's not fixable
 currently[1].  Having commits with insanely long messages is just, well,
 insane.

 [1]  unless we do a major rework of the loading infrastructure, so that
 we can teach it to load only the beginning of a commit as long as we are
 only interested in parents and such

Actually, Peff, doesn't your commit parent/tree pointer caching give us
this for free?

-- 
Thomas Rast
trast@{inf,student}.ethz.ch
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance of git describe in big repos

2013-05-31 Thread Jeff King
On Fri, May 31, 2013 at 12:27:11PM +0200, Thomas Rast wrote:

 Thomas Rast tr...@inf.ethz.ch writes:
 
  However, if that turns out to be the culprit, it's not fixable
  currently[1].  Having commits with insanely long messages is just, well,
  insane.
 
  [1]  unless we do a major rework of the loading infrastructure, so that
  we can teach it to load only the beginning of a commit as long as we are
  only interested in parents and such
 
 Actually, Peff, doesn't your commit parent/tree pointer caching give us
 this for free?

It does. You can test it from the jk/metapacks branch at
git://github.com/peff/git. After building, you'd need to do:

  $ git gc
  $ git metapack --all --commits

in the target repository. You can check that it's working because git
rev-list --all --count should be an order of magnitude faster. You may
need to add save_commit_buffer = 0 in any commands you are checking,
though, as the optimization can only kick in if parse_commit does not
want to save the buffer as a side effect.

I also looked into trying to just read the beginning part of a commit[1],
but it turned out not to be all that much of an improvement.

-Peff

[1] http://article.gmane.org/gmane.comp.version-control.git/212301
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance of git describe in big repos

2013-05-30 Thread Ramkumar Ramachandra
Alex Bennée wrote:
time /usr/bin/git --no-pager
 traversed 223 commits

 real0m4.817s
 user0m4.320s
 sys 0m0.464s

I'm quite clueless about why it is taking this long: I think it's IO
because there's nothing to compute?  I really can't trace anything
unless you can reproduce it on a public repository.  On linux.git with
my rotating hard disk:

$ time git describe --debug --long --tags HEAD~1
searching to describe HEAD~1
 annotated   5445 v2.6.33
 annotated   5660 v2.6.33-rc8
 annotated   5884 v2.6.33-rc7
 annotated   6140 v2.6.33-rc6
 annotated   6467 v2.6.33-rc5
 annotated   6999 v2.6.33-rc4
 annotated   7430 v2.6.33-rc3
 annotated   7746 v2.6.33-rc2
 annotated   8212 v2.6.33-rc1
 annotated  13854 v2.6.32
traversed 18895 commits
more than 10 tags found; listed 10 most recent
gave up search at 648f4e3e50c4793d9dbf9a09afa193631f76fa26
v2.6.33-5445-ge7c84ee

real0m0.509s
user0m0.470s
sys 0m0.037s

18k+ commits traversed in half a second here, so I really don't know
what is going on.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance of git describe in big repos

2013-05-30 Thread John Keeping
On Thu, May 30, 2013 at 11:38:32AM +0100, Alex Bennée wrote:
 One factor might be the size of my repo (.git is around 2.4G). Could
 this just be due to computational cost of searching through large
 packs to walk the commit chain? Is there any way to make this easier
 for git to do?

What does git count-objects -v say for your repository?

You may find that performance improves if you repack with git gc
--aggressive.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance of git describe in big repos

2013-05-30 Thread Alex Bennée
The repo is a fairly hairy one as it includes two historically
un-related but content related repos which I'm the process of
cherry-picking stuff across.

11:58 ajb@sloy/x86_64 [work.git] git count-objects -v
count: 493
size: 4572
in-pack: 399307
packs: 1
size-pack: 1930755
prune-packable: 0
garbage: 0
size-garbage: 0

This was after a repack which did have slight negative effect on
performance. The pack file is:

13:27 ajb@sloy/x86_64 [work.git] ls -lh ./.git/objects/pack/*
-r--r--r-- 1 ajb cvs  11M May 30 11:56
./.git/objects/pack/pack-a9ba133a6f25ffa74c3c407e09ab030f8745b201.idx
-r--r--r-- 1 ajb cvs 1.9G May 30 11:56
./.git/objects/pack/pack-a9ba133a6f25ffa74c3c407e09ab030f8745b201.pack

I ran perf on it and the top items in the report where:

 41.58%   git  libcrypto.so.1.0.0  [.] 0x6ae73
 33.96%   git  libz.so.1.2.3.4 [.] 0xe0ec
 10.39%   git  libz.so.1.2.3.4 [.] adler32
  2.03%   git  [kernel.kallsyms]   [k] clear_page_c

So I'm guessing it's spending a lot of non-cache efficient time
un-packing and processing the deltas?

--
Alex.

On 30 May 2013 12:48, John Keeping j...@keeping.me.uk wrote:
 On Thu, May 30, 2013 at 11:38:32AM +0100, Alex Bennée wrote:
 One factor might be the size of my repo (.git is around 2.4G). Could
 this just be due to computational cost of searching through large
 packs to walk the commit chain? Is there any way to make this easier
 for git to do?

 What does git count-objects -v say for your repository?

 You may find that performance improves if you repack with git gc
 --aggressive.



-- 
Alex, homepage: http://www.bennee.com/~alex/
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance of git describe in big repos

2013-05-30 Thread Alex Bennée
It looks like it's a file caching effect combined with my repo being
more pathalogical in size and contents. Note run 1 (cold) vs run 2 on
the linux file tree:

13:52 ajb@sloy/x86_64 [linux.git] time git describe --debug --long
--tags HEAD~1
searching to describe HEAD~1
 annotated 57 v2.6.34-rc2
 annotated   1688 v2.6.34-rc1
 annotated   7932 v2.6.33
 annotated   8157 v2.6.33-rc8
 annotated   8381 v2.6.33-rc7
 annotated   8637 v2.6.33-rc6
 annotated   8964 v2.6.33-rc5
 annotated   9493 v2.6.33-rc4
 annotated   9912 v2.6.33-rc3
 annotated  10202 v2.6.33-rc2
traversed 10547 commits
more than 10 tags found; listed 10 most recent
gave up search at 55639353a0035052d9ea6cfe4dde0ac7fcbb2c9f
v2.6.34-rc2-57-gef5da59

real0m7.332s
user0m0.308s
sys 0m0.244s
14:03 ajb@sloy/x86_64 [linux.git] time git describe --debug --long
--tags HEAD~1
searching to describe HEAD~1
 annotated 57 v2.6.34-rc2
 annotated   1688 v2.6.34-rc1
 annotated   7932 v2.6.33
 annotated   8157 v2.6.33-rc8
 annotated   8381 v2.6.33-rc7
 annotated   8637 v2.6.33-rc6
 annotated   8964 v2.6.33-rc5
 annotated   9493 v2.6.33-rc4
 annotated   9912 v2.6.33-rc3
 annotated  10202 v2.6.33-rc2
traversed 10547 commits
more than 10 tags found; listed 10 most recent
gave up search at 55639353a0035052d9ea6cfe4dde0ac7fcbb2c9f
v2.6.34-rc2-57-gef5da59

real0m0.298s
user0m0.244s
sys 0m0.036s

Although the perf profile looks subtly different.

First through the linux tree:

 22.35%   git  libz.so.1.2.3.4[.] inflate
 18.56%   git  libz.so.1.2.3.4[.] inflate_fast
 17.48%   git  libz.so.1.2.3.4[.] inflate_table
  7.84%   git  git[.] hashcmp
  3.93%   git  git[.] get_sha1_hex
  3.46%   git  libz.so.1.2.3.4[.] adler32

And through my special repo:

 41.58%   git  libcrypto.so.1.0.0  [.] sha1_block_data_order_ssse3
 33.62%   git  libz.so.1.2.3.4 [.] inflate_fast
 10.39%   git  libz.so.1.2.3.4 [.] adler32
  2.03%   git  [kernel.kallsyms]   [k] clear_page_c

 I'm not sure why libcrypto features so highly in the results


 --
 Alex.

On 30 May 2013 12:33, Ramkumar Ramachandra artag...@gmail.com wrote:
 Alex Bennée wrote:
time /usr/bin/git --no-pager
 traversed 223 commits

 real0m4.817s
 user0m4.320s
 sys 0m0.464s

 I'm quite clueless about why it is taking this long: I think it's IO
 because there's nothing to compute?  I really can't trace anything
 unless you can reproduce it on a public repository.  On linux.git with
 my rotating hard disk:

 $ time git describe --debug --long --tags HEAD~1
 searching to describe HEAD~1
  annotated   5445 v2.6.33
  annotated   5660 v2.6.33-rc8
  annotated   5884 v2.6.33-rc7
  annotated   6140 v2.6.33-rc6
  annotated   6467 v2.6.33-rc5
  annotated   6999 v2.6.33-rc4
  annotated   7430 v2.6.33-rc3
  annotated   7746 v2.6.33-rc2
  annotated   8212 v2.6.33-rc1
  annotated  13854 v2.6.32
 traversed 18895 commits
 more than 10 tags found; listed 10 most recent
 gave up search at 648f4e3e50c4793d9dbf9a09afa193631f76fa26
 v2.6.33-5445-ge7c84ee

 real0m0.509s
 user0m0.470s
 sys 0m0.037s

 18k+ commits traversed in half a second here, so I really don't know
 what is going on.



-- 
Alex, homepage: http://www.bennee.com/~alex/
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance of git describe in big repos

2013-05-30 Thread Alex Bennée
 You may find that performance improves if you repack with git gc
--aggressive.

It seems that increases the time to get to where it wants to:

14:12 ajb@sloy/x86_64 [work.git] time /usr/bin/git --no-pager
describe --long --tags --debug
searching to describe HEAD
 lightweight   10 ajb-build-test-5224
 lightweight   41 ajb-build-test-5222
 annotated146 vnms-2-1-36-32
 annotated155 vnms-2-1-36-31
 annotated174 vnms-2-1-36-30
 annotated183 vnms-2-1-36-29
 lightweight  188 vnms-2-1-36-28
 annotated193 vnms-2-1-36-27
 annotated206 vnms-2-1-36-26
 annotated215 vectastar-4-2-83-5
traversed 223 commits
more than 10 tags found; listed 10 most recent
gave up search at 2b69df72d47be8440e3ce4cee91b9b7ceaf8b77c
ajb-build-test-5224-10-gfa296e6

real0m14.658s
user0m12.845s
sys 0m1.776s

On 30 May 2013 12:48, John Keeping j...@keeping.me.uk wrote:
 On Thu, May 30, 2013 at 11:38:32AM +0100, Alex Bennée wrote:
 One factor might be the size of my repo (.git is around 2.4G). Could
 this just be due to computational cost of searching through large
 packs to walk the commit chain? Is there any way to make this easier
 for git to do?

 What does git count-objects -v say for your repository?

 You may find that performance improves if you repack with git gc
 --aggressive.



-- 
Alex, homepage: http://www.bennee.com/~alex/
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance of git describe in big repos

2013-05-30 Thread Duy Nguyen
On Thu, May 30, 2013 at 7:29 PM, Alex Bennée kernel-hac...@bennee.com wrote:
 I ran perf on it and the top items in the report where:

  41.58%   git  libcrypto.so.1.0.0  [.] 0x6ae73
  33.96%   git  libz.so.1.2.3.4 [.] 0xe0ec
  10.39%   git  libz.so.1.2.3.4 [.] adler32
   2.03%   git  [kernel.kallsyms]   [k] clear_page_c

 So I'm guessing it's spending a lot of non-cache efficient time
 un-packing and processing the deltas?

If I'm not mistaken, commits are never deltified. They are usually
small and packed close together for better I/O patterns. If you really
just read hundreds of commits, it can't take that long. Maybe some
code paths accidentally open a tree, a blob or something..

Can you try setting core.logpackaccess to a path on and rerun
describe? Jugding from the code (I never actually tried it), it'll
create a file at the given path with the accessed pack offsets. You
can check what offset corresponds to what object with verify-pack -v.
--
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance of git describe in big repos

2013-05-30 Thread Duy Nguyen
On Thu, May 30, 2013 at 8:34 PM, Alex Bennée kernel-hac...@bennee.com wrote:
 From the following run:


 14:31 ajb@sloy/x86_64 [work.git] time /usr/bin/git --no-pager
 describe --long --tags
 ajb-build-test-5224-11-g9660048

 real0m14.720s
 user0m12.985s
 sys 0m1.700s
 14:31 ajb@sloy/x86_64 [work.git] wc -l /tmp/log-pack.txt
 1610 /tmp/log-pack.txt

 The pack has been tuned with a gc --aggressive. Assuming the numbers
 are offsets into the pack it looks fairly random access until the last
 100 or so.

 [snipped]

Thanks. Can you share verify-pack -v output of
pack-a9ba133a6f25ffa74c3c407e09ab030f8745b201.pack? I think you need
to put it somewhere on Internet temporarily as it's likely to exceed
git@vger limits.
--
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance of git describe in big repos

2013-05-30 Thread Alex Bennée
On 30 May 2013 14:45, Duy Nguyen pclo...@gmail.com wrote:
 On Thu, May 30, 2013 at 8:34 PM, Alex Bennée kernel-hac...@bennee.com wrote:
 snip
 Thanks. Can you share verify-pack -v output of
 pack-a9ba133a6f25ffa74c3c407e09ab030f8745b201.pack? I think you need
 to put it somewhere on Internet temporarily as it's likely to exceed
 git@vger limits.
 --
 Duy

http://www.bennee.com/~alex/stuff/git-pack-access.tar.bz2

--
Alex, homepage: http://www.bennee.com/~alex/
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance of git describe in big repos

2013-05-30 Thread Ramkumar Ramachandra
Alex Bennée wrote:
 And through my special repo:

  41.58%   git  libcrypto.so.1.0.0  [.] sha1_block_data_order_ssse3
  33.62%   git  libz.so.1.2.3.4 [.] inflate_fast
  10.39%   git  libz.so.1.2.3.4 [.] adler32
   2.03%   git  [kernel.kallsyms]   [k] clear_page_c

  I'm not sure why libcrypto features so highly in the results

While Duy churns on the delta chain, let me try to make a (rather
crude) observation:

What does it mean for libcrypto to be so high in your perf report?
sha1_block_data_order is ultimately by object.c:parse_object.  While
it indicates that deltas are taking a long time to apply (or are
somehow not optimally organized for IO), I think it indicates either:

1. Your history is very deep and there are an unusually high number of
deltas for each blob.  What are the total number of commits?

2. You have have huge (binary) files checked into your repository.  Do
you?  If so, why isn't the code in streaming.c kicking in?
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance of git describe in big repos

2013-05-30 Thread Alex Bennée
On 30 May 2013 15:32, Ramkumar Ramachandra artag...@gmail.com wrote:
 Alex Bennée wrote:
 And through my special repo:

  41.58%   git  libcrypto.so.1.0.0  [.] sha1_block_data_order_ssse3
  33.62%   git  libz.so.1.2.3.4 [.] inflate_fast
  10.39%   git  libz.so.1.2.3.4 [.] adler32
   2.03%   git  [kernel.kallsyms]   [k] clear_page_c

  I'm not sure why libcrypto features so highly in the results

 While Duy churns on the delta chain, let me try to make a (rather
 crude) observation:

 What does it mean for libcrypto to be so high in your perf report?
 sha1_block_data_order is ultimately by object.c:parse_object.  While
 it indicates that deltas are taking a long time to apply (or are
 somehow not optimally organized for IO), I think it indicates either:

 1. Your history is very deep and there are an unusually high number of
 deltas for each blob.  What are the total number of commits?

Well the history does en-compose about 10 years of product development
and has a high number of files in the repo (including about 3 copies of
the kernel - sans upstream history).

15:50 ajb@sloy/x86_64 [work.git] time git log --pretty=oneline | wc -l
24648

real0m0.434s
user0m0.388s
sys 0m0.112s

Although it doesn't take too long to walk the whole mainline history
(obviously ignoring all the other branches).

15:52 ajb@sloy/x86_64 [work.git] git count-objects -v -H
count: 581
size: 5.09 MiB
in-pack: 399307
packs: 1
size-pack: 1.49 GiB
prune-packable: 0
garbage: 0
size-garbage: 0 bytes

It is a pick repo. The gc --aggressive nearly took out my machine keeping
around 4gb resident for most of the half hour and using nearly 8gb of VM.

Of course most of the history is not needed for day to day stuff. Maybe
if I split the pack files up it wouldn't be quite such a strain to work
through them?

 2. You have have huge (binary) files checked into your repository.  Do
 you?  If so, why isn't the code in streaming.c kicking in?

We do have some binary blobs in the repository (mainly DSP and FPGA images)
although not a huge number:

15:58 ajb@sloy/x86_64 [work.git] time git log --pretty=oneline -- xxx
xxx/xx/*.out ./xxx/xxx/*.out ./xxx/xxx/*.out | wc -l
234

real0m0.590s
user0m0.552s
sys 0m0.040s

How can I tell if streaming is kicking in or now?


-- 
Alex, homepage: http://www.bennee.com/~alex/
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance of git describe in big repos

2013-05-30 Thread Ramkumar Ramachandra
Alex Bennée wrote:
 15:50 ajb@sloy/x86_64 [work.git] time git log --pretty=oneline | wc -l
 24648

 real0m0.434s
 user0m0.388s
 sys 0m0.112s

 Although it doesn't take too long to walk the whole mainline history
 (obviously ignoring all the other branches).

Damn, non-starter.  linux.git has 361k+ commits in mainline history.

Nit: use git rev-list --count HEAD next time.

 15:52 ajb@sloy/x86_64 [work.git] git count-objects -v -H
 count: 581
 size: 5.09 MiB
 in-pack: 399307
 packs: 1
 size-pack: 1.49 GiB
 prune-packable: 0
 garbage: 0
 size-garbage: 0 bytes

linux.git has 2.9m+ in-pack.  The pack-size is much lower at about
800+ MiB, but I don't think 1.49 GiB is a problem in itself.  Looking
forward to your big-files report to see why it's so big.

 It is a pick repo. The gc --aggressive nearly took out my machine keeping
 around 4gb resident for most of the half hour and using nearly 8gb of VM.

 Of course most of the history is not needed for day to day stuff. Maybe
 if I split the pack files up it wouldn't be quite such a strain to work
 through them?

Really out of my depth here, sorry.  Let's see what Duy (or the
others) have to say.

 2. You have have huge (binary) files checked into your repository.  Do
 you?  If so, why isn't the code in streaming.c kicking in?

 We do have some binary blobs in the repository (mainly DSP and FPGA images)
 although not a huge number:

 15:58 ajb@sloy/x86_64 [work.git] time git log --pretty=oneline -- xxx
 xxx/xx/*.out ./xxx/xxx/*.out ./xxx/xxx/*.out | wc -l
 234

 real0m0.590s
 user0m0.552s
 sys 0m0.040s

log is streaming, and is not a good measure: it doesn't even walk the
entire commit graph.  How big are these files?

 How can I tell if streaming is kicking in or now?

I use callgrind (and kcachegrind to visualize).  Can you post
callgrind output?  It will be helpful in figuring out where exactly
git is spending time.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance of git describe in big repos

2013-05-30 Thread Thomas Rast
Alex Bennée kernel-hac...@bennee.com writes:

  41.58%   git  libcrypto.so.1.0.0  [.] sha1_block_data_order_ssse3
  33.62%   git  libz.so.1.2.3.4 [.] inflate_fast
  10.39%   git  libz.so.1.2.3.4 [.] adler32
   2.03%   git  [kernel.kallsyms]   [k] clear_page_c

Do you have any large blobs in the repo that are referenced directly by
a tag?

Because this just so happens to exactly reproduce your symptoms:

  # in a random git.git
  $ time git describe --debug
  [...]
  real0m0.390s
  user0m0.037s
  sys 0m0.011s
  $ git tag big1 $(dd if=/dev/urandom bs=1M count=512 | git hash-object -w 
--stdin)
  512+0 records in
  512+0 records out
  536870912 bytes (537 MB) copied, 45.5088 s, 11.8 MB/s
  $ time git describe --debug
  [...]
  real0m1.875s
  user0m1.738s
  sys 0m0.129s
  $ git tag big2 $(dd if=/dev/urandom bs=1M count=512 | git hash-object -w 
--stdin)
  512+0 records in
  512+0 records out
  536870912 bytes (537 MB) copied, 44.972 s, 11.9 MB/s
  $ time git describe --debugsuche zur Beschreibung von HEAD
  [...]
  real0m3.620s
  user0m3.357s
  sys 0m0.248s

(I actually ran the git-describe invocations more than once to ensure
that they are again cache-hot.)

git-describe should probably be fixed to avoid loading blobs, though I'm
not sure off hand if we have any infrastructure to infer the type of a
loose object without inflating it.  (This could probably be added by
inflating only the first block.)  We do have this for packed objects, so
at least for packed repos there's a speedup to be had.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance of git describe in big repos

2013-05-30 Thread Alex Bennée
On 30 May 2013 16:33, Thomas Rast tr...@inf.ethz.ch wrote:
 Alex Bennée kernel-hac...@bennee.com writes:

  41.58%   git  libcrypto.so.1.0.0  [.] sha1_block_data_order_ssse3
  33.62%   git  libz.so.1.2.3.4 [.] inflate_fast
  10.39%   git  libz.so.1.2.3.4 [.] adler32
   2.03%   git  [kernel.kallsyms]   [k] clear_page_c

 Do you have any large blobs in the repo that are referenced directly by
 a tag?

Most probably. I've certainly done a bunch of releases (which are tagged) were
the last thing that was updated was an FPGA image.

 Because this just so happens to exactly reproduce your symptoms:

   # in a random git.git
   $ time git describe --debug
   [...]
   real0m0.390s
   user0m0.037s
   sys 0m0.011s
   $ git tag big1 $(dd if=/dev/urandom bs=1M count=512 | git hash-object -w 
 --stdin)
   512+0 records in
   512+0 records out
   536870912 bytes (537 MB) copied, 45.5088 s, 11.8 MB/s
   $ time git describe --debug
   [...]
   real0m1.875s
   user0m1.738s
   sys 0m0.129s
   $ git tag big2 $(dd if=/dev/urandom bs=1M count=512 | git hash-object -w 
 --stdin)
   512+0 records in
   512+0 records out
   536870912 bytes (537 MB) copied, 44.972 s, 11.9 MB/s
   $ time git describe --debugsuche zur Beschreibung von HEAD
   [...]
   real0m3.620s
   user0m3.357s
   sys 0m0.248s

 (I actually ran the git-describe invocations more than once to ensure
 that they are again cache-hot.)

That looks pretty promising as a replication.

 git-describe should probably be fixed to avoid loading blobs, though I'm
 not sure off hand if we have any infrastructure to infer the type of a
 loose object without inflating it.  (This could probably be added by
 inflating only the first block.)  We do have this for packed objects, so
 at least for packed repos there's a speedup to be had.

Will it be loading the blob for every commit it traverses or just ones that hit
a tag? Why does it need to load the blob at all? Surely the commit
tree state doesn't
need to be walked down?


 --
 Thomas Rast
 trast@{inf,student}.ethz.ch



-- 
Alex, homepage: http://www.bennee.com/~alex/
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance of git describe in big repos

2013-05-30 Thread Thomas Rast
Alex Bennée kernel-hac...@bennee.com writes:

 On 30 May 2013 16:33, Thomas Rast tr...@inf.ethz.ch wrote:
 Alex Bennée kernel-hac...@bennee.com writes:

  41.58%   git  libcrypto.so.1.0.0  [.] sha1_block_data_order_ssse3
  33.62%   git  libz.so.1.2.3.4 [.] inflate_fast
  10.39%   git  libz.so.1.2.3.4 [.] adler32
   2.03%   git  [kernel.kallsyms]   [k] clear_page_c

 Do you have any large blobs in the repo that are referenced directly by
 a tag?

 Most probably. I've certainly done a bunch of releases (which are tagged) were
 the last thing that was updated was an FPGA image.
[...]
 git-describe should probably be fixed to avoid loading blobs, though I'm
 not sure off hand if we have any infrastructure to infer the type of a
 loose object without inflating it.  (This could probably be added by
 inflating only the first block.)  We do have this for packed objects, so
 at least for packed repos there's a speedup to be had.

 Will it be loading the blob for every commit it traverses or just ones that 
 hit
 a tag? Why does it need to load the blob at all? Surely the commit
 tree state doesn't
 need to be walked down?

No, my theory is that you tagged *the blobs*.  Git supports this.

git-describe needs to look at the commit (if any) obtained by peeling
each tag (i.e. dereferencing tags until it reaches a non-tag).  So to do
that, it resolves the tag's referent and loads it.  Usually this will be
a commit, in which case it is marked as reached by the tag.

As my example shows, it also resolves tags' referents if they refer to
non-commits, in particular, it will decompress large blobs that are
(directly) referenced by a tag.

Note that while annotated tags provide the type information themselves,
e.g.

  $ git cat-file tag junio-gpg-pub
  object fe113d3f96636710600c6b02d5fd421fa7e87dd6
  type blob
  tag junio-gpg-pub
  [...]

unannotated tags are simply refs, so it is not enough to just look at
the tag objects' referent type.

I had a brief look around sha1_file.c, in particular sha1_object_info,
and it turns out we lack the deflate only early part logic as I
suspected.  So that'll have to be fixed first.  After that I *think* it
should automatically carry over into the tag readers.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance of git describe in big repos

2013-05-30 Thread Thomas Rast
Thomas Rast tr...@inf.ethz.ch writes:

 I had a brief look around sha1_file.c, in particular sha1_object_info,
 and it turns out we lack the deflate only early part logic as I
 suspected.  So that'll have to be fixed first.  After that I *think* it
 should automatically carry over into the tag readers.

Strike that, I'm wrong.  sha1_object_info is fast even for these big
loose objects.

The culprit, according to some callgrind investigation, is
lookup_commit_reference_gently() [for the unannotated case] or
deref_tag() [annotated case] calling parse_object().

-- 
Thomas Rast
trast@{inf,student}.ethz.ch
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance of git describe in big repos

2013-05-30 Thread Antoine Pelisse
 The culprit, according to some callgrind investigation, is
 lookup_commit_reference_gently() [for the unannotated case] or
 deref_tag() [annotated case] calling parse_object().

Using the scenario you described earlier, I think it ends-up spending
most of its time in check_sha1_signature (both deref_tag and
lookup_commit_reference_gently() go there) with 20% inflating, 80% in
SHA1_Update(). Not much we can do about that, can we ?
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Poor performance of git describe in big repos

2013-05-30 Thread John Keeping
On Thu, May 30, 2013 at 06:21:55PM +0200, Thomas Rast wrote:
 Alex Bennée kernel-hac...@bennee.com writes:
 
  On 30 May 2013 16:33, Thomas Rast tr...@inf.ethz.ch wrote:
  Alex Bennée kernel-hac...@bennee.com writes:
 
   41.58%   git  libcrypto.so.1.0.0  [.] sha1_block_data_order_ssse3
   33.62%   git  libz.so.1.2.3.4 [.] inflate_fast
   10.39%   git  libz.so.1.2.3.4 [.] adler32
2.03%   git  [kernel.kallsyms]   [k] clear_page_c
 
  Do you have any large blobs in the repo that are referenced directly by
  a tag?
 
  Most probably. I've certainly done a bunch of releases (which are tagged) 
  were
  the last thing that was updated was an FPGA image.
 [...]
  git-describe should probably be fixed to avoid loading blobs, though I'm
  not sure off hand if we have any infrastructure to infer the type of a
  loose object without inflating it.  (This could probably be added by
  inflating only the first block.)  We do have this for packed objects, so
  at least for packed repos there's a speedup to be had.
 
  Will it be loading the blob for every commit it traverses or just ones that 
  hit
  a tag? Why does it need to load the blob at all? Surely the commit
  tree state doesn't
  need to be walked down?
 
 No, my theory is that you tagged *the blobs*.  Git supports this.

You can see if that is the case by doing something like this:

eval $(git for-each-ref --shell --format '
test $(git cat-file -t %(objectname)^{}) = commit ||
echo %(refname);')

That will print out the name of any ref that doesn't point at a commit.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html