Thomas Koch wrote:
first I just wanted to report that I have a git-annex repo that is really big
and slow and that this makes me kind of unhappy. Then I realized, that it may
be a good idea to add a diagnostics command to git-annex that will gather
all informations useful for you to improve git-annex, e.g. for my repo:
`git annex status` is essentially that, combined with the --debug flag
when there's a specific problem. There is also the ability to build
with `make PROFILE=1`, at which point the techniques described here can
be used to profile for time or space:
find . -type l -a \( -path .git -prune -o -print \) | wc -l
This is the most relevant number, probably.
find .git/objects -type f | wc -l
This is surprisingly many. git auto gc typically keeps the loose objects
fewer, packing when there are more than 6700. (I have 194.)
Packing does tend to improve git repository performance, since the
kernel can better buffer pack files, rather than seeking like mad amoung
many loose objects. I'd be curious how your fsck performs after packing.
time git annex fsck --fast | grep -A 10 -v ok$
1200.66s real 45.35s user 5.86s system 156 maxmem/kb 301856 nrInOps 4%
By comparison, I have a repo with 40 thousand files, and running fsck on
a SSD (on an otherwise 3 years out of date netbook) takes 10 minutes:
225.33user 59.37system 9:58.26elapsed 47%CPU (0avgtext+0avgdata
Note the 47% CPU usage. The other half of the CPU was used by git cat-object,
which is looking up the location log for each file being fscked.
Indeed, as the number of files, rather than the size of files increases,
the largest source of scalability problems is git itself. Some helpful
* Use `git status .` to only check status of current subdirectory,
rather than scanning entire repository, and `git commit .` or
staged commits, rather than commit -a.
* Run git annex fsck in or on active directories; put inactive files in a
different directory of the same repository. Even in a large repository
git-annex will be fast if run in a relatively small (tens of thousands
of files) subdirectory.
* If doing git annex add (or move, or drop) on a large number of files,
consider setting `git config annex.alwayscommit false` with the newest
version, to avoid running the slow git commit as much as possible.
* Use branches. Now that git-annex fully supports them, if there's a
sensible branch strategy for your repository that can segment the
files in a useful way, you can avoid performance issues, since
the files in the non-checked-out branch are essentially free.
see shy jo
Description: Digital signature
vcs-home mailing list