Re: git-annex diagnostics

2012-03-27 Thread Thomas Koch
Certainly not perfect but good enough:

CDDIR=$1

find $CDDIR -type f -print | while read F
do
#  echo searching $F
  FILENAME=$(basename $F)
  FOUND=$(find . -path .git -prune -o -name $FILENAME -print|head -n 1)
  if [ -r $FOUND ]
  then
echo found $FOUND
  else
echo not found: $F
DIRNAME=$(dirname $F)
mkdir -p ./$DIRNAME
cp -v $F ./$DIRNAME
  fi
done 

Still a solution backed into git-annex would be wonderful!

Thomas Koch, http://www.koch.ro
___
vcs-home mailing list
vcs-home@lists.madduck.net
http://lists.madduck.net/listinfo/vcs-home


git-annex diagnostics

2012-03-03 Thread Thomas Koch
Hi,

first I just wanted to report that I have a git-annex repo that is really big 
and slow and that this makes me kind of unhappy. Then I realized, that it may 
be a good idea to add a diagnostics command to git-annex that will gather 
all informations useful for you to improve git-annex, e.g. for my repo:

du -hs
11G

time git status
4.34s real  0.07s user  0.15s system  13 maxmem/kb  36384 nrInOps  5% CPU

find .git/annex/objects -type f | wc -l
32598

find . -type l -a \( -path .git -prune -o -print \) | wc -l 
37738

find .git/objects -type f | wc -l
207864

time git annex fsck --fast | grep -A 10 -v ok$
1200.66s real  45.35s user  5.86s system  156 maxmem/kb  301856 nrInOps  4%

The last one is the annoying one. It takes 1200sec=20min to do an annex fsck 
--fast over the repo.

git gc --aggressive
Counting objects: 1067858, done.
Delta compression using up to 2 threads.
Compressing objects: 100% (1063155/1063155), done.
Writing objects: 100% (1067858/1067858), done.
Total 1067858 (delta 856150), reused 165564 (delta 0)
Removing duplicate objects: 100% (256/256), done.
Checking connectivity: 1067858, done.

I didn't take the time for the last call, but it was well over an hour.

Best regards,

Thomas Koch, http://www.koch.ro
___
vcs-home mailing list
vcs-home@lists.madduck.net
http://lists.madduck.net/listinfo/vcs-home


Re: git-annex diagnostics

2012-03-03 Thread Joey Hess
Thomas Koch wrote:
 first I just wanted to report that I have a git-annex repo that is really big 
 and slow and that this makes me kind of unhappy. Then I realized, that it may 
 be a good idea to add a diagnostics command to git-annex that will gather 
 all informations useful for you to improve git-annex, e.g. for my repo:

`git annex status` is essentially that, combined with the --debug flag
when there's a specific problem. There is also the ability to build
with `make PROFILE=1`, at which point the techniques described here can
be used to profile for time or space:
http://book.realworldhaskell.org/read/profiling-and-optimization.html

 find . -type l -a \( -path .git -prune -o -print \) | wc -l 
 37738

This is the most relevant number, probably.

 find .git/objects -type f | wc -l
 207864

This is surprisingly many. git auto gc typically keeps the loose objects
fewer, packing when there are more than 6700. (I have 194.)
Packing does tend to improve git repository performance, since the
kernel can better buffer pack files, rather than seeking like mad amoung
many loose objects. I'd be curious how your fsck performs after packing.

 time git annex fsck --fast | grep -A 10 -v ok$
 1200.66s real  45.35s user  5.86s system  156 maxmem/kb  301856 nrInOps  4%

By comparison, I have a repo with 40 thousand files, and running fsck on
a SSD (on an otherwise 3 years out of date netbook) takes 10 minutes:

225.33user 59.37system 9:58.26elapsed 47%CPU (0avgtext+0avgdata 
54448maxresident)k

Note the 47% CPU usage. The other half of the CPU was used by git cat-object,
which is looking up the location log for each file being fscked.

Indeed, as the number of files, rather than the size of files increases,
the largest source of scalability problems is git itself. Some helpful
tips include:

* Use `git status .` to only check status of current subdirectory,
  rather than scanning entire repository, and `git commit .` or
  staged commits, rather than commit -a.
* Run git annex fsck in or on active directories; put inactive files in a
  different directory of the same repository. Even in a large repository
  git-annex will be fast if run in a relatively small (tens of thousands
  of files) subdirectory.
* If doing git annex add (or move, or drop) on a large number of files,
  consider setting `git config annex.alwayscommit false` with the newest
  version, to avoid running the slow git commit as much as possible.
* Use branches. Now that git-annex fully supports them, if there's a
  sensible branch strategy for your repository that can segment the
  files in a useful way, you can avoid performance issues, since
  the files in the non-checked-out branch are essentially free.

-- 
see shy jo


signature.asc
Description: Digital signature
___
vcs-home mailing list
vcs-home@lists.madduck.net
http://lists.madduck.net/listinfo/vcs-home