Re: Blobs not referenced by file (anymore) are not removed by GC

2014-12-10 Thread Junio C Hamano
Jeff King p...@peff.net writes:

 ... I'd be happy to
 contribute a patch that gives 'gc' a flag to do the equivalent of:
 
 git reflog expire --expire=now --all  git gc --prune=now --aggressive
 
 Maybe:
 
 git gc --purge

 Yeah, that is common enough that it might be worthwhile (you probably
 want --expire-unreachable in the reflog invocation, though).

Also you would not want an unconditional --aggressive.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Blobs not referenced by file (anymore) are not removed by GC

2014-12-10 Thread Roberto Tyley
On 10 December 2014 at 16:07, Junio C Hamano gits...@pobox.com wrote:
 Jeff King p...@peff.net writes:
 git reflog expire --expire=now --all  git gc --prune=now --aggressive

 Maybe:

 git gc --purge

 Yeah, that is common enough that it might be worthwhile (you probably
 want --expire-unreachable in the reflog invocation, though).

 Also you would not want an unconditional --aggressive.

After a big rewrite deleting files the re-optimisation of --aggressive
can make a big difference to packsize - for instance 1.2GB to 768MB in
a test I just ran - but of course it is *much* slower, so I suspect
you're right about not including it.

I wasn't aware of the '--expire-unreachable=all' switch, though it
seems like a 'milder' version of the '--expire=now' switch? - in that
it would keep reflog entries if they haven't been changed, which is
fair enough and compatible with the 'purge' goal.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Blobs not referenced by file (anymore) are not removed by GC

2014-12-09 Thread Jeff King
On Mon, Dec 08, 2014 at 05:22:23PM +0100, Martin Scherer wrote:

 # invoke bfg --delete-folders something multiple times with different
 pattern.
 
 # try to cleanup
 
 git gc --aggressive --prune=now # big blobs still in history
 git fsck # no results
 git fsck --full  --unreachable --dangling # no results

Might you still have reflogs pointing to the objects? Try:

  git reflog expire --expire-unreachable=now --all

I also don't know if BFG keeps backup refs around (filter-branch, for
example, writes a copy of the original refs into refs/original; you
would want to delete that if you're trying to slim down the repo).

In general, you can see the on-disk size of the objects required for a
particular ref with something like:

  size() {
git rev-list --objects $@ |
cut -d' ' -f1 |
git cat-file --batch-check='%(objectsize:disk)' |
perl -lne '$t += $_; END { print $t }'
  }

  # size of master branch
  size master

  # size of each ref on top of what is in the master branch
  git for-each-ref --format='%(refname)' |
  while read ref; do
echo $(size master..$ref) $ref
  done | sort -rn


Note that these sizes are somewhat approximate. We may store object X
needed by one ref as a delta against Y used by another ref. The
accounting shows X as tiny compared to Y. And then a repack may find the
delta in the opposite direction. But if you're talking about rewriting
history to drop a bunch of gigantic objects, the output of the final
loop is a good way to see which refs are still referring to the old
history.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Blobs not referenced by file (anymore) are not removed by GC

2014-12-09 Thread Roberto Tyley
On 9 December 2014 at 14:14, Jeff King p...@peff.net wrote:
 On Mon, Dec 08, 2014 at 05:22:23PM +0100, Martin Scherer wrote:

 # invoke bfg --delete-folders something multiple times with different
 pattern.

 # try to cleanup

 git gc --aggressive --prune=now # big blobs still in history
 git fsck # no results
 git fsck --full  --unreachable --dangling # no results

 Might you still have reflogs pointing to the objects? Try:

   git reflog expire --expire-unreachable=now --all

Yeah, we figured that's what it was!

https://github.com/rtyley/bfg-repo-cleaner/issues/62#issuecomment-66152559

 I also don't know if BFG keeps backup refs around (filter-branch, for
 example, writes a copy of the original refs into refs/original; you
 would want to delete that if you're trying to slim down the repo).

The BFG reports the ref changes to the command line (and outputs a
full list of changed object-ids in
repo-name.git.bfg-report/[datetime]/object-id-map.old-new.txt) but
doesn't keep refs (like refs/original) around because that would get
in the way of the BFG's explicit intended use-case of removing
unwanted data.

Thanks for the object-size checking scripts, very useful.

Roberto
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Blobs not referenced by file (anymore) are not removed by GC

2014-12-09 Thread Roberto Tyley
On Tuesday, 9 December 2014, Jeff King p...@peff.net wrote:
 I actually think filter-branch's refs/original is a bit outdated at
 this point. The information is there in the reflogs already, and
 dealing with refs/original often causes confusion in my experience. It
 could probably use a git filter-branch --restore or something to
 switch each $ref to $ref@{1} (after making sure that the reflog entry
 was from filter-branch, of course).

Yeah, I'd agree that refs/original can cause confusion.


 Not that I expect you to want to work on filter-branch. :) But maybe
 food for thought for a BFG feature.

I haven't heard much demand for a recover/restore feature on the BFG
(I think by the time people get to the BFG, they're pretty sure they
want to go ahead with the procedure!) but I'll bear it in mind. Mind
you, to make the post-rewrite clean-up easier, I'd be happy to
contribute a patch that gives 'gc' a flag to do the equivalent of:

git reflog expire --expire=now --all  git gc --prune=now --aggressive

Maybe:

git gc --purge

??
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Blobs not referenced by file (anymore) are not removed by GC

2014-12-09 Thread Jeff King
On Tue, Dec 09, 2014 at 10:15:31PM +, Roberto Tyley wrote:

  Not that I expect you to want to work on filter-branch. :) But maybe
  food for thought for a BFG feature.
 
 I haven't heard much demand for a recover/restore feature on the BFG
 (I think by the time people get to the BFG, they're pretty sure they
 want to go ahead with the procedure!) but I'll bear it in mind. Mind
 you, to make the post-rewrite clean-up easier, I'd be happy to
 contribute a patch that gives 'gc' a flag to do the equivalent of:
 
 git reflog expire --expire=now --all  git gc --prune=now --aggressive
 
 Maybe:
 
 git gc --purge

Yeah, that is common enough that it might be worthwhile (you probably
want --expire-unreachable in the reflog invocation, though).

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Blobs not referenced by file (anymore) are not removed by GC

2014-12-08 Thread Martin Scherer
Hi,

after using BFG on a repo given certain directory globs, all of those
files(names) are gone from history, but can not be collected by garbage
collection anymore. So the blobs of the underlying files are not deleted
and only the file names are not associated with the blob anymore. I
wonder, if I discovered a bug (at least in bfg). But I expect git to
discover that this blobs are not used in any way (so they have to
associated to something right?)

# invoke bfg --delete-folders something multiple times with different
pattern.

# try to cleanup

git gc --aggressive --prune=now # big blobs still in history
git fsck # no results
git fsck --full  --unreachable --dangling # no results

to verify if the blobs are still there, see the output of

git gc  git verify-pack -v .git/objects/pack/pack-*.idx | egrep ^\w+
blob\W+[0-9]+ [0-9]+ [0-9]+$ | sort -k 3 -n -r  bigobjects
.txt

head bigobjects.txt # outputs 9451427d7335395779b91864418630d2f0af780a
blob   7895212 1869047 7657491


Also if bfg is being told to remove the biggest blob (bfg -B 1) with
no-blob-protection, it does not succeed in removing it.

--- output of bfg -B 1

Found 1 blob ids for large blobs - biggest=7895212 smallest=7895212


BFG aborting: No refs to update - no dirty commits found??
---

The repo can be found here.

https://github.com/marscher/stallone_stale_objects

I will restart all over to cleanup the history, but I guess this might
be interesting for git developers.


Best,
Martin
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Blobs not referenced by file (anymore) are not removed by GC

2014-12-08 Thread Roberto Tyley
Hi Martin, I'm the developer of the BFG - I'd guess that there
probably isn't a bug for Git developers here, so you might want to
open one or more issues at
https://github.com/rtyley/bfg-repo-cleaner/issues, where I'd be happy
to take a look.

best regards,
Roberto

 On 8 Dec 2014 16:35, Martin Scherer m.sche...@fu-berlin.de wrote:

 Hi,

 after using BFG on a repo given certain directory globs, all of those
 files(names) are gone from history, but can not be collected by garbage
 collection anymore. So the blobs of the underlying files are not deleted
 and only the file names are not associated with the blob anymore. I
 wonder, if I discovered a bug (at least in bfg). But I expect git to
 discover that this blobs are not used in any way (so they have to
 associated to something right?)

 # invoke bfg --delete-folders something multiple times with different
 pattern.

 # try to cleanup

 git gc --aggressive --prune=now # big blobs still in history
 git fsck # no results
 git fsck --full  --unreachable --dangling # no results

 to verify if the blobs are still there, see the output of

 git gc  git verify-pack -v .git/objects/pack/pack-*.idx | egrep ^\w+
 blob\W+[0-9]+ [0-9]+ [0-9]+$ | sort -k 3 -n -r  bigobjects
 .txt

 head bigobjects.txt # outputs 9451427d7335395779b91864418630d2f0af780a
 blob   7895212 1869047 7657491


 Also if bfg is being told to remove the biggest blob (bfg -B 1) with
 no-blob-protection, it does not succeed in removing it.

 --- output of bfg -B 1

 Found 1 blob ids for large blobs - biggest=7895212 smallest=7895212
 

 BFG aborting: No refs to update - no dirty commits found??
 ---

 The repo can be found here.

 https://github.com/marscher/stallone_stale_objects

 I will restart all over to cleanup the history, but I guess this might
 be interesting for git developers.


 Best,
 Martin
 --
 To unsubscribe from this list: send the line unsubscribe git in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html