Re: Import/Export as a fast way to purge files from Git?

2018-11-12 Thread Elijah Newren
On Mon, Nov 12, 2018 at 1:17 AM Ævar Arnfjörð Bjarmason
 wrote:
>
>
> On Thu, Nov 01 2018, Elijah Newren wrote:
>
> > On Wed, Oct 31, 2018 at 12:16 PM Lars Schneider
> >  wrote:
> >> > On Sep 24, 2018, at 7:24 PM, Elijah Newren  wrote:
> >> > On Sun, Sep 23, 2018 at 6:08 AM Lars Schneider 
> >> >  wrote:
> >> >>
> >> >> Hi,
> >> >>
> >> >> I recently had to purge files from large Git repos (many files, many 
> >> >> commits).
> >> >> The usual recommendation is to use `git filter-branch --index-filter` 
> >> >> to purge
> >> >> files. However, this is *very* slow for large repos (e.g. it takes 
> >> >> 45min to
> >> >> remove the `builtin` directory from git core). I realized that I can 
> >> >> remove
> >> >> files *way* faster by exporting the repo, removing the file references,
> >> >> and then importing the repo (see Perl script below, it takes ~30sec to 
> >> >> remove
> >> >> the `builtin` directory from git core). Do you see any problem with this
> >> >> approach?
> >> >
> >> > It looks like others have pointed you at other tools, and you're
> >> > already shifting to that route.  But I think it's a useful question to
> >> > answer more generally, so for those that are really curious...
> >> >
> >> >
> >> > The basic approach is fine, though if you try to extend it much you
> >> > can run into a few possible edge/corner cases (more on that below).
> >> > I've been using this basic approach for years and even created a
> >> > mini-python library[1] designed specifically to allow people to create
> >> > "fast-filters", used as
> >> >   git fast-export  | your-fast-filter | git fast-import 
> >> > 
> >> >
> >> > But that library didn't really take off; even I have rarely used it,
> >> > often opting for filter-branch despite its horrible performance or a
> >> > simple fast-export | long-sed-command | fast-import (with some extra
> >> > pre-checking to make sure the sed wouldn't unintentionally munge other
> >> > data).  BFG is great, as long as you're only interested in removing a
> >> > few big items, but otherwise doesn't seem very useful (to be fair,
> >> > it's very upfront about only wanting to solve that problem).
> >> > Recently, due to continuing questions on filter-branch and folks still
> >> > getting confused with it, I looked at existing tools, decided I didn't
> >> > think any quite fit, and started looking into converting
> >> > git_fast_filter into a filter-branch-like tool instead of just a
> >> > libary.  Found some bugs and missing features in fast-export along the
> >> > way (and have some patches I still need to send in).  But I kind of
> >> > got stuck -- if the tool is in python, will that limit adoption too
> >> > much?  It'd be kind of nice to have this tool in core git.  But I kind
> >> > of like leaving open the possibility of using it as a tool _or_ as a
> >> > library, the latter for the special cases where case-specific
> >> > programmatic filtering is needed.  But a developer-convenience library
> >> > makes almost no sense unless in a higher level language, such as
> >> > python.  I'm still trying to make up my mind about what I want (and
> >> > what others might want), and have been kind of blocking on that.  (If
> >> > others have opinions, I'm all ears.)
> >>
> >> That library sounds like a very interesting idea. Unfortunately, the
> >> referenced repo seems not to be available anymore:
> >> git://gitorious.org/git_fast_filter/mainline.git
> >
> > Yeah, gitorious went down at a time when I was busy with enough other
> > things that I never bothered moving my repos to a new hosting site.
> > Sorry about that.
> >
> > I've got a copy locally, but I've been editing it heavily, without the
> > testing I should have in place, so I hesitate to point you at it right
> > now.  (Also, the old version failed to handle things like --no-data
> > output, which is important.)  I'll post an updated copy soon; feel
> > free to ping me in a week if you haven't heard anything yet.
> >
> >> I very much like Python. However, more recently I started to
> >> write Git tools in Perl as they work out of the box on every
> >> machine with Git installed ... and I think Perl can be quite
> >> readable if no shortcuts are used :-).
> >
> > Yeah, when portability matters, perl makes sense.  I thought about
> > switching it over, but I'm not sure I want to rewrite 1-2k lines of
> > code.  Especially since repo-filtering tools are kind of one-shot by
> > nature, and only need to be done by one person of a team, on one
> > specific machine, and won't affect daily development thereafter.
> > (Also, since I don't depend on any libraries and use only stuff from
> > the default python library, it ought to be relatively portable
> > anyway.)
>
> FWIW I'd be very happy to have this tool itself included in git.git
> if/when it's stable / useful enough, and as you point out the language
> doesn't really matter as much as what features it exposes.

Well, I'm happy to propose it for inclusion once it gets 

Re: Import/Export as a fast way to purge files from Git?

2018-11-12 Thread Ævar Arnfjörð Bjarmason


On Thu, Nov 01 2018, Elijah Newren wrote:

> On Wed, Oct 31, 2018 at 12:16 PM Lars Schneider
>  wrote:
>> > On Sep 24, 2018, at 7:24 PM, Elijah Newren  wrote:
>> > On Sun, Sep 23, 2018 at 6:08 AM Lars Schneider  
>> > wrote:
>> >>
>> >> Hi,
>> >>
>> >> I recently had to purge files from large Git repos (many files, many 
>> >> commits).
>> >> The usual recommendation is to use `git filter-branch --index-filter` to 
>> >> purge
>> >> files. However, this is *very* slow for large repos (e.g. it takes 45min 
>> >> to
>> >> remove the `builtin` directory from git core). I realized that I can 
>> >> remove
>> >> files *way* faster by exporting the repo, removing the file references,
>> >> and then importing the repo (see Perl script below, it takes ~30sec to 
>> >> remove
>> >> the `builtin` directory from git core). Do you see any problem with this
>> >> approach?
>> >
>> > It looks like others have pointed you at other tools, and you're
>> > already shifting to that route.  But I think it's a useful question to
>> > answer more generally, so for those that are really curious...
>> >
>> >
>> > The basic approach is fine, though if you try to extend it much you
>> > can run into a few possible edge/corner cases (more on that below).
>> > I've been using this basic approach for years and even created a
>> > mini-python library[1] designed specifically to allow people to create
>> > "fast-filters", used as
>> >   git fast-export  | your-fast-filter | git fast-import 
>> >
>> > But that library didn't really take off; even I have rarely used it,
>> > often opting for filter-branch despite its horrible performance or a
>> > simple fast-export | long-sed-command | fast-import (with some extra
>> > pre-checking to make sure the sed wouldn't unintentionally munge other
>> > data).  BFG is great, as long as you're only interested in removing a
>> > few big items, but otherwise doesn't seem very useful (to be fair,
>> > it's very upfront about only wanting to solve that problem).
>> > Recently, due to continuing questions on filter-branch and folks still
>> > getting confused with it, I looked at existing tools, decided I didn't
>> > think any quite fit, and started looking into converting
>> > git_fast_filter into a filter-branch-like tool instead of just a
>> > libary.  Found some bugs and missing features in fast-export along the
>> > way (and have some patches I still need to send in).  But I kind of
>> > got stuck -- if the tool is in python, will that limit adoption too
>> > much?  It'd be kind of nice to have this tool in core git.  But I kind
>> > of like leaving open the possibility of using it as a tool _or_ as a
>> > library, the latter for the special cases where case-specific
>> > programmatic filtering is needed.  But a developer-convenience library
>> > makes almost no sense unless in a higher level language, such as
>> > python.  I'm still trying to make up my mind about what I want (and
>> > what others might want), and have been kind of blocking on that.  (If
>> > others have opinions, I'm all ears.)
>>
>> That library sounds like a very interesting idea. Unfortunately, the
>> referenced repo seems not to be available anymore:
>> git://gitorious.org/git_fast_filter/mainline.git
>
> Yeah, gitorious went down at a time when I was busy with enough other
> things that I never bothered moving my repos to a new hosting site.
> Sorry about that.
>
> I've got a copy locally, but I've been editing it heavily, without the
> testing I should have in place, so I hesitate to point you at it right
> now.  (Also, the old version failed to handle things like --no-data
> output, which is important.)  I'll post an updated copy soon; feel
> free to ping me in a week if you haven't heard anything yet.
>
>> I very much like Python. However, more recently I started to
>> write Git tools in Perl as they work out of the box on every
>> machine with Git installed ... and I think Perl can be quite
>> readable if no shortcuts are used :-).
>
> Yeah, when portability matters, perl makes sense.  I thought about
> switching it over, but I'm not sure I want to rewrite 1-2k lines of
> code.  Especially since repo-filtering tools are kind of one-shot by
> nature, and only need to be done by one person of a team, on one
> specific machine, and won't affect daily development thereafter.
> (Also, since I don't depend on any libraries and use only stuff from
> the default python library, it ought to be relatively portable
> anyway.)

FWIW I'd be very happy to have this tool itself included in git.git
if/when it's stable / useful enough, and as you point out the language
doesn't really matter as much as what features it exposes.


Re: Import/Export as a fast way to purge files from Git?

2018-11-01 Thread Elijah Newren
On Wed, Oct 31, 2018 at 12:16 PM Lars Schneider
 wrote:
> > On Sep 24, 2018, at 7:24 PM, Elijah Newren  wrote:
> > On Sun, Sep 23, 2018 at 6:08 AM Lars Schneider  
> > wrote:
> >>
> >> Hi,
> >>
> >> I recently had to purge files from large Git repos (many files, many 
> >> commits).
> >> The usual recommendation is to use `git filter-branch --index-filter` to 
> >> purge
> >> files. However, this is *very* slow for large repos (e.g. it takes 45min to
> >> remove the `builtin` directory from git core). I realized that I can remove
> >> files *way* faster by exporting the repo, removing the file references,
> >> and then importing the repo (see Perl script below, it takes ~30sec to 
> >> remove
> >> the `builtin` directory from git core). Do you see any problem with this
> >> approach?
> >
> > It looks like others have pointed you at other tools, and you're
> > already shifting to that route.  But I think it's a useful question to
> > answer more generally, so for those that are really curious...
> >
> >
> > The basic approach is fine, though if you try to extend it much you
> > can run into a few possible edge/corner cases (more on that below).
> > I've been using this basic approach for years and even created a
> > mini-python library[1] designed specifically to allow people to create
> > "fast-filters", used as
> >   git fast-export  | your-fast-filter | git fast-import 
> >
> > But that library didn't really take off; even I have rarely used it,
> > often opting for filter-branch despite its horrible performance or a
> > simple fast-export | long-sed-command | fast-import (with some extra
> > pre-checking to make sure the sed wouldn't unintentionally munge other
> > data).  BFG is great, as long as you're only interested in removing a
> > few big items, but otherwise doesn't seem very useful (to be fair,
> > it's very upfront about only wanting to solve that problem).
> > Recently, due to continuing questions on filter-branch and folks still
> > getting confused with it, I looked at existing tools, decided I didn't
> > think any quite fit, and started looking into converting
> > git_fast_filter into a filter-branch-like tool instead of just a
> > libary.  Found some bugs and missing features in fast-export along the
> > way (and have some patches I still need to send in).  But I kind of
> > got stuck -- if the tool is in python, will that limit adoption too
> > much?  It'd be kind of nice to have this tool in core git.  But I kind
> > of like leaving open the possibility of using it as a tool _or_ as a
> > library, the latter for the special cases where case-specific
> > programmatic filtering is needed.  But a developer-convenience library
> > makes almost no sense unless in a higher level language, such as
> > python.  I'm still trying to make up my mind about what I want (and
> > what others might want), and have been kind of blocking on that.  (If
> > others have opinions, I'm all ears.)
>
> That library sounds like a very interesting idea. Unfortunately, the
> referenced repo seems not to be available anymore:
> git://gitorious.org/git_fast_filter/mainline.git

Yeah, gitorious went down at a time when I was busy with enough other
things that I never bothered moving my repos to a new hosting site.
Sorry about that.

I've got a copy locally, but I've been editing it heavily, without the
testing I should have in place, so I hesitate to point you at it right
now.  (Also, the old version failed to handle things like --no-data
output, which is important.)  I'll post an updated copy soon; feel
free to ping me in a week if you haven't heard anything yet.

> I very much like Python. However, more recently I started to
> write Git tools in Perl as they work out of the box on every
> machine with Git installed ... and I think Perl can be quite
> readable if no shortcuts are used :-).

Yeah, when portability matters, perl makes sense.  I thought about
switching it over, but I'm not sure I want to rewrite 1-2k lines of
code.  Especially since repo-filtering tools are kind of one-shot by
nature, and only need to be done by one person of a team, on one
specific machine, and won't affect daily development thereafter.
(Also, since I don't depend on any libraries and use only stuff from
the default python library, it ought to be relatively portable
anyway.)


Re: Import/Export as a fast way to purge files from Git?

2018-10-31 Thread Lars Schneider



> On Sep 24, 2018, at 7:24 PM, Elijah Newren  wrote:
> 
> On Sun, Sep 23, 2018 at 6:08 AM Lars Schneider  
> wrote:
>> 
>> Hi,
>> 
>> I recently had to purge files from large Git repos (many files, many 
>> commits).
>> The usual recommendation is to use `git filter-branch --index-filter` to 
>> purge
>> files. However, this is *very* slow for large repos (e.g. it takes 45min to
>> remove the `builtin` directory from git core). I realized that I can remove
>> files *way* faster by exporting the repo, removing the file references,
>> and then importing the repo (see Perl script below, it takes ~30sec to remove
>> the `builtin` directory from git core). Do you see any problem with this
>> approach?
> 
> It looks like others have pointed you at other tools, and you're
> already shifting to that route.  But I think it's a useful question to
> answer more generally, so for those that are really curious...
> 
> 
> The basic approach is fine, though if you try to extend it much you
> can run into a few possible edge/corner cases (more on that below).
> I've been using this basic approach for years and even created a
> mini-python library[1] designed specifically to allow people to create
> "fast-filters", used as
>   git fast-export  | your-fast-filter | git fast-import 
> 
> But that library didn't really take off; even I have rarely used it,
> often opting for filter-branch despite its horrible performance or a
> simple fast-export | long-sed-command | fast-import (with some extra
> pre-checking to make sure the sed wouldn't unintentionally munge other
> data).  BFG is great, as long as you're only interested in removing a
> few big items, but otherwise doesn't seem very useful (to be fair,
> it's very upfront about only wanting to solve that problem).
> Recently, due to continuing questions on filter-branch and folks still
> getting confused with it, I looked at existing tools, decided I didn't
> think any quite fit, and started looking into converting
> git_fast_filter into a filter-branch-like tool instead of just a
> libary.  Found some bugs and missing features in fast-export along the
> way (and have some patches I still need to send in).  But I kind of
> got stuck -- if the tool is in python, will that limit adoption too
> much?  It'd be kind of nice to have this tool in core git.  But I kind
> of like leaving open the possibility of using it as a tool _or_ as a
> library, the latter for the special cases where case-specific
> programmatic filtering is needed.  But a developer-convenience library
> makes almost no sense unless in a higher level language, such as
> python.  I'm still trying to make up my mind about what I want (and
> what others might want), and have been kind of blocking on that.  (If
> others have opinions, I'm all ears.)

That library sounds like a very interesting idea. Unfortunately, the 
referenced repo seems not to be available anymore:
git://gitorious.org/git_fast_filter/mainline.git

I very much like Python. However, more recently I started to
write Git tools in Perl as they work out of the box on every
machine with Git installed ... and I think Perl can be quite
readable if no shortcuts are used :-). 


> Anyway, the edge/corner cases you can watch out for:
> 
>  - Signed tags are a problem; you may need to specify
> --signed-tags=strip to fast-export
> 
>  - References to other commits in your commit messages will now be
> incorrect.  I think a good tool should either default to rewriting
> commit ids in commit messages or at least have an option to do so
> (BFG does this; filter-branch doesn't; fast-export format makes it
> really hard for a filter based on it to do so)
> 
>  - If the paths you remove are the only paths modified in a commit,
> the commit can become empty.  If you're only filtering a few paths
> out, this might be nothing more than a minor inconvenience for you.
> However, if you're trying to prune directories (and perhaps several
> toplevel ones), then it can be extremely annoying to have a new
> history with the vast majority of all commits being empty.
> (filter-branch has an option for this; BFG does not; tools based on
> fast-export output can do it with sufficient effort).
> 
>  - If you start pruning empty commits, you have to worry about
> rewriting branches and tags to remaining parents.  This _might_ happen
> for free depending on your history's structure and the fast-export
> stream, but to be correct in general you will have to specify the new
> commit for some branches or tags.
> 
>  - If you start pruning empty commits, you have to decide whether to
> allow pruning of merge commits.  Your first reaction might be to not
> allow it, but if one parent and its entire history are all pruned,
> then transforming the merge commit to a normal commit and then
> considering whether it is empty and allowing it to be pruned is much
> better.
> 
>  - If you start pruning empty commits, you also have to worry about
> history topology changing, beyond 

Re: Import/Export as a fast way to purge files from Git?

2018-09-24 Thread Elijah Newren
On Sun, Sep 23, 2018 at 6:08 AM Lars Schneider  wrote:
>
> Hi,
>
> I recently had to purge files from large Git repos (many files, many commits).
> The usual recommendation is to use `git filter-branch --index-filter` to purge
> files. However, this is *very* slow for large repos (e.g. it takes 45min to
> remove the `builtin` directory from git core). I realized that I can remove
> files *way* faster by exporting the repo, removing the file references,
> and then importing the repo (see Perl script below, it takes ~30sec to remove
> the `builtin` directory from git core). Do you see any problem with this
> approach?

It looks like others have pointed you at other tools, and you're
already shifting to that route.  But I think it's a useful question to
answer more generally, so for those that are really curious...


The basic approach is fine, though if you try to extend it much you
can run into a few possible edge/corner cases (more on that below).
I've been using this basic approach for years and even created a
mini-python library[1] designed specifically to allow people to create
"fast-filters", used as
   git fast-export  | your-fast-filter | git fast-import 

But that library didn't really take off; even I have rarely used it,
often opting for filter-branch despite its horrible performance or a
simple fast-export | long-sed-command | fast-import (with some extra
pre-checking to make sure the sed wouldn't unintentionally munge other
data).  BFG is great, as long as you're only interested in removing a
few big items, but otherwise doesn't seem very useful (to be fair,
it's very upfront about only wanting to solve that problem).
Recently, due to continuing questions on filter-branch and folks still
getting confused with it, I looked at existing tools, decided I didn't
think any quite fit, and started looking into converting
git_fast_filter into a filter-branch-like tool instead of just a
libary.  Found some bugs and missing features in fast-export along the
way (and have some patches I still need to send in).  But I kind of
got stuck -- if the tool is in python, will that limit adoption too
much?  It'd be kind of nice to have this tool in core git.  But I kind
of like leaving open the possibility of using it as a tool _or_ as a
library, the latter for the special cases where case-specific
programmatic filtering is needed.  But a developer-convenience library
makes almost no sense unless in a higher level language, such as
python.  I'm still trying to make up my mind about what I want (and
what others might want), and have been kind of blocking on that.  (If
others have opinions, I'm all ears.)


Anyway, the edge/corner cases you can watch out for:

  - Signed tags are a problem; you may need to specify
--signed-tags=strip to fast-export

  - References to other commits in your commit messages will now be
incorrect.  I think a good tool should either default to rewriting
commit ids in commit messages or at least have an option to do so
(BFG does this; filter-branch doesn't; fast-export format makes it
really hard for a filter based on it to do so)

  - If the paths you remove are the only paths modified in a commit,
the commit can become empty.  If you're only filtering a few paths
out, this might be nothing more than a minor inconvenience for you.
However, if you're trying to prune directories (and perhaps several
toplevel ones), then it can be extremely annoying to have a new
history with the vast majority of all commits being empty.
(filter-branch has an option for this; BFG does not; tools based on
fast-export output can do it with sufficient effort).

  - If you start pruning empty commits, you have to worry about
rewriting branches and tags to remaining parents.  This _might_ happen
for free depending on your history's structure and the fast-export
stream, but to be correct in general you will have to specify the new
commit for some branches or tags.

  - If you start pruning empty commits, you have to decide whether to
allow pruning of merge commits.  Your first reaction might be to not
allow it, but if one parent and its entire history are all pruned,
then transforming the merge commit to a normal commit and then
considering whether it is empty and allowing it to be pruned is much
better.

  - If you start pruning empty commits, you also have to worry about
history topology changing, beyond the all-ancestors-empty case above.
For example, the last non-empty commit in the ancestry of a merge on
both sides may be the same commit, making the merge-commit have the
same parent twice.  Should the duplicate parent be de-duped,
transforming the commit into a normal non-merge commit?  (I'd say yes
-- this commit is likely to be empty and prunable once you do so, but
I'm not sure everyone would agree with converting a merge commit to a
non-merge.)  Similarly, what if the rewritten parents of a merge have
one parent that is the direct ancestor of another?  Can the extra
unnecessary parent be removed as a 

Re: Import/Export as a fast way to purge files from Git?

2018-09-23 Thread Jeff King
On Sun, Sep 23, 2018 at 03:53:38PM +, brian m. carlson wrote:

> I suspect you're gaining speed mostly because you're running three
> processes total instead of at least one process (sh) per commit.  So I
> don't think there's anything that Git can do to make this faster on our
> end without a redesign.

It's not just the process startup overhead that makes it faster. Using
multiple processes means they have to communicate somehow. In this case,
git-read-tree is writing out the whole index for each commit, which
git-rm reads in and modifies, and then git-commit-tree finally converts
back to a tree. In addition to the raw CPU of that work, there's a bunch
of latency as each step is performed serially.

Whereas in the proposed pipeline, fast-export is writing out a diff and
fast-import is turning that directly back into tree objects. And both
processes are proceeding independently, so you benefit from multiple
cores.

Which isn't to say I really disagree with "Git can't really make this
faster". filter-branch has a ton of power to let you replay arbitrary
commands (including non-Git commands!), so the speed tradeoff in its
approach is very intentional. If we could modify the index in-place that
would probably make it a little faster, but that probably counts as
"redesign" in your statement. ;)

-Peff


Re: Import/Export as a fast way to purge files from Git?

2018-09-23 Thread Lars Schneider



> On Sep 23, 2018, at 4:55 PM, Eric Sunshine  wrote:
> 
> On Sun, Sep 23, 2018 at 9:05 AM Lars Schneider  
> wrote:
>> I recently had to purge files from large Git repos (many files, many 
>> commits).
>> The usual recommendation is to use `git filter-branch --index-filter` to 
>> purge
>> files. However, this is *very* slow for large repos (e.g. it takes 45min to
>> remove the `builtin` directory from git core). I realized that I can remove
>> files *way* faster by exporting the repo, removing the file references,
>> and then importing the repo (see Perl script below, it takes ~30sec to remove
>> the `builtin` directory from git core). Do you see any problem with this
>> approach?
> 
> A couple comments:
> 
> For purging files from a history, take a look at BFG[1] which bills
> itself as "a simpler, faster alternative to git-filter-branch for
> cleansing bad data out of your Git repository history".

Yes, BFG is great. Unfortunately, it requires Java which is not available
on every system I have to work with. I required a solution that would work
in every Git environment. Hence the Perl script :-)


> The approach of exporting to a fast-import stream, modifying the
> stream, and re-importing is quite reasonable.

Thanks for the confirmation!


> However, rather than
> re-inventing, take a look at reposurgeon[2], which allows you to do
> major surgery on fast-import streams. Not only can it purge files from
> a repository, but it can slice, dice, puree, and saute pretty much any
> attribute of a repository.

Wow. Reposurgeon looks very interesting. Thanks a lot for the pointer!

Cheers,
Lars


> [1]: https://rtyley.github.io/bfg-repo-cleaner/
> [2]: http://www.catb.org/esr/reposurgeon/



Re: Import/Export as a fast way to purge files from Git?

2018-09-23 Thread brian m. carlson
On Sun, Sep 23, 2018 at 03:04:58PM +0200, Lars Schneider wrote:
> Hi,
> 
> I recently had to purge files from large Git repos (many files, many commits).
> The usual recommendation is to use `git filter-branch --index-filter` to purge
> files. However, this is *very* slow for large repos (e.g. it takes 45min to
> remove the `builtin` directory from git core). I realized that I can remove
> files *way* faster by exporting the repo, removing the file references,
> and then importing the repo (see Perl script below, it takes ~30sec to remove
> the `builtin` directory from git core). Do you see any problem with this
> approach?

I don't know of any problems with this approach.  I didn't audit your
specific Perl script for any issues, though.

I suspect you're gaining speed mostly because you're running three
processes total instead of at least one process (sh) per commit.  So I
don't think there's anything that Git can do to make this faster on our
end without a redesign.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204


signature.asc
Description: PGP signature


Re: Import/Export as a fast way to purge files from Git?

2018-09-23 Thread Eric Sunshine
On Sun, Sep 23, 2018 at 9:05 AM Lars Schneider  wrote:
> I recently had to purge files from large Git repos (many files, many commits).
> The usual recommendation is to use `git filter-branch --index-filter` to purge
> files. However, this is *very* slow for large repos (e.g. it takes 45min to
> remove the `builtin` directory from git core). I realized that I can remove
> files *way* faster by exporting the repo, removing the file references,
> and then importing the repo (see Perl script below, it takes ~30sec to remove
> the `builtin` directory from git core). Do you see any problem with this
> approach?

A couple comments:

For purging files from a history, take a look at BFG[1] which bills
itself as "a simpler, faster alternative to git-filter-branch for
cleansing bad data out of your Git repository history".

The approach of exporting to a fast-import stream, modifying the
stream, and re-importing is quite reasonable. However, rather than
re-inventing, take a look at reposurgeon[2], which allows you to do
major surgery on fast-import streams. Not only can it purge files from
a repository, but it can slice, dice, puree, and saute pretty much any
attribute of a repository.

[1]: https://rtyley.github.io/bfg-repo-cleaner/
[2]: http://www.catb.org/esr/reposurgeon/