Re: [Request] Git export with hardlinks

2013-02-11 Thread Jeff King
On Sun, Feb 10, 2013 at 11:33:26AM +0100, Thomas Koch wrote:

> thank you very much for your idea! It's good and simple. It just breaks down 
> for the case when a large folder got renamed.

Yes, it would never find renames, which a true sha1->path map could.

> But I already hacked the basic layout of the algorithm and it's not 
> complicated at all, I believe:
> 
> https://github.com/thkoch2001/git_export_hardlinks/blob/master/git_export_hardlinks.py

It looks like you create the sha1->path mapping by asking the user to
provide , pairs, and then assuming that the exported
tree at  exactly matches . Which it would in the
workflow you've proposed, but it is also easy for that not to be the
case (e.g., somebody munges a file in  after it has been
exported).

So it's a bit dangerous as a general purpose tool, IMHO. It's also a
slight pain in that you have to keep track of the tree sha1 for each
exported path somehow.

A safer and more convenient (but slightly less efficient) solution would
be to keep a git index file for each exported tree. Then we can just
refresh that index, which would check that our sha1 for each path is up
to date (and in the common case of nothing changed, would only be as
expensive as stat()-ing each entry). And then we use that index as the
sha1->path map.

The simplest way to have an index for each export would be to actually
give each one its own git repo (which does not have to use much space,
if you use "-s" to share the objects with the master repo).

That's more complex, and uses more disk than what your script does, but
I do think the added safety would be worth it for a general-purpose
tool.

> I had to interrupt work on this and could not yet finish and test it. But I 
> thought you might be interested. Maybe something like this might one day be 
> rewritten in C and become part of git core?

I think if we had a `git export` command (and we do not, but there has
been discussion in a nearby thread about whether such a thing might be a
good idea), having a `--hard-link-from` option to link with other
checkouts would make sense. It could also potentially be an option to
git-checkout-index, and you could script around it at that low level.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Request] Git export with hardlinks

2013-02-10 Thread Thomas Koch
Jeff King:
> [...]
> So a full checkout is 24M. For the next deploy, we'll start by asking
> "cp" to duplicate the old, using hard links:

Hi Jeff,

thank you very much for your idea! It's good and simple. It just breaks down 
for the case when a large folder got renamed.

But I already hacked the basic layout of the algorithm and it's not 
complicated at all, I believe:

https://github.com/thkoch2001/git_export_hardlinks/blob/master/git_export_hardlinks.py

I had to interrupt work on this and could not yet finish and test it. But I 
thought you might be interested. Maybe something like this might one day be 
rewritten in C and become part of git core?

Regards,

Thomas Koch, http://www.koch.ro
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Request] Git export with hardlinks

2013-02-08 Thread Jeff King
On Wed, Feb 06, 2013 at 04:19:07PM +0100, Thomas Koch wrote:

> I'd like to script a git export command that can be given a list of already 
> exported worktrees and the tree SHA1s these worktrees correspond too. The git 
> export command should then for every file it wants to export lookup in the 
> existing worktrees whether an identical file is already present and in that 
> case hardlink to the new export location instead of writing the same file 
> again.
> 
> Use Case: A git based web deployment system that exports git trees to be 
> served by a web server. Every new deployment is written to a new folder. 
> After 
> the export the web server should start serving new requests from the new 
> folder.
> 
> It might be possible that this is premature optimization. But I'd like to 
> learn more Python and dulwich by hacking this.
> 
> Do you have any additional thoughts or use cases about this?

If you can handle losing the generality of N deployments, you can do it
in a few lines of shell.

Let's assume for a moment that you keep two trees at any given time:
the existing tree being used, and the tree you are setting up to deploy.
To save space, you want the new deployment to reuse (via hardlinks) as
many of the files from the old deployment as possible.

So imagine you have a bare repository storing the actual data:

  $ git clone --bare /some/test/repo repo.git
  $ du -sh *
  49M repo.git

and then you have one deployment you've set up previously by checking
out the repo contents:

  $ export GIT_DIR=$PWD/repo.git
  $ mkdir old
  $ (cd old && GIT_WORK_TREE=$PWD git checkout HEAD)
  $ du -sh *
  24M old
  49M repo.git

So a full checkout is 24M. For the next deploy, we'll start by asking
"cp" to duplicate the old, using hard links:

  $ cp -rl old new
  $ du -sh *
  24M new
  768Kold
  49M repo.git

and we use hardly any extra space (it should just be directory inodes).
And now we can ask git to make "new" look like some other commit. It
will only touch files which have changed, so the rest remain hardlinked,
and we use only a small amount of extra space:

  $ (cd new && GIT_WORK_TREE=$PWD git checkout HEAD~10)
  $ du -sh *
  24M new
  1.3Mold
  49M repo.git

Now you point your deployment at "new", and you are free to leave "old"
sitting around or remove it at your leisure. You save space while the
two co-exist, and you saved the I/O of copying any files from "old" to
"new".

This breaks down, of course, if you want to keep N trees around and
hard-link to whichever one has the content you want. For that you'd have
to write some custom code.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Request] Git export with hardlinks

2013-02-06 Thread Thomas Koch
Hi,

I'd like to script a git export command that can be given a list of already 
exported worktrees and the tree SHA1s these worktrees correspond too. The git 
export command should then for every file it wants to export lookup in the 
existing worktrees whether an identical file is already present and in that 
case hardlink to the new export location instead of writing the same file 
again.

Use Case: A git based web deployment system that exports git trees to be 
served by a web server. Every new deployment is written to a new folder. After 
the export the web server should start serving new requests from the new 
folder.

It might be possible that this is premature optimization. But I'd like to 
learn more Python and dulwich by hacking this.

Do you have any additional thoughts or use cases about this?

Regards,

Thomas Koch, http://www.koch.ro
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html