Re: GC of alternate object store (was: Bringing a bit more sanity to $GIT_DIR/objects/info/alternates?)

2012-08-29 Thread Oswald Buddenhagen
On Tue, Aug 28, 2012 at 09:19:53PM +0200, Hallvard Breien Furuseth wrote:
 Oswald Buddenhagen wrote:
  (...)so the second approach is the bare aggregator repo which adds
  all other repos as remotes, and the other repos link back via
  alternates. problems:
  
  - to actually share objects, one always needs to push to the aggregator
 
 Run a cron job which frequently does that?
 
nope. i also have separate repos which share the same code, so when i
develop it i need to pick between them live. of course it's unlikely
to get conflicts in this case, so the missing object sharing is not that
bad (the objects are transferred via format-patch, as i'm rewriting
paths anyway), but when it happens it's messy to get out again.

  - tags having a shared namespace doesn't actually work, because the
  repos have the same tags on different commits (they are independent
  repos, after all)
 
 Junio's proposal partially fixes that: It pushes refs/* instead of
 refs/heads/*, to refs/remotes/borrowing repo/.  However...
 
i did exacty that. the tags are *still* not populated - git just tries
very hard to treat them specially.
and the stash file is also ignored, unfortunately.

  - one still cannot safely garbage-collect the aggregator, as the refs
  don't include the stashes and the index, so rebasing may invalidate
  these more transient objects.
 
 Also if you copy a repo (e.g. making a backup) instead of cloning it,
 and then start using both, they'll push into the same namespace -
 overwriting each other's refs.

right. it's a clear user error, though - i wouldn't *expect* it to work.
anyway, i don't have *that* problem, as my aggregator actually pulls,
not the other way round.

anyway, the bottom line is that using alternates as-is for anything but
sharing refs/remotes/origin/* (which i'm assuming to be ff-only) is
a recipe for disaster.

anything which is supposed to be in any way safe must make the donor
object store aware of the sharing, which at the very least means setting
the proposed append-only flag _by the borrowing_ object store. which
means that the info/alternates file should be obfuscated, so people
can't edit it manually.

  i would re-propose hallvard's volatile alternates (at least i think that's
  what he was talking about two weeks ago): they can be used to obtain
  objects, but every object which is in any way referenced from the current
  clone must be available locally (or from a regular alternate). that means
  that diffing, etc.  would get objects only temporarily, while cherry-picking
  would actually copy (some of) the objects. this would make it possible to
  cross-link repositories, safely and without any 3rd parties.
 
 I'm afraid that idea by itself won't work:-(

 Either you borrow from a store or not.

correct. from regular alternates you borrow, in volatile ones you
only peek.
so apparently our definitions are different after all.

 If Git uses an object from the volatile store, it can't always know if
 the caller needs the object to be copied.
 
it doesn't have to. the distinction comes when creating objects: if an
object is only in a volatile alternate, it does not already exist for the
purpose of object creation and is thus created locally.

regards

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


GC of alternate object store (was: Bringing a bit more sanity to $GIT_DIR/objects/info/alternates?)

2012-08-28 Thread Hallvard Breien Furuseth
Oswald Buddenhagen wrote:
 (...)so the second approach is the bare aggregator repo which adds
 all other repos as remotes, and the other repos link back via
 alternates. problems:
 
 - to actually share objects, one always needs to push to the aggregator

Run a cron job which frequently does that?

 - tags having a shared namespace doesn't actually work, because the
 repos have the same tags on different commits (they are independent
 repos, after all)

Junio's proposal partially fixes that: It pushes refs/* instead of
refs/heads/*, to refs/remotes/borrowing repo/.  However...

 - one still cannot safely garbage-collect the aggregator, as the refs
 don't include the stashes and the index, so rebasing may invalidate
 these more transient objects.

Also if you copy a repo (e.g. making a backup) instead of cloning it,
and then start using both, they'll push into the same namespace -
overwriting each other's refs.  Non-fast-forward pushes can thus lose
refs to objects needed by the other repo.

receive.denyNonFastForwards only rejects pushes to refs/heads/ or
something.  (A feature, as I learned when I reported it as bug:-)
IIRC Git has no config option to reject all non-fast-forward pushes.

 i would re-propose hallvard's volatile alternates (at least i think that's
 what he was talking about two weeks ago): they can be used to obtain
 objects, but every object which is in any way referenced from the current
 clone must be available locally (or from a regular alternate). that means
 that diffing, etc.  would get objects only temporarily, while cherry-picking
 would actually copy (some of) the objects. this would make it possible to
 cross-link repositories, safely and without any 3rd parties.

I'm afraid that idea by itself won't work:-(  Either you borrow from a
store or not.  If Git uses an object from the volatile store, it can't
always know if the caller needs the object to be copied.

OTOH volatile stores which you do *not* borrow from would be useful:
Let fetch/repack/gc/whatever copy missing objects from there.


2nd attempt for a way to gc of the alternate repo:  Copy the with
removed objects into each borrowing repo, then gc them.   Like this:

1. gc, but pack all to-be-removed objects into a removable pack.

2. Hardlink/copy the removable pack - with a .keep file - into
   borrowing repos when feasible:  I.e. repos you can find and
   have write access to.  Update their .git/objects/info/packs.
   (Is there a Git command for this?)  Repeat until nothing to do,
   in case someone created a new repo during this step.

3. Move the pack from the alternate repo to a backup object store
   which will keep it for a while.

4. Delete the .keep files from step (2).  They were needed in case
   a user gc'ed away an object from the pack and then added an
   identical object - borrowed from the to-be-removed pack.

5. gc/repack the other repos at your leisure.

666. Repos you could not update in step (2), can get temporarily
   broken.  Their owners must link the pack from the backup store by
   hand, or use that store as a volatile store and then gc/repack.

Loose objects are a problem:  If a repo has longer expiry time(s)
than the alternate store, it will get loads of loose objects from all
repos which push into the alternate store.  Worse, gc can *unpack*
those objects, consuming a lot of space.  See threads git gc == git
garbage-create from removed branch (3 May) and Keeping unreachable
objects in a separate pack instead of loose? (10 Jun).

Presumably the work-arounds are:
- Use long expiry times in the alternate repo.  I don't know which
  expiration config settings are relevant how.
- Add some command which checks and warns if the repo has longer
  expiry time than the repo it borrows from.
Also I hope Git will be changed to instead pack such loose objects
somewhere, as discussed in the above threads.

All in all, this isn't something you'd want to do every day.  But it
looks doable and can be scripted.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bringing a bit more sanity to $GIT_DIR/objects/info/alternates?

2012-08-27 Thread Oswald Buddenhagen
hi,

Junio C Hamano gitster at pobox.com writes:
 The alternates mechanism [...]

sorry for the somewhat late response - i found this thread only now.

at qt-project.org we have a somewhat peculiar setup: we have the qt4 repository,
and a bunch of qt5 repositories which resulted from a split. qt5 is under active
development, but qt4 is still maintained. that means that we need to cherry-pick
between those repositories quite a lot. for an optimal cherry-picking experience
one needs three-way-merging, which means we need shared object stores. which is
where the problems start:

my first approach was just a common objects/ directory with all repositories
symlinking into it. problems:
- the object store can never be garbage-collected. with a lot of heavy rebasing
and temporarily added remotes, it gets messy after a while.
- there is a constant risk of destroying the object store by inadvertently
running git gc - which is particularly likely with git-gui, as it seems to be
retarded enough to ignore the auto-gc setting.

so the second approach is the bare aggregator repo which adds all other repos
as remotes, and the other repos link back via alternates. problems:
- to actually share objects, one always needs to push to the aggregator
- tags having a shared namespace doesn't actually work, because the repos have
the same tags on different commits (they are independent repos, after all)
- one still cannot safely garbage-collect the aggregator, as the refs don't
include the stashes and the index, so rebasing may invalidate these more
transient objects.

i would re-propose hallvard's volatile alternates (at least i think that's
what he was talking about two weeks ago): they can be used to obtain objects,
but every object which is in any way referenced from the current clone must be
available locally (or from a regular alternate). that means that diffing, etc.
would get objects only temporarily, while cherry-picking would actually copy
(some of) the objects. this would make it possible to cross-link repositories,
safely and without any 3rd parties.

thoughts?

regards

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bringing a bit more sanity to $GIT_DIR/objects/info/alternates?

2012-08-11 Thread Hallvard Breien Furuseth
Junio C Hamano wrote:
Some ideas:
 
- Make clone --reference without -s not to borrow from the
  reference repository.  (...)

Generalize: Introduce volatile alternate object stores.  Commands like
(remote) fetch, repack, gc will copy desired objects they see there.

That allows pruneable alternates if people want them: Make every
borrowing repo also borrow from a companion volatile store.  To prune
some shared objects:  Move them from the alternate to the volatile.
Repack or gc all borrowing repos.  Empty the volatile alternate.
Similar to detach from one alternate repo while keeping others:
gc with the to-be-dropped alternate as a volatile.

Also it gives a simple way to try to repair a repo with missing
objects, if you have some other repositories which might have the
objects: Repack with the other repositories as volatile alternates.

BTW, if a wanted object disappears from the volatile alternate while
fetch is running, fetch should get it from the remote after all.

- Make the distinction between a regular repository and an object
  store that is meant to be used for object sharing stronger.
 
  Perhaps a configuration item core.objectstore = readonly can
  be introduced, and we forbid clone -s from pointing at a
  repository without such a configuration.  We also forbid object
  pruning operations such as gc and repack from being run in
  a repository marked as such.

I hope Michael's append-only/donor is feasible instead.  In which
case safer gc/repack are needed, like you outline:

  It may be necessary to allow some special kind of repacking of
  such a readonly object store, in order to reduce the number
  of packfiles (and get rid of loose object files); it needs to
  be implemented carefully not to lose any object, regardless of
  local reachability.

And it needs to be default behavior in such stores, so users won't
need don't-shoot-myself-in-foot options.

- It might not be a bad idea to have a dedicated new command to
  help users manage alternates (git alternates?); obviously
  this will be one of its subcommand git alternates detach if
  we go that route.

git object-store subcommand  -- manage alternates  object stores?

- Or just an entry in the documentation is sufficient?

Better doc would be useful anyway, and this command gives a place to
put it:-)  I had no idea alternates were intended to be read-only,
but that does explain some seeming defects I'd wondered about.

  - When you have two or more repositories that do not share objects,
you may want to rearrange things so that they share their objects
from a single common object store.
 
There is no direct UI to do this, as far as I know.  You can
obviously create a new bare repository, push there from all
of these repositories, and then borrow from there, e.g.

   git --bare init shared.git 
   for r in a.git b.git c.git ...
 do
   (
   cd $r 
   git push ../shared.git refs/*:refs/remotes/$r/* 
   echo ../../../shared.git/objects .git/objects/info/alternates
   )
   done
 
And then repack shared.git once.

...and finally gc the other repositories.

The refs/remotes/$r/ namespace becomes misleading if the user renames
or copies the corresponding Git repository, and then cleverly does
something to the shared repo and the repo (if any) in directory $r.

I suggest refs/remotes/$unique_number/ and note $unique_number
somewhere in the borrowing repo.  If someone insists on being clever,
this may force them to read up on what they're doing first.

Or store no refs, since the shared repo shouldn't lose objects anyway.

If we're sure objects won't be lost: Create a proper remote with the
shared repo.  That way the user can push into it once in a while, and
he can configure just which refs should be shared.

 
Some ideas:
 
- (obvious: give a canned command to do the above, perhaps then
  set the core.objectstore=readonly in the resuting shared.git)

That's getting closer to 'bzr init-repository': One dir with the
shared repo and all borrowing repositories.  A simple model which Git
can track and the user need not think further about.

This way, git clone/init of a new repo in this dir can learn to notice
and use the shared repo.

We can also have a command (git object-store?) to maintain the
repository collection, since Git knows where to find them all:
Push from all repos into the shared repo, gc all repos, even prune
unused objects from the shared repo - after imlementing sufficient
paranoia.

  - When you have one object store and a repository that does not yet
borrow from it, you may want to make the repository borrow from
the object store.  Obviously you can run echo like the sample
script in the previous item above, but it is not obvious how to
perform the logical next step of shrinking $GIT_DIR/objects of
the repository that 

Re: Bringing a bit more sanity to $GIT_DIR/objects/info/alternates?

2012-08-07 Thread Sascha Cunz
[..]
  - By design, the borrowed object store MUST not ever lose any
object from it, as such an object loss can corrupt the borrowing
repositories.  In theory, it is OK for the object store whose
objects are borrowed by repositories to acquire new objects, but
losing existing objects is an absolute no-no.
[...]
In practice, this means that users who use clone -s to make a
new repository can *never* prune the original repository without
risking to corrupt its borrowing repository [*1*].
[...]

Given your example of /git/linux.git being a clone of Linus' repository, 
cloning a related repository using it as --reference:

 $ cd /git
 $ git clone --reference /git/linux.git git://k.org/linux-next.git mine

Wouldn't it be by far a less intrusive alternative to do the following (in the 
clone step above):

- create the file /git/linux.git/objects/borrowing/_git_mine (This is where we
  borrow FROM).
  This file would hold a packed-ref list of HEADs from the /git/mine clone of
  the repository.

  _git_mine here is slash-stripped version of the destination path. Maybe the
  packed-ref format could also be extended by a single line containing a full
  path to the foreign repository.

- On every update-ref to /git/mine, update the 'borrowing' refs in
  /git/linux.git

- On any maintenance on /git/linux.git (gc, prune, repack, etc.) consider refs
  in the packed-refs at objects/borrowing to be valid references.

  If packed-ref format was adopted like stated above, we could stat() here if
  this directory still exists and error out if it doesn't (In this case the
  user should tell us if she moved or removed the clone).

Any alternatives for looking up the packed-refs list for borrowing would also 
be doable; i.E. putting the list of valid borrowing-packed-refs-files into the 
config file (as opposed to lookup $GIT_DIR/objects/borrowing above).
Putting this list into the config file would eliminate need for the packed-ref 
format change and give the user the ability to maintain her clones with well-
known command 'git config'

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bringing a bit more sanity to $GIT_DIR/objects/info/alternates?

2012-08-07 Thread Jeff King
On Sun, Aug 05, 2012 at 11:38:12AM +0200, Michael Haggerty wrote:

 I have some other crazy ideas for making the concept even more powerful:
 
 * Support remote alternate repositories.  Local repository obtains
 missing objects from the remote as needed.  This would probably be
 insanely inefficient without also supporting...
 
 * Lazy copying of borrowed objects to the local repository.  Any
 object fetched from the alternate object store is copied to the local
 object store.
 
 Together, I think that these two features would give fully-functional
 shallow clones.

You might be interested in looking at my rough (_very_ rough) experiment
with object db hooks:

  https://github.com/peff/git/commits/jk/external-odb

The basic idea is to have helper programs that basically have two
commands: give a list of sha1s you can provide, and fetch a specific
object by sha1. That's enough for the low levels of git to fall-back to
a helper on an object lookup failure, and copy the object to a local
cache. Managing the cache could be done externally by helper-specific
code.

Sorry, there's no documentation on the format or behavior, and most of
the changes are in one big patch. If you're interested and find it
unreadable, I can try to clean it up.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bringing a bit more sanity to $GIT_DIR/objects/info/alternates?

2012-08-06 Thread Junio C Hamano
Junio C Hamano gits...@pobox.com writes:

  - When you have one object store and a repository that does not yet
borrow from it, you may want to make the repository borrow from
the object store.  Obviously you can run echo like the sample
script in the previous item above, but it is not obvious how to
perform the logical next step of shrinking $GIT_DIR/objects of
the repository that now borrows the objects.

I think git repack -a -d is the way to do this, but if you
compare this command to git repack -a -d -f we saw previously
in this message, it is not surprising that the users would be
confused---it is not obvious at all.

Some ideas:

- (obvious: give a canned subcommand to do this)

The analysis of this item is wrong, I think.  git repack -a -d -l
should be the way to do so.

The message looks wrong when it turns out that there is no need to
have any object in the borrowing repository, though.  We only see
Nothing new to pack (which technically is correct), and the
command exits successfully.  You can peek .git/objects/ to find out
that all the objects the borrower used to have its own copy are now
gone (because they are available at the alternate), but the message
gives a false impression that we thought about doing something,
found nothing new to be packed, and gave up without doing anything.

But that is not what is happening.  We traversed the connectivity,
found that all the objects necessary for our history are housed in
our alternates, gave Nothing new to pack (because we do not have
to have any object on our own), and then removed all the object
files and packs in our repository.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bringing a bit more sanity to $GIT_DIR/objects/info/alternates?

2012-08-05 Thread Michael Haggerty

On 08/05/2012 06:56 AM, Junio C Hamano wrote:

The alternates mechanism [...]
The UI for this mechanism however has some room for improvement, and
we may want to start improving it for the next release after the
upcoming Git 1.7.12 (or even Git 2.0 if the change is a large one
that may be backward incompatible but gives us a vast improvement).



Here are some random thoughts as a discussion starter. [...]

[...]

- Make the distinction between a regular repository and an object
  store that is meant to be used for object sharing stronger.

  Perhaps a configuration item core.objectstore = readonly can
  be introduced, and we forbid clone -s from pointing at a
  repository without such a configuration.  We also forbid object
  pruning operations such as gc and repack from being run in
  a repository marked as such.


Must the repository necessarily be readonly?  It seems that it would 
be permissible to push new objects to such a repository; just not to 
delete existing objects.  Thus maybe another term would be better to 
describe such a repository, like appendonly or noprune or even 
something more abstract like donor.


I have some other crazy ideas for making the concept even more powerful:

* Support remote alternate repositories.  Local repository obtains 
missing objects from the remote as needed.  This would probably be 
insanely inefficient without also supporting...


* Lazy copying of borrowed objects to the local repository.  Any 
object fetched from the alternate object store is copied to the local 
object store.


Together, I think that these two features would give fully-functional 
shallow clones.


Such alternates could even be chained together: for example, keep a 
single local lazy clone of the upstream repository somewhere on your 
site or on your computer, and use that as read-through cache for other 
clones.


* To help manage local disk space, allow intelligent curation of the 
objects kept in the local store when they are also available in the 
alternate.  The criteria for what to keep could be things like 
revisions with depth = 20 on branches X, Y/*, and Z; objects that 
have been accessed within the last 3 months, all tag objects 
refs/tags/release-*.  It should be possible to cull objects not meeting 
the criteria with or without actively fetching all objects meeting the 
criteria.  Probably the criteria would be stored in the configuration to 
be reused (and perhaps run as part of git gc).


This would cure a lot of storing big, non-deltaable files pain because 
big blobs could be stored on a central server without multiplying the 
size of every clone.


Michael

--
Michael Haggerty
mhag...@alum.mit.edu

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Bringing a bit more sanity to $GIT_DIR/objects/info/alternates?

2012-08-05 Thread Junio C Hamano
Michael Haggerty mhag...@alum.mit.edu writes:

 I have some other crazy ideas for making the concept even more powerful:

Sorry, but the a bit more sanity topic is not interested in making
the concept powerful at all.

This is about making it usable with ease without the user having to
worry about oh, I was about to shoot myself in the foot by running
repack; it is good that I remembered objects in this repository are
borrowed by other repositories and things like that.

For the purpose of a bit more sanity topic, adding new things
users have to worry about to the mix, e.g.  what happens if my
network goes away?  I can afford not to have access to these kinds
of objects for a while, but I must always have access to those
objects, so I can borrow the former but not the latter, is going in
the other way.

The ideas in your messages are *not* useless.  Enhancements along
those lines may be useful, but they do not fit in the same
discussion of making the current mechanism simmpler and easier for
users to use the mechanism in a safe and sane way.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html