Re: GC of alternate object store (was: Bringing a bit more sanity to $GIT_DIR/objects/info/alternates?)
On Tue, Aug 28, 2012 at 09:19:53PM +0200, Hallvard Breien Furuseth wrote: Oswald Buddenhagen wrote: (...)so the second approach is the bare aggregator repo which adds all other repos as remotes, and the other repos link back via alternates. problems: - to actually share objects, one always needs to push to the aggregator Run a cron job which frequently does that? nope. i also have separate repos which share the same code, so when i develop it i need to pick between them live. of course it's unlikely to get conflicts in this case, so the missing object sharing is not that bad (the objects are transferred via format-patch, as i'm rewriting paths anyway), but when it happens it's messy to get out again. - tags having a shared namespace doesn't actually work, because the repos have the same tags on different commits (they are independent repos, after all) Junio's proposal partially fixes that: It pushes refs/* instead of refs/heads/*, to refs/remotes/borrowing repo/. However... i did exacty that. the tags are *still* not populated - git just tries very hard to treat them specially. and the stash file is also ignored, unfortunately. - one still cannot safely garbage-collect the aggregator, as the refs don't include the stashes and the index, so rebasing may invalidate these more transient objects. Also if you copy a repo (e.g. making a backup) instead of cloning it, and then start using both, they'll push into the same namespace - overwriting each other's refs. right. it's a clear user error, though - i wouldn't *expect* it to work. anyway, i don't have *that* problem, as my aggregator actually pulls, not the other way round. anyway, the bottom line is that using alternates as-is for anything but sharing refs/remotes/origin/* (which i'm assuming to be ff-only) is a recipe for disaster. anything which is supposed to be in any way safe must make the donor object store aware of the sharing, which at the very least means setting the proposed append-only flag _by the borrowing_ object store. which means that the info/alternates file should be obfuscated, so people can't edit it manually. i would re-propose hallvard's volatile alternates (at least i think that's what he was talking about two weeks ago): they can be used to obtain objects, but every object which is in any way referenced from the current clone must be available locally (or from a regular alternate). that means that diffing, etc. would get objects only temporarily, while cherry-picking would actually copy (some of) the objects. this would make it possible to cross-link repositories, safely and without any 3rd parties. I'm afraid that idea by itself won't work:-( Either you borrow from a store or not. correct. from regular alternates you borrow, in volatile ones you only peek. so apparently our definitions are different after all. If Git uses an object from the volatile store, it can't always know if the caller needs the object to be copied. it doesn't have to. the distinction comes when creating objects: if an object is only in a volatile alternate, it does not already exist for the purpose of object creation and is thus created locally. regards -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
GC of alternate object store (was: Bringing a bit more sanity to $GIT_DIR/objects/info/alternates?)
Oswald Buddenhagen wrote: (...)so the second approach is the bare aggregator repo which adds all other repos as remotes, and the other repos link back via alternates. problems: - to actually share objects, one always needs to push to the aggregator Run a cron job which frequently does that? - tags having a shared namespace doesn't actually work, because the repos have the same tags on different commits (they are independent repos, after all) Junio's proposal partially fixes that: It pushes refs/* instead of refs/heads/*, to refs/remotes/borrowing repo/. However... - one still cannot safely garbage-collect the aggregator, as the refs don't include the stashes and the index, so rebasing may invalidate these more transient objects. Also if you copy a repo (e.g. making a backup) instead of cloning it, and then start using both, they'll push into the same namespace - overwriting each other's refs. Non-fast-forward pushes can thus lose refs to objects needed by the other repo. receive.denyNonFastForwards only rejects pushes to refs/heads/ or something. (A feature, as I learned when I reported it as bug:-) IIRC Git has no config option to reject all non-fast-forward pushes. i would re-propose hallvard's volatile alternates (at least i think that's what he was talking about two weeks ago): they can be used to obtain objects, but every object which is in any way referenced from the current clone must be available locally (or from a regular alternate). that means that diffing, etc. would get objects only temporarily, while cherry-picking would actually copy (some of) the objects. this would make it possible to cross-link repositories, safely and without any 3rd parties. I'm afraid that idea by itself won't work:-( Either you borrow from a store or not. If Git uses an object from the volatile store, it can't always know if the caller needs the object to be copied. OTOH volatile stores which you do *not* borrow from would be useful: Let fetch/repack/gc/whatever copy missing objects from there. 2nd attempt for a way to gc of the alternate repo: Copy the with removed objects into each borrowing repo, then gc them. Like this: 1. gc, but pack all to-be-removed objects into a removable pack. 2. Hardlink/copy the removable pack - with a .keep file - into borrowing repos when feasible: I.e. repos you can find and have write access to. Update their .git/objects/info/packs. (Is there a Git command for this?) Repeat until nothing to do, in case someone created a new repo during this step. 3. Move the pack from the alternate repo to a backup object store which will keep it for a while. 4. Delete the .keep files from step (2). They were needed in case a user gc'ed away an object from the pack and then added an identical object - borrowed from the to-be-removed pack. 5. gc/repack the other repos at your leisure. 666. Repos you could not update in step (2), can get temporarily broken. Their owners must link the pack from the backup store by hand, or use that store as a volatile store and then gc/repack. Loose objects are a problem: If a repo has longer expiry time(s) than the alternate store, it will get loads of loose objects from all repos which push into the alternate store. Worse, gc can *unpack* those objects, consuming a lot of space. See threads git gc == git garbage-create from removed branch (3 May) and Keeping unreachable objects in a separate pack instead of loose? (10 Jun). Presumably the work-arounds are: - Use long expiry times in the alternate repo. I don't know which expiration config settings are relevant how. - Add some command which checks and warns if the repo has longer expiry time than the repo it borrows from. Also I hope Git will be changed to instead pack such loose objects somewhere, as discussed in the above threads. All in all, this isn't something you'd want to do every day. But it looks doable and can be scripted. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bringing a bit more sanity to $GIT_DIR/objects/info/alternates?
hi, Junio C Hamano gitster at pobox.com writes: The alternates mechanism [...] sorry for the somewhat late response - i found this thread only now. at qt-project.org we have a somewhat peculiar setup: we have the qt4 repository, and a bunch of qt5 repositories which resulted from a split. qt5 is under active development, but qt4 is still maintained. that means that we need to cherry-pick between those repositories quite a lot. for an optimal cherry-picking experience one needs three-way-merging, which means we need shared object stores. which is where the problems start: my first approach was just a common objects/ directory with all repositories symlinking into it. problems: - the object store can never be garbage-collected. with a lot of heavy rebasing and temporarily added remotes, it gets messy after a while. - there is a constant risk of destroying the object store by inadvertently running git gc - which is particularly likely with git-gui, as it seems to be retarded enough to ignore the auto-gc setting. so the second approach is the bare aggregator repo which adds all other repos as remotes, and the other repos link back via alternates. problems: - to actually share objects, one always needs to push to the aggregator - tags having a shared namespace doesn't actually work, because the repos have the same tags on different commits (they are independent repos, after all) - one still cannot safely garbage-collect the aggregator, as the refs don't include the stashes and the index, so rebasing may invalidate these more transient objects. i would re-propose hallvard's volatile alternates (at least i think that's what he was talking about two weeks ago): they can be used to obtain objects, but every object which is in any way referenced from the current clone must be available locally (or from a regular alternate). that means that diffing, etc. would get objects only temporarily, while cherry-picking would actually copy (some of) the objects. this would make it possible to cross-link repositories, safely and without any 3rd parties. thoughts? regards -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bringing a bit more sanity to $GIT_DIR/objects/info/alternates?
Junio C Hamano wrote: Some ideas: - Make clone --reference without -s not to borrow from the reference repository. (...) Generalize: Introduce volatile alternate object stores. Commands like (remote) fetch, repack, gc will copy desired objects they see there. That allows pruneable alternates if people want them: Make every borrowing repo also borrow from a companion volatile store. To prune some shared objects: Move them from the alternate to the volatile. Repack or gc all borrowing repos. Empty the volatile alternate. Similar to detach from one alternate repo while keeping others: gc with the to-be-dropped alternate as a volatile. Also it gives a simple way to try to repair a repo with missing objects, if you have some other repositories which might have the objects: Repack with the other repositories as volatile alternates. BTW, if a wanted object disappears from the volatile alternate while fetch is running, fetch should get it from the remote after all. - Make the distinction between a regular repository and an object store that is meant to be used for object sharing stronger. Perhaps a configuration item core.objectstore = readonly can be introduced, and we forbid clone -s from pointing at a repository without such a configuration. We also forbid object pruning operations such as gc and repack from being run in a repository marked as such. I hope Michael's append-only/donor is feasible instead. In which case safer gc/repack are needed, like you outline: It may be necessary to allow some special kind of repacking of such a readonly object store, in order to reduce the number of packfiles (and get rid of loose object files); it needs to be implemented carefully not to lose any object, regardless of local reachability. And it needs to be default behavior in such stores, so users won't need don't-shoot-myself-in-foot options. - It might not be a bad idea to have a dedicated new command to help users manage alternates (git alternates?); obviously this will be one of its subcommand git alternates detach if we go that route. git object-store subcommand -- manage alternates object stores? - Or just an entry in the documentation is sufficient? Better doc would be useful anyway, and this command gives a place to put it:-) I had no idea alternates were intended to be read-only, but that does explain some seeming defects I'd wondered about. - When you have two or more repositories that do not share objects, you may want to rearrange things so that they share their objects from a single common object store. There is no direct UI to do this, as far as I know. You can obviously create a new bare repository, push there from all of these repositories, and then borrow from there, e.g. git --bare init shared.git for r in a.git b.git c.git ... do ( cd $r git push ../shared.git refs/*:refs/remotes/$r/* echo ../../../shared.git/objects .git/objects/info/alternates ) done And then repack shared.git once. ...and finally gc the other repositories. The refs/remotes/$r/ namespace becomes misleading if the user renames or copies the corresponding Git repository, and then cleverly does something to the shared repo and the repo (if any) in directory $r. I suggest refs/remotes/$unique_number/ and note $unique_number somewhere in the borrowing repo. If someone insists on being clever, this may force them to read up on what they're doing first. Or store no refs, since the shared repo shouldn't lose objects anyway. If we're sure objects won't be lost: Create a proper remote with the shared repo. That way the user can push into it once in a while, and he can configure just which refs should be shared. Some ideas: - (obvious: give a canned command to do the above, perhaps then set the core.objectstore=readonly in the resuting shared.git) That's getting closer to 'bzr init-repository': One dir with the shared repo and all borrowing repositories. A simple model which Git can track and the user need not think further about. This way, git clone/init of a new repo in this dir can learn to notice and use the shared repo. We can also have a command (git object-store?) to maintain the repository collection, since Git knows where to find them all: Push from all repos into the shared repo, gc all repos, even prune unused objects from the shared repo - after imlementing sufficient paranoia. - When you have one object store and a repository that does not yet borrow from it, you may want to make the repository borrow from the object store. Obviously you can run echo like the sample script in the previous item above, but it is not obvious how to perform the logical next step of shrinking $GIT_DIR/objects of the repository that
Re: Bringing a bit more sanity to $GIT_DIR/objects/info/alternates?
[..] - By design, the borrowed object store MUST not ever lose any object from it, as such an object loss can corrupt the borrowing repositories. In theory, it is OK for the object store whose objects are borrowed by repositories to acquire new objects, but losing existing objects is an absolute no-no. [...] In practice, this means that users who use clone -s to make a new repository can *never* prune the original repository without risking to corrupt its borrowing repository [*1*]. [...] Given your example of /git/linux.git being a clone of Linus' repository, cloning a related repository using it as --reference: $ cd /git $ git clone --reference /git/linux.git git://k.org/linux-next.git mine Wouldn't it be by far a less intrusive alternative to do the following (in the clone step above): - create the file /git/linux.git/objects/borrowing/_git_mine (This is where we borrow FROM). This file would hold a packed-ref list of HEADs from the /git/mine clone of the repository. _git_mine here is slash-stripped version of the destination path. Maybe the packed-ref format could also be extended by a single line containing a full path to the foreign repository. - On every update-ref to /git/mine, update the 'borrowing' refs in /git/linux.git - On any maintenance on /git/linux.git (gc, prune, repack, etc.) consider refs in the packed-refs at objects/borrowing to be valid references. If packed-ref format was adopted like stated above, we could stat() here if this directory still exists and error out if it doesn't (In this case the user should tell us if she moved or removed the clone). Any alternatives for looking up the packed-refs list for borrowing would also be doable; i.E. putting the list of valid borrowing-packed-refs-files into the config file (as opposed to lookup $GIT_DIR/objects/borrowing above). Putting this list into the config file would eliminate need for the packed-ref format change and give the user the ability to maintain her clones with well- known command 'git config' -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bringing a bit more sanity to $GIT_DIR/objects/info/alternates?
On Sun, Aug 05, 2012 at 11:38:12AM +0200, Michael Haggerty wrote: I have some other crazy ideas for making the concept even more powerful: * Support remote alternate repositories. Local repository obtains missing objects from the remote as needed. This would probably be insanely inefficient without also supporting... * Lazy copying of borrowed objects to the local repository. Any object fetched from the alternate object store is copied to the local object store. Together, I think that these two features would give fully-functional shallow clones. You might be interested in looking at my rough (_very_ rough) experiment with object db hooks: https://github.com/peff/git/commits/jk/external-odb The basic idea is to have helper programs that basically have two commands: give a list of sha1s you can provide, and fetch a specific object by sha1. That's enough for the low levels of git to fall-back to a helper on an object lookup failure, and copy the object to a local cache. Managing the cache could be done externally by helper-specific code. Sorry, there's no documentation on the format or behavior, and most of the changes are in one big patch. If you're interested and find it unreadable, I can try to clean it up. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bringing a bit more sanity to $GIT_DIR/objects/info/alternates?
Junio C Hamano gits...@pobox.com writes: - When you have one object store and a repository that does not yet borrow from it, you may want to make the repository borrow from the object store. Obviously you can run echo like the sample script in the previous item above, but it is not obvious how to perform the logical next step of shrinking $GIT_DIR/objects of the repository that now borrows the objects. I think git repack -a -d is the way to do this, but if you compare this command to git repack -a -d -f we saw previously in this message, it is not surprising that the users would be confused---it is not obvious at all. Some ideas: - (obvious: give a canned subcommand to do this) The analysis of this item is wrong, I think. git repack -a -d -l should be the way to do so. The message looks wrong when it turns out that there is no need to have any object in the borrowing repository, though. We only see Nothing new to pack (which technically is correct), and the command exits successfully. You can peek .git/objects/ to find out that all the objects the borrower used to have its own copy are now gone (because they are available at the alternate), but the message gives a false impression that we thought about doing something, found nothing new to be packed, and gave up without doing anything. But that is not what is happening. We traversed the connectivity, found that all the objects necessary for our history are housed in our alternates, gave Nothing new to pack (because we do not have to have any object on our own), and then removed all the object files and packs in our repository. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bringing a bit more sanity to $GIT_DIR/objects/info/alternates?
On 08/05/2012 06:56 AM, Junio C Hamano wrote: The alternates mechanism [...] The UI for this mechanism however has some room for improvement, and we may want to start improving it for the next release after the upcoming Git 1.7.12 (or even Git 2.0 if the change is a large one that may be backward incompatible but gives us a vast improvement). Here are some random thoughts as a discussion starter. [...] [...] - Make the distinction between a regular repository and an object store that is meant to be used for object sharing stronger. Perhaps a configuration item core.objectstore = readonly can be introduced, and we forbid clone -s from pointing at a repository without such a configuration. We also forbid object pruning operations such as gc and repack from being run in a repository marked as such. Must the repository necessarily be readonly? It seems that it would be permissible to push new objects to such a repository; just not to delete existing objects. Thus maybe another term would be better to describe such a repository, like appendonly or noprune or even something more abstract like donor. I have some other crazy ideas for making the concept even more powerful: * Support remote alternate repositories. Local repository obtains missing objects from the remote as needed. This would probably be insanely inefficient without also supporting... * Lazy copying of borrowed objects to the local repository. Any object fetched from the alternate object store is copied to the local object store. Together, I think that these two features would give fully-functional shallow clones. Such alternates could even be chained together: for example, keep a single local lazy clone of the upstream repository somewhere on your site or on your computer, and use that as read-through cache for other clones. * To help manage local disk space, allow intelligent curation of the objects kept in the local store when they are also available in the alternate. The criteria for what to keep could be things like revisions with depth = 20 on branches X, Y/*, and Z; objects that have been accessed within the last 3 months, all tag objects refs/tags/release-*. It should be possible to cull objects not meeting the criteria with or without actively fetching all objects meeting the criteria. Probably the criteria would be stored in the configuration to be reused (and perhaps run as part of git gc). This would cure a lot of storing big, non-deltaable files pain because big blobs could be stored on a central server without multiplying the size of every clone. Michael -- Michael Haggerty mhag...@alum.mit.edu -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bringing a bit more sanity to $GIT_DIR/objects/info/alternates?
Michael Haggerty mhag...@alum.mit.edu writes: I have some other crazy ideas for making the concept even more powerful: Sorry, but the a bit more sanity topic is not interested in making the concept powerful at all. This is about making it usable with ease without the user having to worry about oh, I was about to shoot myself in the foot by running repack; it is good that I remembered objects in this repository are borrowed by other repositories and things like that. For the purpose of a bit more sanity topic, adding new things users have to worry about to the mix, e.g. what happens if my network goes away? I can afford not to have access to these kinds of objects for a while, but I must always have access to those objects, so I can borrow the former but not the latter, is going in the other way. The ideas in your messages are *not* useless. Enhancements along those lines may be useful, but they do not fit in the same discussion of making the current mechanism simmpler and easier for users to use the mechanism in a safe and sane way. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html