Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader
On 7/31/2017 5:02 PM, Jonathan Tan wrote: Besides review changes, this patch set now includes my rewritten lazy-loading sha1_file patch, so you can now do this (excerpted from one of the tests): test_create_repo server test_commit -C server 1 1.t abcdefgh HASH=$(git hash-object server/1.t) test_create_repo client test_must_fail git -C client cat-file -p "$HASH" git -C client config core.repositoryformatversion 1 git -C client config extensions.lazyobject \ "\"$TEST_DIRECTORY/t0410/lazy-object\" \"$(pwd)/server/.git\"" git -C client cat-file -p "$HASH" with fsck still working. Also, there is no need for a list of promised blobs, and the long-running process protocol is being used. Changes from v1: - added last patch that supports lazy loading - clarified documentation in "introduce lazyobject extension" patch (following Junio's comments [1]) As listed in the changes above, I have rewritten my lazy-loading sha1_file patch to no longer use the list of promises. Also, I have added documentation about the protocol used to (hopefully) the appropriate places. Glad to see the removal of the promises. Given the ongoing conversation, I'm interested to see how you are detecting locally create objects vs those downloaded from a server. This is a minimal implementation, hopefully enough of a foundation to be built upon. In particular, I haven't added the environment variable to suppress lazy loading, and the lazy loading protocol only supports one object at a time. We can add multiple object support to the protocol when we get to the point that we have code that will utilize it. Other work -- This differs slightly from Ben Peart's patch [2] in that the lazy-loading functionality is provided through a configured shell command instead of a hook shell script. I envision commands like "git clone", in the future, needing to pre-configure lazy loading, and I think that it will be less surprising to the user if "git clone" wrote a default configuration instead of a default hook. This was on my "todo" list to investigate as I've been told it can enable people to use taskset to set CPU affinity and get some significant performance wins. I'd be interested to see if it actually helps here at all. This also differs from Christian Couder's patch set [3] that implement a larger-scale object database, in that (i) my patch set does not support putting objects into external databases, and (ii) my patch set requires the lazy loader to make the objects available in the local repo, instead of allowing the objects to only be stored in the external database. This is the model we're using today so I'm confident it will meet our requirements. [1] https://public-inbox.org/git/xmqqzibpn1zh@gitster.mtv.corp.google.com/ [2] https://public-inbox.org/git/20170714132651.170708-2-benpe...@microsoft.com/ [3] https://public-inbox.org/git/20170620075523.26961-1-chrisc...@tuxfamily.org/ Jonathan Tan (5): environment, fsck: introduce lazyobject extension fsck: support refs pointing to lazy objects fsck: support referenced lazy objects fsck: support lazy objects as CLI argument sha1_file: support loading lazy objects Documentation/Makefile | 1 + Documentation/gitattributes.txt| 54 ++ Documentation/gitrepository-layout.txt | 3 + .../technical/long-running-process-protocol.txt| 50 + Documentation/technical/repository-version.txt | 23 + Makefile | 1 + builtin/cat-file.c | 2 + builtin/fsck.c | 25 - cache.h| 4 + environment.c | 1 + lazy-object.c | 80 +++ lazy-object.h | 12 +++ object.c | 7 ++ object.h | 13 +++ setup.c| 7 +- sha1_file.c| 44 +--- t/t0410-lazy-object.sh | 113 + t/t0410/lazy-object| 102 +++ 18 files changed, 478 insertions(+), 64 deletions(-) create mode 100644 Documentation/technical/long-running-process-protocol.txt create mode 100644 lazy-object.c create mode 100644 lazy-object.h create mode 100755 t/t0410-lazy-object.sh create mode 100755 t/t0410/lazy-object
Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader
On Wed, 02 Aug 2017 13:51:37 -0700 Junio C Hamanowrote: > > The complication is in the "git gc" operation for the case (*). > > Today, "git gc" uses a reachability walk to decide which objects to > > remove --- an object referenced by no other object is fair game to > > remove. With (*), there is another kind of object that must not be > > removed: if an object that I made, M, points to a missing/promised > > object, O, pointed to by a an object I fetched, F, then I cannot prune > > F unless there is another fetched object present to anchor O. > > Absolutely. Lazy-objects support comes with certain cost and this > is one of them. > > But I do not think it is realistic to expect that you can prune > anything you fetched from the "other place" (i.e. the source > 'lazy-objects' hook reads from). After all, once they give out > objects to their clients (like us in this case), they cannot prune > it, if we take the "implicit promise" approach to avoid the cost to > transmit and maintain a separate "object list". By this, do you mean that the client cannot prune anything lazily loaded from the server? If yes, I understand that the server cannot prune anything (for the reasons that you describe), but I don't see how that applies to the client. > > For example: suppose I have a sparse checkout and run > > > > git fetch origin refs/pulls/x > > git checkout -b topic FETCH_HEAD > > echo "Some great modification" >> README > > git add README > > git commit --amend > > > > When I run "git gc", there is nothing pointing to the commit that was > > pointed to by the remote ref refs/pulls/x, so it can be pruned. I > > would naively also expect that the tree pointed to by that commit > > could be pruned. But pruning it means pruning the promise that made > > it permissible to lack various blobs that my topic branch refers to > > that are outside the sparse checkout area. So "git gc" must notice > > that it is not safe to prune that tree. > > > > This feels hacky. I prefer the promised object list over this > > approach. > > I think they are moral equivalents implemented differently with > different assumptions. The example we are discussing makes an extra > assumption: In order to reduce the cost of transferring and > maintaining the list, we assume that all objects that came during > that transfer are implicitly "promised", i.e. everything behind each > of these objects will later be available on demand. How these > objects are marked is up to the exact mechanism (my preference is to > mark the resulting packfile as special; Jon Tan's message to which > my message was a resopnse alluded to using an alternate object > store). If you choose to maintain a separate "object list" and have > the "other side" explicitly give it, perhaps you can lift that > assumption and replace it with some other assumption that assumes > less. Marking might be an issue if we expect the lazy loader to emit an object after every hash, like in the current design. There would thus be one mark per object, with overhead similar to the promise list. (Having said that, batching is possible - I plan to optimize common cases like checkout, and have such a patch online for an older "promised blob" design [1].) Overhead could be reduced by embedding the mark in both the packed and loose objects, requiring a different format (instead of having a separate "catalog" of marked files). This seems more complicated than using an alternate object store, hence my suggestion. [1] https://github.com/jonathantanmy/git/commit/14f07d3f06bc3a1a2c9bca85adc8c42b230b9143
Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader
Junio C Hamano wrote: > Jonathan Niederwrites: >> Can you spell this out more? To be clear, are you speaking as a >> reviewer or as the project maintainer? In other words, if other >> reviewers are able to settle on a design that involves a relaxed >> guarantee for fsck in this mode that they can agree on, does this >> represent a veto meaning the patch can still not go through? > > Consider it a veto over punting without making sure that we can > later come up with a solution to give such a guarantee. I am not > getting a feeling that "other reviewers" are even seeking a "relaxed > guarantee"---all I've seen in the thread is to give up any guarantee > and to hope for the best. Thank you. That makes sense. In my defense, one reason I had for being okay with dropping the connectivity check in the "lazy object" setup (at a higher level than this patch currently does it, to avoid wasted work) is that this patch series does not include the required components to do it more properly and previous discussions on list had pointed to some of those components that will arrive later (the "object size cache", which doubles as an incomplete list of promises). But that doesn't put the project in a good position because it isn't an explicitly spelled out plan. The set of other reviewers that I was hoping will weigh in at some point is the GVFS team at Microsoft. I'll write up a summary of the ideas discussed so far to try to get this unblocked. Sincerely, Jonathan
Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader
Jonathan Niederwrites: > Junio C Hamano wrote: >> Jonathan Tan writes: > >>> One possibility to conceptually have the same thing without the overhead >>> of the list is to put the obtained-from-elsewhere objects into its own >>> alternate object store, so that we can distinguish the two. >> >> Now you are talking. Either a separate object store, or a packfile >> that is specially marked as such, would work. > > Jonathan's not in today, so let me say a few more words about this > approach. > > This approach implies a relaxed connectivity guarantee, by creating > two classes of objects: > > 1. Objects that I made should satisfy the connectivity check. They > can point to other objects I made, objects I fetched, or (*) objects > pointed to directly by objects I fetched. More on (*) below. Or objects that are referred to by objects I fetched. If you narrowly clone while omitting a subdirectory, updated a file that is outside the subdirectory, and created a new commit, while recording the same tree object name for the directory you do not know its contents (becaues you didn't fetch), then it is OK for the top-level tree of the resulting commit you created to be pointing at the tree that represents the subdirectory you never touched. > The complication is in the "git gc" operation for the case (*). > Today, "git gc" uses a reachability walk to decide which objects to > remove --- an object referenced by no other object is fair game to > remove. With (*), there is another kind of object that must not be > removed: if an object that I made, M, points to a missing/promised > object, O, pointed to by a an object I fetched, F, then I cannot prune > F unless there is another fetched object present to anchor O. Absolutely. Lazy-objects support comes with certain cost and this is one of them. But I do not think it is realistic to expect that you can prune anything you fetched from the "other place" (i.e. the source 'lazy-objects' hook reads from). After all, once they give out objects to their clients (like us in this case), they cannot prune it, if we take the "implicit promise" approach to avoid the cost to transmit and maintain a separate "object list". > For example: suppose I have a sparse checkout and run > > git fetch origin refs/pulls/x > git checkout -b topic FETCH_HEAD > echo "Some great modification" >> README > git add README > git commit --amend > > When I run "git gc", there is nothing pointing to the commit that was > pointed to by the remote ref refs/pulls/x, so it can be pruned. I > would naively also expect that the tree pointed to by that commit > could be pruned. But pruning it means pruning the promise that made > it permissible to lack various blobs that my topic branch refers to > that are outside the sparse checkout area. So "git gc" must notice > that it is not safe to prune that tree. > > This feels hacky. I prefer the promised object list over this > approach. I think they are moral equivalents implemented differently with different assumptions. The example we are discussing makes an extra assumption: In order to reduce the cost of transferring and maintaining the list, we assume that all objects that came during that transfer are implicitly "promised", i.e. everything behind each of these objects will later be available on demand. How these objects are marked is up to the exact mechanism (my preference is to mark the resulting packfile as special; Jon Tan's message to which my message was a resopnse alluded to using an alternate object store). If you choose to maintain a separate "object list" and have the "other side" explicitly give it, perhaps you can lift that assumption and replace it with some other assumption that assumes less. > Can you spell this out more? To be clear, are you speaking as a > reviewer or as the project maintainer? In other words, if other > reviewers are able to settle on a design that involves a relaxed > guarantee for fsck in this mode that they can agree on, does this > represent a veto meaning the patch can still not go through? Consider it a veto over punting without making sure that we can later come up with a solution to give such a guarantee. I am not getting a feeling that "other reviewers" are even seeking a "relaxed guarantee"---all I've seen in the thread is to give up any guarantee and to hope for the best.
Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader
Hi, Junio C Hamano wrote: > Jonathan Tanwrites: >> One possibility to conceptually have the same thing without the overhead >> of the list is to put the obtained-from-elsewhere objects into its own >> alternate object store, so that we can distinguish the two. > > Now you are talking. Either a separate object store, or a packfile > that is specially marked as such, would work. Jonathan's not in today, so let me say a few more words about this approach. This approach implies a relaxed connectivity guarantee, by creating two classes of objects: 1. Objects that I made should satisfy the connectivity check. They can point to other objects I made, objects I fetched, or (*) objects pointed to directly by objects I fetched. More on (*) below. 2. Objects that I fetched do not need to satisfy a connectivity check. I can trust the server to provide objects that they point to when I ask for them, except in extraordinary cases like a credit card number that was accidentally pushed to the server and prompted a rewriting of history to remove it (**). The guarantee (1) looks like it should be easy to satisfy (just like the current connectivity guarantee where all objects are in class (1)). I have to know about an object to point to it --- that means the pointed-to object has to be in the object store or pointed to by something in the object store. The complication is in the "git gc" operation for the case (*). Today, "git gc" uses a reachability walk to decide which objects to remove --- an object referenced by no other object is fair game to remove. With (*), there is another kind of object that must not be removed: if an object that I made, M, points to a missing/promised object, O, pointed to by a an object I fetched, F, then I cannot prune F unless there is another fetched object present to anchor O. For example: suppose I have a sparse checkout and run git fetch origin refs/pulls/x git checkout -b topic FETCH_HEAD echo "Some great modification" >> README git add README git commit --amend When I run "git gc", there is nothing pointing to the commit that was pointed to by the remote ref refs/pulls/x, so it can be pruned. I would naively also expect that the tree pointed to by that commit could be pruned. But pruning it means pruning the promise that made it permissible to lack various blobs that my topic branch refers to that are outside the sparse checkout area. So "git gc" must notice that it is not safe to prune that tree. This feels hacky. I prefer the promised object list over this approach. >"Maintaining a list > of object names in a flat file is too costly" is not a valid excuse > to discard the integrity of locally created objects, without which > Git will no longer be a version control system, I am confused by this: I think that Git without a "git fsck" command at all would still be a version control system, just not as good of one. Can you spell this out more? To be clear, are you speaking as a reviewer or as the project maintainer? In other words, if other reviewers are able to settle on a design that involves a relaxed guarantee for fsck in this mode that they can agree on, does this represent a veto meaning the patch can still not go through? On one hand I'm grateful for the help exploring the design space, and I think it has helped get a better understanding of the issues involved. On the other hand, if this is a veto then it feels very black and white and a hard starting point to build a consensus from. I am worried. [...] >> I mentioned >> this in my e-mail but rejected it, but after some more thought, this >> might be sufficient - we might still need to iterate through every >> object to know exactly what we can assume the remote to have, but the >> "frontier" solution also needs this iteration, so we are no worse off. > > Most importantly, this is allowed to be costly---we are doing this > not at runtime all the time, but when the user says "make sure that > I haven't lost objects and it is safe for me to build further on > what I created locally so far" by running "git fsck". check_everything_connected is also used in some other circumstances: e.g. when running a fetch, and when receiving a push in git receive-pack. Thanks, Jonathan
Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader
Jonathan Tanwrites: > One possibility to conceptually have the same thing without the overhead > of the list is to put the obtained-from-elsewhere objects into its own > alternate object store, so that we can distinguish the two. Now you are talking. Either a separate object store, or a packfile that is specially marked as such, would work. "Maintaining a list of object names in a flat file is too costly" is not a valid excuse to discard the integrity of locally created objects, without which Git will no longer be a version control system, and your "We have to trust the sender of objects on the other side anyway when operating in lazy-objects mode, so devise a way that we can use to tell which objects we _know_ the other side has" that leads to the above idea is a good line of thought. > I mentioned > this in my e-mail but rejected it, but after some more thought, this > might be sufficient - we might still need to iterate through every > object to know exactly what we can assume the remote to have, but the > "frontier" solution also needs this iteration, so we are no worse off. Most importantly, this is allowed to be costly---we are doing this not at runtime all the time, but when the user says "make sure that I haven't lost objects and it is safe for me to build further on what I created locally so far" by running "git fsck".
Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader
On Tue, 01 Aug 2017 10:11:38 -0700 Junio C Hamanowrote: > Let's step back a bit and think what already happens in the pre- > lazy-object world. We record cut-off commits when a depth limited > clone is created in "shallow". These essentially are promises, > saying something like: > > Rest assured that everything in the history behind these commits > are on the other side and you can retrieve them by unshallowing. > > If you traverse from your local tips and find no missing objects > before reaching one of these commits, then you do not have any > local corruption you need to worry about. > > the other end made to us, when the shallow clone was made. And we > take this promise and build more commits on top, and then we adjust > these cut-off commits incrementally as we deepen our clone or make > it even shallower. For this assurance to work, we of course need to > assume a bit more than what we assume for a complete clone, namely, > the "other side" will hold onto the history behind these, i.e. does > not remind the tips it already has shown to us, or even if it does, > the objects that are reachable from these cut-off points will > somehow always be available to us on demand. > > Can we do something similar, i.e. maintain minimum set of cut-off > points and adjust that set incrementally, just sufficient to ensure > the integrity of objects locally created and not yet safely stored > away by pushing them the "other side"? This suggestion (the "frontier" of what we have) does seem to incur less overhead than the original promise suggestion (the "frontier" of what we don't have), but after some in-office discussion, I'm convinced that it might not be the case - for example, one tree (that we have) might reference many blobs (that we don't have), but at the same time, many trees (that we have) might have the same blob (that we don't have). And the promise overhead was already decided to be too much - which is why we moved away from it. One possibility to conceptually have the same thing without the overhead of the list is to put the obtained-from-elsewhere objects into its own alternate object store, so that we can distinguish the two. I mentioned this in my e-mail but rejected it, but after some more thought, this might be sufficient - we might still need to iterate through every object to know exactly what we can assume the remote to have, but the "frontier" solution also needs this iteration, so we are no worse off. Going back to the original use cases that motivated this (the monorepo like Microsoft's repo and the large-blob repo like Android's repo), it might be better just to disable the connectivity check when extensions.lazyObject is set (as you mentioned). This does change the meaning of fsck, but it may be fine since the "meaning" of the repo (a view of another repo, and no longer a full repo) has changed too. Then this patch set will be more about ensuring that the lazy object loader is not inadvertently run. As future work, we could add diagnostics that, for example, attempt a walk anyway and print a list of missing SHA-1s. (I suspect that we will also need to disable the connectivity check for things like "git fetch", which means that we won't be able to tell locally if the server sent us all the objects that we requested for. This might not be a problem, though, since the local repo already has some measure of trust for the server.)
Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader
Jonathan Niederwrites: > If we are deeply worried about this kind of broken connectivity, there > is another case to care about: the server can "promise" to serve > requests for some object (e.g., the tree pointed to by the server's > "master") and then decide it does not want to fulfill that promise > (e.g., that tree pointed to private key material and "master" was > rewound to avoid it). I think I've already covered that in my message (i.e. "we need to assume more than the normal Git"). In short, it is not our problem, but the "lazy-object" service's problem. If developers cannot trust the "central server", most likely owned by the organization that employs them and forces them to offload the access to these objects to the "central server", I think there is much larger problem there. > In the promises model, how we do we get a fresh > understanding of what the server wants to promise now? Yes, that is one of the things that needs to be designed if we want to officially support lazy-objects like structure. We need a way to incrementally adjust the cut-off point, below which it is the responsibility of the "other side" to ensure that necessary objects are available (on demand), and above which it is a local repository's responsibility to notice corrupted and/or missing objects.
Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader
Hi, Junio C Hamano wrote: > Can we do something similar, i.e. maintain minimum set of cut-off > points and adjust that set incrementally, just sufficient to ensure > the integrity of objects locally created and not yet safely stored > away by pushing them the "other side"? This sounds like a variant on the "promises" idea (maintaining a list of objects at the frontier) described before. Instead of listing blobs that the server promised, you are proposing listing trees that the server has promised to handle all references from. > I haven't thought things through (and I know you, Ben and others > have thought much longer and harder), but I would imagine if we have > a commit object [*1*], some of whose parent commits, trees and blobs > are locally missing, and know that the commit exists on the "other > side", we know that all of these "missing" objects that are > referenced by the commit are also available from the "other side". > IOW, I suspect that the same principle "shallow" uses to give us the > integrity guarantee can be naturally extended to allow us to see if > a broken connectivity is OK. If we are deeply worried about this kind of broken connectivity, there is another case to care about: the server can "promise" to serve requests for some object (e.g., the tree pointed to by the server's "master") and then decide it does not want to fulfill that promise (e.g., that tree pointed to private key material and "master" was rewound to avoid it). In the promises model, how we do we get a fresh understanding of what the server wants to promise now? Earlier in this discussion of fsck, I thought you were proposing a slightly different idea. The idea I heard is that you want to check connectivity for whatever you have built locally, while accepting a relaxed guarantee for objects from upstream. If an object is missing, the idea would be that at least this way you know whose fault it is. :) (Not that there's much to do with that knowledge.) Implementing that by treating everything reachable from a remote-tracking branch as "from upstream" seems natural. But that implementation suffers from the same problems: not all objects from upstream need be reachable from a remote-tracking branch (e.g. after a fetch-by-object-id, or because a remote branch can be rewound). Both variants proposed of the promises idea also hold some promise, but my understanding was that the cost of maintaining the promises file (getting data to fill it, locking on update, merging in new objects into it on update), for little benefit wasn't palatable to Microsoft, who has been coping fine without such a file. Jonathan
Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader
Jonathan Tanwrites: > Well, the fsck can still detect issues like corrupt objects (as you > mention above) and dangling heads, which might be real issues. But it is > true that it does not give you the guarantee you describe. Which makes it pretty much useless. The whole point of running "fsck" is to make sure that we won't waste work by not finding a corruption long after it was introduced and spent a lot of effort building on top of a state that nobody can reproduce. > From a user standpoint, this might be able to be worked around by > providing a network-requiring object connectivity checking tool or by > just having the user running a build to ensure that all necessary files > are present. I actually was hoping that you do not have to go to the network for the checking. And I have to say that "only the tip matters" is a horrible cop-out that is not even a workaround. Your users would be served better if you honestly admit that your fsck will not be useful when this feature is used---at least they won't be harmed by a false expectation that "fsck" would give them some assurance, which is not the case. Let's step back a bit and think what already happens in the pre- lazy-object world. We record cut-off commits when a depth limited clone is created in "shallow". These essentially are promises, saying something like: Rest assured that everything in the history behind these commits are on the other side and you can retrieve them by unshallowing. If you traverse from your local tips and find no missing objects before reaching one of these commits, then you do not have any local corruption you need to worry about. the other end made to us, when the shallow clone was made. And we take this promise and build more commits on top, and then we adjust these cut-off commits incrementally as we deepen our clone or make it even shallower. For this assurance to work, we of course need to assume a bit more than what we assume for a complete clone, namely, the "other side" will hold onto the history behind these, i.e. does not remind the tips it already has shown to us, or even if it does, the objects that are reachable from these cut-off points will somehow always be available to us on demand. Can we do something similar, i.e. maintain minimum set of cut-off points and adjust that set incrementally, just sufficient to ensure the integrity of objects locally created and not yet safely stored away by pushing them the "other side"? I haven't thought things through (and I know you, Ben and others have thought much longer and harder), but I would imagine if we have a commit object [*1*], some of whose parent commits, trees and blobs are locally missing, and know that the commit exists on the "other side", we know that all of these "missing" objects that are referenced by the commit are also available from the "other side". IOW, I suspect that the same principle "shallow" uses to give us the integrity guarantee can be naturally extended to allow us to see if a broken connectivity is OK. [Footnote] *1* The same can be said for a tag or a tree object that we know exist on the "other side"; they may refer, directly or indirectly through objects we locally have, to objects that that are missing locally, and as long as the starting point object are known to be available on the "other side", it is OK for them to be missing locally.
Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader
On Mon, 31 Jul 2017 14:21:56 -0700 Junio C Hamanowrote: > Jonathan Tan writes: > > > Besides review changes, this patch set now includes my rewritten > > lazy-loading sha1_file patch, so you can now do this (excerpted from one > > of the tests): > > > > test_create_repo server > > test_commit -C server 1 1.t abcdefgh > > HASH=$(git hash-object server/1.t) > > > > test_create_repo client > > test_must_fail git -C client cat-file -p "$HASH" > > git -C client config core.repositoryformatversion 1 > > git -C client config extensions.lazyobject \ > > "\"$TEST_DIRECTORY/t0410/lazy-object\" \"$(pwd)/server/.git\"" > > git -C client cat-file -p "$HASH" > > > > with fsck still working. Also, there is no need for a list of promised > > blobs, and the long-running process protocol is being used. > > I do not think I read your response to my last comment on this > series, so I could be missing something large, but I am afraid that > the resulting fsck is only half as useful as the normal fsck. I do > not see it any better than a hypothetical castrated version of fsck > that only checks the integrity of objects that appear in the local > object store, without doing any connectivity checks. Sorry, I haven't replied to your last response [1]. That does sound like a good idea, though, and probably can be extended to trees and blobs in that we need to make sure that any object referenced from local-only commits (calculated as you describe in [1]) can be obtained through an object walk from a remote-tracking branch. I haven't fully thought of the implications of things like building commits on top of an arbitrary upstream commit (so since our upstream commit is not a tip, the object walk from all remote-tracking branches might not reach our upstream commit). To try to solve that, we could use an alternate object store to store remote objects in order to be able to find remote objects quickly without doing a traversal, but that does not fully solve the problem, because some information about remote object possession lies only in their parents (for example, if we don't have a remote blob, sometimes the only way to know that the remote has it is by having a tree containing that blob). In addition, this also couples the lazy object loading with either a remote ref (or all remote refs, if we decide to consider objects from all remote refs as potentially loadable). I'll think about this further. [1] https://public-inbox.org/git/xmqq379fkz4x@gitster.mtv.corp.google.com/ > Don't get me wrong. The integrity check on local objects you still > do is important---that is what allows us to make sure that the local > "cache" does not prevent us from going to the real source of the > remote object store by having a corrupt copy. > > But not being able to tell if a missing object is OK to be missing > (because we can get them if/as needed from elsewhere) or we lost the > sole copy of an object that we created and have not pushed out > (hence we are in deep yogurt) makes it pretty much pointless to run > "fsck", doesn't it? It does not give us any guarantee that our > repository plus perfect network connectivity gives us an environment > to build further work on. > > Or am I missing something fundamental? Well, the fsck can still detect issues like corrupt objects (as you mention above) and dangling heads, which might be real issues. But it is true that it does not give you the guarantee you describe. From a user standpoint, this might be able to be worked around by providing a network-requiring object connectivity checking tool or by just having the user running a build to ensure that all necessary files are present. Having said that, this feature will be very nice to have.
Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader
Jonathan Tanwrites: > Besides review changes, this patch set now includes my rewritten > lazy-loading sha1_file patch, so you can now do this (excerpted from one > of the tests): > > test_create_repo server > test_commit -C server 1 1.t abcdefgh > HASH=$(git hash-object server/1.t) > > test_create_repo client > test_must_fail git -C client cat-file -p "$HASH" > git -C client config core.repositoryformatversion 1 > git -C client config extensions.lazyobject \ > "\"$TEST_DIRECTORY/t0410/lazy-object\" \"$(pwd)/server/.git\"" > git -C client cat-file -p "$HASH" > > with fsck still working. Also, there is no need for a list of promised > blobs, and the long-running process protocol is being used. I do not think I read your response to my last comment on this series, so I could be missing something large, but I am afraid that the resulting fsck is only half as useful as the normal fsck. I do not see it any better than a hypothetical castrated version of fsck that only checks the integrity of objects that appear in the local object store, without doing any connectivity checks. Don't get me wrong. The integrity check on local objects you still do is important---that is what allows us to make sure that the local "cache" does not prevent us from going to the real source of the remote object store by having a corrupt copy. But not being able to tell if a missing object is OK to be missing (because we can get them if/as needed from elsewhere) or we lost the sole copy of an object that we created and have not pushed out (hence we are in deep yogurt) makes it pretty much pointless to run "fsck", doesn't it? It does not give us any guarantee that our repository plus perfect network connectivity gives us an environment to build further work on. Or am I missing something fundamental?
[PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader
Besides review changes, this patch set now includes my rewritten lazy-loading sha1_file patch, so you can now do this (excerpted from one of the tests): test_create_repo server test_commit -C server 1 1.t abcdefgh HASH=$(git hash-object server/1.t) test_create_repo client test_must_fail git -C client cat-file -p "$HASH" git -C client config core.repositoryformatversion 1 git -C client config extensions.lazyobject \ "\"$TEST_DIRECTORY/t0410/lazy-object\" \"$(pwd)/server/.git\"" git -C client cat-file -p "$HASH" with fsck still working. Also, there is no need for a list of promised blobs, and the long-running process protocol is being used. Changes from v1: - added last patch that supports lazy loading - clarified documentation in "introduce lazyobject extension" patch (following Junio's comments [1]) As listed in the changes above, I have rewritten my lazy-loading sha1_file patch to no longer use the list of promises. Also, I have added documentation about the protocol used to (hopefully) the appropriate places. This is a minimal implementation, hopefully enough of a foundation to be built upon. In particular, I haven't added the environment variable to suppress lazy loading, and the lazy loading protocol only supports one object at a time. Other work -- This differs slightly from Ben Peart's patch [2] in that the lazy-loading functionality is provided through a configured shell command instead of a hook shell script. I envision commands like "git clone", in the future, needing to pre-configure lazy loading, and I think that it will be less surprising to the user if "git clone" wrote a default configuration instead of a default hook. This also differs from Christian Couder's patch set [3] that implement a larger-scale object database, in that (i) my patch set does not support putting objects into external databases, and (ii) my patch set requires the lazy loader to make the objects available in the local repo, instead of allowing the objects to only be stored in the external database. [1] https://public-inbox.org/git/xmqqzibpn1zh@gitster.mtv.corp.google.com/ [2] https://public-inbox.org/git/20170714132651.170708-2-benpe...@microsoft.com/ [3] https://public-inbox.org/git/20170620075523.26961-1-chrisc...@tuxfamily.org/ Jonathan Tan (5): environment, fsck: introduce lazyobject extension fsck: support refs pointing to lazy objects fsck: support referenced lazy objects fsck: support lazy objects as CLI argument sha1_file: support loading lazy objects Documentation/Makefile | 1 + Documentation/gitattributes.txt| 54 ++ Documentation/gitrepository-layout.txt | 3 + .../technical/long-running-process-protocol.txt| 50 + Documentation/technical/repository-version.txt | 23 + Makefile | 1 + builtin/cat-file.c | 2 + builtin/fsck.c | 25 - cache.h| 4 + environment.c | 1 + lazy-object.c | 80 +++ lazy-object.h | 12 +++ object.c | 7 ++ object.h | 13 +++ setup.c| 7 +- sha1_file.c| 44 +--- t/t0410-lazy-object.sh | 113 + t/t0410/lazy-object| 102 +++ 18 files changed, 478 insertions(+), 64 deletions(-) create mode 100644 Documentation/technical/long-running-process-protocol.txt create mode 100644 lazy-object.c create mode 100644 lazy-object.h create mode 100755 t/t0410-lazy-object.sh create mode 100755 t/t0410/lazy-object -- 2.14.0.rc1.383.gd1ce394fe2-goog