Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader

2017-08-08 Thread Ben Peart



On 7/31/2017 5:02 PM, Jonathan Tan wrote:

Besides review changes, this patch set now includes my rewritten
lazy-loading sha1_file patch, so you can now do this (excerpted from one
of the tests):

 test_create_repo server
 test_commit -C server 1 1.t abcdefgh
 HASH=$(git hash-object server/1.t)
 
 test_create_repo client

 test_must_fail git -C client cat-file -p "$HASH"
 git -C client config core.repositoryformatversion 1
 git -C client config extensions.lazyobject \
 "\"$TEST_DIRECTORY/t0410/lazy-object\" \"$(pwd)/server/.git\""
 git -C client cat-file -p "$HASH"

with fsck still working. Also, there is no need for a list of promised
blobs, and the long-running process protocol is being used.

Changes from v1:
  - added last patch that supports lazy loading
  - clarified documentation in "introduce lazyobject extension" patch
(following Junio's comments [1])

As listed in the changes above, I have rewritten my lazy-loading
sha1_file patch to no longer use the list of promises. Also, I have
added documentation about the protocol used to (hopefully) the
appropriate places.


Glad to see the removal of the promises.  Given the ongoing 
conversation, I'm interested to see how you are detecting locally create 
objects vs those downloaded from a server.




This is a minimal implementation, hopefully enough of a foundation to be
built upon. In particular, I haven't added the environment variable to
suppress lazy loading, and the lazy loading protocol only supports one
object at a time.


We can add multiple object support to the protocol when we get to the 
point that we have code that will utilize it.




Other work
--

This differs slightly from Ben Peart's patch [2] in that the
lazy-loading functionality is provided through a configured shell
command instead of a hook shell script. I envision commands like "git
clone", in the future, needing to pre-configure lazy loading, and I
think that it will be less surprising to the user if "git clone" wrote a
default configuration instead of a default hook.


This was on my "todo" list to investigate as I've been told it can 
enable people to use taskset to set CPU affinity and get some 
significant performance wins. I'd be interested to see if it actually 
helps here at all.




This also differs from Christian Couder's patch set [3] that implement a
larger-scale object database, in that (i) my patch set does not support
putting objects into external databases, and (ii) my patch set requires
the lazy loader to make the objects available in the local repo, instead
of allowing the objects to only be stored in the external database.


This is the model we're using today so I'm confident it will meet our 
requirements.




[1] https://public-inbox.org/git/xmqqzibpn1zh@gitster.mtv.corp.google.com/
[2] https://public-inbox.org/git/20170714132651.170708-2-benpe...@microsoft.com/
[3] https://public-inbox.org/git/20170620075523.26961-1-chrisc...@tuxfamily.org/

Jonathan Tan (5):
   environment, fsck: introduce lazyobject extension
   fsck: support refs pointing to lazy objects
   fsck: support referenced lazy objects
   fsck: support lazy objects as CLI argument
   sha1_file: support loading lazy objects

  Documentation/Makefile |   1 +
  Documentation/gitattributes.txt|  54 ++
  Documentation/gitrepository-layout.txt |   3 +
  .../technical/long-running-process-protocol.txt|  50 +
  Documentation/technical/repository-version.txt |  23 +
  Makefile   |   1 +
  builtin/cat-file.c |   2 +
  builtin/fsck.c |  25 -
  cache.h|   4 +
  environment.c  |   1 +
  lazy-object.c  |  80 +++
  lazy-object.h  |  12 +++
  object.c   |   7 ++
  object.h   |  13 +++
  setup.c|   7 +-
  sha1_file.c|  44 +---
  t/t0410-lazy-object.sh | 113 +
  t/t0410/lazy-object| 102 +++
  18 files changed, 478 insertions(+), 64 deletions(-)
  create mode 100644 Documentation/technical/long-running-process-protocol.txt
  create mode 100644 lazy-object.c
  create mode 100644 lazy-object.h
  create mode 100755 t/t0410-lazy-object.sh
  create mode 100755 t/t0410/lazy-object



Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader

2017-08-03 Thread Jonathan Tan
On Wed, 02 Aug 2017 13:51:37 -0700
Junio C Hamano  wrote:

> > The complication is in the "git gc" operation for the case (*).
> > Today, "git gc" uses a reachability walk to decide which objects to
> > remove --- an object referenced by no other object is fair game to
> > remove.  With (*), there is another kind of object that must not be
> > removed: if an object that I made, M, points to a missing/promised
> > object, O, pointed to by a an object I fetched, F, then I cannot prune
> > F unless there is another fetched object present to anchor O.
> 
> Absolutely.  Lazy-objects support comes with certain cost and this
> is one of them.  
> 
> But I do not think it is realistic to expect that you can prune
> anything you fetched from the "other place" (i.e. the source
> 'lazy-objects' hook reads from).  After all, once they give out
> objects to their clients (like us in this case), they cannot prune
> it, if we take the "implicit promise" approach to avoid the cost to
> transmit and maintain a separate "object list".

By this, do you mean that the client cannot prune anything lazily loaded
from the server? If yes, I understand that the server cannot prune
anything (for the reasons that you describe), but I don't see how that
applies to the client.

> > For example: suppose I have a sparse checkout and run
> >
> > git fetch origin refs/pulls/x
> > git checkout -b topic FETCH_HEAD
> > echo "Some great modification" >> README
> > git add README
> > git commit --amend
> >
> > When I run "git gc", there is nothing pointing to the commit that was
> > pointed to by the remote ref refs/pulls/x, so it can be pruned.  I
> > would naively also expect that the tree pointed to by that commit
> > could be pruned.  But pruning it means pruning the promise that made
> > it permissible to lack various blobs that my topic branch refers to
> > that are outside the sparse checkout area.  So "git gc" must notice
> > that it is not safe to prune that tree.
> >
> > This feels hacky.  I prefer the promised object list over this
> > approach.
> 
> I think they are moral equivalents implemented differently with
> different assumptions.  The example we are discussing makes an extra
> assumption: In order to reduce the cost of transferring and
> maintaining the list, we assume that all objects that came during
> that transfer are implicitly "promised", i.e. everything behind each
> of these objects will later be available on demand.  How these
> objects are marked is up to the exact mechanism (my preference is to
> mark the resulting packfile as special; Jon Tan's message to which
> my message was a resopnse alluded to using an alternate object
> store).  If you choose to maintain a separate "object list" and have
> the "other side" explicitly give it, perhaps you can lift that
> assumption and replace it with some other assumption that assumes
> less.

Marking might be an issue if we expect the lazy loader to emit an object
after every hash, like in the current design. There would thus be one
mark per object, with overhead similar to the promise list. (Having said
that, batching is possible - I plan to optimize common cases like
checkout, and have such a patch online for an older "promised blob"
design [1].)

Overhead could be reduced by embedding the mark in both the packed and
loose objects, requiring a different format (instead of having a
separate "catalog" of marked files). This seems more complicated than
using an alternate object store, hence my suggestion.

[1] 
https://github.com/jonathantanmy/git/commit/14f07d3f06bc3a1a2c9bca85adc8c42b230b9143


Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader

2017-08-02 Thread Jonathan Nieder
Junio C Hamano wrote:
> Jonathan Nieder  writes:

>> Can you spell this out more?  To be clear, are you speaking as a
>> reviewer or as the project maintainer?  In other words, if other
>> reviewers are able to settle on a design that involves a relaxed
>> guarantee for fsck in this mode that they can agree on, does this
>> represent a veto meaning the patch can still not go through?
>
> Consider it a veto over punting without making sure that we can
> later come up with a solution to give such a guarantee.  I am not
> getting a feeling that "other reviewers" are even seeking a "relaxed
> guarantee"---all I've seen in the thread is to give up any guarantee
> and to hope for the best.

Thank you.  That makes sense.

In my defense, one reason I had for being okay with dropping the
connectivity check in the "lazy object" setup (at a higher level than
this patch currently does it, to avoid wasted work) is that this patch
series does not include the required components to do it more properly
and previous discussions on list had pointed to some of those
components that will arrive later (the "object size cache", which
doubles as an incomplete list of promises).  But that doesn't put the
project in a good position because it isn't an explicitly spelled out
plan.

The set of other reviewers that I was hoping will weigh in at some
point is the GVFS team at Microsoft.

I'll write up a summary of the ideas discussed so far to try to get
this unblocked.

Sincerely,
Jonathan


Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader

2017-08-02 Thread Junio C Hamano
Jonathan Nieder  writes:

> Junio C Hamano wrote:
>> Jonathan Tan  writes:
>
>>> One possibility to conceptually have the same thing without the overhead
>>> of the list is to put the obtained-from-elsewhere objects into its own
>>> alternate object store, so that we can distinguish the two.
>>
>> Now you are talking.  Either a separate object store, or a packfile
>> that is specially marked as such, would work.
>
> Jonathan's not in today, so let me say a few more words about this
> approach.
>
> This approach implies a relaxed connectivity guarantee, by creating
> two classes of objects:
>
>  1. Objects that I made should satisfy the connectivity check.  They
> can point to other objects I made, objects I fetched, or (*) objects
> pointed to directly by objects I fetched.  More on (*) below.

Or objects that are referred to by objects I fetched.  If you
narrowly clone while omitting a subdirectory, updated a file
that is outside the subdirectory, and created a new commit, while
recording the same tree object name for the directory you do not
know its contents (becaues you didn't fetch), then it is OK for the
top-level tree of the resulting commit you created to be pointing
at the tree that represents the subdirectory you never touched.

> The complication is in the "git gc" operation for the case (*).
> Today, "git gc" uses a reachability walk to decide which objects to
> remove --- an object referenced by no other object is fair game to
> remove.  With (*), there is another kind of object that must not be
> removed: if an object that I made, M, points to a missing/promised
> object, O, pointed to by a an object I fetched, F, then I cannot prune
> F unless there is another fetched object present to anchor O.

Absolutely.  Lazy-objects support comes with certain cost and this
is one of them.  

But I do not think it is realistic to expect that you can prune
anything you fetched from the "other place" (i.e. the source
'lazy-objects' hook reads from).  After all, once they give out
objects to their clients (like us in this case), they cannot prune
it, if we take the "implicit promise" approach to avoid the cost to
transmit and maintain a separate "object list".

> For example: suppose I have a sparse checkout and run
>
>   git fetch origin refs/pulls/x
>   git checkout -b topic FETCH_HEAD
>   echo "Some great modification" >> README
>   git add README
>   git commit --amend
>
> When I run "git gc", there is nothing pointing to the commit that was
> pointed to by the remote ref refs/pulls/x, so it can be pruned.  I
> would naively also expect that the tree pointed to by that commit
> could be pruned.  But pruning it means pruning the promise that made
> it permissible to lack various blobs that my topic branch refers to
> that are outside the sparse checkout area.  So "git gc" must notice
> that it is not safe to prune that tree.
>
> This feels hacky.  I prefer the promised object list over this
> approach.

I think they are moral equivalents implemented differently with
different assumptions.  The example we are discussing makes an extra
assumption: In order to reduce the cost of transferring and
maintaining the list, we assume that all objects that came during
that transfer are implicitly "promised", i.e. everything behind each
of these objects will later be available on demand.  How these
objects are marked is up to the exact mechanism (my preference is to
mark the resulting packfile as special; Jon Tan's message to which
my message was a resopnse alluded to using an alternate object
store).  If you choose to maintain a separate "object list" and have
the "other side" explicitly give it, perhaps you can lift that
assumption and replace it with some other assumption that assumes
less.

> Can you spell this out more?  To be clear, are you speaking as a
> reviewer or as the project maintainer?  In other words, if other
> reviewers are able to settle on a design that involves a relaxed
> guarantee for fsck in this mode that they can agree on, does this
> represent a veto meaning the patch can still not go through?

Consider it a veto over punting without making sure that we can
later come up with a solution to give such a guarantee.  I am not
getting a feeling that "other reviewers" are even seeking a "relaxed
guarantee"---all I've seen in the thread is to give up any guarantee
and to hope for the best.



Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader

2017-08-02 Thread Jonathan Nieder
Hi,

Junio C Hamano wrote:
> Jonathan Tan  writes:

>> One possibility to conceptually have the same thing without the overhead
>> of the list is to put the obtained-from-elsewhere objects into its own
>> alternate object store, so that we can distinguish the two.
>
> Now you are talking.  Either a separate object store, or a packfile
> that is specially marked as such, would work.

Jonathan's not in today, so let me say a few more words about this
approach.

This approach implies a relaxed connectivity guarantee, by creating
two classes of objects:

 1. Objects that I made should satisfy the connectivity check.  They
can point to other objects I made, objects I fetched, or (*) objects
pointed to directly by objects I fetched.  More on (*) below.

 2. Objects that I fetched do not need to satisfy a connectivity
check.  I can trust the server to provide objects that they point
to when I ask for them, except in extraordinary cases like a
credit card number that was accidentally pushed to the server and
prompted a rewriting of history to remove it (**).

The guarantee (1) looks like it should be easy to satisfy (just like
the current connectivity guarantee where all objects are in class (1)).
I have to know about an object to point to it --- that means the
pointed-to object has to be in the object store or pointed to by
something in the object store.

The complication is in the "git gc" operation for the case (*).
Today, "git gc" uses a reachability walk to decide which objects to
remove --- an object referenced by no other object is fair game to
remove.  With (*), there is another kind of object that must not be
removed: if an object that I made, M, points to a missing/promised
object, O, pointed to by a an object I fetched, F, then I cannot prune
F unless there is another fetched object present to anchor O.

For example: suppose I have a sparse checkout and run

git fetch origin refs/pulls/x
git checkout -b topic FETCH_HEAD
echo "Some great modification" >> README
git add README
git commit --amend

When I run "git gc", there is nothing pointing to the commit that was
pointed to by the remote ref refs/pulls/x, so it can be pruned.  I
would naively also expect that the tree pointed to by that commit
could be pruned.  But pruning it means pruning the promise that made
it permissible to lack various blobs that my topic branch refers to
that are outside the sparse checkout area.  So "git gc" must notice
that it is not safe to prune that tree.

This feels hacky.  I prefer the promised object list over this
approach.

>"Maintaining a list
> of object names in a flat file is too costly" is not a valid excuse
> to discard the integrity of locally created objects, without which
> Git will no longer be a version control system,

I am confused by this: I think that Git without a "git fsck" command
at all would still be a version control system, just not as good of
one.

Can you spell this out more?  To be clear, are you speaking as a
reviewer or as the project maintainer?  In other words, if other
reviewers are able to settle on a design that involves a relaxed
guarantee for fsck in this mode that they can agree on, does this
represent a veto meaning the patch can still not go through?

On one hand I'm grateful for the help exploring the design space, and
I think it has helped get a better understanding of the issues
involved.

On the other hand, if this is a veto then it feels very black and
white and a hard starting point to build a consensus from.  I am
worried.

[...]
>> I mentioned
>> this in my e-mail but rejected it, but after some more thought, this
>> might be sufficient - we might still need to iterate through every
>> object to know exactly what we can assume the remote to have, but the
>> "frontier" solution also needs this iteration, so we are no worse off.
>
> Most importantly, this is allowed to be costly---we are doing this
> not at runtime all the time, but when the user says "make sure that
> I haven't lost objects and it is safe for me to build further on
> what I created locally so far" by running "git fsck".

check_everything_connected is also used in some other circumstances:
e.g. when running a fetch, and when receiving a push in git
receive-pack.

Thanks,
Jonathan


Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader

2017-08-02 Thread Junio C Hamano
Jonathan Tan  writes:

> One possibility to conceptually have the same thing without the overhead
> of the list is to put the obtained-from-elsewhere objects into its own
> alternate object store, so that we can distinguish the two.

Now you are talking.  Either a separate object store, or a packfile
that is specially marked as such, would work.  "Maintaining a list
of object names in a flat file is too costly" is not a valid excuse
to discard the integrity of locally created objects, without which
Git will no longer be a version control system, and your "We have to
trust the sender of objects on the other side anyway when operating
in lazy-objects mode, so devise a way that we can use to tell which
objects we _know_ the other side has" that leads to the above idea
is a good line of thought.

> I mentioned
> this in my e-mail but rejected it, but after some more thought, this
> might be sufficient - we might still need to iterate through every
> object to know exactly what we can assume the remote to have, but the
> "frontier" solution also needs this iteration, so we are no worse off.

Most importantly, this is allowed to be costly---we are doing this
not at runtime all the time, but when the user says "make sure that
I haven't lost objects and it is safe for me to build further on
what I created locally so far" by running "git fsck".


Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader

2017-08-01 Thread Jonathan Tan
On Tue, 01 Aug 2017 10:11:38 -0700
Junio C Hamano  wrote:

> Let's step back a bit and think what already happens in the pre-
> lazy-object world.  We record cut-off commits when a depth limited
> clone is created in "shallow".  These essentially are promises,
> saying something like:
> 
> Rest assured that everything in the history behind these commits
> are on the other side and you can retrieve them by unshallowing.
> 
> If you traverse from your local tips and find no missing objects
> before reaching one of these commits, then you do not have any
> local corruption you need to worry about.
> 
> the other end made to us, when the shallow clone was made.  And we
> take this promise and build more commits on top, and then we adjust
> these cut-off commits incrementally as we deepen our clone or make
> it even shallower.  For this assurance to work, we of course need to
> assume a bit more than what we assume for a complete clone, namely,
> the "other side" will hold onto the history behind these, i.e. does
> not remind the tips it already has shown to us, or even if it does,
> the objects that are reachable from these cut-off points will
> somehow always be available to us on demand.
> 
> Can we do something similar, i.e. maintain minimum set of cut-off
> points and adjust that set incrementally, just sufficient to ensure
> the integrity of objects locally created and not yet safely stored
> away by pushing them the "other side"?

This suggestion (the "frontier" of what we have) does seem to incur less
overhead than the original promise suggestion (the "frontier" of what we
don't have), but after some in-office discussion, I'm convinced that it
might not be the case - for example, one tree (that we have) might
reference many blobs (that we don't have), but at the same time, many
trees (that we have) might have the same blob (that we don't have). And
the promise overhead was already decided to be too much - which is why
we moved away from it.

One possibility to conceptually have the same thing without the overhead
of the list is to put the obtained-from-elsewhere objects into its own
alternate object store, so that we can distinguish the two. I mentioned
this in my e-mail but rejected it, but after some more thought, this
might be sufficient - we might still need to iterate through every
object to know exactly what we can assume the remote to have, but the
"frontier" solution also needs this iteration, so we are no worse off.

Going back to the original use cases that motivated this (the monorepo
like Microsoft's repo and the large-blob repo like Android's repo), it
might be better just to disable the connectivity check when
extensions.lazyObject is set (as you mentioned). This does change the
meaning of fsck, but it may be fine since the "meaning" of the repo (a
view of another repo, and no longer a full repo) has changed too. Then
this patch set will be more about ensuring that the lazy object loader
is not inadvertently run. As future work, we could add diagnostics that,
for example, attempt a walk anyway and print a list of missing SHA-1s.

(I suspect that we will also need to disable the connectivity check for
things like "git fetch", which means that we won't be able to tell
locally if the server sent us all the objects that we requested for.
This might not be a problem, though, since the local repo already has
some measure of trust for the server.)


Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader

2017-08-01 Thread Junio C Hamano
Jonathan Nieder  writes:

> If we are deeply worried about this kind of broken connectivity, there
> is another case to care about: the server can "promise" to serve
> requests for some object (e.g., the tree pointed to by the server's
> "master") and then decide it does not want to fulfill that promise
> (e.g., that tree pointed to private key material and "master" was
> rewound to avoid it).  

I think I've already covered that in my message (i.e. "we need to
assume more than the normal Git").  In short, it is not our problem,
but the "lazy-object" service's problem.  If developers cannot trust
the "central server", most likely owned by the organization that
employs them and forces them to offload the access to these objects
to the "central server", I think there is much larger problem there.

> In the promises model, how we do we get a fresh
> understanding of what the server wants to promise now?

Yes, that is one of the things that needs to be designed if we want
to officially support lazy-objects like structure.  We need a way to
incrementally adjust the cut-off point, below which it is the
responsibility of the "other side" to ensure that necessary objects
are available (on demand), and above which it is a local
repository's responsibility to notice corrupted and/or missing
objects.



Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader

2017-08-01 Thread Jonathan Nieder
Hi,

Junio C Hamano wrote:

> Can we do something similar, i.e. maintain minimum set of cut-off
> points and adjust that set incrementally, just sufficient to ensure
> the integrity of objects locally created and not yet safely stored
> away by pushing them the "other side"?

This sounds like a variant on the "promises" idea (maintaining a list
of objects at the frontier) described before.  Instead of listing
blobs that the server promised, you are proposing listing trees that
the server has promised to handle all references from.

> I haven't thought things through (and I know you, Ben and others
> have thought much longer and harder), but I would imagine if we have
> a commit object [*1*], some of whose parent commits, trees and blobs
> are locally missing, and know that the commit exists on the "other
> side", we know that all of these "missing" objects that are
> referenced by the commit are also available from the "other side".
> IOW, I suspect that the same principle "shallow" uses to give us the
> integrity guarantee can be naturally extended to allow us to see if
> a broken connectivity is OK.

If we are deeply worried about this kind of broken connectivity, there
is another case to care about: the server can "promise" to serve
requests for some object (e.g., the tree pointed to by the server's
"master") and then decide it does not want to fulfill that promise
(e.g., that tree pointed to private key material and "master" was
rewound to avoid it).  In the promises model, how we do we get a fresh
understanding of what the server wants to promise now?

Earlier in this discussion of fsck, I thought you were proposing a
slightly different idea.  The idea I heard is that you want to check
connectivity for whatever you have built locally, while accepting a
relaxed guarantee for objects from upstream.  If an object is missing,
the idea would be that at least this way you know whose fault it is.
:)  (Not that there's much to do with that knowledge.)  Implementing
that by treating everything reachable from a remote-tracking branch as
"from upstream" seems natural.  But that implementation suffers from
the same problems: not all objects from upstream need be reachable
from a remote-tracking branch (e.g. after a fetch-by-object-id, or
because a remote branch can be rewound).

Both variants proposed of the promises idea also hold some promise,
but my understanding was that the cost of maintaining the promises
file (getting data to fill it, locking on update, merging in new
objects into it on update), for little benefit wasn't palatable to
Microsoft, who has been coping fine without such a file.

Jonathan


Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader

2017-08-01 Thread Junio C Hamano
Jonathan Tan  writes:

> Well, the fsck can still detect issues like corrupt objects (as you
> mention above) and dangling heads, which might be real issues. But it is
> true that it does not give you the guarantee you describe.

Which makes it pretty much useless.  The whole point of running
"fsck" is to make sure that we won't waste work by not finding
a corruption long after it was introduced and spent a lot of effort
building on top of a state that nobody can reproduce.

> From a user standpoint, this might be able to be worked around by
> providing a network-requiring object connectivity checking tool or by
> just having the user running a build to ensure that all necessary files
> are present.

I actually was hoping that you do not have to go to the network for
the checking.  And I have to say that "only the tip matters" is a
horrible cop-out that is not even a workaround.  Your users would be
served better if you honestly admit that your fsck will not be
useful when this feature is used---at least they won't be harmed by
a false expectation that "fsck" would give them some assurance,
which is not the case.

Let's step back a bit and think what already happens in the pre-
lazy-object world.  We record cut-off commits when a depth limited
clone is created in "shallow".  These essentially are promises,
saying something like:

Rest assured that everything in the history behind these commits
are on the other side and you can retrieve them by unshallowing.

If you traverse from your local tips and find no missing objects
before reaching one of these commits, then you do not have any
local corruption you need to worry about.

the other end made to us, when the shallow clone was made.  And we
take this promise and build more commits on top, and then we adjust
these cut-off commits incrementally as we deepen our clone or make
it even shallower.  For this assurance to work, we of course need to
assume a bit more than what we assume for a complete clone, namely,
the "other side" will hold onto the history behind these, i.e. does
not remind the tips it already has shown to us, or even if it does,
the objects that are reachable from these cut-off points will
somehow always be available to us on demand.

Can we do something similar, i.e. maintain minimum set of cut-off
points and adjust that set incrementally, just sufficient to ensure
the integrity of objects locally created and not yet safely stored
away by pushing them the "other side"?

I haven't thought things through (and I know you, Ben and others
have thought much longer and harder), but I would imagine if we have
a commit object [*1*], some of whose parent commits, trees and blobs
are locally missing, and know that the commit exists on the "other
side", we know that all of these "missing" objects that are
referenced by the commit are also available from the "other side".
IOW, I suspect that the same principle "shallow" uses to give us the
integrity guarantee can be naturally extended to allow us to see if
a broken connectivity is OK.


[Footnote]

*1* The same can be said for a tag or a tree object that we know
exist on the "other side"; they may refer, directly or
indirectly through objects we locally have, to objects that that
are missing locally, and as long as the starting point object
are known to be available on the "other side", it is OK for them
to be missing locally.



Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader

2017-07-31 Thread Jonathan Tan
On Mon, 31 Jul 2017 14:21:56 -0700
Junio C Hamano  wrote:

> Jonathan Tan  writes:
> 
> > Besides review changes, this patch set now includes my rewritten
> > lazy-loading sha1_file patch, so you can now do this (excerpted from one
> > of the tests):
> >
> > test_create_repo server
> > test_commit -C server 1 1.t abcdefgh
> > HASH=$(git hash-object server/1.t)
> > 
> > test_create_repo client
> > test_must_fail git -C client cat-file -p "$HASH"
> > git -C client config core.repositoryformatversion 1
> > git -C client config extensions.lazyobject \
> > "\"$TEST_DIRECTORY/t0410/lazy-object\" \"$(pwd)/server/.git\""
> > git -C client cat-file -p "$HASH"
> >
> > with fsck still working. Also, there is no need for a list of promised
> > blobs, and the long-running process protocol is being used.
> 
> I do not think I read your response to my last comment on this
> series, so I could be missing something large, but I am afraid that
> the resulting fsck is only half as useful as the normal fsck.  I do
> not see it any better than a hypothetical castrated version of fsck
> that only checks the integrity of objects that appear in the local
> object store, without doing any connectivity checks.

Sorry, I haven't replied to your last response [1]. That does sound like
a good idea, though, and probably can be extended to trees and blobs in
that we need to make sure that any object referenced from local-only
commits (calculated as you describe in [1]) can be obtained through an
object walk from a remote-tracking branch.

I haven't fully thought of the implications of things like building
commits on top of an arbitrary upstream commit (so since our upstream
commit is not a tip, the object walk from all remote-tracking branches
might not reach our upstream commit).

To try to solve that, we could use an alternate object store to store
remote objects in order to be able to find remote objects quickly
without doing a traversal, but that does not fully solve the problem,
because some information about remote object possession lies only in
their parents (for example, if we don't have a remote blob, sometimes
the only way to know that the remote has it is by having a tree
containing that blob).

In addition, this also couples the lazy object loading with either a
remote ref (or all remote refs, if we decide to consider objects from
all remote refs as potentially loadable).

I'll think about this further.

[1] https://public-inbox.org/git/xmqq379fkz4x@gitster.mtv.corp.google.com/

> Don't get me wrong.  The integrity check on local objects you still
> do is important---that is what allows us to make sure that the local
> "cache" does not prevent us from going to the real source of the
> remote object store by having a corrupt copy.  
> 
> But not being able to tell if a missing object is OK to be missing
> (because we can get them if/as needed from elsewhere) or we lost the
> sole copy of an object that we created and have not pushed out
> (hence we are in deep yogurt) makes it pretty much pointless to run
> "fsck", doesn't it?  It does not give us any guarantee that our
> repository plus perfect network connectivity gives us an environment
> to build further work on.
> 
> Or am I missing something fundamental?

Well, the fsck can still detect issues like corrupt objects (as you
mention above) and dangling heads, which might be real issues. But it is
true that it does not give you the guarantee you describe.

From a user standpoint, this might be able to be worked around by
providing a network-requiring object connectivity checking tool or by
just having the user running a build to ensure that all necessary files
are present.

Having said that, this feature will be very nice to have.


Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader

2017-07-31 Thread Junio C Hamano
Jonathan Tan  writes:

> Besides review changes, this patch set now includes my rewritten
> lazy-loading sha1_file patch, so you can now do this (excerpted from one
> of the tests):
>
> test_create_repo server
> test_commit -C server 1 1.t abcdefgh
> HASH=$(git hash-object server/1.t)
> 
> test_create_repo client
> test_must_fail git -C client cat-file -p "$HASH"
> git -C client config core.repositoryformatversion 1
> git -C client config extensions.lazyobject \
> "\"$TEST_DIRECTORY/t0410/lazy-object\" \"$(pwd)/server/.git\""
> git -C client cat-file -p "$HASH"
>
> with fsck still working. Also, there is no need for a list of promised
> blobs, and the long-running process protocol is being used.

I do not think I read your response to my last comment on this
series, so I could be missing something large, but I am afraid that
the resulting fsck is only half as useful as the normal fsck.  I do
not see it any better than a hypothetical castrated version of fsck
that only checks the integrity of objects that appear in the local
object store, without doing any connectivity checks.

Don't get me wrong.  The integrity check on local objects you still
do is important---that is what allows us to make sure that the local
"cache" does not prevent us from going to the real source of the
remote object store by having a corrupt copy.  

But not being able to tell if a missing object is OK to be missing
(because we can get them if/as needed from elsewhere) or we lost the
sole copy of an object that we created and have not pushed out
(hence we are in deep yogurt) makes it pretty much pointless to run
"fsck", doesn't it?  It does not give us any guarantee that our
repository plus perfect network connectivity gives us an environment
to build further work on.

Or am I missing something fundamental?


[PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader

2017-07-31 Thread Jonathan Tan
Besides review changes, this patch set now includes my rewritten
lazy-loading sha1_file patch, so you can now do this (excerpted from one
of the tests):

test_create_repo server
test_commit -C server 1 1.t abcdefgh
HASH=$(git hash-object server/1.t)

test_create_repo client
test_must_fail git -C client cat-file -p "$HASH"
git -C client config core.repositoryformatversion 1
git -C client config extensions.lazyobject \
"\"$TEST_DIRECTORY/t0410/lazy-object\" \"$(pwd)/server/.git\""
git -C client cat-file -p "$HASH"

with fsck still working. Also, there is no need for a list of promised
blobs, and the long-running process protocol is being used.

Changes from v1:
 - added last patch that supports lazy loading
 - clarified documentation in "introduce lazyobject extension" patch
   (following Junio's comments [1])

As listed in the changes above, I have rewritten my lazy-loading
sha1_file patch to no longer use the list of promises. Also, I have
added documentation about the protocol used to (hopefully) the
appropriate places.

This is a minimal implementation, hopefully enough of a foundation to be
built upon. In particular, I haven't added the environment variable to
suppress lazy loading, and the lazy loading protocol only supports one
object at a time.

Other work
--

This differs slightly from Ben Peart's patch [2] in that the
lazy-loading functionality is provided through a configured shell
command instead of a hook shell script. I envision commands like "git
clone", in the future, needing to pre-configure lazy loading, and I
think that it will be less surprising to the user if "git clone" wrote a
default configuration instead of a default hook.

This also differs from Christian Couder's patch set [3] that implement a
larger-scale object database, in that (i) my patch set does not support
putting objects into external databases, and (ii) my patch set requires
the lazy loader to make the objects available in the local repo, instead
of allowing the objects to only be stored in the external database.

[1] https://public-inbox.org/git/xmqqzibpn1zh@gitster.mtv.corp.google.com/
[2] https://public-inbox.org/git/20170714132651.170708-2-benpe...@microsoft.com/
[3] https://public-inbox.org/git/20170620075523.26961-1-chrisc...@tuxfamily.org/

Jonathan Tan (5):
  environment, fsck: introduce lazyobject extension
  fsck: support refs pointing to lazy objects
  fsck: support referenced lazy objects
  fsck: support lazy objects as CLI argument
  sha1_file: support loading lazy objects

 Documentation/Makefile |   1 +
 Documentation/gitattributes.txt|  54 ++
 Documentation/gitrepository-layout.txt |   3 +
 .../technical/long-running-process-protocol.txt|  50 +
 Documentation/technical/repository-version.txt |  23 +
 Makefile   |   1 +
 builtin/cat-file.c |   2 +
 builtin/fsck.c |  25 -
 cache.h|   4 +
 environment.c  |   1 +
 lazy-object.c  |  80 +++
 lazy-object.h  |  12 +++
 object.c   |   7 ++
 object.h   |  13 +++
 setup.c|   7 +-
 sha1_file.c|  44 +---
 t/t0410-lazy-object.sh | 113 +
 t/t0410/lazy-object| 102 +++
 18 files changed, 478 insertions(+), 64 deletions(-)
 create mode 100644 Documentation/technical/long-running-process-protocol.txt
 create mode 100644 lazy-object.c
 create mode 100644 lazy-object.h
 create mode 100755 t/t0410-lazy-object.sh
 create mode 100755 t/t0410/lazy-object

-- 
2.14.0.rc1.383.gd1ce394fe2-goog