Re: [PATCH 00/10] RFC Partial Clone and Fetch

2017-05-07 Thread Junio C Hamano
Jonathan Nieder  writes:

> - there shouldn't be any need for the blobs to even be mentioned in
>   the .pack stored locally.  The .idx file maps from sha1 to offset
>   within the packfile --- a special offset could mean "this is a
>   missing blob".

Clever.

> - However, the list of missing blobs can be inferred from the existing
>   pack format, without a change to the pack format used in git
>   protocol.  As part of constructing the idx, "git index-pack"
>   inflates every object in the pack file sent by the server.  This
>   means we know what blobs they reference, so we can easily produce a
>   list for the idx file without changing the pack file format.

A minor wrinkle to keep in mind if you were to go this route is that
you'd need a way to tell the reason why a blob that is referenced by
a tree in the pack stream is not in the same pack stream.  

If the resulting repository on the receiving side has that blob
after the transfer, it is likely that the reason why the blob does
not appear in the pack is because the want/have/ack exchange told
the sending side that the receiving side has a commit that contains
the blob.  But when the blob does not exist in the receiving side
after the transfer, we cannot tell between two possible cases.  The
server may have actively wanted to omit it (i.e. the case we are
interested in in this discussion thread).  Or the receiving end said
that it has a commit that contains the blob, but due to preexisting
corruption, the receiving repository was missing the blob in
reality.


Re: [PATCH 00/10] RFC Partial Clone and Fetch

2017-05-04 Thread Jonathan Nieder
Hi again,

Jeff Hostetler wrote:

> In my original RFC there were comments/complaints that with
> missing blobs we lose the ability to detect corruptions.  My
> proposed changes to index-pack and rev-list (and suggestions
> for other commands like fsck) just disabled those errors.
> Personally, I'm OK with that, but I understand that others
> would like to save the ability to distinguish between missing
> and corrupted.

I'm also okay with it.  In a partial clone, in the same way as a
missing ref represents a different valid state and thus passes fsck
regardless of how it happened, a missing blob is a valid state and it
is sensible for it to pass fsck.

A person might object that previously a repository that passed "git
fsck" was a repository where "git fast-export --all" would succeed,
and if I omit a blob that is not present on the remote server then
that invariant is gone.  But that problem exists even if we have a
list of missing blobs.  The server could rewind history and garbage
collect, causing attempts on the client to fetch a previously
advertised missing blob to fail.  Or the server can disappear
completely, or it can lose all its data and have to be restored from
an older backup that is missing newer blobs.

> Right, only the .pack is sent over the wire.  And that's why I
> suggest listing the missing SHAs in it.  I think we need the server
> to send a list of them -- whether in individual per-file type-5
> records as I suggested, or a single record containing a list of
> SHAs+data (which I think I prefer in hindsight).

A list of SHAs+data sounds sensible as a way of conveying additional
per-blob information (such as size).

> WRT being able to discover the missing blobs, is that true in
> all cases?  I don't think it is for thin-packs -- where the server
> only sends stuff you don't (supposedly) already have, right ?

Generate the list of blobs referenced by trees in the pack, when you
are inflating them in git index-pack.  Omit any objects that you
already know about.  The remainder is the set of missing blobs.

One thing this doesn't tell you is if the same missing blob is
available from multiple remotes.  It associates each blob with a
single remote, the first one it was discovered from.

> If instead, we have pack-object indicate that it *would have*
> sent this blob normally, we don't change any of the semantics
> of how pack files are assembled.  This gives the client a
> definitive list of what's missing.

If there is additional information the server wants to convey about
the missing blobs, then this makes sense to me --- it has to send it
somewhere, and a separate list outside the pack seems like a good
place to put that.

When there is no additional information beyond "this is a blob I am
omitting", there is nothing the wire protocol needs to convey.  But
you've convinced me that that's usually moot because the blob size
is valuable information.

[...]
> The more I think about it, I'd like to NOT put the list in the .idx
> file.  If we put it in a separate peer file next to the *.{pack,idx}
> we could decorate it with the name of the remote used in the fetch
> or clone.

I have no strong opinions about this in either direction.

Since it only affects the local repository format and doesn't affect
the protocol, we can experiment without too much fuss. :)

[...]
> I've always wondered if repack, fetch, and friends should build
> a meta-idx that merges all of the current .idx files, but that
> is a different conversation

Yes, we've been thinking about a one-index-for-many-packs facility
to deal with the proliferation of packfiles with only one or a few
large objects without having to waste I/O copying them into a
concatenated pack file.

Another thing we're looking into is incorporating something like
Martin Fick's "git exproll" (
http://public-inbox.org/git/1375756727-1275-1-git-send-email-artag...@gmail.com/)
into "git gc --auto" to more aggressively keep the number of packs
down.

> On 5/3/2017 2:27 PM, Jonathan Nieder wrote:

>> If we were starting over, would this belong in the tree object?
>> (Part of the reason I ask is that we have an opportunity to sort
>> of start over as part of the transition away from SHA-1.)
>
> Yes, putting the size in the tree would be nice.  That does
> add 5-8 bytes to every entry in every tree (on top of the
> larger hash), but it would be useful.
>
> If we're going there, we might just define the new hash
> as the concatenation of the SHA* and the length of the data
> hashed.  So instead of a 32-byte SHA256, we'd have a (32 + 8)
> byte value.  (Or maybe a (32 + 5) if we want to squeeze it.)

Thanks --- that feedback helps.

It doesn't stop us from having to figure something else out in the
short term, of course.

[...]
>> I am worried about the implications of storing this kind of policy
>> information in the pack file.  There may be useful information along
>> these lines for a server to advertise, but I don't think it belongs in
>> the pack 

Re: [PATCH 00/10] RFC Partial Clone and Fetch

2017-05-04 Thread Jeff Hostetler



On 5/3/2017 2:27 PM, Jonathan Nieder wrote:

Hi,

Jeff Hostetler wrote:

Missing-Blob Support


Let me offer up an alternative idea for representing
missing blobs.  This is differs from both of our previous
proposals.  (I don't have any code for this new proposal,
I just want to think out loud a bit and see if this is a
direction worth pursuing -- or a complete non-starter.)

Both proposals talk about detecting and adapting to a missing
blob and ways to recover -- when we fail to find a blob.
Comments on the thread asked about:
() being able to detect missing blobs vs corrupt repos
() being unable to detect duplicate blobs
() expense of blob search.

Suppose we store "positive" information about missing blobs?
This would let us know that a blob is intentionally missing
and possibly some meta-data about it.


We've discussed this a little informally but didn't go more into
it, so I'm glad you brought it up.

There are two use cases I care about.  I'll want names to refer to
them later, so describing them now:

 A. A p4 or svn style "monorepo" containing all code within a company.
We want to make git scale well when working with such a
repository.  Disk usage, network usage, index size, and object
lookup time are all issues for such a repository.

A repository can creep up in size so it starts falling into this
category even though git coped well with it before.  Another way
to end up in this category is a conversion from a version control
system like p4 or svn.

 B. A more modestly sized repository with some large blobs in it.  At
clone time we want to omit unneeded large blobs to speed up the
clone, save disk space, and save bandwidth.

For this kind of repository, the idx file already contained all
those blobs and that was not causing problems --- the only problem
was the actual blob size.


Yes, I've been primarily concerned with "case A" repos.
I work with the team converting the Windows source repo
to git.  This was discussed in Brussels as part of the
GVFS presentation.

The Windows tree has 3.5M files in the worktree for a simple checkout
of HEAD.  The index is 450MB.  There are 500K trees/folders in
the commit.  Multiply that by scale factor considering the number
of trunk/release branches, number of developers, typical number of
commits per day per developer, and n years(decades) of history and
we get to a very large number

FWIW, there's also a "case C" which has both, but that
just hurts to think about.




1. Suppose we update the .pack file format slightly.

[...]

2. Make a similar change in the .idx format and git-index-pack
   to include them there.  Then blob lookup operations could
   definitively determine that a blob exists and is just not
   present locally.


Some nits:

- there shouldn't be any need for the blobs to even be mentioned in
  the .pack stored locally.  The .idx file maps from sha1 to offset
  within the packfile --- a special offset could mean "this is a
  missing blob".

- Git protocol works by sending pack files over the wire.  The idx
  file is not transmitted to the client --- the client instead
  reconstructs it from the pack file.  I assume this is why you
  stored the SHA-1 of the object in the pack file, but it could
  equally well be sent through another stream (since this proposal
  involves a change to git protocol anyway).

- However, the list of missing blobs can be inferred from the existing
  pack format, without a change to the pack format used in git
  protocol.  As part of constructing the idx, "git index-pack"
  inflates every object in the pack file sent by the server.  This
  means we know what blobs they reference, so we can easily produce a
  list for the idx file without changing the pack file format.


In my original RFC there were comments/complaints that with
missing blobs we lose the ability to detect corruptions.  My
proposed changes to index-pack and rev-list (and suggestions
for other commands like fsck) just disabled those errors.
Personally, I'm OK with that, but I understand that others
would like to save the ability to distinguish between missing
and corrupted.

Right, only the .pack is sent over the wire.  And that's why I
suggest listing the missing SHAs in it.  I think we need the server
to send a list of them -- whether in individual per-file type-5
records as I suggested, or a single record containing a list of
SHAs+data (which I think I prefer in hindsight).

WRT being able to discover the missing blobs, is that true in
all cases?  I don't think it is for thin-packs -- where the server
only sends stuff you don't (supposedly) already have, right ?

If instead, we have pack-object indicate that it *would have*
sent this blob normally, we don't change any of the semantics
of how pack files are assembled.  This gives the client a
definitive list of what's missing.



If this is done by only changing the idx file format and not the pack
file, then it does not 

Re: [PATCH 00/10] RFC Partial Clone and Fetch

2017-05-03 Thread Jonathan Nieder
Hi,

Jonathan Tan wrote:

> The binary search to lookup a packfile offset from a .idx file
> (which involves disk reads) would take longer for all lookups (not
> just lookups for missing blobs) - I think I prefer keeping the lists
> separate, to avoid pessimizing the (likely) usual case where the
> relevant blobs are all already in local repo storage.

Another relevant operation is looking up objects by offset or
index_nr.  The current implementation involves building an in-memory
reverse index on demand by reading the idx file and sorting it by
offset --- see pack-revindex.c::create_pack_revindex.  This takes O(n
log n) time where n is the size of the idx file.

That said, it could be avoided by storing an on-disk reverse index
with the pack.  That's something we've been wanting to do anyway.

Thanks,
Jonathan


Re: [PATCH 00/10] RFC Partial Clone and Fetch

2017-05-03 Thread Jonathan Tan

On 05/03/2017 09:38 AM, Jeff Hostetler wrote:



On 3/8/2017 1:50 PM, g...@jeffhostetler.com wrote:

From: Jeff Hostetler 


[RFC] Partial Clone and Fetch
=
[...]
E. Unresolved Thoughts
==

*TODO* The server should optionally return (in a side-band?) a list
of the blobs that it omitted from the packfile (and possibly the sizes
or sha1_object_info() data for them) during the fetch-pack/upload-pack
operation.  This would allow the client to distinguish from invalid
SHAs and missing ones.  Size information would allow the client to
maybe choose between various servers.


Since I first posted this, Jonathan Tan has started a related
discussion on missing blob support.
https://public-inbox.org/git/cagf8dgk05+f4ux-8+imfvqd0n2jp6yxj18ag8udaeh6qc6s...@mail.gmail.com/T/


I want to respond to both of these threads here.
-


Thanks for your input. I see that you have explained both "storing 
'positive' information about missing blobs" and "what to store with 
those positive information"; I'll just comment on the former for now.



Missing-Blob Support


Let me offer up an alternative idea for representing
missing blobs.  This is differs from both of our previous
proposals.  (I don't have any code for this new proposal,
I just want to think out loud a bit and see if this is a
direction worth pursuing -- or a complete non-starter.)

Both proposals talk about detecting and adapting to a missing
blob and ways to recover -- when we fail to find a blob.
Comments on the thread asked about:
() being able to detect missing blobs vs corrupt repos
() being unable to detect duplicate blobs
() expense of blob search.

Suppose we store "positive" information about missing blobs?
This would let us know that a blob is intentionally missing
and possibly some meta-data about it.


I thought about this (see "Some alternative designs" in [1]), listing 
some similar benefits, but concluded that "it is difficult to scale to 
large repos".


Firstly, to be clear, by large repos I meant (and mean) the svn-style 
"monorepos" that Jonathan Nieder mentions as use case "A" [2].


My concern is that such lists (whether in separate file(s) or in .idx 
files) would be too unwieldy to manipulate. Even if we design things to 
avoid modifying such lists (for example, by adding a new list whenever 
we fetch instead of trying to modify an existing one), we would at least 
need to sort their contents (for example, when generating an .idx in the 
first place). For a repo with 10M-100M blobs [3], this might be doable 
on today's computers, but I would be concerned if a repo would exceed 
such numbers.


[1] <20170426221346.25337-1-jonathanta...@google.com>
[2] <20170503182725.gc28...@aiede.svl.corp.google.com>
[3] In Microsoft's announcement of Git Virtual File System [4], they 
mentioned "over 3.5 million files" in the Windows codebase. I'm not sure 
if this refers to files in a snapshot (that is, working copy) or all 
historical versions.
[4] 
https://blogs.msdn.microsoft.com/visualstudioalm/2017/02/03/announcing-gvfs-git-virtual-file-system/



1. Suppose we update the .pack file format slightly.
   () We use the 5 value in "enum object_type" to mean a
  "missing-blob".
   () We update git-pack-object as I did in my RFC, but have it
  create type 5 entries for the blobs that are omitted,
  rather than nothing.
   () Hopefully, the same logic that currently keeps pack-object
  from sending unnecessary blobs on subsequent fetches can
  also be used to keep it from sending unnecessary missing-blob
  entries.
   () The type 5 missing-blob entry would contain the SHA-1 of the
  blob and some meta-data to be explained later.


My original idea was to have sorted list(s) of hashes in separate 
file(s) much like the currently existing shallow file; it would have the 
semantics of "a hash here might be present or absent; if it is absent, 
use the hook". (Initially I thought that one list would be sufficient, 
but after reading your idea and considering it some more, multiple lists 
might be better.)


Your idea of storing them in an .idx (and possibly corresponding .pack 
file) is similar, I think. Although mine is probably simpler - at least, 
we wouldn't need a new object_type.


As described above, I don't think this list-of-hashes idea will work 
(because of the large numbers of blobs involved), but I'll compare it to 
yours anyway just in case we end up being convinced that this general 
idea works.



2. Make a similar change in the .idx format and git-index-pack
   to include them there.  Then blob lookup operations could
   definitively determine that a blob exists and is just not
   present locally.

3. With this, packfile-based blob-lookup operations can get a
   "missing-blob" result.
   () It should be possible to short-cut searching in other
  packfiles (because we don't have 

Re: [PATCH 00/10] RFC Partial Clone and Fetch

2017-05-03 Thread Jonathan Nieder
Hi,

Jeff Hostetler wrote:
> On 3/8/2017 1:50 PM, g...@jeffhostetler.com wrote:

>> [RFC] Partial Clone and Fetch
>> =
>> [...]
>> E. Unresolved Thoughts
>> ==
>>
>> *TODO* The server should optionally return (in a side-band?) a list
>> of the blobs that it omitted from the packfile (and possibly the sizes
>> or sha1_object_info() data for them) during the fetch-pack/upload-pack
>> operation.  This would allow the client to distinguish from invalid
>> SHAs and missing ones.  Size information would allow the client to
>> maybe choose between various servers.
>
> Since I first posted this, Jonathan Tan has started a related
> discussion on missing blob support.
> https://public-inbox.org/git/cagf8dgk05+f4ux-8+imfvqd0n2jp6yxj18ag8udaeh6qc6s...@mail.gmail.com/T/
>
> I want to respond to both of these threads here.

Thanks much for this.

> Missing-Blob Support
> 
>
> Let me offer up an alternative idea for representing
> missing blobs.  This is differs from both of our previous
> proposals.  (I don't have any code for this new proposal,
> I just want to think out loud a bit and see if this is a
> direction worth pursuing -- or a complete non-starter.)
>
> Both proposals talk about detecting and adapting to a missing
> blob and ways to recover -- when we fail to find a blob.
> Comments on the thread asked about:
> () being able to detect missing blobs vs corrupt repos
> () being unable to detect duplicate blobs
> () expense of blob search.
>
> Suppose we store "positive" information about missing blobs?
> This would let us know that a blob is intentionally missing
> and possibly some meta-data about it.

We've discussed this a little informally but didn't go more into
it, so I'm glad you brought it up.

There are two use cases I care about.  I'll want names to refer to
them later, so describing them now:

 A. A p4 or svn style "monorepo" containing all code within a company.
We want to make git scale well when working with such a
repository.  Disk usage, network usage, index size, and object
lookup time are all issues for such a repository.

A repository can creep up in size so it starts falling into this
category even though git coped well with it before.  Another way
to end up in this category is a conversion from a version control
system like p4 or svn.

 B. A more modestly sized repository with some large blobs in it.  At
clone time we want to omit unneeded large blobs to speed up the
clone, save disk space, and save bandwidth.

For this kind of repository, the idx file already contained all
those blobs and that was not causing problems --- the only problem
was the actual blob size.

> 1. Suppose we update the .pack file format slightly.
[...]
> 2. Make a similar change in the .idx format and git-index-pack
>to include them there.  Then blob lookup operations could
>definitively determine that a blob exists and is just not
>present locally.

Some nits:

- there shouldn't be any need for the blobs to even be mentioned in
  the .pack stored locally.  The .idx file maps from sha1 to offset
  within the packfile --- a special offset could mean "this is a
  missing blob".

- Git protocol works by sending pack files over the wire.  The idx
  file is not transmitted to the client --- the client instead
  reconstructs it from the pack file.  I assume this is why you
  stored the SHA-1 of the object in the pack file, but it could
  equally well be sent through another stream (since this proposal
  involves a change to git protocol anyway).

- However, the list of missing blobs can be inferred from the existing
  pack format, without a change to the pack format used in git
  protocol.  As part of constructing the idx, "git index-pack"
  inflates every object in the pack file sent by the server.  This
  means we know what blobs they reference, so we can easily produce a
  list for the idx file without changing the pack file format.

If this is done by only changing the idx file format and not the pack
file, then it does not affect the protocol.  That is good for
experimentation --- it lets us try out different formats client-side
without having to coordinate with server authors.

In case (A), this proposal means we get back some of the per-object
overhead that we were trying to avoid.  I would like to avoid that
if possible.

In case (B), this proposal doesn't hurt.

One problem with proposals so far has been how to handle repositories
with multiple remotes.  Having a local list of missing blobs is
convenient because it can be associated to the remote --- when a blob
is referenced later, we know which remote to ask for for it.  This is
a niche feature, but it's a nice bonus.

[...]
> 3. With this, packfile-based blob-lookup operations can get a
>"missing-blob" result.
>() It should be possible to short-cut searching in other
>   packfiles (because we don't have to assume that the 

Re: [PATCH 00/10] RFC Partial Clone and Fetch

2017-05-03 Thread Jeff Hostetler



On 3/8/2017 1:50 PM, g...@jeffhostetler.com wrote:

From: Jeff Hostetler 


[RFC] Partial Clone and Fetch
=
[...]
E. Unresolved Thoughts
==

*TODO* The server should optionally return (in a side-band?) a list
of the blobs that it omitted from the packfile (and possibly the sizes
or sha1_object_info() data for them) during the fetch-pack/upload-pack
operation.  This would allow the client to distinguish from invalid
SHAs and missing ones.  Size information would allow the client to
maybe choose between various servers.


Since I first posted this, Jonathan Tan has started a related
discussion on missing blob support.
https://public-inbox.org/git/cagf8dgk05+f4ux-8+imfvqd0n2jp6yxj18ag8udaeh6qc6s...@mail.gmail.com/T/

I want to respond to both of these threads here.
-

Missing-Blob Support


Let me offer up an alternative idea for representing
missing blobs.  This is differs from both of our previous
proposals.  (I don't have any code for this new proposal,
I just want to think out loud a bit and see if this is a
direction worth pursuing -- or a complete non-starter.)

Both proposals talk about detecting and adapting to a missing
blob and ways to recover -- when we fail to find a blob.
Comments on the thread asked about:
() being able to detect missing blobs vs corrupt repos
() being unable to detect duplicate blobs
() expense of blob search.

Suppose we store "positive" information about missing blobs?
This would let us know that a blob is intentionally missing
and possibly some meta-data about it.


1. Suppose we update the .pack file format slightly.
   () We use the 5 value in "enum object_type" to mean a
  "missing-blob".
   () We update git-pack-object as I did in my RFC, but have it
  create type 5 entries for the blobs that are omitted,
  rather than nothing.
   () Hopefully, the same logic that currently keeps pack-object
  from sending unnecessary blobs on subsequent fetches can
  also be used to keep it from sending unnecessary missing-blob
  entries.
   () The type 5 missing-blob entry would contain the SHA-1 of the
  blob and some meta-data to be explained later.

2. Make a similar change in the .idx format and git-index-pack
   to include them there.  Then blob lookup operations could
   definitively determine that a blob exists and is just not
   present locally.

3. With this, packfile-based blob-lookup operations can get a
   "missing-blob" result.
   () It should be possible to short-cut searching in other
  packfiles (because we don't have to assume that the blob
  was just misplaced in another packfile).
   () Lookup can still look for the corresponding loose blob
  (in case a previous lookup already "faulted it in").

4. We can then think about dynamically fetching it.
   () Several techniques for this are currently being
  discussed on the mailing list in other threads,
  so I won't go into this here.
   () There has also been debate about whether this should
  yield a loose blob or a new packfile.  I think both
  forms have merit and depend on whether we are limited
  to asking for a single blob or can make a batch request.
   () A dynamically-fetched loose blob is placed in the normal
  loose blob directory hierarchy so that subsequent
  lookups can find it as mentioned above.
   () A dynamically-fetched packfile (with one or more blobs)
  is written to the ODB and then the lookup operation
  completes.
  {} I want to isolate these packfiles from the main
 packfiles, so that they behave like a second-stage
 lookup and don't affect the caching/LRU nature of
 the existing first-stage packfile lookup.
  {} I also don't want the ambiguity of having 2 primary
 packfiles with a blob marked as missing in 1 and
 present in the other.

5. git-repack should be updated to "do the right thing" and
   squash missing-blob entries.

6. And etc.


Missing-Blob Entry Data
===

A missing-blob entry needs to contain the SHA-1 value of
the blob (obviously).  Other fields are nice to have, but
are not necessary.  Here are a few fields to consider.

A. The SHA-1 (20 bytes)

B. The raw size of the blob (5? bytes).
   () This is the cleaned size of the file as stored.  The
  server does not (and should not) have any knowledge
  of the smudging that may happen.
   () This may be useful if whatever dynamic-fetch-hook
  wants to customize its behavior, such as individually
  fetching large blobs and batch fetching smaller ones
  from the same server.
   () GVFS found it necessary to create a custom server
  end-point to get blob size data so that "ls -l"
  could show file sizes for non-present virtualized
  files.
   () 5 bytes (uint:40) should be more than enough for this.

C. A server "hint" (20 

Re: [PATCH 00/10] RFC Partial Clone and Fetch

2017-03-22 Thread Jeff Hostetler



On 3/22/2017 12:21 PM, Johannes Schindelin wrote:

Hi Kostis,

On Wed, 22 Mar 2017, ankostis wrote:


On 8 March 2017 at 19:50,   wrote:

From: Jeff Hostetler 

[RFC] Partial Clone and Fetch
=

This is a WIP RFC for a partial clone and fetch feature wherein the
client can request that the server omit various blobs from the
packfile during clone and fetch.  Clients can later request omitted
blobs (either from a modified upload-pack-like request to the server
or via a completely independent mechanism).


Is it foreseen the server to *decide* with partial objects to serve
And the cloning-client still to work ok?


The foreseeable use case will be to speed up clones of insanely large
repositories by omitting blobs that are not immediately required, and let
the client fetch them later on demand.

That is all, no additional permission model or anything. In fact, we do
not even need to ensure that blobs are reachable in our use case, as only
trusted parties are allowed to access the server to begin with.

That does not mean, of course, that there should not be an option to limit
access to objects that are reachable.


My case in mind is storing confidential files in Git (server)
that I want to publicize them to partial-cloning clients,
for non-repudiation, by sending out trees and commits alone
(or any non-sensitive blobs).

A possible UI would be to rely on a `.gitattributes` to specify
which objects are to be upheld.


Apologies if I'm intruding with an unrelated feature requests.


I think this is a valid use case, and Jeff's design certainly does not
prevent future patches to that end.

However, given that Jeff's use case does not require any such feature, I
would expect the people who want those features to do the heavy lifting on
top of his work. It is too different from the intended use case to
reasonably ask of Jeff.


As Johannes said, all I'm proposing is a way to limit the amount of
data the client receives to help git scale to extremely large
repositories.  For example, I probably don't need 20 years of history
or the entire source tree if I'm only working in a narrow subset of
the tree.

I'm not sure how you would achieve the confidential file scenario
that you describe, but you might try to build on it and see if you
can make it work.

Jeff





Re: [PATCH 00/10] RFC Partial Clone and Fetch

2017-03-22 Thread Johannes Schindelin
Hi Kostis,

On Wed, 22 Mar 2017, ankostis wrote:

> On 8 March 2017 at 19:50,   wrote:
> > From: Jeff Hostetler 
> >
> > [RFC] Partial Clone and Fetch
> > =
> >
> > This is a WIP RFC for a partial clone and fetch feature wherein the
> > client can request that the server omit various blobs from the
> > packfile during clone and fetch.  Clients can later request omitted
> > blobs (either from a modified upload-pack-like request to the server
> > or via a completely independent mechanism).
> 
> Is it foreseen the server to *decide* with partial objects to serve
> And the cloning-client still to work ok?

The foreseeable use case will be to speed up clones of insanely large
repositories by omitting blobs that are not immediately required, and let
the client fetch them later on demand.

That is all, no additional permission model or anything. In fact, we do
not even need to ensure that blobs are reachable in our use case, as only
trusted parties are allowed to access the server to begin with.

That does not mean, of course, that there should not be an option to limit
access to objects that are reachable.

> My case in mind is storing confidential files in Git (server)
> that I want to publicize them to partial-cloning clients,
> for non-repudiation, by sending out trees and commits alone
> (or any non-sensitive blobs).
> 
> A possible UI would be to rely on a `.gitattributes` to specify
> which objects are to be upheld.
> 
> 
> Apologies if I'm intruding with an unrelated feature requests.

I think this is a valid use case, and Jeff's design certainly does not
prevent future patches to that end.

However, given that Jeff's use case does not require any such feature, I
would expect the people who want those features to do the heavy lifting on
top of his work. It is too different from the intended use case to
reasonably ask of Jeff.

Ciao,
Johannes


Re: [PATCH 00/10] RFC Partial Clone and Fetch

2017-03-22 Thread ankostis
Dear Jeff

I read most of the valuable references you provided
but could not find something along the lines describing inline.


On 8 March 2017 at 19:50,   wrote:
> From: Jeff Hostetler 
>
>
> [RFC] Partial Clone and Fetch
> =
>
> This is a WIP RFC for a partial clone and fetch feature wherein the client
> can request that the server omit various blobs from the packfile during
> clone and fetch.  Clients can later request omitted blobs (either from a
> modified upload-pack-like request to the server or via a completely
> independent mechanism).

Is it foreseen the server to *decide* with partial objects to serve
And the cloning-client still to work ok?

My case in mind is storing confidential files in Git (server)
that I want to publicize them to partial-cloning clients,
for non-repudiation, by sending out trees and commits alone
(or any non-sensitive blobs).

A possible UI would be to rely on a `.gitattributes` to specify
which objects are to be upheld.


Apologies if I'm intruding with an unrelated feature requests.
  Kostis


Re: [PATCH 00/10] RFC Partial Clone and Fetch

2017-03-17 Thread Jeff Hostetler



On 3/16/2017 5:43 PM, Jeff Hostetler wrote:



On 3/9/2017 3:18 PM, Jonathan Tan wrote:

Overall, this fetch/clone approach seems reasonable to me, except
perhaps some unanswered questions (some of which are also being
discussed elsewhere):
 - does the server need to tell us of missing blobs?
 - if yes, does the server need to tell us their file sizes?


File sizes are a nice addition.  For example, with a virtual
file system, a "ls -l" can lie and tell you the sizes of the
yet-to-be-populated files.


Nevermind the "ls -l" case, I forgot about the need for the
client to display the size of the (possibly) smudged file,
rather than the actual blob size.


Re: [PATCH 00/10] RFC Partial Clone and Fetch

2017-03-16 Thread Jeff Hostetler



On 3/9/2017 3:18 PM, Jonathan Tan wrote:

Overall, this fetch/clone approach seems reasonable to me, except
perhaps some unanswered questions (some of which are also being
discussed elsewhere):
 - does the server need to tell us of missing blobs?
 - if yes, does the server need to tell us their file sizes?


File sizes are a nice addition.  For example, with a virtual
file system, a "ls -l" can lie and tell you the sizes of the
yet-to-be-populated files.  Or if the client wants to distinguish
between going back to the original remote or going to S3 for the
blob, it could use the size to choose.  (I'm not saying we actually
build that yet, but others on the mailing list have spoken about
parking large blobs in S3.)

So, not necessary, but might be nice to have.


 - do we need to store the list of missing blobs somewhere (whether the
   server told it to us or whether we inferred it from the fetched
   trees)


We should be able to infer the list of missing blobs; I hadn't
considered that.  However, by doing so we will need to disable
some of the integrity checking (as I had to do with the
"--allow-partial" option) and some concerns about that were
discussed earlier in the thread.  But if we do that inference
during clone/fetch and record it somewhere, we could get back
the integrity checking.



The answers to this probably depend on the answers in "B. Issues
Backfilling Omitted Blobs" (especially the additional concepts I listed
below).

Also, do you have any plans to implement other functionality, e.g. "git
checkout" (which will allow fetches and clones to repositories with a
working directory)? (I don't know what the mailing list consensus would
be for the "acceptance criteria" for this patch set, but I would at
least include "checkout".)


Yes, supporting "checkout" is essential. Commands like "merge", "diff",
and etc. will come later.  In Ben's RFC, he has been investigating
demand-loading blobs in read_object().  I've been focusing on
pre-fetching the missing blobs for a particular command.  I need
to make more progress on this topic.



On 03/08/2017 10:50 AM, g...@jeffhostetler.com wrote:

B. Issues Backfilling Omitted Blobs
===

Ideally, if the client only does "--partial-by-profile" fetches, it
should not need to fetch individual missing blobs, but we have to allow
for it to handle the other commands and other unexpected issues.

There are 3 orthogonal concepts here:  when, how and where?


Another concept is "how to determine if a blob is really omitted" - do
we store a list somewhere or do we assume that all missing blobs are
purposely omitted (like in this patch set)?

Yet another concept is "whether to fetch" - for example, a checkout
should almost certainly fetch, but a rev-list used by a connectivity
check (like in patch 6 of this set) should not.

For example, for historical-blob-searching commands like "git log -S",
should we:
 a) fetch everything missing (so users should use date-limiting
arguments)
 b) fetch nothing missing
 c) use the file size to automatically exclude big files, but fetch
everything else

For a) and b), we wouldn't need file size information for missing blobs,
but for c), we do. This might determine if we need file size information
in the fetch-pack/upload-pack protocol.


good points.




C. New Blob-Fetch Protocol (2a)
===

*TODO* A new pair of commands, such as fetch-blob-pack and
upload-blob-pack,
will be created to let the client request a batch of blobs and receive a
packfile.  A protocol similar to the fetch-pack/upload-pack will be
spoken
between them.  (This avoids complicating the existing protocol and the
work
of enumerating the refs.)  Upload-blob-pack will use pack-objects to
build
the packfile.

It is also more efficient than requesting a single blob at a time using
the existing fetch-pack/upload-pack mechanism (with the various allow
unreachable options).

*TODO* The new request protocol will be defined in the patch series.
It will include: a list of the desired blob SHAs.  Possibly also the
commit
SHA, branch name, and pathname of each blob (or whatever is necessary
to let
the server address the reachability concerns).  Possibly also the last
known SHA for each blob to allow for deltafication in the packfile.


Context (like the commit SHA-1) would help in reachability checks, but
I'm not sure if we can rely on that. It is true that I can't think of a
way that the client would dissociate a blob that is missing from its
tree or commit (because it would first need to "fault-in" that blob to
do its operation). But clients operating on non-contextual SHA-1s (e.g.
"git cat-file") and servers manipulating commits (so that the commit
SHA-1 that the client had in its context is no longer reachable) are not
uncommon, I think.

Having said that, it might be useful to include the context in the
protocol anyway as an optional "hint".


That is what I was thinking. A hint of the branch or 

Re: [PATCH 00/10] RFC Partial Clone and Fetch

2017-03-09 Thread Jonathan Tan
Overall, this fetch/clone approach seems reasonable to me, except 
perhaps some unanswered questions (some of which are also being 
discussed elsewhere):

 - does the server need to tell us of missing blobs?
 - if yes, does the server need to tell us their file sizes?
 - do we need to store the list of missing blobs somewhere (whether the
   server told it to us or whether we inferred it from the fetched
   trees)

The answers to this probably depend on the answers in "B. Issues 
Backfilling Omitted Blobs" (especially the additional concepts I listed 
below).


Also, do you have any plans to implement other functionality, e.g. "git 
checkout" (which will allow fetches and clones to repositories with a 
working directory)? (I don't know what the mailing list consensus would 
be for the "acceptance criteria" for this patch set, but I would at 
least include "checkout".)


On 03/08/2017 10:50 AM, g...@jeffhostetler.com wrote:

B. Issues Backfilling Omitted Blobs
===

Ideally, if the client only does "--partial-by-profile" fetches, it
should not need to fetch individual missing blobs, but we have to allow
for it to handle the other commands and other unexpected issues.

There are 3 orthogonal concepts here:  when, how and where?


Another concept is "how to determine if a blob is really omitted" - do 
we store a list somewhere or do we assume that all missing blobs are 
purposely omitted (like in this patch set)?


Yet another concept is "whether to fetch" - for example, a checkout 
should almost certainly fetch, but a rev-list used by a connectivity 
check (like in patch 6 of this set) should not.


For example, for historical-blob-searching commands like "git log -S", 
should we:

 a) fetch everything missing (so users should use date-limiting
arguments)
 b) fetch nothing missing
 c) use the file size to automatically exclude big files, but fetch
everything else

For a) and b), we wouldn't need file size information for missing blobs, 
but for c), we do. This might determine if we need file size information 
in the fetch-pack/upload-pack protocol.



C. New Blob-Fetch Protocol (2a)
===

*TODO* A new pair of commands, such as fetch-blob-pack and upload-blob-pack,
will be created to let the client request a batch of blobs and receive a
packfile.  A protocol similar to the fetch-pack/upload-pack will be spoken
between them.  (This avoids complicating the existing protocol and the work
of enumerating the refs.)  Upload-blob-pack will use pack-objects to build
the packfile.

It is also more efficient than requesting a single blob at a time using
the existing fetch-pack/upload-pack mechanism (with the various allow
unreachable options).

*TODO* The new request protocol will be defined in the patch series.
It will include: a list of the desired blob SHAs.  Possibly also the commit
SHA, branch name, and pathname of each blob (or whatever is necessary to let
the server address the reachability concerns).  Possibly also the last
known SHA for each blob to allow for deltafication in the packfile.


Context (like the commit SHA-1) would help in reachability checks, but 
I'm not sure if we can rely on that. It is true that I can't think of a 
way that the client would dissociate a blob that is missing from its 
tree or commit (because it would first need to "fault-in" that blob to 
do its operation). But clients operating on non-contextual SHA-1s (e.g. 
"git cat-file") and servers manipulating commits (so that the commit 
SHA-1 that the client had in its context is no longer reachable) are not 
uncommon, I think.


Having said that, it might be useful to include the context in the 
protocol anyway as an optional "hint".


I'm not sure what you mean by "last known SHA for each blob".

(If we do store the file size of a blob somewhere, we could also store 
some context there. I'm not sure how useful this is, though.)



E. Unresolved Thoughts
==





*TODO* The partial clone arguments should be recorded in ".git/info/"
so that subsequent fetch commands can inherit them and rev-list/index-pack
know to not complain by default.

*TODO* Update GC like rev-list to not complain when there are missing blobs.


These 2 points would be part of "whether to fetch" above.