Re: [PATCH 00/10] RFC Partial Clone and Fetch
Jonathan Niederwrites: > - there shouldn't be any need for the blobs to even be mentioned in > the .pack stored locally. The .idx file maps from sha1 to offset > within the packfile --- a special offset could mean "this is a > missing blob". Clever. > - However, the list of missing blobs can be inferred from the existing > pack format, without a change to the pack format used in git > protocol. As part of constructing the idx, "git index-pack" > inflates every object in the pack file sent by the server. This > means we know what blobs they reference, so we can easily produce a > list for the idx file without changing the pack file format. A minor wrinkle to keep in mind if you were to go this route is that you'd need a way to tell the reason why a blob that is referenced by a tree in the pack stream is not in the same pack stream. If the resulting repository on the receiving side has that blob after the transfer, it is likely that the reason why the blob does not appear in the pack is because the want/have/ack exchange told the sending side that the receiving side has a commit that contains the blob. But when the blob does not exist in the receiving side after the transfer, we cannot tell between two possible cases. The server may have actively wanted to omit it (i.e. the case we are interested in in this discussion thread). Or the receiving end said that it has a commit that contains the blob, but due to preexisting corruption, the receiving repository was missing the blob in reality.
Re: [PATCH 00/10] RFC Partial Clone and Fetch
Hi again, Jeff Hostetler wrote: > In my original RFC there were comments/complaints that with > missing blobs we lose the ability to detect corruptions. My > proposed changes to index-pack and rev-list (and suggestions > for other commands like fsck) just disabled those errors. > Personally, I'm OK with that, but I understand that others > would like to save the ability to distinguish between missing > and corrupted. I'm also okay with it. In a partial clone, in the same way as a missing ref represents a different valid state and thus passes fsck regardless of how it happened, a missing blob is a valid state and it is sensible for it to pass fsck. A person might object that previously a repository that passed "git fsck" was a repository where "git fast-export --all" would succeed, and if I omit a blob that is not present on the remote server then that invariant is gone. But that problem exists even if we have a list of missing blobs. The server could rewind history and garbage collect, causing attempts on the client to fetch a previously advertised missing blob to fail. Or the server can disappear completely, or it can lose all its data and have to be restored from an older backup that is missing newer blobs. > Right, only the .pack is sent over the wire. And that's why I > suggest listing the missing SHAs in it. I think we need the server > to send a list of them -- whether in individual per-file type-5 > records as I suggested, or a single record containing a list of > SHAs+data (which I think I prefer in hindsight). A list of SHAs+data sounds sensible as a way of conveying additional per-blob information (such as size). > WRT being able to discover the missing blobs, is that true in > all cases? I don't think it is for thin-packs -- where the server > only sends stuff you don't (supposedly) already have, right ? Generate the list of blobs referenced by trees in the pack, when you are inflating them in git index-pack. Omit any objects that you already know about. The remainder is the set of missing blobs. One thing this doesn't tell you is if the same missing blob is available from multiple remotes. It associates each blob with a single remote, the first one it was discovered from. > If instead, we have pack-object indicate that it *would have* > sent this blob normally, we don't change any of the semantics > of how pack files are assembled. This gives the client a > definitive list of what's missing. If there is additional information the server wants to convey about the missing blobs, then this makes sense to me --- it has to send it somewhere, and a separate list outside the pack seems like a good place to put that. When there is no additional information beyond "this is a blob I am omitting", there is nothing the wire protocol needs to convey. But you've convinced me that that's usually moot because the blob size is valuable information. [...] > The more I think about it, I'd like to NOT put the list in the .idx > file. If we put it in a separate peer file next to the *.{pack,idx} > we could decorate it with the name of the remote used in the fetch > or clone. I have no strong opinions about this in either direction. Since it only affects the local repository format and doesn't affect the protocol, we can experiment without too much fuss. :) [...] > I've always wondered if repack, fetch, and friends should build > a meta-idx that merges all of the current .idx files, but that > is a different conversation Yes, we've been thinking about a one-index-for-many-packs facility to deal with the proliferation of packfiles with only one or a few large objects without having to waste I/O copying them into a concatenated pack file. Another thing we're looking into is incorporating something like Martin Fick's "git exproll" ( http://public-inbox.org/git/1375756727-1275-1-git-send-email-artag...@gmail.com/) into "git gc --auto" to more aggressively keep the number of packs down. > On 5/3/2017 2:27 PM, Jonathan Nieder wrote: >> If we were starting over, would this belong in the tree object? >> (Part of the reason I ask is that we have an opportunity to sort >> of start over as part of the transition away from SHA-1.) > > Yes, putting the size in the tree would be nice. That does > add 5-8 bytes to every entry in every tree (on top of the > larger hash), but it would be useful. > > If we're going there, we might just define the new hash > as the concatenation of the SHA* and the length of the data > hashed. So instead of a 32-byte SHA256, we'd have a (32 + 8) > byte value. (Or maybe a (32 + 5) if we want to squeeze it.) Thanks --- that feedback helps. It doesn't stop us from having to figure something else out in the short term, of course. [...] >> I am worried about the implications of storing this kind of policy >> information in the pack file. There may be useful information along >> these lines for a server to advertise, but I don't think it belongs in >> the pack
Re: [PATCH 00/10] RFC Partial Clone and Fetch
On 5/3/2017 2:27 PM, Jonathan Nieder wrote: Hi, Jeff Hostetler wrote: Missing-Blob Support Let me offer up an alternative idea for representing missing blobs. This is differs from both of our previous proposals. (I don't have any code for this new proposal, I just want to think out loud a bit and see if this is a direction worth pursuing -- or a complete non-starter.) Both proposals talk about detecting and adapting to a missing blob and ways to recover -- when we fail to find a blob. Comments on the thread asked about: () being able to detect missing blobs vs corrupt repos () being unable to detect duplicate blobs () expense of blob search. Suppose we store "positive" information about missing blobs? This would let us know that a blob is intentionally missing and possibly some meta-data about it. We've discussed this a little informally but didn't go more into it, so I'm glad you brought it up. There are two use cases I care about. I'll want names to refer to them later, so describing them now: A. A p4 or svn style "monorepo" containing all code within a company. We want to make git scale well when working with such a repository. Disk usage, network usage, index size, and object lookup time are all issues for such a repository. A repository can creep up in size so it starts falling into this category even though git coped well with it before. Another way to end up in this category is a conversion from a version control system like p4 or svn. B. A more modestly sized repository with some large blobs in it. At clone time we want to omit unneeded large blobs to speed up the clone, save disk space, and save bandwidth. For this kind of repository, the idx file already contained all those blobs and that was not causing problems --- the only problem was the actual blob size. Yes, I've been primarily concerned with "case A" repos. I work with the team converting the Windows source repo to git. This was discussed in Brussels as part of the GVFS presentation. The Windows tree has 3.5M files in the worktree for a simple checkout of HEAD. The index is 450MB. There are 500K trees/folders in the commit. Multiply that by scale factor considering the number of trunk/release branches, number of developers, typical number of commits per day per developer, and n years(decades) of history and we get to a very large number FWIW, there's also a "case C" which has both, but that just hurts to think about. 1. Suppose we update the .pack file format slightly. [...] 2. Make a similar change in the .idx format and git-index-pack to include them there. Then blob lookup operations could definitively determine that a blob exists and is just not present locally. Some nits: - there shouldn't be any need for the blobs to even be mentioned in the .pack stored locally. The .idx file maps from sha1 to offset within the packfile --- a special offset could mean "this is a missing blob". - Git protocol works by sending pack files over the wire. The idx file is not transmitted to the client --- the client instead reconstructs it from the pack file. I assume this is why you stored the SHA-1 of the object in the pack file, but it could equally well be sent through another stream (since this proposal involves a change to git protocol anyway). - However, the list of missing blobs can be inferred from the existing pack format, without a change to the pack format used in git protocol. As part of constructing the idx, "git index-pack" inflates every object in the pack file sent by the server. This means we know what blobs they reference, so we can easily produce a list for the idx file without changing the pack file format. In my original RFC there were comments/complaints that with missing blobs we lose the ability to detect corruptions. My proposed changes to index-pack and rev-list (and suggestions for other commands like fsck) just disabled those errors. Personally, I'm OK with that, but I understand that others would like to save the ability to distinguish between missing and corrupted. Right, only the .pack is sent over the wire. And that's why I suggest listing the missing SHAs in it. I think we need the server to send a list of them -- whether in individual per-file type-5 records as I suggested, or a single record containing a list of SHAs+data (which I think I prefer in hindsight). WRT being able to discover the missing blobs, is that true in all cases? I don't think it is for thin-packs -- where the server only sends stuff you don't (supposedly) already have, right ? If instead, we have pack-object indicate that it *would have* sent this blob normally, we don't change any of the semantics of how pack files are assembled. This gives the client a definitive list of what's missing. If this is done by only changing the idx file format and not the pack file, then it does not
Re: [PATCH 00/10] RFC Partial Clone and Fetch
Hi, Jonathan Tan wrote: > The binary search to lookup a packfile offset from a .idx file > (which involves disk reads) would take longer for all lookups (not > just lookups for missing blobs) - I think I prefer keeping the lists > separate, to avoid pessimizing the (likely) usual case where the > relevant blobs are all already in local repo storage. Another relevant operation is looking up objects by offset or index_nr. The current implementation involves building an in-memory reverse index on demand by reading the idx file and sorting it by offset --- see pack-revindex.c::create_pack_revindex. This takes O(n log n) time where n is the size of the idx file. That said, it could be avoided by storing an on-disk reverse index with the pack. That's something we've been wanting to do anyway. Thanks, Jonathan
Re: [PATCH 00/10] RFC Partial Clone and Fetch
On 05/03/2017 09:38 AM, Jeff Hostetler wrote: On 3/8/2017 1:50 PM, g...@jeffhostetler.com wrote: From: Jeff Hostetler[RFC] Partial Clone and Fetch = [...] E. Unresolved Thoughts == *TODO* The server should optionally return (in a side-band?) a list of the blobs that it omitted from the packfile (and possibly the sizes or sha1_object_info() data for them) during the fetch-pack/upload-pack operation. This would allow the client to distinguish from invalid SHAs and missing ones. Size information would allow the client to maybe choose between various servers. Since I first posted this, Jonathan Tan has started a related discussion on missing blob support. https://public-inbox.org/git/cagf8dgk05+f4ux-8+imfvqd0n2jp6yxj18ag8udaeh6qc6s...@mail.gmail.com/T/ I want to respond to both of these threads here. - Thanks for your input. I see that you have explained both "storing 'positive' information about missing blobs" and "what to store with those positive information"; I'll just comment on the former for now. Missing-Blob Support Let me offer up an alternative idea for representing missing blobs. This is differs from both of our previous proposals. (I don't have any code for this new proposal, I just want to think out loud a bit and see if this is a direction worth pursuing -- or a complete non-starter.) Both proposals talk about detecting and adapting to a missing blob and ways to recover -- when we fail to find a blob. Comments on the thread asked about: () being able to detect missing blobs vs corrupt repos () being unable to detect duplicate blobs () expense of blob search. Suppose we store "positive" information about missing blobs? This would let us know that a blob is intentionally missing and possibly some meta-data about it. I thought about this (see "Some alternative designs" in [1]), listing some similar benefits, but concluded that "it is difficult to scale to large repos". Firstly, to be clear, by large repos I meant (and mean) the svn-style "monorepos" that Jonathan Nieder mentions as use case "A" [2]. My concern is that such lists (whether in separate file(s) or in .idx files) would be too unwieldy to manipulate. Even if we design things to avoid modifying such lists (for example, by adding a new list whenever we fetch instead of trying to modify an existing one), we would at least need to sort their contents (for example, when generating an .idx in the first place). For a repo with 10M-100M blobs [3], this might be doable on today's computers, but I would be concerned if a repo would exceed such numbers. [1] <20170426221346.25337-1-jonathanta...@google.com> [2] <20170503182725.gc28...@aiede.svl.corp.google.com> [3] In Microsoft's announcement of Git Virtual File System [4], they mentioned "over 3.5 million files" in the Windows codebase. I'm not sure if this refers to files in a snapshot (that is, working copy) or all historical versions. [4] https://blogs.msdn.microsoft.com/visualstudioalm/2017/02/03/announcing-gvfs-git-virtual-file-system/ 1. Suppose we update the .pack file format slightly. () We use the 5 value in "enum object_type" to mean a "missing-blob". () We update git-pack-object as I did in my RFC, but have it create type 5 entries for the blobs that are omitted, rather than nothing. () Hopefully, the same logic that currently keeps pack-object from sending unnecessary blobs on subsequent fetches can also be used to keep it from sending unnecessary missing-blob entries. () The type 5 missing-blob entry would contain the SHA-1 of the blob and some meta-data to be explained later. My original idea was to have sorted list(s) of hashes in separate file(s) much like the currently existing shallow file; it would have the semantics of "a hash here might be present or absent; if it is absent, use the hook". (Initially I thought that one list would be sufficient, but after reading your idea and considering it some more, multiple lists might be better.) Your idea of storing them in an .idx (and possibly corresponding .pack file) is similar, I think. Although mine is probably simpler - at least, we wouldn't need a new object_type. As described above, I don't think this list-of-hashes idea will work (because of the large numbers of blobs involved), but I'll compare it to yours anyway just in case we end up being convinced that this general idea works. 2. Make a similar change in the .idx format and git-index-pack to include them there. Then blob lookup operations could definitively determine that a blob exists and is just not present locally. 3. With this, packfile-based blob-lookup operations can get a "missing-blob" result. () It should be possible to short-cut searching in other packfiles (because we don't have
Re: [PATCH 00/10] RFC Partial Clone and Fetch
Hi, Jeff Hostetler wrote: > On 3/8/2017 1:50 PM, g...@jeffhostetler.com wrote: >> [RFC] Partial Clone and Fetch >> = >> [...] >> E. Unresolved Thoughts >> == >> >> *TODO* The server should optionally return (in a side-band?) a list >> of the blobs that it omitted from the packfile (and possibly the sizes >> or sha1_object_info() data for them) during the fetch-pack/upload-pack >> operation. This would allow the client to distinguish from invalid >> SHAs and missing ones. Size information would allow the client to >> maybe choose between various servers. > > Since I first posted this, Jonathan Tan has started a related > discussion on missing blob support. > https://public-inbox.org/git/cagf8dgk05+f4ux-8+imfvqd0n2jp6yxj18ag8udaeh6qc6s...@mail.gmail.com/T/ > > I want to respond to both of these threads here. Thanks much for this. > Missing-Blob Support > > > Let me offer up an alternative idea for representing > missing blobs. This is differs from both of our previous > proposals. (I don't have any code for this new proposal, > I just want to think out loud a bit and see if this is a > direction worth pursuing -- or a complete non-starter.) > > Both proposals talk about detecting and adapting to a missing > blob and ways to recover -- when we fail to find a blob. > Comments on the thread asked about: > () being able to detect missing blobs vs corrupt repos > () being unable to detect duplicate blobs > () expense of blob search. > > Suppose we store "positive" information about missing blobs? > This would let us know that a blob is intentionally missing > and possibly some meta-data about it. We've discussed this a little informally but didn't go more into it, so I'm glad you brought it up. There are two use cases I care about. I'll want names to refer to them later, so describing them now: A. A p4 or svn style "monorepo" containing all code within a company. We want to make git scale well when working with such a repository. Disk usage, network usage, index size, and object lookup time are all issues for such a repository. A repository can creep up in size so it starts falling into this category even though git coped well with it before. Another way to end up in this category is a conversion from a version control system like p4 or svn. B. A more modestly sized repository with some large blobs in it. At clone time we want to omit unneeded large blobs to speed up the clone, save disk space, and save bandwidth. For this kind of repository, the idx file already contained all those blobs and that was not causing problems --- the only problem was the actual blob size. > 1. Suppose we update the .pack file format slightly. [...] > 2. Make a similar change in the .idx format and git-index-pack >to include them there. Then blob lookup operations could >definitively determine that a blob exists and is just not >present locally. Some nits: - there shouldn't be any need for the blobs to even be mentioned in the .pack stored locally. The .idx file maps from sha1 to offset within the packfile --- a special offset could mean "this is a missing blob". - Git protocol works by sending pack files over the wire. The idx file is not transmitted to the client --- the client instead reconstructs it from the pack file. I assume this is why you stored the SHA-1 of the object in the pack file, but it could equally well be sent through another stream (since this proposal involves a change to git protocol anyway). - However, the list of missing blobs can be inferred from the existing pack format, without a change to the pack format used in git protocol. As part of constructing the idx, "git index-pack" inflates every object in the pack file sent by the server. This means we know what blobs they reference, so we can easily produce a list for the idx file without changing the pack file format. If this is done by only changing the idx file format and not the pack file, then it does not affect the protocol. That is good for experimentation --- it lets us try out different formats client-side without having to coordinate with server authors. In case (A), this proposal means we get back some of the per-object overhead that we were trying to avoid. I would like to avoid that if possible. In case (B), this proposal doesn't hurt. One problem with proposals so far has been how to handle repositories with multiple remotes. Having a local list of missing blobs is convenient because it can be associated to the remote --- when a blob is referenced later, we know which remote to ask for for it. This is a niche feature, but it's a nice bonus. [...] > 3. With this, packfile-based blob-lookup operations can get a >"missing-blob" result. >() It should be possible to short-cut searching in other > packfiles (because we don't have to assume that the
Re: [PATCH 00/10] RFC Partial Clone and Fetch
On 3/8/2017 1:50 PM, g...@jeffhostetler.com wrote: From: Jeff Hostetler[RFC] Partial Clone and Fetch = [...] E. Unresolved Thoughts == *TODO* The server should optionally return (in a side-band?) a list of the blobs that it omitted from the packfile (and possibly the sizes or sha1_object_info() data for them) during the fetch-pack/upload-pack operation. This would allow the client to distinguish from invalid SHAs and missing ones. Size information would allow the client to maybe choose between various servers. Since I first posted this, Jonathan Tan has started a related discussion on missing blob support. https://public-inbox.org/git/cagf8dgk05+f4ux-8+imfvqd0n2jp6yxj18ag8udaeh6qc6s...@mail.gmail.com/T/ I want to respond to both of these threads here. - Missing-Blob Support Let me offer up an alternative idea for representing missing blobs. This is differs from both of our previous proposals. (I don't have any code for this new proposal, I just want to think out loud a bit and see if this is a direction worth pursuing -- or a complete non-starter.) Both proposals talk about detecting and adapting to a missing blob and ways to recover -- when we fail to find a blob. Comments on the thread asked about: () being able to detect missing blobs vs corrupt repos () being unable to detect duplicate blobs () expense of blob search. Suppose we store "positive" information about missing blobs? This would let us know that a blob is intentionally missing and possibly some meta-data about it. 1. Suppose we update the .pack file format slightly. () We use the 5 value in "enum object_type" to mean a "missing-blob". () We update git-pack-object as I did in my RFC, but have it create type 5 entries for the blobs that are omitted, rather than nothing. () Hopefully, the same logic that currently keeps pack-object from sending unnecessary blobs on subsequent fetches can also be used to keep it from sending unnecessary missing-blob entries. () The type 5 missing-blob entry would contain the SHA-1 of the blob and some meta-data to be explained later. 2. Make a similar change in the .idx format and git-index-pack to include them there. Then blob lookup operations could definitively determine that a blob exists and is just not present locally. 3. With this, packfile-based blob-lookup operations can get a "missing-blob" result. () It should be possible to short-cut searching in other packfiles (because we don't have to assume that the blob was just misplaced in another packfile). () Lookup can still look for the corresponding loose blob (in case a previous lookup already "faulted it in"). 4. We can then think about dynamically fetching it. () Several techniques for this are currently being discussed on the mailing list in other threads, so I won't go into this here. () There has also been debate about whether this should yield a loose blob or a new packfile. I think both forms have merit and depend on whether we are limited to asking for a single blob or can make a batch request. () A dynamically-fetched loose blob is placed in the normal loose blob directory hierarchy so that subsequent lookups can find it as mentioned above. () A dynamically-fetched packfile (with one or more blobs) is written to the ODB and then the lookup operation completes. {} I want to isolate these packfiles from the main packfiles, so that they behave like a second-stage lookup and don't affect the caching/LRU nature of the existing first-stage packfile lookup. {} I also don't want the ambiguity of having 2 primary packfiles with a blob marked as missing in 1 and present in the other. 5. git-repack should be updated to "do the right thing" and squash missing-blob entries. 6. And etc. Missing-Blob Entry Data === A missing-blob entry needs to contain the SHA-1 value of the blob (obviously). Other fields are nice to have, but are not necessary. Here are a few fields to consider. A. The SHA-1 (20 bytes) B. The raw size of the blob (5? bytes). () This is the cleaned size of the file as stored. The server does not (and should not) have any knowledge of the smudging that may happen. () This may be useful if whatever dynamic-fetch-hook wants to customize its behavior, such as individually fetching large blobs and batch fetching smaller ones from the same server. () GVFS found it necessary to create a custom server end-point to get blob size data so that "ls -l" could show file sizes for non-present virtualized files. () 5 bytes (uint:40) should be more than enough for this. C. A server "hint" (20
Re: [PATCH 00/10] RFC Partial Clone and Fetch
On 3/22/2017 12:21 PM, Johannes Schindelin wrote: Hi Kostis, On Wed, 22 Mar 2017, ankostis wrote: On 8 March 2017 at 19:50,wrote: From: Jeff Hostetler [RFC] Partial Clone and Fetch = This is a WIP RFC for a partial clone and fetch feature wherein the client can request that the server omit various blobs from the packfile during clone and fetch. Clients can later request omitted blobs (either from a modified upload-pack-like request to the server or via a completely independent mechanism). Is it foreseen the server to *decide* with partial objects to serve And the cloning-client still to work ok? The foreseeable use case will be to speed up clones of insanely large repositories by omitting blobs that are not immediately required, and let the client fetch them later on demand. That is all, no additional permission model or anything. In fact, we do not even need to ensure that blobs are reachable in our use case, as only trusted parties are allowed to access the server to begin with. That does not mean, of course, that there should not be an option to limit access to objects that are reachable. My case in mind is storing confidential files in Git (server) that I want to publicize them to partial-cloning clients, for non-repudiation, by sending out trees and commits alone (or any non-sensitive blobs). A possible UI would be to rely on a `.gitattributes` to specify which objects are to be upheld. Apologies if I'm intruding with an unrelated feature requests. I think this is a valid use case, and Jeff's design certainly does not prevent future patches to that end. However, given that Jeff's use case does not require any such feature, I would expect the people who want those features to do the heavy lifting on top of his work. It is too different from the intended use case to reasonably ask of Jeff. As Johannes said, all I'm proposing is a way to limit the amount of data the client receives to help git scale to extremely large repositories. For example, I probably don't need 20 years of history or the entire source tree if I'm only working in a narrow subset of the tree. I'm not sure how you would achieve the confidential file scenario that you describe, but you might try to build on it and see if you can make it work. Jeff
Re: [PATCH 00/10] RFC Partial Clone and Fetch
Hi Kostis, On Wed, 22 Mar 2017, ankostis wrote: > On 8 March 2017 at 19:50,wrote: > > From: Jeff Hostetler > > > > [RFC] Partial Clone and Fetch > > = > > > > This is a WIP RFC for a partial clone and fetch feature wherein the > > client can request that the server omit various blobs from the > > packfile during clone and fetch. Clients can later request omitted > > blobs (either from a modified upload-pack-like request to the server > > or via a completely independent mechanism). > > Is it foreseen the server to *decide* with partial objects to serve > And the cloning-client still to work ok? The foreseeable use case will be to speed up clones of insanely large repositories by omitting blobs that are not immediately required, and let the client fetch them later on demand. That is all, no additional permission model or anything. In fact, we do not even need to ensure that blobs are reachable in our use case, as only trusted parties are allowed to access the server to begin with. That does not mean, of course, that there should not be an option to limit access to objects that are reachable. > My case in mind is storing confidential files in Git (server) > that I want to publicize them to partial-cloning clients, > for non-repudiation, by sending out trees and commits alone > (or any non-sensitive blobs). > > A possible UI would be to rely on a `.gitattributes` to specify > which objects are to be upheld. > > > Apologies if I'm intruding with an unrelated feature requests. I think this is a valid use case, and Jeff's design certainly does not prevent future patches to that end. However, given that Jeff's use case does not require any such feature, I would expect the people who want those features to do the heavy lifting on top of his work. It is too different from the intended use case to reasonably ask of Jeff. Ciao, Johannes
Re: [PATCH 00/10] RFC Partial Clone and Fetch
Dear Jeff I read most of the valuable references you provided but could not find something along the lines describing inline. On 8 March 2017 at 19:50,wrote: > From: Jeff Hostetler > > > [RFC] Partial Clone and Fetch > = > > This is a WIP RFC for a partial clone and fetch feature wherein the client > can request that the server omit various blobs from the packfile during > clone and fetch. Clients can later request omitted blobs (either from a > modified upload-pack-like request to the server or via a completely > independent mechanism). Is it foreseen the server to *decide* with partial objects to serve And the cloning-client still to work ok? My case in mind is storing confidential files in Git (server) that I want to publicize them to partial-cloning clients, for non-repudiation, by sending out trees and commits alone (or any non-sensitive blobs). A possible UI would be to rely on a `.gitattributes` to specify which objects are to be upheld. Apologies if I'm intruding with an unrelated feature requests. Kostis
Re: [PATCH 00/10] RFC Partial Clone and Fetch
On 3/16/2017 5:43 PM, Jeff Hostetler wrote: On 3/9/2017 3:18 PM, Jonathan Tan wrote: Overall, this fetch/clone approach seems reasonable to me, except perhaps some unanswered questions (some of which are also being discussed elsewhere): - does the server need to tell us of missing blobs? - if yes, does the server need to tell us their file sizes? File sizes are a nice addition. For example, with a virtual file system, a "ls -l" can lie and tell you the sizes of the yet-to-be-populated files. Nevermind the "ls -l" case, I forgot about the need for the client to display the size of the (possibly) smudged file, rather than the actual blob size.
Re: [PATCH 00/10] RFC Partial Clone and Fetch
On 3/9/2017 3:18 PM, Jonathan Tan wrote: Overall, this fetch/clone approach seems reasonable to me, except perhaps some unanswered questions (some of which are also being discussed elsewhere): - does the server need to tell us of missing blobs? - if yes, does the server need to tell us their file sizes? File sizes are a nice addition. For example, with a virtual file system, a "ls -l" can lie and tell you the sizes of the yet-to-be-populated files. Or if the client wants to distinguish between going back to the original remote or going to S3 for the blob, it could use the size to choose. (I'm not saying we actually build that yet, but others on the mailing list have spoken about parking large blobs in S3.) So, not necessary, but might be nice to have. - do we need to store the list of missing blobs somewhere (whether the server told it to us or whether we inferred it from the fetched trees) We should be able to infer the list of missing blobs; I hadn't considered that. However, by doing so we will need to disable some of the integrity checking (as I had to do with the "--allow-partial" option) and some concerns about that were discussed earlier in the thread. But if we do that inference during clone/fetch and record it somewhere, we could get back the integrity checking. The answers to this probably depend on the answers in "B. Issues Backfilling Omitted Blobs" (especially the additional concepts I listed below). Also, do you have any plans to implement other functionality, e.g. "git checkout" (which will allow fetches and clones to repositories with a working directory)? (I don't know what the mailing list consensus would be for the "acceptance criteria" for this patch set, but I would at least include "checkout".) Yes, supporting "checkout" is essential. Commands like "merge", "diff", and etc. will come later. In Ben's RFC, he has been investigating demand-loading blobs in read_object(). I've been focusing on pre-fetching the missing blobs for a particular command. I need to make more progress on this topic. On 03/08/2017 10:50 AM, g...@jeffhostetler.com wrote: B. Issues Backfilling Omitted Blobs === Ideally, if the client only does "--partial-by-profile" fetches, it should not need to fetch individual missing blobs, but we have to allow for it to handle the other commands and other unexpected issues. There are 3 orthogonal concepts here: when, how and where? Another concept is "how to determine if a blob is really omitted" - do we store a list somewhere or do we assume that all missing blobs are purposely omitted (like in this patch set)? Yet another concept is "whether to fetch" - for example, a checkout should almost certainly fetch, but a rev-list used by a connectivity check (like in patch 6 of this set) should not. For example, for historical-blob-searching commands like "git log -S", should we: a) fetch everything missing (so users should use date-limiting arguments) b) fetch nothing missing c) use the file size to automatically exclude big files, but fetch everything else For a) and b), we wouldn't need file size information for missing blobs, but for c), we do. This might determine if we need file size information in the fetch-pack/upload-pack protocol. good points. C. New Blob-Fetch Protocol (2a) === *TODO* A new pair of commands, such as fetch-blob-pack and upload-blob-pack, will be created to let the client request a batch of blobs and receive a packfile. A protocol similar to the fetch-pack/upload-pack will be spoken between them. (This avoids complicating the existing protocol and the work of enumerating the refs.) Upload-blob-pack will use pack-objects to build the packfile. It is also more efficient than requesting a single blob at a time using the existing fetch-pack/upload-pack mechanism (with the various allow unreachable options). *TODO* The new request protocol will be defined in the patch series. It will include: a list of the desired blob SHAs. Possibly also the commit SHA, branch name, and pathname of each blob (or whatever is necessary to let the server address the reachability concerns). Possibly also the last known SHA for each blob to allow for deltafication in the packfile. Context (like the commit SHA-1) would help in reachability checks, but I'm not sure if we can rely on that. It is true that I can't think of a way that the client would dissociate a blob that is missing from its tree or commit (because it would first need to "fault-in" that blob to do its operation). But clients operating on non-contextual SHA-1s (e.g. "git cat-file") and servers manipulating commits (so that the commit SHA-1 that the client had in its context is no longer reachable) are not uncommon, I think. Having said that, it might be useful to include the context in the protocol anyway as an optional "hint". That is what I was thinking. A hint of the branch or
Re: [PATCH 00/10] RFC Partial Clone and Fetch
Overall, this fetch/clone approach seems reasonable to me, except perhaps some unanswered questions (some of which are also being discussed elsewhere): - does the server need to tell us of missing blobs? - if yes, does the server need to tell us their file sizes? - do we need to store the list of missing blobs somewhere (whether the server told it to us or whether we inferred it from the fetched trees) The answers to this probably depend on the answers in "B. Issues Backfilling Omitted Blobs" (especially the additional concepts I listed below). Also, do you have any plans to implement other functionality, e.g. "git checkout" (which will allow fetches and clones to repositories with a working directory)? (I don't know what the mailing list consensus would be for the "acceptance criteria" for this patch set, but I would at least include "checkout".) On 03/08/2017 10:50 AM, g...@jeffhostetler.com wrote: B. Issues Backfilling Omitted Blobs === Ideally, if the client only does "--partial-by-profile" fetches, it should not need to fetch individual missing blobs, but we have to allow for it to handle the other commands and other unexpected issues. There are 3 orthogonal concepts here: when, how and where? Another concept is "how to determine if a blob is really omitted" - do we store a list somewhere or do we assume that all missing blobs are purposely omitted (like in this patch set)? Yet another concept is "whether to fetch" - for example, a checkout should almost certainly fetch, but a rev-list used by a connectivity check (like in patch 6 of this set) should not. For example, for historical-blob-searching commands like "git log -S", should we: a) fetch everything missing (so users should use date-limiting arguments) b) fetch nothing missing c) use the file size to automatically exclude big files, but fetch everything else For a) and b), we wouldn't need file size information for missing blobs, but for c), we do. This might determine if we need file size information in the fetch-pack/upload-pack protocol. C. New Blob-Fetch Protocol (2a) === *TODO* A new pair of commands, such as fetch-blob-pack and upload-blob-pack, will be created to let the client request a batch of blobs and receive a packfile. A protocol similar to the fetch-pack/upload-pack will be spoken between them. (This avoids complicating the existing protocol and the work of enumerating the refs.) Upload-blob-pack will use pack-objects to build the packfile. It is also more efficient than requesting a single blob at a time using the existing fetch-pack/upload-pack mechanism (with the various allow unreachable options). *TODO* The new request protocol will be defined in the patch series. It will include: a list of the desired blob SHAs. Possibly also the commit SHA, branch name, and pathname of each blob (or whatever is necessary to let the server address the reachability concerns). Possibly also the last known SHA for each blob to allow for deltafication in the packfile. Context (like the commit SHA-1) would help in reachability checks, but I'm not sure if we can rely on that. It is true that I can't think of a way that the client would dissociate a blob that is missing from its tree or commit (because it would first need to "fault-in" that blob to do its operation). But clients operating on non-contextual SHA-1s (e.g. "git cat-file") and servers manipulating commits (so that the commit SHA-1 that the client had in its context is no longer reachable) are not uncommon, I think. Having said that, it might be useful to include the context in the protocol anyway as an optional "hint". I'm not sure what you mean by "last known SHA for each blob". (If we do store the file size of a blob somewhere, we could also store some context there. I'm not sure how useful this is, though.) E. Unresolved Thoughts == *TODO* The partial clone arguments should be recorded in ".git/info/" so that subsequent fetch commands can inherit them and rev-list/index-pack know to not complain by default. *TODO* Update GC like rev-list to not complain when there are missing blobs. These 2 points would be part of "whether to fetch" above.