Re: Partial clone design (with connectivity check for locally-created objects)
Hi, Ben Peart wrote: > We've discussed a couple of different possible solutions, each of > which have different tradeoffs. Let me try to summarize here and > perhaps suggest some other possibilities: Thanks for this. Some comments below. > Promised list > - > This provides an external data structure that allowed us to flag > objects that came from a remote server (vs created locally). > > The biggest drawback is that this data structure can get very large > and become difficult/expensive to generate/transfer/maintain. Agreed. Using a single immutable file to maintain this data with lock-and-rename update means that I/O when updating it can be a bottleneck and that contention can be a problem. > It also (at least in one proposal) required protocol and server side > changes to support it. I don't think that's a very big problem. This is the Git project: we control the protocol and the server. Partial clone requires changing the protocol and server already. > Annotated via filename > -- > This idea is to annotate the file names of objects that came from a > remote server (pack files and loose objects) with a unique file > extension (.remote) that indicates whether they are locally created > or not. > > To make this work, git must understand about both types of loose > objects and pack files and search in both locations when looking for > objects. I don't understand the drawback you're describing here. To avoid a number of serious problems, Git already needs to be aware of partial clone. I don't think anyone has been proposing adding partial clone to upstream Git without a repository format extension (see Documentation/technical/repository-version.txt) to prevent older versions of Git from being confused about such repositories. If you don't do this, some problems include - confusing messages due to missing objects - errors over the wire protocol from trying to serve fetches and getting confused - "git gc" running and not knowing which objects are safe to be deleted So the relevant issue couldn't be that Git has to be changed at all: it would be that a change is excessively invasive. But it's not clear to me that the change you are describing is very invasive. > Another drawback of this is that commands (repack, gc) that optimize > loose objects and pack files must now be aware of the different > extensions and handle both while not merging remote and non-remote > objects. > > In short, we're creating separate object stores - one for locally > created objects and one for everything else. These also seem like non-issues. Some examples of problems I could imagine: - is writing multiple files when writing a loose object a problem for your setup? - is one of the operations described (repack, prune, fsck) too slow? Do you forsee either of those being an issue? > Now a couple of different ideas: > > Annotated via flags > === > The fundamental idea here is that we add the ability to flag locally > created objects on the object itself. Do you mean changing the underlying object format that produces an object's object id? Or do you mean changing the container format? Changing the container format is exactly what was described in the previous example ("Annotated via filename"). There are other ways to change the container format: e.g. if writing multiple files when writing a loose object is a problem, we could add a field that does not affect the object id to the loose object format. [...] > Local list > -- > Given the number of locally created objects is usually very small in > comparison to the total number of objects (even just due to > history), it makes more sense to track locally created objects > instead of promised/remote objects. > > The biggest advantage of this over the "promised list" is that the > "local list" being maintained is _significantly_ smaller (often > orders of magnitude smaller). [...] > On the surface, this seems like the simplest solution that meets the > stated requirements. This has the same problems as the list of promised objects: excessive I/O and contention when updating the list. Moreover, it doesn't bring one of the main benefits of the list of promised objects. Promised objects are not present in the local repository, so the list of promises provided a way to maintain some information about them (e.g., object size). Locally created objects are present in the local repository so they don't need such metadata. > Object DB > - If I understand correctly, this is pushing the issues described in the other cases into a hook and making them not upstream Git's problem. But it is still someone's problem. It just means upstream Git doesn't benefit from their solution to it. I don't see a need to give up in that way just yet. I'm also available on #git-devel on freenode.net for real-time conversation. Logs are at http://bit.ly/aLzrmv. You can prepend a message with "[off]" to prevent
Re: Partial clone design (with connectivity check for locally-created objects)
On 8/7/2017 3:41 PM, Junio C Hamano wrote: Ben Peartwrites: My concern with this proposal is the combination of 1) writing a new pack file for every git command that ends up bringing down a missing object and 2) gc not compressing those pack files into a single pack file. Your noticing these is a sign that you read the outline of the design correctly, I think. The basic idea is that the local fsck should tolerate missing objects when they are known to be obtainable from that external service, but should still be able to diagnose missing objects that we do not know if the external service has, especially the ones that have been newly created locally and not yet made available to them by pushing them back. This helps me a lot as now I think I understand the primary requirement we're trying to solve for. Let me rephrase it and see if this makes sense: We need to be able to identify whether an object was created locally (and should pass more strict fsck/connectivity tests) or whether it came from a remote (and so any missing objects could presumably be fetched from the server). I agree it would be nice to solve this (and not just punt fsck - even if it is an opt-in behavior). We've discussed a couple of different possible solutions, each of which have different tradeoffs. Let me try to summarize here and perhaps suggest some other possibilities: Promised list - This provides an external data structure that allowed us to flag objects that came from a remote server (vs created locally). The biggest drawback is that this data structure can get very large and become difficult/expensive to generate/transfer/maintain. It also (at least in one proposal) required protocol and server side changes to support it. Annotated via filename -- This idea is to annotate the file names of objects that came from a remote server (pack files and loose objects) with a unique file extension (.remote) that indicates whether they are locally created or not. To make this work, git must understand about both types of loose objects and pack files and search in both locations when looking for objects. Another drawback of this is that commands (repack, gc) that optimize loose objects and pack files must now be aware of the different extensions and handle both while not merging remote and non-remote objects. In short, we're creating separate object stores - one for locally created objects and one for everything else. Now a couple of different ideas: Annotated via flags === The fundamental idea here is that we add the ability to flag locally created objects on the object itself. Given that at the core, "Git is a simple key-value data store" can we take advantage of that fact and include a "locally created" bit as a property on every object? I could not think of a good way to accomplish this as it is ultimately changing the object format which creates rapidly expanding ripples of change. For example, The object header currently includes the type a space, the length and a null. Even if we could add a "local" property (either by adding a 5th item, taking over the space, creating new object types, etc), the fact that the header is included in the sha1 means that push would become problematic as flipping the bit would change the sha and the trees and commits that reference it. Local list -- Given the number of locally created objects is usually very small in comparison to the total number of objects (even just due to history), it makes more sense to track locally created objects instead of promised/remote objects. The biggest advantage of this over the "promised list" is that the "local list" being maintained is _significantly_ smaller (often orders of magnitude smaller). Another advantage over the "promised list" solution is that it doesn't require any server side or protocol changes. On the client when objects are created (write_loose_object?) the new objects are added to the "local list" and in the connectivity check (fsck) if the object is not in the "local list," the connectivity check can be skipped as any missing object can presumably be retrieved from the server. A simple file format could be used (header + list of SHA1 values) and write_loose_object could do a trivial append. In fsck, the file could be loaded into a hashmap to make for fast existence checks. Entries could be removed from the "local list" for objects later fetched from a server (though I had a hard time contriving a scenario where this would happen so I consider this optional). On the surface, this seems like the simplest solution that meets the stated requirements. Object DB - This is a different way of providing separate object stores than the "Annotated via filename" proposal. It should be a cleaner/more elegant solution that enables several other capabilities but it is also more
Re: Partial clone design (with connectivity check for locally-created objects)
On 8/7/2017 3:21 PM, Jonathan Nieder wrote: Hi, Ben Peart wrote: On Fri, 04 Aug 2017 15:51:08 -0700 Junio C Hamanowrote: Jonathan Tan writes: "Imported" objects must be in a packfile that has a ".remote" file with arbitrary text (similar to the ".keep" file). They come from clones, fetches, and the object loader (see below). ... A "homegrown" object is valid if each object it references: 1. is a "homegrown" object, 2. is an "imported" object, or 3. is referenced by an "imported" object. Overall it captures what was discussed, and I think it is a good start. I missed the offline discussion and so am trying to piece together what this latest design is trying to do. Please let me know if I'm not understanding something correctly. I believe https://public-inbox.org/git/cover.1501532294.git.jonathanta...@google.com/ and the surrounding thread (especially https://public-inbox.org/git/xmqqefsudjqk@gitster.mtv.corp.google.com/) is the discussion Junio is referring to. [...] This segmentation is what is driving the need for the object loader to build a new local pack file for every command that has to fetch a missing object. For example, we can't just write a tree object from a "partial" clone into the loose object store as we have no way for fsck to treat them differently and ignore any missing objects referenced by that tree object. That's related and how it got lumped into this proposal, but it's not the only motivation. Other aspects: 1. using pack files instead of loose objects means we can use deltas. This is the primary motivation. 2. pack files can use reachability bitmaps (I realize there are obstacles to getting benefit out of this because git's bitmap format currently requires a pack to be self-contained, but I thought it was worth mentioning for completeness). 3. existing git servers are oriented around pack files; they can more cheaply serve objects from pack files in pack format, including reusing deltas from them. 4. file systems cope better with a few large files than many small files [...] We all know that git doesn't scale well with a lot of pack files as it has to do a linear search through all the pack files when attempting to find an object. I can see that very quickly, there would be a lot of pack files generated and with gc ignoring "partial" pack files, this would never get corrected. Yes, that's an important point. Regardless of this proposal, we need to get more aggressive about concatenating pack files (e.g. by implementing exponential rollup in "git gc --auto"). In our usage scenarios, _all_ of the objects come from "partial" clones so all of our objects would end up in a series of "partial" pack files and would have pretty poor performance as a result. Can you say more about this? Why would the pack files (or loose objects, for that matter) never end up being consolidated into few pack files? Our initial clone is very sparse - we only pull down the commit we are about to checkout and none of the blobs. All missing objects are then downloaded on demand (and in this proposal, would end up in a "partial" pack file). For performance reasons, we also (by default) download a server computed pack file of commits and trees to pre-populate the local cache. Without modification, fsck, repack, prune, gc will trigger every object in the repo to be downloaded. We punted for now and just block those commands but eventually they need to be aware of missing objects so that they do not cause them to be downloaded. Jonathan is already working on this for fsck in another patch series. [...] That thinking did lead me back to wondering again if we could live with a repo specific flag. If any clone/fetch was "partial" the flag is set and fsck ignore missing objects whether they came from a "partial" remote or not. I'll admit it isn't as robust if someone is mixing and matching remotes from different servers some of which are partial and some of which are not. I'm not sure how often that would actually happen but I _am_ certain a single repo specific flag is a _much_ simpler model than anything else we've come up with so far. The primary motivation in this thread is locally-created objects, not objects obtained from other remotes. Objects obtained from other remotes are more of an edge case. Thank you - that helps me to better understand the requirements of the problem we're trying to solve. In short, that means what we really need is a way to identify locally created objects so that fsck can do a complete connectivity check on them. I'll have to think about a good way to do that - we've talked about a few but each has a different set of trade-offs and none of them are great (yet :)). Thanks for your thoughtful comments. Jonathan
Re: Partial clone design (with connectivity check for locally-created objects)
On Mon, 7 Aug 2017 15:12:11 -0400 Ben Peartwrote: > I missed the offline discussion and so am trying to piece together what > this latest design is trying to do. Please let me know if I'm not > understanding something correctly. > > From what I can tell, objects are going to be segmented into two > "types" - those that were fetched from a remote source that allows > partial clones/fetches (lazyobject/imported) and those that come from > "regular" remote sources (homegrown) that requires all objects to exist > locally. > > FWIW, the names here are not making things clearer for me. If I'm > correct perhaps "partial" and "normal" would be better to indicate the > type of the source? Anyway... That's right. As for names, I'm leaning now towards "imported" and "non-imported". "Partial" is a bit strange because such an object is fully available; it's just that the objects that it references are promised by the server. > Once the objects are segmented into the 2 types, the fsck connectivity > check code is updated to ignore missing objects from "partial" remotes > but still expect/validate them from "normal" remotes. > > This compromise seems reasonable - don't generate errors for missing > objects for remotes that returned a partial clone but do generate errors > for missing objects from normal clones as a missing object is always an > error in this case. Yes. In addition, the references of "imported" objects are also potentially used when connectivity-checking "non-imported" objects - if a "non-imported" object refers to an object that an "imported" object refers to, that is fine, even though we don't have that object. > This segmentation is what is driving the need for the object loader to > build a new local pack file for every command that has to fetch a > missing object. For example, we can't just write a tree object from a > "partial" clone into the loose object store as we have no way for fsck > to treat them differently and ignore any missing objects referenced by > that tree object. > > My concern with this proposal is the combination of 1) writing a new > pack file for every git command that ends up bringing down a missing > object and 2) gc not compressing those pack files into a single pack file. > > We all know that git doesn't scale well with a lot of pack files as it > has to do a linear search through all the pack files when attempting to > find an object. I can see that very quickly, there would be a lot of > pack files generated and with gc ignoring "partial" pack files, this > would never get corrected. > > In our usage scenarios, _all_ of the objects come from "partial" clones > so all of our objects would end up in a series of "partial" pack files > and would have pretty poor performance as a result. One possible solution...would support for annotating loose objects with ".remote" be sufficient? (That is, for each loose object file created, create another of the same name but with ".remote" appended.) This means that a loose-object-creating lazy loader would need to create 2 files per object instead of one. The lazy loader protocol will thus be updated to something resembling a prior version with the loader writing objects directly to the object database, but now the loader is also responsible for creating the ".remote" files. (In the Android use case, we probably won't need the writing-to-partial-packfile mechanism anymore since only comparatively few and large blobs will go in there.)
Re: Partial clone design (with connectivity check for locally-created objects)
Ben Peartwrites: > My concern with this proposal is the combination of 1) writing a new > pack file for every git command that ends up bringing down a missing > object and 2) gc not compressing those pack files into a single pack > file. Your noticing these is a sign that you read the outline of the design correctly, I think. The basic idea is that the local fsck should tolerate missing objects when they are known to be obtainable from that external service, but should still be able to diagnose missing objects that we do not know if the external service has, especially the ones that have been newly created locally and not yet made available to them by pushing them back. So we need a way to tell if an object that we do not have (but we know about) can later be obtained from the external service. Maintaining an explicit list of such objects obviously is one way, but we can get the moral equivalent by using pack files. After receiving a pack file that has a commit from such an external service, if the commit refers to its parent commit that we do not have locally, the design proposes us to consider that the parent commit that is missing is available at the external service that gave the pack to us. Similarly for missing trees, blobs, and any objects that are supposed to be "reachable" from objects in such a packfile. We can extend the approach to cover loose objects if we wanted to; just define an alternate object store used internally for this purpose and drop loose objects obtained from such an external service in that object store. Because we do not want to leave too many loose objects and small packfiles lying around, we will need a new way of packing these. Just enumerate these objects known to have come from the external service (by being in packfiles marked as such or being loose objects in the dedicated alternate object store), and create a single larger packfile, which is marked as "holding the objects that are known to be in the external service". We do not have such a mode of gc, and that is a new development that needs to happen, but we know that is doable. > That thinking did lead me back to wondering again if we could live > with a repo specific flag. If any clone/fetch was "partial" the flag > is set and fsck ignore missing objects whether they came from a > "partial" remote or not. The only reason people run "git fsck" is to make sure that their local repository is sound and they can rely on the objects you have as the base of building new stuff on top of. That is why we are trying to find a way to make sure "fsck" can be used to detect broken or missing objects that cannot be obtained from the lazy-object store, without incurring undue overhead for normal codepath (i.e. outside fsck). It is OK to go back to wondering again, but I think that essentially tosses "git fsck" out of the window and declares that it is OK to hope that local objects will never go bad. We can make such an declaration anytime, but I do not want to see us doing so without first trying to solve the issue without punting.
Re: Partial clone design (with connectivity check for locally-created objects)
Hi, Ben Peart wrote: >> On Fri, 04 Aug 2017 15:51:08 -0700 >> Junio C Hamanowrote: >>> Jonathan Tan writes: "Imported" objects must be in a packfile that has a ".remote" file with arbitrary text (similar to the ".keep" file). They come from clones, fetches, and the object loader (see below). ... A "homegrown" object is valid if each object it references: 1. is a "homegrown" object, 2. is an "imported" object, or 3. is referenced by an "imported" object. >>> >>> Overall it captures what was discussed, and I think it is a good >>> start. > > I missed the offline discussion and so am trying to piece together > what this latest design is trying to do. Please let me know if I'm > not understanding something correctly. I believe https://public-inbox.org/git/cover.1501532294.git.jonathanta...@google.com/ and the surrounding thread (especially https://public-inbox.org/git/xmqqefsudjqk@gitster.mtv.corp.google.com/) is the discussion Junio is referring to. [...] > This segmentation is what is driving the need for the object loader > to build a new local pack file for every command that has to fetch a > missing object. For example, we can't just write a tree object from > a "partial" clone into the loose object store as we have no way for > fsck to treat them differently and ignore any missing objects > referenced by that tree object. That's related and how it got lumped into this proposal, but it's not the only motivation. Other aspects: 1. using pack files instead of loose objects means we can use deltas. This is the primary motivation. 2. pack files can use reachability bitmaps (I realize there are obstacles to getting benefit out of this because git's bitmap format currently requires a pack to be self-contained, but I thought it was worth mentioning for completeness). 3. existing git servers are oriented around pack files; they can more cheaply serve objects from pack files in pack format, including reusing deltas from them. 4. file systems cope better with a few large files than many small files [...] > We all know that git doesn't scale well with a lot of pack files as > it has to do a linear search through all the pack files when > attempting to find an object. I can see that very quickly, there > would be a lot of pack files generated and with gc ignoring > "partial" pack files, this would never get corrected. Yes, that's an important point. Regardless of this proposal, we need to get more aggressive about concatenating pack files (e.g. by implementing exponential rollup in "git gc --auto"). > In our usage scenarios, _all_ of the objects come from "partial" > clones so all of our objects would end up in a series of "partial" > pack files and would have pretty poor performance as a result. Can you say more about this? Why would the pack files (or loose objects, for that matter) never end up being consolidated into few pack files? [...] > That thinking did lead me back to wondering again if we could live > with a repo specific flag. If any clone/fetch was "partial" the > flag is set and fsck ignore missing objects whether they came from a > "partial" remote or not. > > I'll admit it isn't as robust if someone is mixing and matching > remotes from different servers some of which are partial and some of > which are not. I'm not sure how often that would actually happen > but I _am_ certain a single repo specific flag is a _much_ simpler > model than anything else we've come up with so far. The primary motivation in this thread is locally-created objects, not objects obtained from other remotes. Objects obtained from other remotes are more of an edge case. Thanks for your thoughtful comments. Jonathan
Re: Partial clone design (with connectivity check for locally-created objects)
On 8/4/2017 8:21 PM, Jonathan Tan wrote: On Fri, 04 Aug 2017 15:51:08 -0700 Junio C Hamanowrote: Jonathan Tan writes: "Imported" objects must be in a packfile that has a ".remote" file with arbitrary text (similar to the ".keep" file). They come from clones, fetches, and the object loader (see below). ... A "homegrown" object is valid if each object it references: 1. is a "homegrown" object, 2. is an "imported" object, or 3. is referenced by an "imported" object. Overall it captures what was discussed, and I think it is a good start. I missed the offline discussion and so am trying to piece together what this latest design is trying to do. Please let me know if I'm not understanding something correctly. From what I can tell, objects are going to be segmented into two "types" - those that were fetched from a remote source that allows partial clones/fetches (lazyobject/imported) and those that come from "regular" remote sources (homegrown) that requires all objects to exist locally. FWIW, the names here are not making things clearer for me. If I'm correct perhaps "partial" and "normal" would be better to indicate the type of the source? Anyway... Once the objects are segmented into the 2 types, the fsck connectivity check code is updated to ignore missing objects from "partial" remotes but still expect/validate them from "normal" remotes. This compromise seems reasonable - don't generate errors for missing objects for remotes that returned a partial clone but do generate errors for missing objects from normal clones as a missing object is always an error in this case. This segmentation is what is driving the need for the object loader to build a new local pack file for every command that has to fetch a missing object. For example, we can't just write a tree object from a "partial" clone into the loose object store as we have no way for fsck to treat them differently and ignore any missing objects referenced by that tree object. My concern with this proposal is the combination of 1) writing a new pack file for every git command that ends up bringing down a missing object and 2) gc not compressing those pack files into a single pack file. We all know that git doesn't scale well with a lot of pack files as it has to do a linear search through all the pack files when attempting to find an object. I can see that very quickly, there would be a lot of pack files generated and with gc ignoring "partial" pack files, this would never get corrected. In our usage scenarios, _all_ of the objects come from "partial" clones so all of our objects would end up in a series of "partial" pack files and would have pretty poor performance as a result. I wondered if it is possible to flag a specific remote as "partial" and have fsck be able to track any given object back to the remote and then properly handle the fact that it was missing based on that. I couldn't think of a good way to do that without some additional data structure that would have to be build/maintained (ie promises). That thinking did lead me back to wondering again if we could live with a repo specific flag. If any clone/fetch was "partial" the flag is set and fsck ignore missing objects whether they came from a "partial" remote or not. I'll admit it isn't as robust if someone is mixing and matching remotes from different servers some of which are partial and some of which are not. I'm not sure how often that would actually happen but I _am_ certain a single repo specific flag is a _much_ simpler model than anything else we've come up with so far. I doubt you want to treat all fetches/clones the same way as the "lazy object" loading, though. You may be critically rely on the corporate central server that will give the objects it "promised" when you cloned from it lazily (i.e. it may have given you a commit, but not its parents or objects contained in its tree--you still know that the parents and the tree and its contents will later be available and rely on that fact). You trust that and build on top, so the packfile you obtained when you cloned from such a server should count as "imported". But if you exchanged wip changes with your colleages by fetching or pushing peer-to-peer, without the corporate central server knowing, you would want to treat objects in packs (or loose objects) you obtained that way as "not imported". That's true. I discussed this with a teammate and we might need to make extensions.lazyObject be the name of the "corporate central server" remote instead, and have a "loader" setting within that remote, so that we can distinguish that objects from this server are "imported" but objects from other servers are not. The connectivity check shouldn't be slow in this case because fetches are usually onto tips that we have (so we don't hit case 3). Also I think "imported" vs "homegrown" may be a bit misnomer;
Re: Partial clone design (with connectivity check for locally-created objects)
On Fri, 04 Aug 2017 15:51:08 -0700 Junio C Hamanowrote: > Jonathan Tan writes: > > > "Imported" objects must be in a packfile that has a ".remote" > > file with arbitrary text (similar to the ".keep" file). They come from > > clones, fetches, and the object loader (see below). > > ... > > A "homegrown" object is valid if each object it references: > > 1. is a "homegrown" object, > > 2. is an "imported" object, or > > 3. is referenced by an "imported" object. > > Overall it captures what was discussed, and I think it is a good > start. > > I doubt you want to treat all fetches/clones the same way as the > "lazy object" loading, though. You may be critically rely on the > corporate central server that will give the objects it "promised" > when you cloned from it lazily (i.e. it may have given you a commit, > but not its parents or objects contained in its tree--you still know > that the parents and the tree and its contents will later be > available and rely on that fact). You trust that and build on top, > so the packfile you obtained when you cloned from such a server > should count as "imported". But if you exchanged wip changes with > your colleages by fetching or pushing peer-to-peer, without the > corporate central server knowing, you would want to treat objects in > packs (or loose objects) you obtained that way as "not imported". That's true. I discussed this with a teammate and we might need to make extensions.lazyObject be the name of the "corporate central server" remote instead, and have a "loader" setting within that remote, so that we can distinguish that objects from this server are "imported" but objects from other servers are not. The connectivity check shouldn't be slow in this case because fetches are usually onto tips that we have (so we don't hit case 3). > Also I think "imported" vs "homegrown" may be a bit misnomer; the > idea to split objects into two camps sounds like a good idea, and > "imported" probably is an OK name to use for the category that is a > group of objects to which you know/trust are backed by your lazy > loader. But the other one does not have to be "home"-grown. > > Well, the names are not that important, but I think the line between > the two classes should not be "everything that came from clone and > fetch is imported", which is a more important point I am trying to > make. > > Thanks. Maybe "imported" vs "non-imported" would be better. I agree that the objects in the non-"imported" group could still be obtained from elsewhere. Thanks for your comments.
Re: Partial clone design (with connectivity check for locally-created objects)
Jonathan Tanwrites: > "Imported" objects must be in a packfile that has a ".remote" > file with arbitrary text (similar to the ".keep" file). They come from > clones, fetches, and the object loader (see below). > ... > A "homegrown" object is valid if each object it references: > 1. is a "homegrown" object, > 2. is an "imported" object, or > 3. is referenced by an "imported" object. Overall it captures what was discussed, and I think it is a good start. I doubt you want to treat all fetches/clones the same way as the "lazy object" loading, though. You may be critically rely on the corporate central server that will give the objects it "promised" when you cloned from it lazily (i.e. it may have given you a commit, but not its parents or objects contained in its tree--you still know that the parents and the tree and its contents will later be available and rely on that fact). You trust that and build on top, so the packfile you obtained when you cloned from such a server should count as "imported". But if you exchanged wip changes with your colleages by fetching or pushing peer-to-peer, without the corporate central server knowing, you would want to treat objects in packs (or loose objects) you obtained that way as "not imported". Also I think "imported" vs "homegrown" may be a bit misnomer; the idea to split objects into two camps sounds like a good idea, and "imported" probably is an OK name to use for the category that is a group of objects to which you know/trust are backed by your lazy loader. But the other one does not have to be "home"-grown. Well, the names are not that important, but I think the line between the two classes should not be "everything that came from clone and fetch is imported", which is a more important point I am trying to make. Thanks.
Partial clone design (with connectivity check for locally-created objects)
After some discussion in [1] (in particular, about preserving the functionality of the connectivity check as much as possible) and some in-office discussion, here's an updated design. Overview This is an update of the design in [1]. The main difference between this and other related work [1] [2] [3] is that we can still check connectivity between locally-created objects without having to consult a remote server for any information. In addition, the object loader writes to an incomplete packfile. This (i) ensures that Git has immediate access to the object, (ii) ensures that not too many files are written during a single Git invocation, and (iii) prevents some unnecessary copies (compared to, for example, transmitting entire objects through the protocol). Local repo layout = Objects in the local repo are further divided into "homegrown" and "imported" objects. "Imported" objects must be in a packfile that has a ".remote" file with arbitrary text (similar to the ".keep" file). They come from clones, fetches, and the object loader (see below). "Homegrown" objects are every other object. Object loader = The object loader is a process that can obtain objects from elsewhere, given their hashes, and write their packed representation to a client-given file. The first time a missing object is needed during an invocation of Git, Git creates a temporary packfile and writes the header with a placeholder number of objects. Then, it starts the object loader, passing in the name of that temporary packfile. Whenever a missing object is needed, Git sends the hash of the missing object and expects the loader to append (with O_APPEND) the object to that packfile. Git keeps track of the object offsets as it goes, and Git can use the contents of that incomplete packfile. This is similar to what "git fast-import" does. When Git exits, it writes the number of objects in the header, writes the packfile checksum, moves the packfile to its final location, and writes a .idx and a .remote file. Connectivity check == An object walk is performed as usual from the tips (see the documentation for fsck etc. for which tips they use). A "homegrown" object is valid if each object it references: 1. is a "homegrown" object, 2. is an "imported" object, or 3. is referenced by an "imported" object. The references of an "imported" object are not checked. Performance notes - Because of rule 3 above, iteration through every "imported" object (or, at least, every "imported" object of a certain type) is sometimes required. For fsck, this should be fine because (i) this is not a regression since currently all objects must be iterated through anyway, and (ii) fsck prioritizes correctness over speed. For fetch, the speed of the connectivity check is immaterial; the connectivity check no longer needs to be performed because all objects obtained from the remote are, by definition, "imported" objects. There might be connectivity checks run during other commands like "receive-pack". I don't expect partial clones to use these often. These commands will still work, but performance of these is a secondary concern in this design. Impact on other tools = "git gc" will need to not do anything to an "imported" object, even if it is unreachable, without ensuring that the connectivity check will succeed in that object's absence. (Special attention to rule 3 under "Connectivity check".) If this design stands, the initial patch set will probably have "git gc" not touch "imported" packs at all, trivially satisfying the above. In the future, "git gc" will either need to expel such objects into loose objects (like what is currently done for normal packs), treating them like a "homegrown" object (unreachable, so it won't interfere with future connectivity checks), or delete them outright - but there may be race conditions to think of. "git repack" will need to differentiate between packs with ".remote" and packs without. [1] https://public-inbox.org/git/cover.1501532294.git.jonathanta...@google.com/ [2] https://public-inbox.org/git/20170714132651.170708-1-benpe...@microsoft.com/ [3] https://public-inbox.org/git/20170803091926.1755-1-chrisc...@tuxfamily.org/