Re: Re-Transmission of blobs?
On Fri, Sep 20, 2013 at 11:27:15AM +0200, Josef Wolf wrote: Yes. If you know that the receiver has commit X, and you want to know if it has some blob Y, the only way to know for sure is to look at every tree of every commit reachable from X, and see whether any of them references Y. Jeff, in my original example, I did a cherry-pick from origin/somebranch. Sorry, I thought we were talking about the general case, not your specific example. Even without asking, we can assume with great probability that origin/somebranch is available at origin. Bear in mind that the transfer process does not know about cherry-picking at all. It only sees the other side's tips and traverses. And the file in question happens to reside in the tree at the very tip of origin/somebranch, not in some of its ancestors. In this case, there's no need to search the history at all. And even in this pretty simple case, the algorithm seems to fail for some reason. Correct. And in the current code, we should be looking at the tip tree for your case. However, the usual reason to do so is to mark those objects as a preferred base in pack-objects for doing deltas. I wonder if we are not correctly noticing the case that an object is both requested to be sent and marked as a preferred base (in which case we should drop it from our sending list). If that's the problem, it should be easy to fix cheaply. It would not work in the general case, but it would for your specific example. But since it costs nothing, there's no reason not to. I'll see if I can investigate using the example script you posted. Yes, you do not have to recurse into sub-trees (or commits) you have already seen. And we already do that optimization. Why is the file re-transmitted, then? I meant we do the optimization during history traversal that avoids going into sub-trees we have already seen. We do _not_ do the full history traversal for a partial push. With a little change in the protocol, a very simple optimization could be implemented, avoiding the complicated bitmap strategy you were talking about: [...] In the last step, instead of sending a packfile, the sending side should send a list of the SHA's which would be included in this packfile. The receiving side would then be able to request all the objects it needs to get up-to-date. I think you have a mis-impression of the problem bitmaps are trying to solve, mostly because it is not the problem you presented in your thread (but your problem is one that bitmaps can help). Consider what the sending side has to do to come up with that list of objects to send. It has to traverse history to do it, looking at each tree of each commit that is going to be sent. That effort is proportional to the amount of history we are going to send. For a small push or fetch, it is not much. For a clone, it can be quite a lot (tens of seconds of CPU time per clone). Bitmaps drastically reduce the amount of CPU required. Bitmaps can _also_ solve other problems, like letting us be more thorough in realizing which objects the other side has (without spending effort on an expensive traversal). If that were the only thing they did, it might not be worth it. But we basically get that for free by solving the other problem. So I do not think such a protocol extension is an argument against pack bitmaps; you would want them with or without the protocol change. I think this change would be considerably simpler than the reachability bitmap you are talking about. And it would avoid all those time consuming traversals through the history and the tree. And it would omit _all_ redundant retransmissions. Even in the case when sender and receiver do not have _any_ common heads at all, _no_ files at all would be retransmitted unnecessarily. Yes, that would be nice. However, in the common cases it would make things much worse. A clone of linux.git has ~3.5M objects. That's 70MB of sha1s that the server tells the client tell me which of these you need, and then another 70MB for the client to say yep, I need all of them. You could have the client instead say here are the few I _don't_ need, which would save the second half in the common cases. And of course it would be smaller for a smaller fetch/push. Just looking at git rev-list --objects --all --since=1.week.ago in the kernel, there are ~77K new objects. So over the course of a week, we use an extra 1.5MB of bandwidth. How many objects did we save ourselves from sending, and how big were they (keep in mind they will typically be deltas against object you already have anyway)? The answer would depend on your cherry-picking and reverting habits. But I would guess in a normal workflow that you would not come close to breaking even. There are, of course, cases where the user _knows_ there is a huge object that the other side has, and they do not want to have to send it again. For that case, it would be useful to have something like the protocol you described
Re: Re-Transmission of blobs?
On Tue, Sep 24, 2013 at 03:36:13AM -0400, Jeff King wrote: On Fri, Sep 20, 2013 at 11:27:15AM +0200, Josef Wolf wrote: Even without asking, we can assume with great probability that origin/somebranch is available at origin. Bear in mind that the transfer process does not know about cherry-picking at all. It dosn't need to know. It only sees the other side's tips and traverses. The sender side knows with high probability that origin/somebranch is avalable at the receivig side (unless it was deleted). And since the file in question is part of the tree at the tip of origin/somebranch, we can deduce that the file is available on the other side (unless it was deleted). And the file in question happens to reside in the tree at the very tip of origin/somebranch, not in some of its ancestors. In this case, there's no need to search the history at all. And even in this pretty simple case, the algorithm seems to fail for some reason. Correct. And in the current code, we should be looking at the tip tree for your case. However, the usual reason to do so is to mark those objects as a preferred base in pack-objects for doing deltas. I wonder if we are not correctly noticing the case that an object is both requested to be sent and marked as a preferred base (in which case we should drop it from our sending list). Further, it seems that the marking as preferred base had no effect, since the delta should have been zero in this case. Or is this mechanism deactivated for binary data (/dev/zero in this case)? If that's the problem, it should be easy to fix cheaply. It would not work in the general case, but it would for your specific example. But since it costs nothing, there's no reason not to. I'll see if I can investigate using the example script you posted. Thanks! I meant we do the optimization during history traversal that avoids going into sub-trees we have already seen. We do _not_ do the full history traversal for a partial push. OK. I see. Maybe a config option to request a full traversal would be a reasonable compromise? That way CPU could be traded against bandwidth for repositories that happen to have slow/unreliable/expensive connections. Yes, that would be nice. However, in the common cases it would make things much worse. A clone of linux.git has ~3.5M objects. Of course, if there's nothing you can drop, any attempt to drop objects will add to overhead. That's similar to compressing compressed files. This will enlarge the original file. Would that be a reasonable argument to get rid of all attempts to compress files? -- Josef Wolf j...@raven.inka.de -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re-Transmission of blobs?
On Mon, Sep 16, 2013 at 05:55:36PM -0400, Jeff King wrote: On Fri, Sep 13, 2013 at 12:09:35PM +0200, Josef Wolf wrote: I'm not sure I understand correctly. I see that bitmaps can be used to implement set operations. But how comes that walking the graph requires a lot of CPU? Isn't it O(n)? Yes and no. Your n there is the entirety of history. Is this really true? Yes. If you know that the receiver has commit X, and you want to know if it has some blob Y, the only way to know for sure is to look at every tree of every commit reachable from X, and see whether any of them references Y. Jeff, in my original example, I did a cherry-pick from origin/somebranch. Even without asking, we can assume with great probability that origin/somebranch is available at origin. And the file in question happens to reside in the tree at the very tip of origin/somebranch, not in some of its ancestors. In this case, there's no need to search the history at all. And even in this pretty simple case, the algorithm seems to fail for some reason. Yes, you do not have to recurse into sub-trees (or commits) you have already seen. And we already do that optimization. Why is the file re-transmitted, then? With a little change in the protocol, a very simple optimization could be implemented, avoiding the complicated bitmap strategy you were talking about: Please consider Junio's description of the dialog: [ Junio wrote: ] Consider this simple history with only a handful of commits (as usual, time flows from left to right): E / A---B---C---D where D is at the tip of the sending side, E is at the tip of the receiving side. The exchange goes roughly like this: (receiving side): what do you have? (sending side): my tip is at D. (receiving side): D? I've never heard of it --- please give it to me. I have E. (sending side): E? I don't know about it; must be something you created since you forked from me. Tell me about its ancestors. (receiving side): OK, I have C. (sending side): Oh, C I know about. You do not have to tell me anything more. A packfile to bring you up to date will follow. In the last step, instead of sending a packfile, the sending side should send a list of the SHA's which would be included in this packfile. The receiving side would then be able to request all the objects it needs to get up-to-date. I think this change would be considerably simpler than the reachability bitmap you are talking about. And it would avoid all those time consuming traversals through the history and the tree. And it would omit _all_ redundant retransmissions. Even in the case when sender and receiver do not have _any_ common heads at all, _no_ files at all would be retransmitted unnecessarily. -- Josef Wolf j...@raven.inka.de -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re-Transmission of blobs?
On Fri, Sep 13, 2013 at 12:09:35PM +0200, Josef Wolf wrote: I'm not sure I understand correctly. I see that bitmaps can be used to implement set operations. But how comes that walking the graph requires a lot of CPU? Isn't it O(n)? Yes and no. Your n there is the entirety of history. Is this really true? Yes. If you know that the receiver has commit X, and you want to know if it has some blob Y, the only way to know for sure is to look at every tree of every commit reachable from X, and see whether any of them references Y. You might get lucky and see that one of the first commits you looked at mentions Y, but in the negative case, you have to go all the way down to the roots. (and each one needs to be pulled off of the disk, decompressed, and reconstructed from deltas). While you need to unpack commits/trees to traverse further down, I can't see any reason to unpack/reconstruct blobs just to see whether you need to send it. The SHA is all you need to know, isn't it? Correct. The full sentence that you partially quoted above was: So even though you are looking at each commit and tree only once, it's still a large number of them (and each one needs to be pulled off of the disk, decompressed, and reconstructed from deltas). I.e., the each one is commits and trees. Even reading just them takes a fair bit of time. Pulling each blob off of disk, too, takes even longer. You can try it yourself like this: git rev-list --objects --all | cut -d' ' -f1 | git cat-file --batch-check all-objects for i in commit tree blob; do grep $i all-objects | cut -d' ' -f1 $i-objects echo 2 == $i time git cat-file --batch $i-objects /dev/null done For git.git, commits take about 0.5 seconds on my machine, trees 1 second, and blobs 13 seconds. For the kernel, it's 5, 22, and 210 seconds, respectively. Now those are times to actually cat the content to /dev/null. Just looking at it internally is cheaper, but it gives you a ballpark figure (and most of that time goes to zlib inflation, which is the same either way). Secondly, the graph traversal ends up seeing the same sha1s over and over again in tree entries (because most entries in the tree don't change from commit to commit). Whenever you see an object (whether commit or tree) that you already have seen, you can stop traversing further down this part of the graph/tree, as everything you will see on this part has already be seen before. Why would you see the same commits/trees over and over again? You'd stop traversing on the boundary of the already-seen-territory, leaving the vast majority of the duplicated structure under the carpet. Somehow I fail to see the problem here. Yes, you do not have to recurse into sub-trees (or commits) you have already seen. And we already do that optimization. So you do not see the whole recursive tree over and over, but you see almost same single-level trees repeatedly. Let me try to give an example. Here's the root tree of git.git's v1.8.4 release: $ git ls-tree v1.8.4 | wc -l 361 So we have to do 361 lookups, one per entry, to find that we haven't yet processed each one. Now what happens when we look at the next commit? $ git ls-tree v1.8.4^ | wc -l 361 $ git diff-tree --abbrev v1.8.4 v1.8.4^ :04 04 f3aec4c... a6e780e... M Documentation :100755 100755 06026ea... 572dfeb... M GIT-VERSION-GEN Still 361 entries, but only two are changed. Yet we still have to go through all 361 to figure out _which_ were changed. We can do that by going linearly through the tree and checking each sha1 against a have we seen this? data structure. Or we could diff on-the-fly between adjacent trees, and only process those that we know we didn't just see. The current code uses the seen this strategy with a hash table. I've tried the diff strategy, but I couldn't make it any faster than using the hash table. If we had a tree storage format that made diffs cheap (like packv4), then traversing via diffs would probably be a win. Note also that the cost of traversing is dependent on the shape of the tree. Putting all of your files in the root directory does not perform as well as having a nicely balanced tree structure, because we can't weed out as many entries by noticing the whole sub-tree is unchanged. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re-Transmission of blobs?
On Thu, Sep 12, 2013 at 03:44:53PM -0400, Jeff King wrote: On Thu, Sep 12, 2013 at 12:35:32PM +0200, Josef Wolf wrote: I'm not sure I understand correctly. I see that bitmaps can be used to implement set operations. But how comes that walking the graph requires a lot of CPU? Isn't it O(n)? Yes and no. Your n there is the entirety of history. Is this really true? (and each one needs to be pulled off of the disk, decompressed, and reconstructed from deltas). While you need to unpack commits/trees to traverse further down, I can't see any reason to unpack/reconstruct blobs just to see whether you need to send it. The SHA is all you need to know, isn't it? Secondly, the graph traversal ends up seeing the same sha1s over and over again in tree entries (because most entries in the tree don't change from commit to commit). Whenever you see an object (whether commit or tree) that you already have seen, you can stop traversing further down this part of the graph/tree, as everything you will see on this part has already be seen before. Why would you see the same commits/trees over and over again? You'd stop traversing on the boundary of the already-seen-territory, leaving the vast majority of the duplicated structure under the carpet. Somehow I fail to see the problem here. -- Josef Wolf j...@raven.inka.de -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re-Transmission of blobs?
On Thu, Sep 12, 2013 at 08:06:35PM +, Pyeron, Jason J CTR (US) wrote: Yes, but it is those awfully slow connections (slower that the looping issue) which happen to always drop while cloning from our office. And the round trip should be mitigated by http-keep-alives. [ ... ] But, again if the connection drops, we have already lost the delta advantage. I would think the scenario would go like this: git clone url://blah/blah [fail] cd blah git clone --resume #uses normal methods [fail] while ! git clone --resume --HitItWithAStick replace clone with fetch for that use case too Last time I checked, cloning could not be resumed: http://git.661346.n2.nabble.com/git-clone-and-unreliable-links-td7570652.html If you're on a slow/unreliable link, you've lost. :-( :-( :-( -- Josef Wolf j...@raven.inka.de -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Re-Transmission of blobs?
-Original Message- From: Josef Wolf Sent: Friday, September 13, 2013 6:23 On Thu, Sep 12, 2013 at 08:06:35PM +, Pyeron, Jason J CTR (US) wrote: Yes, but it is those awfully slow connections (slower that the looping issue) which happen to always drop while cloning from our office. And the round trip should be mitigated by http-keep-alives. [ ... ] But, again if the connection drops, we have already lost the delta advantage. I would think the scenario would go like this: git clone url://blah/blah [fail] cd blah git clone --resume #uses normal methods I am using the mythical --resume, where it would fetch packs and indexes. [fail] while ! git clone --resume --HitItWithAStick replace clone with fetch for that use case too Last time I checked, cloning could not be resumed: http://git.661346.n2.nabble.com/git-clone-and-unreliable-links-td7570652.html If you're on a slow/unreliable link, you've lost. That is kind of the point. It should be possible. -- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- - - - Jason Pyeron PD Inc. http://www.pdinc.us - - Principal Consultant 10 West 24th Street #100- - +1 (443) 269-1555 x333Baltimore, Maryland 21218 - - - -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- This message is copyright PD Inc, subject to license 20080407P00. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re-Transmission of blobs?
On Fri, Sep 13, 2013 at 3:06 AM, Pyeron, Jason J CTR (US) jason.j.pyeron@mail.mil wrote: But, again if the connection drops, we have already lost the delta advantage. I would think the scenario would go like this: git clone url://blah/blah [fail] cd blah git clone --resume #uses normal methods [fail] while ! git clone --resume --HitItWithAStick replace clone with fetch for that use case too Sorry if I missed something in this thread. But I think we could stablize the transferred pack so that --resume works. The sender constructs exactly the same pack as in the first git clone then it starts sending from the offset given by the client. For that to work, the first git clone must also be git clone --resume. I started working on that but now my focus is pack v4, so that has to wait. -- Duy -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re-Transmission of blobs?
On Mi, Sep 11, 2013 at 10:14:54 -0700, Junio C Hamano wrote: Josef Wolf j...@raven.inka.de writes: On Di, Sep 10, 2013 at 10:51:02 -0700, Junio C Hamano wrote: Consider this simple history with only a handful of commits (as usual, time flows from left to right): E / A---B---C---D where D is at the tip of the sending side, E is at the tip of the receiving side. The exchange goes roughly like this: (receiving side): what do you have? (sending side): my tip is at D. (receiving side): D? I've never heard of it --- please give it to me. I have E. At this point, why would the receiving side not tell all the heads it knows about? It did. The receiving end had only one branch whose tip is E. It may have a tracking branch that knows where the tip of the sending end used to be when it forked (which is C), so the above may say I have E and C. It actually would say I have B and A and ... for a bounded number of commits, but that does not fundamentally change the picture---the important point is it is bounded and there is a horizon. Therefore, the sending sinde has all information it needs to do any optimizations you can think of... There are some work being done to optimize this further using various techniques, but they are not ready yet. And this still stands. Do you have a pointer or something? I'd like to check out whether I can contribute to this work. -- Josef Wolf j...@raven.inka.de -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re-Transmission of blobs?
On Thu, Sep 12, 2013 at 09:42:41AM +0200, Josef Wolf wrote: There are some work being done to optimize this further using various techniques, but they are not ready yet. And this still stands. Do you have a pointer or something? I'd like to check out whether I can contribute to this work. I think Junio is referring to the reachability bitmap work. We may know that the other side has commit E (and therefore every object reachable from it), but we do not walk the graph to find the complete set of reachable objects. Doing so requires a lot of CPU and I/O, and in most cases does not help much. However, if we had an index of reachable objects (e.g., a bitmap) for each commit, then we could very cheaply compute the set difference between what the other side wants and what they have. JGit has support for pack bitmaps already. There was a patch series a few months ago to implement a similar functionality for C git, but the on-disk format was not compatible with JGit's. That series has been reworked off-list to be compatible with the JGit implementation. Those patches need a little cleanup before they are ready for the list, but hopefully that should happen soon-ish. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re-Transmission of blobs?
On Do, Sep 12, 2013 at 05:23:40 -0400, Jeff King wrote: On Thu, Sep 12, 2013 at 09:42:41AM +0200, Josef Wolf wrote: I think Junio is referring to the reachability bitmap work. We may know that the other side has commit E (and therefore every object reachable from it), but we do not walk the graph to find the complete set of reachable objects. Doing so requires a lot of CPU and I/O, and in most cases does not help much. I'm not sure I understand correctly. I see that bitmaps can be used to implement set operations. But how comes that walking the graph requires a lot of CPU? Isn't it O(n)? However, if we had an index of reachable objects (e.g., a bitmap) for each commit, then we could very cheaply compute the set difference between what the other side wants and what they have. Those bitmaps would be stored in the git metadata? Is it worth it? Storing a bitmap for every commit just to be used once-in-a-while seems to be a pretty big overhead to me. Not to mention the interoperability problems you mentioned below. JGit has support for pack bitmaps already. There was a patch series a few months ago to implement a similar functionality for C git, but the on-disk format was not compatible with JGit's. That series has been reworked off-list to be compatible with the JGit implementation. Those patches need a little cleanup before they are ready for the list, but hopefully that should happen soon-ish. Sounds like you're already almost done and don't really need help anymore. Just out of curiosity, I'd be interested in a pointer anyway ;-) -- Josef Wolf j...@raven.inka.de -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Re-Transmission of blobs?
-Original Message- From: Jeff King Sent: Thursday, September 12, 2013 5:24 AM On Thu, Sep 12, 2013 at 09:42:41AM +0200, Josef Wolf wrote: There are some work being done to optimize this further using various techniques, but they are not ready yet. And this still stands. Do you have a pointer or something? I'd like to check out whether I can contribute to this work. I think Junio is referring to the reachability bitmap work. We may know that the other side has commit E (and therefore every object reachable from it), but we do not walk the graph to find the complete set of reachable objects. Doing so requires a lot of CPU and I/O, and in most cases does not help much. If the rules of engagement are change a bit, the server side can be release from most of its work (CPU/IO). Client does the following, looping as needed: Heads=server-heads(); KnownCommits=Local-AllCommits(); Missingblobs=[]; Foreach(commit:heads) if (!knownCommits-contains(commit)) MissingBlobs[]=commit; Foreach(commit:knownCommit) if (!commit-isValid()) MissingBlobs[]=commit-blobs(); If (missingBlobs-size()0) server-FetchBlobs(missingBlobs); This should work efficiently for the server if a) the client is empty b) the client is corrupt c) the client is up to date Extending the server-fetchBlobs() to be more fancy, like taking patterns, such as between aa and dd exclusive is an exercise for someone else. smime.p7s Description: S/MIME cryptographic signature
Re: Re-Transmission of blobs?
On Thu, Sep 12, 2013 at 12:35:32PM +0200, Josef Wolf wrote: I'm not sure I understand correctly. I see that bitmaps can be used to implement set operations. But how comes that walking the graph requires a lot of CPU? Isn't it O(n)? Yes and no. Your n there is the entirety of history. Whereas a simple git push generally only has to look at the recent history. So even though you are looking at each commit and tree only once, it's still a large number of them (and each one needs to be pulled off of the disk, decompressed, and reconstructed from deltas). Secondly, the graph traversal ends up seeing the same sha1s over and over again in tree entries (because most entries in the tree don't change from commit to commit). We spend a non-trivial amount of time looking those up in a hash table. Just try git rev-list --objects --all in your favorite repository to get a sense. It takes something like 30 seconds in the kernel repo. You would probably not want to add 30 seconds of CPU time to a trivial push. Those bitmaps would be stored in the git metadata? Is it worth it? Storing a bitmap for every commit just to be used once-in-a-while seems to be a pretty big overhead to me. Not to mention the interoperability problems you mentioned below. There are tricks to make them smaller (run-length compression, bitmapping a subset of commits and traversing to the nearest one, storing bitmaps as deltas against nearby bitmaps). And how often it is used depends on your git workload. For a repository serving git clones and fetches, it speeds up every operation. Try starting a clone of: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git versus git://github.com/torvalds/linux.git and see which one starts sending you data more quickly. Sounds like you're already almost done and don't really need help anymore. Just out of curiosity, I'd be interested in a pointer anyway ;-) Shawn gave a talk on JGit here: http://www.eclipsecon.org/2013/sites/eclipsecon.org.2013/files/Scaling%20Up%20JGit%20-%20EclipseCon%202013.pdf and the scrapped patches for git are here: http://article.gmane.org/gmane.comp.version-control.git/228918 -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re-Transmission of blobs?
On Thu, Sep 12, 2013 at 12:45:44PM +, Pyeron, Jason J CTR (US) wrote: If the rules of engagement are change a bit, the server side can be release from most of its work (CPU/IO). Client does the following, looping as needed: Heads=server-heads(); KnownCommits=Local-AllCommits(); Missingblobs=[]; Foreach(commit:heads) if (!knownCommits-contains(commit)) MissingBlobs[]=commit; Foreach(commit:knownCommit) if (!commit-isValid()) MissingBlobs[]=commit-blobs(); If (missingBlobs-size()0) server-FetchBlobs(missingBlobs); That doesn't quite work. The client does not know the set of missing objects just from the commits. It knows the sha1 of the root trees it is missing. And then if it fetches those, it knows the sha1 of any top-level entries it is missing. And when it gets those, it knows the sha1 of any 2nd-level entries it is missing, and so forth. You can progressively ask for each level, but: 1. You are spending a round-trip for each request. Doing it per-object is awful (the dumb http walker will do this if the repo is not packed, and it's S-L-O-W). Doing it per-level would be better, but not great. 2. You are losing opportunities for deltas (or you are making the state the server needs to maintain very complicated, as it must remember from request to request which objects you have gotten that can be used as delta bases). 3. There is a lot of overhead in this protocol. The client has to mention each object individually by sha1. It may not seem like a lot, but it can easily add 10% to a clone (just look at the size of the pack .idx files versus the packfiles themselves). -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Re-Transmission of blobs?
-Original Message- From: Jeff King Sent: Thursday, September 12, 2013 3:57 PM On Thu, Sep 12, 2013 at 12:45:44PM +, Pyeron, Jason J CTR (US) wrote: If the rules of engagement are change a bit, the server side can be release from most of its work (CPU/IO). Client does the following, looping as needed: Heads=server-heads(); KnownCommits=Local-AllCommits(); Missingblobs=[]; Foreach(commit:heads) if (!knownCommits-contains(commit)) MissingBlobs[]=commit; Foreach(commit:knownCommit) if (!commit-isValid()) MissingBlobs[]=commit-blobs(); If (missingBlobs-size()0) server-FetchBlobs(missingBlobs); That doesn't quite work. The client does not know the set of missing looping as needed objects just from the commits. It knows the sha1 of the root trees it is missing. And then if it fetches those, it knows the sha1 of any top-level entries it is missing. And when it gets those, it knows the sha1 of any 2nd-level entries it is missing, and so forth. You can progressively ask for each level, but: 1. You are spending a round-trip for each request. Doing it per- object is awful (the dumb http walker will do this if the repo is not packed, and it's S-L-O-W). Doing it per-level would be better, but not great. Yes, but it is those awfully slow connections (slower that the looping issue) which happen to always drop while cloning from our office. And the round trip should be mitigated by http-keep-alives. 2. You are losing opportunities for deltas (or you are making the state the server needs to maintain very complicated, as it must remember from request to request which objects you have gotten that can be used as delta bases). But, again if the connection drops, we have already lost the delta advantage. I would think the scenario would go like this: git clone url://blah/blah [fail] cd blah git clone --resume #uses normal methods [fail] while ! git clone --resume --HitItWithAStick replace clone with fetch for that use case too 3. There is a lot of overhead in this protocol. The client has to mention each object individually by sha1. It may not seem like a lot, but it can easily add 10% to a clone (just look at the size of the pack .idx files versus the packfiles themselves). But if it finishes in a week, it is a lot better than it never, ever finishes. I draw attention to the time I had to download DB2 UDB in Thailand via 28k modem. It was resumable with wget, if it were not, it would have required the use of a plane to sneaker net it back. smime.p7s Description: S/MIME cryptographic signature
Re: Re-Transmission of blobs?
On Di, Sep 10, 2013 at 10:51:02 -0700, Junio C Hamano wrote: Consider this simple history with only a handful of commits (as usual, time flows from left to right): E / A---B---C---D where D is at the tip of the sending side, E is at the tip of the receiving side. The exchange goes roughly like this: (receiving side): what do you have? (sending side): my tip is at D. (receiving side): D? I've never heard of it --- please give it to me. I have E. At this point, why would the receiving side not tell all the heads it knows about? That would enable the sending side to see whether it knows any of those commits. It then would be able to remove from the sending list all the objects that can be reached from the commits it knows about. (sending side): E? I don't know about it; must be something you created since you forked from me. Tell me about its ancestors. This is not exactly true. In the example I had given, the sending side has all three commits: C, D and E. So the sending side has no reason to say that it doesn't know anything about E. Therefore the sending side has all information it needs to deduce exactly which objects need to be sent to the receiving side. What needs to be sent are all the objects in C..D but without all the objects in C..E. I guess this operation would be called set-difference in english. And if the receiving side would have told that it has heads X Y Z in addition, and the sending side happens to have Y, then the sending side could in addition remove any objects that can be reached from Y from the sending list. -- Josef Wolf j...@raven.inka.de -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re-Transmission of blobs?
Josef Wolf j...@raven.inka.de writes: On Di, Sep 10, 2013 at 10:51:02 -0700, Junio C Hamano wrote: Consider this simple history with only a handful of commits (as usual, time flows from left to right): E / A---B---C---D where D is at the tip of the sending side, E is at the tip of the receiving side. The exchange goes roughly like this: (receiving side): what do you have? (sending side): my tip is at D. (receiving side): D? I've never heard of it --- please give it to me. I have E. At this point, why would the receiving side not tell all the heads it knows about? It did. The receiving end had only one branch whose tip is E. It may have a tracking branch that knows where the tip of the sending end used to be when it forked (which is C), so the above may say I have E and C. It actually would say I have B and A and ... for a bounded number of commits, but that does not fundamentally change the picture---the important point is it is bounded and there is a horizon. There are some work being done to optimize this further using various techniques, but they are not ready yet. And this still stands. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re-Transmission of blobs?
Josef Wolf j...@raven.inka.de writes: as we all know, files are identified by their SHA. Thus I had the impression that when transfering files, git would know by the SHA whether a given file is already available in the destination repository and the transfer would be of no use. That is unfortunately not how things work. It is not like the receiving end sends the names of all objects it has, and the sending end excludes these objects from what it is going to send. Consider this simple history with only a handful of commits (as usual, time flows from left to right): E / A---B---C---D where D is at the tip of the sending side, E is at the tip of the receiving side. The exchange goes roughly like this: (receiving side): what do you have? (sending side): my tip is at D. (receiving side): D? I've never heard of it --- please give it to me. I have E. (sending side): E? I don't know about it; must be something you created since you forked from me. Tell me about its ancestors. (receiving side): OK, I have C. (sending side): Oh, C I know about. You do not have to tell me anything more. A packfile to bring you up to date will follow. At this point, the sender knows that the receiver needs the commit D, and trees and blobs in D. It does also know it has the commit C and trees and blobs in C. It does the best thing it can do using these (and only these) information, namely, to send the commit D, and send trees and blobs in D that are not in the commit C. You may happen to have something in E that match what is in D but not in C. Because the sender does not know anything about E at all in the first place, that information cannot be used to reduce the transfer. The sender theoretically _could_ also exploit the fact that any receiver that has C must have B and A and all trees and blobs associated with these ancestor commits [*1*], but that information is not currently discovered nor used during the object transfer. There may happen to be a tree or a blob in A that matches a tree or a blob in D. But because the common ancestor discovery exchange above stops at C, the sender does not bother enumerating all the objects that are in the ancestor commits of C when figuring out what objects to send to ensure that the receiving end has all the objects necessary to complete D. If you modified a blob at B (or C) and then resurrected the old version of the blob at D, it is likely that the blob is going to be sent again when the receiving end asks for D. There are some work being done to optimize this further using various techniques, but they are not ready yet. [Footnote] *1* only down to the shallow boundary, if the receiving end is a shallow clone. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html