Re: Re-Transmission of blobs?

2013-09-24 Thread Jeff King
On Fri, Sep 20, 2013 at 11:27:15AM +0200, Josef Wolf wrote:

  Yes. If you know that the receiver has commit X, and you want to know if
  it has some blob Y, the only way to know for sure is to look at every
  tree of every commit reachable from X, and see whether any of them
  references Y.
 
 Jeff, in my original example, I did a cherry-pick from origin/somebranch.

Sorry, I thought we were talking about the general case, not your
specific example.

 Even without asking, we can assume with great probability that
 origin/somebranch is available at origin.

Bear in mind that the transfer process does not know about
cherry-picking at all. It only sees the other side's tips and traverses.

 And the file in question happens to reside in the tree at the very tip
 of origin/somebranch, not in some of its ancestors. In this case,
 there's no need to search the history at all. And even in this pretty
 simple case, the algorithm seems to fail for some reason.

Correct. And in the current code, we should be looking at the tip tree
for your case.  However, the usual reason to do so is to mark those
objects as a preferred base in pack-objects for doing deltas. I wonder
if we are not correctly noticing the case that an object is both
requested to be sent and marked as a preferred base (in which case we
should drop it from our sending list).

If that's the problem, it should be easy to fix cheaply. It would not
work in the general case, but it would for your specific example. But
since it costs nothing, there's no reason not to.

I'll see if I can investigate using the example script you posted.

  Yes, you do not have to recurse into sub-trees (or commits) you have
  already seen. And we already do that optimization.
 
 Why is the file re-transmitted, then?

I meant we do the optimization during history traversal that avoids
going into sub-trees we have already seen. We do _not_ do the full
history traversal for a partial push.

 With a little change in the protocol, a very simple optimization could be
 implemented, avoiding the complicated bitmap strategy you were talking about:
 [...]
 In the last step, instead of sending a packfile, the sending side should send
 a list of the SHA's which would be included in this packfile. The receiving
 side would then be able to request all the objects it needs to get up-to-date.

I think you have a mis-impression of the problem bitmaps are trying to
solve, mostly because it is not the problem you presented in your
thread (but your problem is one that bitmaps can help).

Consider what the sending side has to do to come up with that list of
objects to send. It has to traverse history to do it, looking at each
tree of each commit that is going to be sent. That effort is
proportional to the amount of history we are going to send. For a small
push or fetch, it is not much. For a clone, it can be quite a lot (tens
of seconds of CPU time per clone). Bitmaps drastically reduce the amount
of CPU required.

Bitmaps can _also_ solve other problems, like letting us be more
thorough in realizing which objects the other side has (without spending
effort on an expensive traversal). If that were the only thing they did,
it might not be worth it. But we basically get that for free by
solving the other problem.

So I do not think such a protocol extension is an argument against
pack bitmaps; you would want them with or without the protocol change.

 I think this change would be considerably simpler than the reachability bitmap
 you are talking about. And it would avoid all those time consuming traversals
 through the history and the tree. And it would omit _all_ redundant
 retransmissions. Even in the case when sender and receiver do not have _any_
 common heads at all, _no_ files at all would be retransmitted unnecessarily.

Yes, that would be nice. However, in the common cases it would make
things much worse. A clone of linux.git has ~3.5M objects. That's 70MB
of sha1s that the server tells the client tell me which of these you
need, and then another 70MB for the client to say yep, I need all of
them.  You could have the client instead say here are the few I
_don't_ need, which would save the second half in the common cases.

And of course it would be smaller for a smaller fetch/push. Just looking
at git rev-list --objects --all --since=1.week.ago in the kernel,
there are ~77K new objects. So over the course of a week, we use an
extra 1.5MB of bandwidth. How many objects did we save ourselves from
sending, and how big were they (keep in mind they will typically be
deltas against object you already have anyway)?

The answer would depend on your cherry-picking and reverting habits. But
I would guess in a normal workflow that you would not come close to
breaking even.

There are, of course, cases where the user _knows_ there is a huge
object that the other side has, and they do not want to have to send it
again. For that case, it would be useful to have something like the
protocol you described 

Re: Re-Transmission of blobs?

2013-09-24 Thread Josef Wolf
On Tue, Sep 24, 2013 at 03:36:13AM -0400, Jeff King wrote:
 On Fri, Sep 20, 2013 at 11:27:15AM +0200, Josef Wolf wrote:

  Even without asking, we can assume with great probability that
  origin/somebranch is available at origin.
 Bear in mind that the transfer process does not know about
 cherry-picking at all.

It dosn't need to know.

 It only sees the other side's tips and traverses.

The sender side knows with high probability that origin/somebranch is avalable
at the receivig side (unless it was deleted). And since the file in question
is part of the tree at the tip of origin/somebranch, we can deduce that the
file is available on the other side (unless it was deleted).

  And the file in question happens to reside in the tree at the very tip
  of origin/somebranch, not in some of its ancestors. In this case,
  there's no need to search the history at all. And even in this pretty
  simple case, the algorithm seems to fail for some reason.
 
 Correct. And in the current code, we should be looking at the tip tree
 for your case.  However, the usual reason to do so is to mark those
 objects as a preferred base in pack-objects for doing deltas. I wonder
 if we are not correctly noticing the case that an object is both
 requested to be sent and marked as a preferred base (in which case we
 should drop it from our sending list).

Further, it seems that the marking as preferred base had no effect, since the
delta should have been zero in this case. Or is this mechanism deactivated for
binary data (/dev/zero in this case)?

 If that's the problem, it should be easy to fix cheaply. It would not
 work in the general case, but it would for your specific example. But
 since it costs nothing, there's no reason not to.
 
 I'll see if I can investigate using the example script you posted.

Thanks!

 I meant we do the optimization during history traversal that avoids
 going into sub-trees we have already seen. We do _not_ do the full
 history traversal for a partial push.

OK. I see. Maybe a config option to request a full traversal would be a
reasonable compromise? That way CPU could be traded against bandwidth for
repositories that happen to have slow/unreliable/expensive connections.

 Yes, that would be nice. However, in the common cases it would make
 things much worse. A clone of linux.git has ~3.5M objects.

Of course, if there's nothing you can drop, any attempt to drop objects will
add to overhead. That's similar to compressing compressed files. This will
enlarge the original file. Would that be a reasonable argument to get rid of
all attempts to compress files?

-- 
Josef Wolf
j...@raven.inka.de
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re-Transmission of blobs?

2013-09-20 Thread Josef Wolf
On Mon, Sep 16, 2013 at 05:55:36PM -0400, Jeff King wrote:
 On Fri, Sep 13, 2013 at 12:09:35PM +0200, Josef Wolf wrote:
 
I'm not sure I understand correctly. I see that bitmaps can be used to
implement set operations. But how comes that walking the graph requires 
a lot
of CPU? Isn't it O(n)?
   
   Yes and no. Your n there is the entirety of history.
  
  Is this really true?
 
 Yes. If you know that the receiver has commit X, and you want to know if
 it has some blob Y, the only way to know for sure is to look at every
 tree of every commit reachable from X, and see whether any of them
 references Y.

Jeff, in my original example, I did a cherry-pick from origin/somebranch.
Even without asking, we can assume with great probability that
origin/somebranch is available at origin. And the file in question happens to
reside in the tree at the very tip of origin/somebranch, not in some of its
ancestors. In this case, there's no need to search the history at all. And
even in this pretty simple case, the algorithm seems to fail for some reason.

 Yes, you do not have to recurse into sub-trees (or commits) you have
 already seen. And we already do that optimization.

Why is the file re-transmitted, then?


With a little change in the protocol, a very simple optimization could be
implemented, avoiding the complicated bitmap strategy you were talking about:

Please consider Junio's description of the dialog:

[ Junio wrote: ]
 Consider this simple history with only a handful of commits (as
 usual, time flows from left to right):

  E
 /   
A---B---C---D

 where D is at the tip of the sending side, E is at the tip of the
 receiving side.  The exchange goes roughly like this:

(receiving side): what do you have?

(sending side): my tip is at D.

(receiving side): D?  I've never heard of it --- please give it
  to me.  I have E.

(sending side): E?  I don't know about it; must be something you
created since you forked from me.  Tell me about
its ancestors.

(receiving side): OK, I have C.

(sending side): Oh, C I know about. You do not have to tell me
anything more.  A packfile to bring you up to
date will follow.

In the last step, instead of sending a packfile, the sending side should send
a list of the SHA's which would be included in this packfile. The receiving
side would then be able to request all the objects it needs to get up-to-date.

I think this change would be considerably simpler than the reachability bitmap
you are talking about. And it would avoid all those time consuming traversals
through the history and the tree. And it would omit _all_ redundant
retransmissions. Even in the case when sender and receiver do not have _any_
common heads at all, _no_ files at all would be retransmitted unnecessarily.

-- 
Josef Wolf
j...@raven.inka.de
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re-Transmission of blobs?

2013-09-16 Thread Jeff King
On Fri, Sep 13, 2013 at 12:09:35PM +0200, Josef Wolf wrote:

   I'm not sure I understand correctly. I see that bitmaps can be used to
   implement set operations. But how comes that walking the graph requires a 
   lot
   of CPU? Isn't it O(n)?
  
  Yes and no. Your n there is the entirety of history.
 
 Is this really true?

Yes. If you know that the receiver has commit X, and you want to know if
it has some blob Y, the only way to know for sure is to look at every
tree of every commit reachable from X, and see whether any of them
references Y. You might get lucky and see that one of the first commits
you looked at mentions Y, but in the negative case, you have to go all
the way down to the roots.

  (and each one needs to be pulled off of the disk,
  decompressed, and reconstructed from deltas).
 
 While you need to unpack commits/trees to traverse further down, I can't see
 any reason to unpack/reconstruct blobs just to see whether you need to send
 it. The SHA is all you need to know, isn't it?

Correct. The full sentence that you partially quoted above was:

   So even though you are looking at each commit and tree only once,
   it's still a large number of them (and each one needs to be pulled
   off of the disk, decompressed, and reconstructed from deltas).

I.e., the each one is commits and trees.  Even reading just them
takes a fair bit of time. Pulling each blob off of disk, too, takes even
longer. You can try it yourself like this:

  git rev-list --objects --all |
cut -d' ' -f1 |
git cat-file --batch-check all-objects
  for i in commit tree blob; do
grep $i all-objects | cut -d' ' -f1 $i-objects
echo 2 == $i
time git cat-file --batch $i-objects /dev/null
  done

For git.git, commits take about 0.5 seconds on my machine, trees 1
second, and blobs 13 seconds. For the kernel, it's 5, 22, and 210
seconds, respectively.

Now those are times to actually cat the content to /dev/null. Just
looking at it internally is cheaper, but it gives you a ballpark figure
(and most of that time goes to zlib inflation, which is the same either
way).

  Secondly, the graph traversal ends up seeing the same sha1s over and
  over again in tree entries (because most entries in the tree don't
  change from commit to commit).
 
 Whenever you see an object (whether commit or tree) that you already have
 seen, you can stop traversing further down this part of the graph/tree, as
 everything you will see on this part has already be seen before.
 
 Why would you see the same commits/trees over and over again? You'd stop
 traversing on the boundary of the already-seen-territory, leaving the vast
 majority of the duplicated structure under the carpet. Somehow I fail to see
 the problem here.

Yes, you do not have to recurse into sub-trees (or commits) you have
already seen. And we already do that optimization.  So you do not see
the whole recursive tree over and over, but you see almost same
single-level trees repeatedly.

Let me try to give an example.  Here's the root tree of git.git's v1.8.4
release:

  $ git ls-tree v1.8.4 | wc -l
  361

So we have to do 361 lookups, one per entry, to find that we haven't
yet processed each one.

Now what happens when we look at the next commit?

  $ git ls-tree v1.8.4^ | wc -l
  361
  $ git diff-tree --abbrev v1.8.4 v1.8.4^
  :04 04 f3aec4c... a6e780e... M  Documentation
  :100755 100755 06026ea... 572dfeb... M  GIT-VERSION-GEN

Still 361 entries, but only two are changed. Yet we still have
to go through all 361 to figure out _which_ were changed.

We can do that by going linearly through the tree and checking each sha1
against a have we seen this? data structure. Or we could diff
on-the-fly between adjacent trees, and only process those that we know
we didn't just see.

The current code uses the seen this strategy with a hash table. I've
tried the diff strategy, but I couldn't make it any faster than using
the hash table. If we had a tree storage format that made diffs cheap
(like packv4), then traversing via diffs would probably be a win.

Note also that the cost of traversing is dependent on the shape of the
tree. Putting all of your files in the root directory does not perform
as well as having a nicely balanced tree structure, because we can't
weed out as many entries by noticing the whole sub-tree is unchanged.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re-Transmission of blobs?

2013-09-13 Thread Josef Wolf
On Thu, Sep 12, 2013 at 03:44:53PM -0400, Jeff King wrote:
 On Thu, Sep 12, 2013 at 12:35:32PM +0200, Josef Wolf wrote:
 
  I'm not sure I understand correctly. I see that bitmaps can be used to
  implement set operations. But how comes that walking the graph requires a 
  lot
  of CPU? Isn't it O(n)?
 
 Yes and no. Your n there is the entirety of history.

Is this really true?

 (and each one needs to be pulled off of the disk,
 decompressed, and reconstructed from deltas).

While you need to unpack commits/trees to traverse further down, I can't see
any reason to unpack/reconstruct blobs just to see whether you need to send
it. The SHA is all you need to know, isn't it?

 Secondly, the graph traversal ends up seeing the same sha1s over and
 over again in tree entries (because most entries in the tree don't
 change from commit to commit).

Whenever you see an object (whether commit or tree) that you already have
seen, you can stop traversing further down this part of the graph/tree, as
everything you will see on this part has already be seen before.

Why would you see the same commits/trees over and over again? You'd stop
traversing on the boundary of the already-seen-territory, leaving the vast
majority of the duplicated structure under the carpet. Somehow I fail to see
the problem here.

-- 
Josef Wolf
j...@raven.inka.de
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re-Transmission of blobs?

2013-09-13 Thread Josef Wolf
On Thu, Sep 12, 2013 at 08:06:35PM +, Pyeron, Jason J CTR (US) wrote:

 Yes, but it is those awfully slow connections (slower that the looping
 issue) which happen to always drop while cloning from our office. And the
 round trip should be mitigated by http-keep-alives.
[ ... ]
 But, again if the connection drops, we have already lost the delta
 advantage. I would think the scenario would go like this:
 
 git clone url://blah/blah
 [fail]
 cd blah
 git clone --resume #uses normal methods
 [fail]
 while ! git clone --resume --HitItWithAStick
 
 replace clone with fetch for that use case too

Last time I checked, cloning could not be resumed:

http://git.661346.n2.nabble.com/git-clone-and-unreliable-links-td7570652.html

If you're on a slow/unreliable link, you've lost.

:-( :-( :-(

-- 
Josef Wolf
j...@raven.inka.de
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Re-Transmission of blobs?

2013-09-13 Thread Jason Pyeron
 -Original Message-
 From: Josef Wolf
 Sent: Friday, September 13, 2013 6:23
 
 On Thu, Sep 12, 2013 at 08:06:35PM +, Pyeron, Jason J CTR 
 (US) wrote:
 
  Yes, but it is those awfully slow connections (slower that 
 the looping
  issue) which happen to always drop while cloning from our 
 office. And 
  the round trip should be mitigated by http-keep-alives.
 [ ... ]
  But, again if the connection drops, we have already lost the delta 
  advantage. I would think the scenario would go like this:
  
  git clone url://blah/blah
  [fail]
  cd blah
  git clone --resume #uses normal methods

I am using the mythical --resume, where it would fetch packs and indexes.

  [fail]
  while ! git clone --resume --HitItWithAStick
  
  replace clone with fetch for that use case too
 
 Last time I checked, cloning could not be resumed:
 
 http://git.661346.n2.nabble.com/git-clone-and-unreliable-links-td7570652.html
 
 If you're on a slow/unreliable link, you've lost.


That is kind of the point. It should be possible.


--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
-   -
- Jason Pyeron  PD Inc. http://www.pdinc.us -
- Principal Consultant  10 West 24th Street #100-
- +1 (443) 269-1555 x333Baltimore, Maryland 21218   -
-   -
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
This message is copyright PD Inc, subject to license 20080407P00.

 

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re-Transmission of blobs?

2013-09-13 Thread Duy Nguyen
On Fri, Sep 13, 2013 at 3:06 AM, Pyeron, Jason J CTR (US)
jason.j.pyeron@mail.mil wrote:
 But, again if the connection drops, we have already lost the delta advantage. 
 I would think the scenario would go like this:

 git clone url://blah/blah
 [fail]
 cd blah
 git clone --resume #uses normal methods
 [fail]
 while ! git clone --resume --HitItWithAStick

 replace clone with fetch for that use case too

Sorry if I missed something in this thread. But I think we could
stablize the transferred pack so that --resume works. The sender
constructs exactly the same pack as in the first git clone then it
starts sending from the offset given by the client. For that to work,
the first git clone must also be git clone --resume. I started
working on that but now my focus is pack v4, so that has to wait.
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re-Transmission of blobs?

2013-09-12 Thread Josef Wolf
On Mi, Sep 11, 2013 at 10:14:54 -0700, Junio C Hamano wrote:
 Josef Wolf j...@raven.inka.de writes:
  On Di, Sep 10, 2013 at 10:51:02 -0700, Junio C Hamano wrote:
  Consider this simple history with only a handful of commits (as
  usual, time flows from left to right):
  
E
   /   
  A---B---C---D
  
  where D is at the tip of the sending side, E is at the tip of the
  receiving side.  The exchange goes roughly like this:
  
  (receiving side): what do you have?
  
  (sending side): my tip is at D.
  
  (receiving side): D?  I've never heard of it --- please give it
to me.  I have E.
 
  At this point, why would the receiving side not tell all the heads it knows
  about?
 
 It did.  The receiving end had only one branch whose tip is E.  It
 may have a tracking branch that knows where the tip of the sending
 end used to be when it forked (which is C), so the above may say I
 have E and C.  It actually would say I have B and A and ... for a
 bounded number of commits, but that does not fundamentally change
 the picture---the important point is it is bounded and there is a
 horizon.

Therefore, the sending sinde has all information it needs to do any
optimizations you can think of...

  There are some work being done to optimize this further using
  various techniques, but they are not ready yet.
 
 And this still stands.

Do you have a pointer or something? I'd like to check out whether I can
contribute to this work.

-- 
Josef Wolf
j...@raven.inka.de
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re-Transmission of blobs?

2013-09-12 Thread Jeff King
On Thu, Sep 12, 2013 at 09:42:41AM +0200, Josef Wolf wrote:

   There are some work being done to optimize this further using
   various techniques, but they are not ready yet.
  
  And this still stands.
 
 Do you have a pointer or something? I'd like to check out whether I can
 contribute to this work.

I think Junio is referring to the reachability bitmap work. We may know
that the other side has commit E (and therefore every object reachable
from it), but we do not walk the graph to find the complete set of
reachable objects. Doing so requires a lot of CPU and I/O, and in most
cases does not help much.

However, if we had an index of reachable objects (e.g., a bitmap) for
each commit, then we could very cheaply compute the set difference
between what the other side wants and what they have.

JGit has support for pack bitmaps already. There was a patch series a
few months ago to implement a similar functionality for C git, but the
on-disk format was not compatible with JGit's. That series has been
reworked off-list to be compatible with the JGit implementation.

Those patches need a little cleanup before they are ready for the list,
but hopefully that should happen soon-ish.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re-Transmission of blobs?

2013-09-12 Thread Josef Wolf
On Do, Sep 12, 2013 at 05:23:40 -0400, Jeff King wrote:
 On Thu, Sep 12, 2013 at 09:42:41AM +0200, Josef Wolf wrote:

 I think Junio is referring to the reachability bitmap work. We may know
 that the other side has commit E (and therefore every object reachable
 from it), but we do not walk the graph to find the complete set of
 reachable objects. Doing so requires a lot of CPU and I/O, and in most
 cases does not help much.

I'm not sure I understand correctly. I see that bitmaps can be used to
implement set operations. But how comes that walking the graph requires a lot
of CPU? Isn't it O(n)?

 However, if we had an index of reachable objects (e.g., a bitmap) for
 each commit, then we could very cheaply compute the set difference
 between what the other side wants and what they have.

Those bitmaps would be stored in the git metadata? Is it worth it? Storing a
bitmap for every commit just to be used once-in-a-while seems to be a pretty
big overhead to me. Not to mention the interoperability problems you mentioned
below.

 JGit has support for pack bitmaps already. There was a patch series a
 few months ago to implement a similar functionality for C git, but the
 on-disk format was not compatible with JGit's. That series has been
 reworked off-list to be compatible with the JGit implementation.
 
 Those patches need a little cleanup before they are ready for the list,
 but hopefully that should happen soon-ish.

Sounds like you're already almost done and don't really need help
anymore. Just out of curiosity, I'd be interested in a pointer anyway ;-)

-- 
Josef Wolf
j...@raven.inka.de
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Re-Transmission of blobs?

2013-09-12 Thread Pyeron, Jason J CTR (US)
 -Original Message-
 From: Jeff King
 Sent: Thursday, September 12, 2013 5:24 AM
 
 On Thu, Sep 12, 2013 at 09:42:41AM +0200, Josef Wolf wrote:
 
There are some work being done to optimize this further using
various techniques, but they are not ready yet.
  
   And this still stands.
 
  Do you have a pointer or something? I'd like to check out whether I
 can
  contribute to this work.
 
 I think Junio is referring to the reachability bitmap work. We may know
 that the other side has commit E (and therefore every object
 reachable
 from it), but we do not walk the graph to find the complete set of
 reachable objects. Doing so requires a lot of CPU and I/O, and in most
 cases does not help much.

If the rules of engagement are change a bit, the server side can be release 
from most of its work (CPU/IO).

Client does the following, looping as needed:

Heads=server-heads();
KnownCommits=Local-AllCommits();
Missingblobs=[];
Foreach(commit:heads) if (!knownCommits-contains(commit)) 
MissingBlobs[]=commit;
Foreach(commit:knownCommit) if (!commit-isValid()) 
MissingBlobs[]=commit-blobs();
If (missingBlobs-size()0) server-FetchBlobs(missingBlobs);


This should work efficiently for the server if
a) the client is empty
b) the client is corrupt
c) the client is up to date

Extending the server-fetchBlobs() to be more fancy, like taking patterns, such 
as between aa and dd exclusive is an exercise for someone else.




smime.p7s
Description: S/MIME cryptographic signature


Re: Re-Transmission of blobs?

2013-09-12 Thread Jeff King
On Thu, Sep 12, 2013 at 12:35:32PM +0200, Josef Wolf wrote:

 I'm not sure I understand correctly. I see that bitmaps can be used to
 implement set operations. But how comes that walking the graph requires a lot
 of CPU? Isn't it O(n)?

Yes and no. Your n there is the entirety of history. Whereas a simple
git push generally only has to look at the recent history. So even
though you are looking at each commit and tree only once, it's still a
large number of them (and each one needs to be pulled off of the disk,
decompressed, and reconstructed from deltas).

Secondly, the graph traversal ends up seeing the same sha1s over and
over again in tree entries (because most entries in the tree don't
change from commit to commit). We spend a non-trivial amount of time
looking those up in a hash table.

Just try git rev-list --objects --all in your favorite repository to
get a sense. It takes something like 30 seconds in the kernel repo. You
would probably not want to add 30 seconds of CPU time to a trivial push.

 Those bitmaps would be stored in the git metadata? Is it worth it? Storing a
 bitmap for every commit just to be used once-in-a-while seems to be a pretty
 big overhead to me. Not to mention the interoperability problems you mentioned
 below.

There are tricks to make them smaller (run-length compression,
bitmapping a subset of commits and traversing to the nearest one,
storing bitmaps as deltas against nearby bitmaps). And how often it is
used depends on your git workload. For a repository serving git clones
and fetches, it speeds up every operation.

Try starting a clone of:

  git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

versus

  git://github.com/torvalds/linux.git

and see which one starts sending you data more quickly.

 Sounds like you're already almost done and don't really need help
 anymore. Just out of curiosity, I'd be interested in a pointer anyway
 ;-)

Shawn gave a talk on JGit here:

  
http://www.eclipsecon.org/2013/sites/eclipsecon.org.2013/files/Scaling%20Up%20JGit%20-%20EclipseCon%202013.pdf

and the scrapped patches for git are here:

  http://article.gmane.org/gmane.comp.version-control.git/228918

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re-Transmission of blobs?

2013-09-12 Thread Jeff King
On Thu, Sep 12, 2013 at 12:45:44PM +, Pyeron, Jason J CTR (US) wrote:

 If the rules of engagement are change a bit, the server side can be release 
 from most of its work (CPU/IO).
 
 Client does the following, looping as needed:
 
 Heads=server-heads();
 KnownCommits=Local-AllCommits();
 Missingblobs=[];
 Foreach(commit:heads) if (!knownCommits-contains(commit)) 
 MissingBlobs[]=commit;
 Foreach(commit:knownCommit) if (!commit-isValid()) 
 MissingBlobs[]=commit-blobs();
 If (missingBlobs-size()0) server-FetchBlobs(missingBlobs);

That doesn't quite work. The client does not know the set of missing
objects just from the commits. It knows the sha1 of the root trees it is
missing. And then if it fetches those, it knows the sha1 of any
top-level entries it is missing. And when it gets those, it knows the
sha1 of any 2nd-level entries it is missing, and so forth.

You can progressively ask for each level, but:

  1. You are spending a round-trip for each request. Doing it per-object
 is awful (the dumb http walker will do this if the repo is not
 packed, and it's S-L-O-W). Doing it per-level would be better, but
 not great.

  2. You are losing opportunities for deltas (or you are making the
 state the server needs to maintain very complicated, as it must
 remember from request to request which objects you have gotten that
 can be used as delta bases).

  3. There is a lot of overhead in this protocol. The client has to
 mention each object individually by sha1. It may not seem like a
 lot, but it can easily add 10% to a clone (just look at the size of
 the pack .idx files versus the packfiles themselves).

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Re-Transmission of blobs?

2013-09-12 Thread Pyeron, Jason J CTR (US)
 -Original Message-
 From: Jeff King
 Sent: Thursday, September 12, 2013 3:57 PM
 
 On Thu, Sep 12, 2013 at 12:45:44PM +, Pyeron, Jason J CTR (US)
 wrote:
 
  If the rules of engagement are change a bit, the server side can be
 release from most of its work (CPU/IO).
 
  Client does the following, looping as needed:
 
  Heads=server-heads();
  KnownCommits=Local-AllCommits();
  Missingblobs=[];
  Foreach(commit:heads) if (!knownCommits-contains(commit))
 MissingBlobs[]=commit;
  Foreach(commit:knownCommit) if (!commit-isValid())
 MissingBlobs[]=commit-blobs();
  If (missingBlobs-size()0) server-FetchBlobs(missingBlobs);
 
 That doesn't quite work. The client does not know the set of missing

looping as needed

 objects just from the commits. It knows the sha1 of the root trees it
 is
 missing. And then if it fetches those, it knows the sha1 of any
 top-level entries it is missing. And when it gets those, it knows the
 sha1 of any 2nd-level entries it is missing, and so forth.
 
 You can progressively ask for each level, but:
 
   1. You are spending a round-trip for each request. Doing it per-
 object
  is awful (the dumb http walker will do this if the repo is not
  packed, and it's S-L-O-W). Doing it per-level would be better, but
  not great.

Yes, but it is those awfully slow connections (slower that the looping issue) 
which happen to always drop while cloning from our office. And the round trip 
should be mitigated by http-keep-alives.

 
   2. You are losing opportunities for deltas (or you are making the
  state the server needs to maintain very complicated, as it must
  remember from request to request which objects you have gotten
 that
  can be used as delta bases).

But, again if the connection drops, we have already lost the delta advantage. I 
would think the scenario would go like this:

git clone url://blah/blah
[fail]
cd blah
git clone --resume #uses normal methods
[fail]
while ! git clone --resume --HitItWithAStick

replace clone with fetch for that use case too

 
   3. There is a lot of overhead in this protocol. The client has to
  mention each object individually by sha1. It may not seem like a
  lot, but it can easily add 10% to a clone (just look at the size
 of
  the pack .idx files versus the packfiles themselves).

But if it finishes in a week, it is a lot better than it never, ever finishes.

I draw attention to the time I had to download DB2 UDB in Thailand via 28k 
modem. It was resumable with wget, if it were not, it would have required the 
use of a plane to sneaker net it back.




smime.p7s
Description: S/MIME cryptographic signature


Re: Re-Transmission of blobs?

2013-09-11 Thread Josef Wolf
On Di, Sep 10, 2013 at 10:51:02 -0700, Junio C Hamano wrote:
 Consider this simple history with only a handful of commits (as
 usual, time flows from left to right):
 
   E
  /   
 A---B---C---D
 
 where D is at the tip of the sending side, E is at the tip of the
 receiving side.  The exchange goes roughly like this:
 
 (receiving side): what do you have?
 
 (sending side): my tip is at D.
 
 (receiving side): D?  I've never heard of it --- please give it
   to me.  I have E.

At this point, why would the receiving side not tell all the heads it knows
about? That would enable the sending side to see whether it knows any of those
commits. It then would be able to remove from the sending list all the objects
that can be reached from the commits it knows about.

 (sending side): E?  I don't know about it; must be something you
 created since you forked from me.  Tell me about
 its ancestors.

This is not exactly true. In the example I had given, the sending side has all
three commits: C, D and E. So the sending side has no reason to say that it
doesn't know anything about E. Therefore the sending side has all information
it needs to deduce exactly which objects need to be sent to the receiving side.

What needs to be sent are all the objects in C..D but without all the objects
in C..E. I guess this operation would be called set-difference in english.

And if the receiving side would have told that it has heads X Y Z in addition,
and the sending side happens to have Y, then the sending side could in
addition remove any objects that can be reached from Y from the sending list.

-- 
Josef Wolf
j...@raven.inka.de
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re-Transmission of blobs?

2013-09-11 Thread Junio C Hamano
Josef Wolf j...@raven.inka.de writes:

 On Di, Sep 10, 2013 at 10:51:02 -0700, Junio C Hamano wrote:
 Consider this simple history with only a handful of commits (as
 usual, time flows from left to right):
 
   E
  /   
 A---B---C---D
 
 where D is at the tip of the sending side, E is at the tip of the
 receiving side.  The exchange goes roughly like this:
 
 (receiving side): what do you have?
 
 (sending side): my tip is at D.
 
 (receiving side): D?  I've never heard of it --- please give it
   to me.  I have E.

 At this point, why would the receiving side not tell all the heads it knows
 about?

It did.  The receiving end had only one branch whose tip is E.  It
may have a tracking branch that knows where the tip of the sending
end used to be when it forked (which is C), so the above may say I
have E and C.  It actually would say I have B and A and ... for a
bounded number of commits, but that does not fundamentally change
the picture---the important point is it is bounded and there is a
horizon.

 There are some work being done to optimize this further using
 various techniques, but they are not ready yet.

And this still stands.

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re-Transmission of blobs?

2013-09-10 Thread Junio C Hamano
Josef Wolf j...@raven.inka.de writes:

 as we all know, files are identified by their SHA. Thus I had the impression
 that when transfering files, git would know by the SHA whether a given file is
 already available in the destination repository and the transfer would be of
 no use.

That is unfortunately not how things work.  It is not like the
receiving end sends the names of all objects it has, and the sending
end excludes these objects from what it is going to send.

Consider this simple history with only a handful of commits (as
usual, time flows from left to right):

  E
 /   
A---B---C---D

where D is at the tip of the sending side, E is at the tip of the
receiving side.  The exchange goes roughly like this:

(receiving side): what do you have?

(sending side): my tip is at D.

(receiving side): D?  I've never heard of it --- please give it
  to me.  I have E.

(sending side): E?  I don't know about it; must be something you
created since you forked from me.  Tell me about
its ancestors.

(receiving side): OK, I have C.

(sending side): Oh, C I know about. You do not have to tell me
anything more.  A packfile to bring you up to
date will follow.

At this point, the sender knows that the receiver needs the commit
D, and trees and blobs in D.  It does also know it has the commit C
and trees and blobs in C.  It does the best thing it can do using
these (and only these) information, namely, to send the commit D,
and send trees and blobs in D that are not in the commit C.

You may happen to have something in E that match what is in D but
not in C.  Because the sender does not know anything about E at all
in the first place, that information cannot be used to reduce the
transfer.

The sender theoretically _could_ also exploit the fact that any
receiver that has C must have B and A and all trees and blobs
associated with these ancestor commits [*1*], but that information
is not currently discovered nor used during the object transfer.

There may happen to be a tree or a blob in A that matches a tree or
a blob in D.  But because the common ancestor discovery exchange
above stops at C, the sender does not bother enumerating all the
objects that are in the ancestor commits of C when figuring out what
objects to send to ensure that the receiving end has all the objects
necessary to complete D.  If you modified a blob at B (or C) and
then resurrected the old version of the blob at D, it is likely that
the blob is going to be sent again when the receiving end asks for
D.

There are some work being done to optimize this further using
various techniques, but they are not ready yet.


[Footnote]

*1* only down to the shallow boundary, if the receiving end is a
shallow clone.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html