Re: push race

2012-10-16 Thread Junio C Hamano
Jeff King  writes:

> But still...the complexity is ugly, and we do not even have a measured
> problem in the real world. This is not worth thinking about. :)

Yup.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: push race

2012-10-16 Thread Jeff King
On Tue, Oct 16, 2012 at 10:21:02AM -0700, Junio C Hamano wrote:

> > I suppose we could do the way unpack-objects does: prefer present
> > objects and drop the new identical ones, no memcmp. Objects that are
> > not bases, or are ref-delta bases, can be safely dropped. ofs-delta
> > bases may lead to rewriting the pack. Do-able but not sure it's worth
> > the effort.
> 
> Until you read all the incoming pack data, you won't know what
> objects are used as bases for others, so unless you are keeping
> everything in core, you would have to spool the incoming data to a
> file and then rewrite the final pack file to "drop" these "can be
> safely dropped" objects, with or without offset delta encoding.

By definition, you know that you have another copy of these objects
(that is why you are dropping them). So you could treat later delta
references to them the same as thin-pack references, and re-add your
existing on-disk copy of the object to the end of the pack.

But still...the complexity is ugly, and we do not even have a measured
problem in the real world. This is not worth thinking about. :)

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: push race

2012-10-16 Thread Junio C Hamano
Nguyen Thai Ngoc Duy  writes:

> On Tue, Oct 16, 2012 at 12:37 PM, Jeff King  wrote:
>> I suspect a site that is heavy on alternates is invoking the index-pack
>> code path more frequently than necessary (e.g., history gets pushed to
>> one forked repo, then when it goes to the next one, we may not share the
>> ref that tells the client we already have the object and receive it a
>> second time).
>
> I suppose we could do the way unpack-objects does: prefer present
> objects and drop the new identical ones, no memcmp. Objects that are
> not bases, or are ref-delta bases, can be safely dropped. ofs-delta
> bases may lead to rewriting the pack. Do-able but not sure it's worth
> the effort.

Until you read all the incoming pack data, you won't know what
objects are used as bases for others, so unless you are keeping
everything in core, you would have to spool the incoming data to a
file and then rewrite the final pack file to "drop" these "can be
safely dropped" objects, with or without offset delta encoding.


--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: push race

2012-10-16 Thread Jeff King
On Tue, Oct 16, 2012 at 05:45:12PM +0700, Nguyen Thai Ngoc Duy wrote:

> On Tue, Oct 16, 2012 at 12:37 PM, Jeff King  wrote:
> > I suspect a site that is heavy on alternates is invoking the index-pack
> > code path more frequently than necessary (e.g., history gets pushed to
> > one forked repo, then when it goes to the next one, we may not share the
> > ref that tells the client we already have the object and receive it a
> > second time).
> 
> I suppose we could do the way unpack-objects does: prefer present
> objects and drop the new identical ones, no memcmp. Objects that are
> not bases, or are ref-delta bases, can be safely dropped. ofs-delta
> bases may lead to rewriting the pack. Do-able but not sure it's worth
> the effort.

Yeah, I think that complexity is why we don't do it currently. We are
pretty alternates-heavy at GitHub, and we have not noticed a performance
impact. So I think it is probably not worth worrying about.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: push race

2012-10-15 Thread Angelo Borsotti
Hi Jeff,

it would be worth to put your description as comments in the code for future
reference.

Thanks
-Angelo
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: push race

2012-10-15 Thread Jeff King
On Tue, Oct 16, 2012 at 12:15:21PM +0700, Nguyen Thai Ngoc Duy wrote:

> On Tue, Oct 16, 2012 at 11:51 AM, Jeff King  wrote:
> >> Its worth nothing that a SHA-1 collision can be identified at the
> >> server because the server performs a byte-for-byte compare of both
> >> copies of the object to make sure they match exactly in every way. Its
> >> not fast, but its safe. :-)
> >
> > Do we? I thought early versions of git did that, but we did not
> > double-check collisions any more for performance reasons. You don't
> > happen to remember where that code is, do you (not that it really
> > matters, but I am just curious)?
> 
> We do. I touched that sha-1 collision code last time I updated
> index-pack, to support large blobs. We only do that when we receive an
> object that we already have, which should not happen often unless
> you're under attack, so little performance impact normally. Search
> "collision" in index-pack.c

Ah, thanks, I remember this now. I think that I was thinking of the very
early code to check every sha1 file write. E.g., the code killed off by
aac1794 (Improve sha1 object file writing., 2005-05-03). But that is
ancient history that is not really relevant.

Interesting that we check only in index-pack. If the pushed content is
small enough, we will call unpack-objects. That follows the usual code
path for writing the object, which will prefer the existing copy.

I suspect a site that is heavy on alternates is invoking the index-pack
code path more frequently than necessary (e.g., history gets pushed to
one forked repo, then when it goes to the next one, we may not share the
ref that tells the client we already have the object and receive it a
second time).

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: push race

2012-10-15 Thread Jeff King
On Mon, Oct 15, 2012 at 07:09:52PM -0700, Shawn O. Pearce wrote:

> On Mon, Oct 15, 2012 at 11:56 AM, Jeff King  wrote:
> > Right. The only thing that needs locking is the refs, because the object
> > database is add-only for normal operations, and by definition collisions
> > mean you have the same content (or are astronomically unlucky, but your
> > consolation prize is that you can write a paper on how you found a sha1
> > collision).
> 
> Its worth nothing that a SHA-1 collision can be identified at the
> server because the server performs a byte-for-byte compare of both
> copies of the object to make sure they match exactly in every way. Its
> not fast, but its safe. :-)

Do we? I thought early versions of git did that, but we did not
double-check collisions any more for performance reasons. You don't
happen to remember where that code is, do you (not that it really
matters, but I am just curious)?

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: push race

2012-10-15 Thread Shawn Pearce
On Mon, Oct 15, 2012 at 11:56 AM, Jeff King  wrote:
> Right. The only thing that needs locking is the refs, because the object
> database is add-only for normal operations, and by definition collisions
> mean you have the same content (or are astronomically unlucky, but your
> consolation prize is that you can write a paper on how you found a sha1
> collision).

Its worth nothing that a SHA-1 collision can be identified at the
server because the server performs a byte-for-byte compare of both
copies of the object to make sure they match exactly in every way. Its
not fast, but its safe. :-)
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: push race

2012-10-15 Thread Jeff King
On Mon, Oct 15, 2012 at 05:50:47PM +0200, Angelo Borsotti wrote:

> correct, there will be no file overwriting because no files are
> written on the work tree.
> I tried to follow the actions of the program, but did not quite catch
> the 6. you mention.

It is the "oldval" parameter to refs.c:update_ref. Or if you are using
the "git update-ref" plumbing, it is the "oldvalue" parameter.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: push race

2012-10-15 Thread Jeff King
On Mon, Oct 15, 2012 at 10:29:08AM -0400, Marc Branchaud wrote:

> Here's a previous discussion of a race in concurrent updates to the same ref,
> even when the updates are all identical:
> 
> http://news.gmane.org/find-root.php?group=gmane.comp.version-control.git&article=164636
> 
> In that thread, Peff outlines the lock procedure for refs:
> 
> 1. get the lock
> 2. check and remember the sha1
> 3. release the lock
> 4. do some long-running work (like the actual push)
> 5. get the lock
> 6. check that the sha1 is the same as the remembered one
> 7. update the sha1
> 8. release the lock

A minor nit, but I was wrong on steps 1-3. We don't have to take a lock
on reading, because our write mechanism uses atomic replacement. So it
is really:

  1. read and remember the original sha1
  2. do some long-running work (like the actual push)
  3. get the write lock
  4. read the sha1 and check that it's the same as our original
  5. write the new sha1 to the lockfile
  6. simultaneously release the lock and update the ref by atomically
 renaming the lockfile to the actual ref

Any simultaneous push may see the "old" sha1 before step 6, and when it
gets to its own step 4, will fail (and two processes cannot be in steps
3-6 simultaneously).

> Angelo, in your case I think one of your concurrent updates would fail in
> step 6.  As you say, this is after the changes have been uploaded.  However,
> there's none of the file-overwriting that you fear, because the changes are
> stored in git's object database under their SHA hashes.  So there'll only be
> an object-level collision if two parties upload the exact same object, in
> which case it doesn't matter.

Right. The only thing that needs locking is the refs, because the object
database is add-only for normal operations, and by definition collisions
mean you have the same content (or are astronomically unlucky, but your
consolation prize is that you can write a paper on how you found a sha1
collision).

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: push race

2012-10-15 Thread Angelo Borsotti
Hi Marc,

correct, there will be no file overwriting because no files are
written on the work tree.
I tried to follow the actions of the program, but did not quite catch
the 6. you mention.

-Angelo
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: push race

2012-10-15 Thread Marc Branchaud
On 12-10-15 10:09 AM, Ævar Arnfjörð Bjarmason wrote:
> On Mon, Oct 15, 2012 at 11:14 AM, Angelo Borsotti
>  wrote:
>> Hello,
> 
> FWIW we have a lot of lemmings pushing to the same ref all the time at
> $work, and while I've seen cases where:
> 
>  1. Two clients try to push
>  2. They both get the initial lock
>  3. One of them fails to get the secondary lock (I think updating the ref)
> 
> I've never seen cases where they clobber each other in #3 (and I would
> have known from "dude, where's my commit that I just pushed" reports).
> 
> So while we could fix git to make sure there's no race condition such
> that two clients never get the #2 lock I haven't seen it cause actual
> data issues because of two clients getting the #3 lock.
> 
> It might still happen in some cases, I recommend testing it with e.g.
> lots of pushes in parallel with GNU Parallel.

Here's a previous discussion of a race in concurrent updates to the same ref,
even when the updates are all identical:

http://news.gmane.org/find-root.php?group=gmane.comp.version-control.git&article=164636

In that thread, Peff outlines the lock procedure for refs:

1. get the lock
2. check and remember the sha1
3. release the lock
4. do some long-running work (like the actual push)
5. get the lock
6. check that the sha1 is the same as the remembered one
7. update the sha1
8. release the lock

Angelo, in your case I think one of your concurrent updates would fail in
step 6.  As you say, this is after the changes have been uploaded.  However,
there's none of the file-overwriting that you fear, because the changes are
stored in git's object database under their SHA hashes.  So there'll only be
an object-level collision if two parties upload the exact same object, in
which case it doesn't matter.

M.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: push race

2012-10-15 Thread demerphq
On 15 October 2012 16:09, Ævar Arnfjörð Bjarmason  wrote:
> On Mon, Oct 15, 2012 at 11:14 AM, Angelo Borsotti
>  wrote:
>> Hello,
>
> FWIW we have a lot of lemmings pushing to the same ref all the time at
> $work, and while I've seen cases where:
>
>  1. Two clients try to push
>  2. They both get the initial lock
>  3. One of them fails to get the secondary lock (I think updating the ref)
>
> I've never seen cases where they clobber each other in #3 (and I would
> have known from "dude, where's my commit that I just pushed" reports).

Except that the error message is really cryptic. It definitely doesnt
shout out "maybe you collided with someone elses push".

Yves

-- 
perl -Mre=debug -e "/just|another|perl|hacker/"
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: push race

2012-10-15 Thread Ævar Arnfjörð Bjarmason
On Mon, Oct 15, 2012 at 11:14 AM, Angelo Borsotti
 wrote:
> Hello,

FWIW we have a lot of lemmings pushing to the same ref all the time at
$work, and while I've seen cases where:

 1. Two clients try to push
 2. They both get the initial lock
 3. One of them fails to get the secondary lock (I think updating the ref)

I've never seen cases where they clobber each other in #3 (and I would
have known from "dude, where's my commit that I just pushed" reports).

So while we could fix git to make sure there's no race condition such
that two clients never get the #2 lock I haven't seen it cause actual
data issues because of two clients getting the #3 lock.

It might still happen in some cases, I recommend testing it with e.g.
lots of pushes in parallel with GNU Parallel.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: push race

2012-10-15 Thread Nguyen Thai Ngoc Duy
On Mon, Oct 15, 2012 at 6:05 PM, Matthieu Moy
 wrote:
> Angelo Borsotti  writes:
>
>> the push command checks first if the tips of the branches match those
>> of the remote references, and if it does uploads the snapshot.
>
> The update does two things: upload objects to the database, and then
> update the reference. Adding objects to the database does not change the
> repository until the objects are reachable from a ref. Updating the ref
> is usually done giving the expected old sha1, and locks the ref, so it
> can't change in the meantime.
>
> I don't know this part of the code very well, but check refs.c for the C
> part, and "git update-ref" for the plumbing interface.

I think it's lock_any_ref_for_update(), which is called inside
refs.c:update_ref().
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: push race

2012-10-15 Thread Matthieu Moy
Angelo Borsotti  writes:

> the push command checks first if the tips of the branches match those
> of the remote references, and if it does uploads the snapshot.

The update does two things: upload objects to the database, and then
update the reference. Adding objects to the database does not change the
repository until the objects are reachable from a ref. Updating the ref
is usually done giving the expected old sha1, and locks the ref, so it
can't change in the meantime.

I don't know this part of the code very well, but check refs.c for the C
part, and "git update-ref" for the plumbing interface.

-- 
Matthieu Moy
http://www-verimag.imag.fr/~moy/
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html