Re: [PATCH 0/6] receive-pack: quarantine pushed objects

2016-10-03 Thread Christian Couder
On Sun, Oct 2, 2016 at 3:02 PM, Jeff King  wrote:
> On Sun, Oct 02, 2016 at 11:20:59AM +0200, Christian Couder wrote:
>
>> I wonder if the patch you sent in:
>>
>> https://public-inbox.org/git/20160816144642.5ikkta4l5hyx6...@sigill.intra.peff.net/
>>
>> is still useful or not.
>
> It is potentially still useful for other code paths besides
> receive-pack. But if the main concern is pushes, then yeah, I think it
> is not really doing anything.
>
>> I guess if we fail the receive-pack because the pack is bigger than
>> receive.maxInputSize, then the "quarantine" directory will also be
>> removed, so the part of the pack that we received before failing the
>> receive-pack will be deleted.
>
> Correct. _Any_ failure up to the tmp_objdir_migrate() call will drop the
> objects. So that includes index-pack failing for any reason.

Great, thanks for explaining!

>> > These two patches set that up by letting index-pack and pre-receive
>> > know that quarantine path and use it to store arbitrary files that
>> > _don't_ get migrated to the main object database (i.e., the log file
>> > mentioned above).
>>
>> It would be nice to have a diffstat for the whole series.
>
> You mean in the cover letter? I do not mind including it if people find
> them useful, but I personally have always just found them to be clutter
> at that level.

I think it can help to quickly get an idea about what the series
impacts, and it would have made it easier for me to see that the
changes in the patch you sent previously
(https://public-inbox.org/git/20160816144642.5ikkta4l5hyx6...@sigill.intra.peff.net/)
are not part of this series.

Thanks anyway,
Christian.


Re: [PATCH 0/6] receive-pack: quarantine pushed objects

2016-10-02 Thread Jeff King
On Sun, Oct 02, 2016 at 11:20:59AM +0200, Christian Couder wrote:

> On Fri, Sep 30, 2016 at 9:35 PM, Jeff King  wrote:
> > I've mentioned before on the list that GitHub "quarantines" objects
> > while the pre-receive hook runs. Here are the patches to implement
> > that.
> 
> Great! Thanks for upstreaming these patches!
> 
> I wonder if the patch you sent in:
> 
> https://public-inbox.org/git/20160816144642.5ikkta4l5hyx6...@sigill.intra.peff.net/
> 
> is still useful or not.

It is potentially still useful for other code paths besides
receive-pack. But if the main concern is pushes, then yeah, I think it
is not really doing anything.

> I guess if we fail the receive-pack because the pack is bigger than
> receive.maxInputSize, then the "quarantine" directory will also be
> removed, so the part of the pack that we received before failing the
> receive-pack will be deleted.

Correct. _Any_ failure up to the tmp_objdir_migrate() call will drop the
objects. So that includes index-pack failing for any reason.

> > These two patches set that up by letting index-pack and pre-receive
> > know that quarantine path and use it to store arbitrary files that
> > _don't_ get migrated to the main object database (i.e., the log file
> > mentioned above).
> 
> It would be nice to have a diffstat for the whole series.

You mean in the cover letter? I do not mind including it if people find
them useful, but I personally have always just found them to be clutter
at that level.

-Peff


Re: [PATCH 0/6] receive-pack: quarantine pushed objects

2016-10-02 Thread Christian Couder
On Fri, Sep 30, 2016 at 9:35 PM, Jeff King  wrote:
> I've mentioned before on the list that GitHub "quarantines" objects
> while the pre-receive hook runs. Here are the patches to implement
> that.

Great! Thanks for upstreaming these patches!

I wonder if the patch you sent in:

https://public-inbox.org/git/20160816144642.5ikkta4l5hyx6...@sigill.intra.peff.net/

is still useful or not.

> The basic problem is that as-is, index-pack admits pushed objects into
> the main object database immediately, before the pre-receive hook runs.
> It _has_ to, since the hook needs to be able to actually look at the
> objects. However, this means that if the pre-receive hook rejects the
> push, we still end up with the objects in the repository. We can't just
> delete them as temporary files, because we don't know what other
> processes might have started referencing them.
>
> The solution here is to push into a "quarantine" directory that is
> accessible only to pre-receive, check_connected(), etc, and only
> move the objects into the main object database after we've finished
> those basic checks.

I guess if we fail the receive-pack because the pack is bigger than
receive.maxInputSize, then the "quarantine" directory will also be
removed, so the part of the pack that we received before failing the
receive-pack will be deleted.

[...]

> These two patches set that up by letting index-pack and pre-receive
> know that quarantine path and use it to store arbitrary files that
> _don't_ get migrated to the main object database (i.e., the log file
> mentioned above).

It would be nice to have a diffstat for the whole series.

Thanks,
Christian.


[PATCH 0/6] receive-pack: quarantine pushed objects

2016-09-30 Thread Jeff King
I've mentioned before on the list that GitHub "quarantines" objects
while the pre-receive hook runs. Here are the patches to implement
that.

The basic problem is that as-is, index-pack admits pushed objects into
the main object database immediately, before the pre-receive hook runs.
It _has_ to, since the hook needs to be able to actually look at the
objects. However, this means that if the pre-receive hook rejects the
push, we still end up with the objects in the repository. We can't just
delete them as temporary files, because we don't know what other
processes might have started referencing them.

The solution here is to push into a "quarantine" directory that is
accessible only to pre-receive, check_connected(), etc, and only
move the objects into the main object database after we've finished
those basic checks.

One of the things we use it for at GitHub is object-size policy, which
we implement via a pre-receive hook (sort of; see below). This scheme
has been in use for about 2 years, though I did do a fair bit of
tweaking to make it ready for upstream (squashing bugfixes and merges
from upstream that came later, along with polishing a few rough edges I
saw while doing so). So I may have introduced new bugs. :)

The patches are:

  [1/6]: check_connected: accept an env argument
  [2/6]: sha1_file: always allow relative paths to alternates

These two are preparatory.

  [3/6]: tmp-objdir: introduce API for temporary object directories
  [4/6]: receive-pack: quarantine objects until pre-receive accepts

This is the interesting part.

  [5/6]: tmp-objdir: put quarantine information in the environment
  [6/6]: tmp-objdir: do not migrate files starting with '.'

These are two changes that I ended up doing later to support another
series. They're not strictly necessary here, but I think they're
worth including now, as they change the visible behavior in minor
ways. It seems like a good idea to start with what I think should be
the final behavior.

The other series is basically an optimization for the object-size
policy. Without it, you are stuck walking the graph again in the
pre-receive hook to find the new objects and check their sizes.

But index-pack can do that for you very cheaply; it has the size of
each object already. But it _doesn't_ produce nice error messages;
it has no idea at what path the objects are found, and it doesn't
know what kind of advice it should give the user.

So what we can do is ask index-pack to make a note of any objects
larger than N bytes, and write their sha1 and size into a file in
the quarantine path. Then the pre-receive hook can look in that log
and generate any nice message it wants. In the common case, the log
is empty, and it does not have to do any work at all.

These two patches set that up by letting index-pack and pre-receive
know that quarantine path and use it to store arbitrary files that
_don't_ get migrated to the main object database (i.e., the log file
mentioned above).

-Peff