Re: [RFC PATCH 3/5] pack-objects: add delta-islands support

2018-08-06 Thread Jeff King
On Mon, Aug 06, 2018 at 10:44:14AM +0200, Christian Couder wrote:

> Taking a look at how we use regexec() in our code base, it looks like
> it might be better to use regexec_buf() defined in
> "git-compat-util.h".
> 
> I am not completely sure about that because apply.c has:
> 
> status = regexec(stamp, timestamp, ARRAY_SIZE(m), m, 0);
> if (status) {
> if (status != REG_NOMATCH)
> warning(_("regexec returned %d for input: %s"),
> status, timestamp);
> return 0;
> }
> 
> Though the above uses a regex that is defined in apply.c. The regex
> doesn't come from the config file.
> 
> Actually except the above there is a mix of regexec() and
> regexec_buf() in our code base, which are used with only 0, 1 or 2
> capture groups, so it is not very clear what should be used.

I don't think we need regexec_buf(). The advantage it has is that it can
operate on strings that aren't NUL-terminated, but that isn't the case
here.

> And anyway I still don't see how we could diagnose when the end user
> input requires more captures than we support.

We can use the final element as a sentinel, and complain if it gets
filled in:

diff --git a/delta-islands.c b/delta-islands.c
index dcc6590cc1..18426ffb18 100644
--- a/delta-islands.c
+++ b/delta-islands.c
@@ -375,6 +375,10 @@ static int find_island_for_ref(const char *refname, const 
struct object_id *oid,
if (i < 0)
return 0;
 
+   if (matches[ARRAY_SIZE(matches)-1].rm_so != -1)
+   die("island regex had too many matches (max=%d)",
+   (int)ARRAY_SIZE(matches) - 2);
+
for (m = 1; m < ARRAY_SIZE(matches); m++) {
regmatch_t *match = [m];
 

The big downside is that it only kicks in when you actually successfully
make a match. So you could have:

  [pack]
  island = refs/(one)/(two)/(three)/(four)/(five)/(six)/(seven)

in your config for years, and then one day it blows up when somebody
actually has a ref that matches it.

I think it would be fine to just say "we only respect the first N
capture groups". And maybe even issue a warning (based on the detection
above). I'd also be fine with bumping the "matches" array to something
more ridiculous, like 32. The current value of 8 was supposed to be
ridiculous already (we've never used more than 2).

-Peff


Re: [RFC PATCH 3/5] pack-objects: add delta-islands support

2018-08-06 Thread Christian Couder
On Sun, Aug 5, 2018 at 7:40 PM, Christian Couder
 wrote:
> On Tue, Jul 24, 2018 at 7:11 PM, Junio C Hamano  wrote:

> This is the new documentation:
>
> -> Refs are grouped into islands based on their "names", and two regexes
> -> that produce the same name are considered to be in the same
> -> island. The names are computed from the regexes by concatenating any
> -> capture groups from the regex, with a '-' dash in between. (And if
> -> there are no capture groups, then the name is the empty string, as in
> -> the above example.) This allows you to create arbitrary numbers of
> -> islands. Only up to 7 such capture groups are supported though.
>
> I added the last sentence above, but I wonder if it is 7 or 8.

Actually after reading the man page again, it looks like the first
element in the array we pass is used for "the entire regular
expression's match addresses", so we only get 7 capture groups (not
8).

> The code is the following:
>
> -> static int find_island_for_ref(const char *refname, const struct
> object_id *oid,
> ->int flags, void *data)
> -> {
> -> int i, m;
> -> regmatch_t matches[8];
> -> struct strbuf island_name = STRBUF_INIT;
> ->
> -> /* walk backwards to get last-one-wins ordering */
> -> for (i = island_regexes_nr - 1; i >= 0; i--) {
> -> if (!regexec(_regexes[i], refname,
> ->  ARRAY_SIZE(matches), matches, 0))
> -> break;
> -> }
>
> I also wonder if the above is enough to diagnose end-user input when
> it requires more captures than we support. A quick look at the man
> page of the regex functions wasn't enough to enlighten me. Any input
> on these issues is very welcome!

Taking a look at how we use regexec() in our code base, it looks like
it might be better to use regexec_buf() defined in
"git-compat-util.h".

I am not completely sure about that because apply.c has:

status = regexec(stamp, timestamp, ARRAY_SIZE(m), m, 0);
if (status) {
if (status != REG_NOMATCH)
warning(_("regexec returned %d for input: %s"),
status, timestamp);
return 0;
}

Though the above uses a regex that is defined in apply.c. The regex
doesn't come from the config file.

Actually except the above there is a mix of regexec() and
regexec_buf() in our code base, which are used with only 0, 1 or 2
capture groups, so it is not very clear what should be used.

And anyway I still don't see how we could diagnose when the end user
input requires more captures than we support.


Re: [RFC PATCH 3/5] pack-objects: add delta-islands support

2018-08-05 Thread Christian Couder
On Tue, Jul 24, 2018 at 7:11 PM, Junio C Hamano  wrote:
>
> Another thing I noticed from 2/5 is that you can have up to 7 such
> capture groups.  I do not have any opinion if 7 is too few or too
> many, but we would want the number to be documented, and end-user
> input diagnosed when it requires more captures than we support (if
> we are not already checking, that is).

This is the new documentation:

-> Refs are grouped into islands based on their "names", and two regexes
-> that produce the same name are considered to be in the same
-> island. The names are computed from the regexes by concatenating any
-> capture groups from the regex, with a '-' dash in between. (And if
-> there are no capture groups, then the name is the empty string, as in
-> the above example.) This allows you to create arbitrary numbers of
-> islands. Only up to 7 such capture groups are supported though.

I added the last sentence above, but I wonder if it is 7 or 8. The
code is the following:

-> static int find_island_for_ref(const char *refname, const struct
object_id *oid,
->int flags, void *data)
-> {
-> int i, m;
-> regmatch_t matches[8];
-> struct strbuf island_name = STRBUF_INIT;
->
-> /* walk backwards to get last-one-wins ordering */
-> for (i = island_regexes_nr - 1; i >= 0; i--) {
-> if (!regexec(_regexes[i], refname,
->  ARRAY_SIZE(matches), matches, 0))
-> break;
-> }

I also wonder if the above is enough to diagnose end-user input when
it requires more captures than we support. A quick look at the man
page of the regex functions wasn't enough to enlighten me. Any input
on these issues is very welcome!


Re: [RFC PATCH 3/5] pack-objects: add delta-islands support

2018-08-05 Thread Christian Couder
On Sun, Jul 22, 2018 at 10:55 AM, Duy Nguyen  wrote:
> On Sun, Jul 22, 2018 at 7:52 AM Christian Couder
>  wrote:
>>
>> -   /*
>> -* And then all remaining commits and tags.
>> -*/
>> -   for (i = last_untagged; i < to_pack.nr_objects; i++) {
>> -   if (oe_type([i]) != OBJ_COMMIT &&
>> -   oe_type([i]) != OBJ_TAG)
>> -   continue;
>> -   add_to_write_order(wo, _end, [i]);
>> -   }
>> +   /*
>> +* Then fill all the tagged tips.
>> +*/
>
> If we move the code in this loop to a separate function, in a separate
> patch, first, would it produce a better diff? I think all the
> indentation change here makes it a bit hard to read.

Sorry I just realized that I forgot to try to do that. It will be on
my todo list for the next iteration.

Thanks,
Christian.


Re: [RFC PATCH 3/5] pack-objects: add delta-islands support

2018-07-28 Thread Christian Couder
On Sat, Jul 28, 2018 at 11:00 AM, Jeff King  wrote:
>
> On Fri, Jul 27, 2018 at 10:22:09AM -0700, Stefan Beller wrote:
>
> > > That would solve the problem that fetching torvalds/linux from GitHub
> > > yields a bigger pack than fetching it from kernel.org. But again, it's
> > > making that root fork weirdly magical. People who fetch solely from
> > > other forks won't get any benefit (and may even see worse packs).
> >
> > Thanks for the explanation.
> > I think this discussion just hints at me being dense reading the
> > documentation. After all I like the concept.
>
> I actually think it hints that the documentation in the commits
> themselves is not adequate. :)

Ok, I will try to improve it using information from the discussion threads.

Thanks,
Christian.


Re: [RFC PATCH 3/5] pack-objects: add delta-islands support

2018-07-28 Thread Jeff King
On Fri, Jul 27, 2018 at 10:22:09AM -0700, Stefan Beller wrote:

> > That would solve the problem that fetching torvalds/linux from GitHub
> > yields a bigger pack than fetching it from kernel.org. But again, it's
> > making that root fork weirdly magical. People who fetch solely from
> > other forks won't get any benefit (and may even see worse packs).
> 
> Thanks for the explanation.
> I think this discussion just hints at me being dense reading the
> documentation. After all I like the concept.

I actually think it hints that the documentation in the commits
themselves is not adequate. :)

-Peff


Re: [RFC PATCH 3/5] pack-objects: add delta-islands support

2018-07-27 Thread Stefan Beller
On Fri, Jul 27, 2018 at 6:13 AM Jeff King  wrote:
>
> On Tue, Jul 24, 2018 at 10:20:05AM -0700, Stefan Beller wrote:
>
> > So in my understanding we have a "common base pack" and specific
> > packs on top for each "island".
>
> Sort of. This is another hacky part. The islands themselves are
> generally just about forbidding deltas, and not any particular kind of
> layering.
>
> But there's some magic layering only for the "core" island, which gets
> to go first (and makes a sort of pseudo-pack at the front of the one
> pack). And then everything else is written willy nilly. This is a hack
> to try to make the "blit the pack bytes out" code path for cloning fast.

yup, I do understand its purpose; we had the same discussions here
for the JGit based hosting.

So you are saying island hopping is disallowed, but the core island
has an airport using the spokes system to reach all other islands?
(I described the core island as sea bed before). Sounds reasonable.

> So no, we don't really layer in any sane way. If pack-objects were fed
> the topological relationships between the forks, in theory we could
> create a layered packfile that respects that.
>
> But even that is not quite enough. At the time of forking, you might
> imagine that torvalds/linux has the base pack, and then somebody forks
> from them and contains all of those objects plus more, and somebody
> forks from them, and so on. But that's just a snapshot. Later
> torvalds/linux will get a bunch of new objects pushed to it. And some of
> its forks will merge those objects, too. But some of them will just rot,
> abandoned, as nobody ever touches them again.
>
> So I don't think there's much to be gained by paying attention to the
> external forking relationships. We have to discover afresh the
> relationships between objects, and which refs (and thus which islands)
> point to them.
>
> One thing I don't think we ever tried was doubling down on the
> islandCore concept and making the "root" fork as tightly packed as it
> could be (with the assumption that _most_ people grab that). And then
> just respect the islands for all the other objects (remember this is an
> optimization, so the worst case is somebody asks for an object during a
> fetch and we have to throw away its on-disk delta).
>
> That would solve the problem that fetching torvalds/linux from GitHub
> yields a bigger pack than fetching it from kernel.org. But again, it's
> making that root fork weirdly magical. People who fetch solely from
> other forks won't get any benefit (and may even see worse packs).

Thanks for the explanation.
I think this discussion just hints at me being dense reading the
documentation. After all I like the concept.

Thanks,
Stefan


Re: [RFC PATCH 3/5] pack-objects: add delta-islands support

2018-07-27 Thread Jeff King
On Tue, Jul 24, 2018 at 10:20:05AM -0700, Stefan Beller wrote:

> So in my understanding we have a "common base pack" and specific
> packs on top for each "island".

Sort of. This is another hacky part. The islands themselves are
generally just about forbidding deltas, and not any particular kind of
layering.

But there's some magic layering only for the "core" island, which gets
to go first (and makes a sort of pseudo-pack at the front of the one
pack). And then everything else is written willy nilly. This is a hack
to try to make the "blit the pack bytes out" code path for cloning fast.
And that has to pick _one_ winner, so ideally you'd point it at the
thing that gets cloned the most, and everybody else gets to be a loser.

Again, this was designed for the current pack-reuse code we have
upstream, which we (GitHub) found to be pretty crappy (which I feel
justified in saying as one of the authors). I need to clean up and share
the alternative strategy we ended up with.

> Do you envision to have "groups of islands" (an atoll) for say all
> open source clones of linux.git, such that you can layer the packs?
> You would not just have the base pack + island pack, but have one
> pack that is common to most islands?

So no, we don't really layer in any sane way. If pack-objects were fed
the topological relationships between the forks, in theory we could
create a layered packfile that respects that.

But even that is not quite enough. At the time of forking, you might
imagine that torvalds/linux has the base pack, and then somebody forks
from them and contains all of those objects plus more, and somebody
forks from them, and so on. But that's just a snapshot. Later
torvalds/linux will get a bunch of new objects pushed to it. And some of
its forks will merge those objects, too. But some of them will just rot,
abandoned, as nobody ever touches them again.

So I don't think there's much to be gained by paying attention to the
external forking relationships. We have to discover afresh the
relationships between objects, and which refs (and thus which islands)
point to them.

One thing I don't think we ever tried was doubling down on the
islandCore concept and making the "root" fork as tightly packed as it
could be (with the assumption that _most_ people grab that). And then
just respect the islands for all the other objects (remember this is an
optimization, so the worst case is somebody asks for an object during a
fetch and we have to throw away its on-disk delta).

That would solve the problem that fetching torvalds/linux from GitHub
yields a bigger pack than fetching it from kernel.org. But again, it's
making that root fork weirdly magical. People who fetch solely from
other forks won't get any benefit (and may even see worse packs).

-Peff


Re: [RFC PATCH 3/5] pack-objects: add delta-islands support

2018-07-24 Thread Stefan Beller
On Tue, Jul 24, 2018 at 2:58 AM Jeff King  wrote:
>
> On Mon, Jul 23, 2018 at 11:52:49AM -0700, Stefan Beller wrote:
>
> > > +DELTA ISLANDS
> > > +-
> > [...]
> >
> > I had to read all of this [background information] to understand the
> > concept and I think it is misnamed, as my gut instinct first told me
> > to have deltas only "within an island and no island hopping is allowed".
> > (This message reads a bit like a commit message, not as documentation
> > as it is long winded, too).
>
> I'm not sure if I'm parsing your sentence correctly, but the reason I
> called them "islands" is exactly that you'd have deltas within an island
> and want to forbid island hopping. So I wasn't sure if you were saying
> "that's what I think, but not how the feature works".

Yeah, you want to avoid island hopping, but still want the sea bed to be
useful for all islands (i.e. to have the largest possible base pack, that all
islands can share without being overexposed with objects they should
not see). And that is the main feature IMHO. It is not so much about
island separation (as you could use its own physically separated repo
for these separation tasks), but the main selling point of this feature
is that it enables a base pack that can be shared across all islands
without violating ACLs for example or making one island "too big"
(having unneeded objects on the island).

So metaphorically speaking I assumed the sea bed to support
all islands, which themselves are independent from each other.

> There _is_ a tricky thing, which is that a given object is going to
> appear in many islands. So the rule is really "you cannot hop to a base
> that is not in all of your islands".

Yes, if islands were numbers, we'd be looking for the Greatest common
common divisor for all of them, as all islands have to have access to
all objects in the base pack.

>
> > What about renaming this feature to
> >
> > [pack]
> > excludePartialReach = refs/virtual/[0-9]]+/tags/
> >
> >   "By setting `pack.excludePartialReach`, object deltafication is
> >   prohibited for objects that are not reachable from all
> >   manifestations of the given regex"
> >
> > Cryptic, but it explains it in my mind in a shorter, more concise way. ;-)
>
> So I'm hopelessly biased at this point, having developed and worked with
> the feature under the "island" name for several years. But I find your
> name and explanation pretty confusing. :)

Ok.

>
> Worse, though, it does not have any noun to refer to the reachable set.
> The regex capture and the island names are an important part of the
> feature, because it lets you make a single island out of:
>
>   refs/virtual/([0-9]+)/heads
>   refs/virtual/([0-9]+)/tags
>
> but exclude:
>
>   refs/virtual/([0-9]+)/(foo)
>
> which goes into its own island ("123-foo" instead of "123"). So what's
> the equivalent nomenclature to "island name"?

So in my understanding we have a "common base pack" and specific
packs on top for each "island".

Regarding naming I find islands interesting and well fitting here, I don't
know of any better name.

I was just confused in the beginning as the name indicates that we care
about the islands, but we rather care about *not* hopping them.

Do you envision to have "groups of islands" (an atoll) for say all
open source clones of linux.git, such that you can layer the packs?
You would not just have the base pack + island pack, but have one
pack that is common to most islands?

Sorry if the initial email came off as a rant; I think the islands
metaphor is very good.

Thanks,
Stefan


Re: [RFC PATCH 3/5] pack-objects: add delta-islands support

2018-07-24 Thread Junio C Hamano
Junio C Hamano  writes:

>> +---
>> +[pack]
>> +island = refs/virtual/([0-9]+)/heads/
>> +island = refs/virtual/([0-9]+)/tags/
>> +island = refs/virtual/([0-9]+)/(pull)/
>> +---
>
> It becomes clear only from this example that what the feature calls
> (and documented in patch 2/5) "regexp" is not BRE but ERE.  Update
> 2/5 so that it is clear to readers of "git config --help" who looks
> for "pack.island" in the output.
>
>> +That puts the heads and tags for each fork in their own island (named
>> +"1234" or similar), and the pull refs for each go into their own
>> +"1234-pull".
>
> "by concatenating any capture groups" made me imagine that the last
> one would be "1234pull" without dash.  The actual rule should be
> mentioned in that paragraph (i.e.  "concatenating any capture groups
> from the regex, with a '-' dash in between" or something like that).

Another thing I noticed from 2/5 is that you can have up to 7 such
capture groups.  I do not have any opinion if 7 is too few or too
many, but we would want the number to be documented, and end-user
input diagnosed when it requires more captures than we support (if
we are not already checking, that is).


Re: [RFC PATCH 3/5] pack-objects: add delta-islands support

2018-07-24 Thread Junio C Hamano
Christian Couder  writes:

> +Islands are configured via the `pack.island` option, which can be
> +specified multiple times. Each value is a left-anchored regular
> +expressions matching refnames. For example:
> +
> +---
> +[pack]
> +island = refs/heads/
> +island = refs/tags/
> +---
> +
> +puts heads and tags into an island (whose name is the empty string; see
> +below for more on naming). Any refs which do not match those regular
> +expressions (e.g., `refs/pull/123`) is not in any island. Any object
> +which is reachable only from `refs/pull/` (but not heads or tags) is
> +therefore not a candidate to be used as a base for `refs/heads/`.
> +
> +Refs are grouped into islands based on their "names", and two regexes
> +that produce the same name are considered to be in the same island. The
> +names are computed from the regexes by concatenating any capture groups
> +from the regex (and if there are none, then the name is the empty
> +string, as in the above example). This allows you to create arbitrary
> +numbers of islands. For example, imagine you store the refs for each
> +fork in `refs/virtual/ID`, where `ID` is a numeric identifier. You might
> +then configure:
> +
> +---
> +[pack]
> +island = refs/virtual/([0-9]+)/heads/
> +island = refs/virtual/([0-9]+)/tags/
> +island = refs/virtual/([0-9]+)/(pull)/
> +---

It becomes clear only from this example that what the feature calls
(and documented in patch 2/5) "regexp" is not BRE but ERE.  Update
2/5 so that it is clear to readers of "git config --help" who looks
for "pack.island" in the output.

> +That puts the heads and tags for each fork in their own island (named
> +"1234" or similar), and the pull refs for each go into their own
> +"1234-pull".

"by concatenating any capture groups" made me imagine that the last
one would be "1234pull" without dash.  The actual rule should be
mentioned in that paragraph (i.e.  "concatenating any capture groups
from the regex, with a '-' dash in between" or something like that).

> +Note that we pick a single island for each regex to go into, using "last
> +one wins" ordering (which allows repo-specific config to take precedence
> +over user-wide config, and so forth).


Re: [RFC PATCH 3/5] pack-objects: add delta-islands support

2018-07-24 Thread Jeff King
On Mon, Jul 23, 2018 at 11:52:49AM -0700, Stefan Beller wrote:

> > +DELTA ISLANDS
> > +-
> [...]
> 
> I had to read all of this [background information] to understand the
> concept and I think it is misnamed, as my gut instinct first told me
> to have deltas only "within an island and no island hopping is allowed".
> (This message reads a bit like a commit message, not as documentation
> as it is long winded, too).

I'm not sure if I'm parsing your sentence correctly, but the reason I
called them "islands" is exactly that you'd have deltas within an island
and want to forbid island hopping. So I wasn't sure if you were saying
"that's what I think, but not how the feature works".

There _is_ a tricky thing, which is that a given object is going to
appear in many islands. So the rule is really "you cannot hop to a base
that is not in all of your islands".

> What about renaming this feature to
> 
> [pack]
> excludePartialReach = refs/virtual/[0-9]]+/tags/
> 
>   "By setting `pack.excludePartialReach`, object deltafication is
>   prohibited for objects that are not reachable from all
>   manifestations of the given regex"
> 
> Cryptic, but it explains it in my mind in a shorter, more concise way. ;-)

So I'm hopelessly biased at this point, having developed and worked with
the feature under the "island" name for several years. But I find your
name and explanation pretty confusing. :)

Worse, though, it does not have any noun to refer to the reachable set.
The regex capture and the island names are an important part of the
feature, because it lets you make a single island out of:

  refs/virtual/([0-9]+)/heads
  refs/virtual/([0-9]+)/tags

but exclude:

  refs/virtual/([0-9]+)/(foo)

which goes into its own island ("123-foo" instead of "123"). So what's
the equivalent nomenclature to "island name"?

> > @@ -3182,6 +3225,8 @@ int cmd_pack_objects(int argc, const char **argv, 
> > const char *prefix)
> >   option_parse_missing_action },
> > OPT_BOOL(0, "exclude-promisor-objects", 
> > _promisor_objects,
> >  N_("do not pack objects in promisor packfiles")),
> > +   OPT_BOOL(0, "delta-islands", _delta_islands,
> > +N_("enable islands for delta compression")),
> 
> We enable this feature, but we disallow certain patterns to be used in 
> packing,
> so it sounds weird to me as we tell Git to *not* explore the full design 
> space,
> so we're not enabling it, but rather restricting it?

Yeah, perhaps "respect islands during delta compression" makes more sense.

-Peff


Re: [RFC PATCH 3/5] pack-objects: add delta-islands support

2018-07-23 Thread Stefan Beller
On Sat, Jul 21, 2018 at 10:49 PM Christian Couder
 wrote:
>
> From: Jeff King 
>
> Implement support for delta islands in git pack-objects
> and document how delta islands work in
> "Documentation/git-pack-objects.txt".
>
> Signed-off-by: Jeff King 
> Signed-off-by: Christian Couder 
> ---
>  Documentation/git-pack-objects.txt |  88 +++
>  builtin/pack-objects.c | 130 -
>  2 files changed, 177 insertions(+), 41 deletions(-)
>
> diff --git a/Documentation/git-pack-objects.txt 
> b/Documentation/git-pack-objects.txt
> index d95b472d16..7b7a36056f 100644
> --- a/Documentation/git-pack-objects.txt
> +++ b/Documentation/git-pack-objects.txt
> @@ -289,6 +289,94 @@ Unexpected missing object will raise an error.
>  --unpack-unreachable::
> Keep unreachable objects in loose form. This implies `--revs`.
>
> +--delta-islands::
> +   Restrict delta matches based on "islands". See DELTA ISLANDS
> +   below.
> +
> +
> +DELTA ISLANDS
> +-
> +
> +When possible, `pack-objects` tries to reuse existing on-disk deltas to
> +avoid having to search for new ones on the fly. This is an important
> +optimization for serving fetches, because it means the server can avoid
> +inflating most objects at all and just send the bytes directly from
> +disk. This optimization can't work when an object is stored as a delta
> +against a base which the receiver does not have (and which we are not
> +already sending). In that case the server "breaks" the delta and has to
> +find a new one, which has a high CPU cost. Therefore it's important for
> +performance that the set of objects in on-disk delta relationships match
> +what a client would fetch.
> +
> +In a normal repository, this tends to work automatically. The objects
> +are mostly reachable from the branches and tags, and that's what clients
> +fetch. Any deltas we find on the server are likely to be between objects
> +the client has or will have.
> +
> +But in some repository setups, you may have several related but separate
> +groups of ref tips, with clients tending to fetch those groups
> +independently. For example, imagine that you are hosting several "forks"
> +of a repository in a single shared object store, and letting clients
> +view them as separate repositories through `GIT_NAMESPACE` or separate
> +repos using the alternates mechanism. A naive repack may find that the
> +optimal delta for an object is against a base that is only found in
> +another fork. But when a client fetches, they will not have the base
> +object, and we'll have to find a new delta on the fly.
> +
> +A similar situation may exist if you have many refs outside of
> +`refs/heads/` and `refs/tags/` that point to related objects (e.g.,
> +`refs/pull` or `refs/changes` used by some hosting providers). By
> +default, clients fetch only heads and tags, and deltas against objects
> +found only in those other groups cannot be sent as-is.
> +
> +Delta islands solve this problem by allowing you to group your refs into
> +distinct "islands". Pack-objects computes which objects are reachable
> +from which islands, and refuses to make a delta from an object `A`
> +against a base which is not present in all of `A`'s islands. This
> +results in slightly larger packs (because we miss some delta
> +opportunities), but guarantees that a fetch of one island will not have
> +to recompute deltas on the fly due to crossing island boundaries.
> +
> +Islands are configured via the `pack.island` option, which can be
> +specified multiple times. Each value is a left-anchored regular
> +expressions matching refnames. For example:
> +
> +---
> +[pack]
> +island = refs/heads/
> +island = refs/tags/
> +---
> +
> +puts heads and tags into an island (whose name is the empty string; see
> +below for more on naming). Any refs which do not match those regular
> +expressions (e.g., `refs/pull/123`) is not in any island. Any object
> +which is reachable only from `refs/pull/` (but not heads or tags) is
> +therefore not a candidate to be used as a base for `refs/heads/`.
> +
> +Refs are grouped into islands based on their "names", and two regexes
> +that produce the same name are considered to be in the same island. The
> +names are computed from the regexes by concatenating any capture groups
> +from the regex (and if there are none, then the name is the empty
> +string, as in the above example). This allows you to create arbitrary
> +numbers of islands. For example, imagine you store the refs for each
> +fork in `refs/virtual/ID`, where `ID` is a numeric identifier. You might
> +then configure:
> +
> +---
> +[pack]
> +island = refs/virtual/([0-9]+)/heads/
> +island = refs/virtual/([0-9]+)/tags/
> +island = refs/virtual/([0-9]+)/(pull)/
> +---
> +
> +That puts the heads and tags for each fork in 

Re: [RFC PATCH 3/5] pack-objects: add delta-islands support

2018-07-22 Thread Duy Nguyen
On Sun, Jul 22, 2018 at 7:52 AM Christian Couder
 wrote:
> @@ -700,51 +705,58 @@ static struct object_entry **compute_write_order(void)
>  */
> for_each_tag_ref(mark_tagged, NULL);
>
> -   /*
> -* Give the objects in the original recency order until
> -* we see a tagged tip.
> -*/
> +   if (use_delta_islands)
> +   max_layers = compute_pack_layers(_pack);
> +
> ALLOC_ARRAY(wo, to_pack.nr_objects);
> -   for (i = wo_end = 0; i < to_pack.nr_objects; i++) {
> -   if (objects[i].tagged)
> -   break;
> -   add_to_write_order(wo, _end, [i]);
> -   }
> -   last_untagged = i;
> +   wo_end = 0;
>
> -   /*
> -* Then fill all the tagged tips.
> -*/
> -   for (; i < to_pack.nr_objects; i++) {
> -   if (objects[i].tagged)
> +   for (; write_layer < max_layers; ++write_layer) {
> +   /*
> +* Give the objects in the original recency order until
> +* we see a tagged tip.
> +*/
> +   for (i = 0; i < to_pack.nr_objects; i++) {
> +   if (objects[i].tagged)
> +   break;
> add_to_write_order(wo, _end, [i]);
> -   }
> +   }
> +   last_untagged = i;
>
> -   /*
> -* And then all remaining commits and tags.
> -*/
> -   for (i = last_untagged; i < to_pack.nr_objects; i++) {
> -   if (oe_type([i]) != OBJ_COMMIT &&
> -   oe_type([i]) != OBJ_TAG)
> -   continue;
> -   add_to_write_order(wo, _end, [i]);
> -   }
> +   /*
> +* Then fill all the tagged tips.
> +*/

If we move the code in this loop to a separate function, in a separate
patch, first, would it produce a better diff? I think all the
indentation change here makes it a bit hard to read.

> +   for (; i < to_pack.nr_objects; i++) {
> +   if (objects[i].tagged)
> +   add_to_write_order(wo, _end, [i]);
> +   }
-- 
Duy


[RFC PATCH 3/5] pack-objects: add delta-islands support

2018-07-21 Thread Christian Couder
From: Jeff King 

Implement support for delta islands in git pack-objects
and document how delta islands work in
"Documentation/git-pack-objects.txt".

Signed-off-by: Jeff King 
Signed-off-by: Christian Couder 
---
 Documentation/git-pack-objects.txt |  88 +++
 builtin/pack-objects.c | 130 -
 2 files changed, 177 insertions(+), 41 deletions(-)

diff --git a/Documentation/git-pack-objects.txt 
b/Documentation/git-pack-objects.txt
index d95b472d16..7b7a36056f 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -289,6 +289,94 @@ Unexpected missing object will raise an error.
 --unpack-unreachable::
Keep unreachable objects in loose form. This implies `--revs`.
 
+--delta-islands::
+   Restrict delta matches based on "islands". See DELTA ISLANDS
+   below.
+
+
+DELTA ISLANDS
+-
+
+When possible, `pack-objects` tries to reuse existing on-disk deltas to
+avoid having to search for new ones on the fly. This is an important
+optimization for serving fetches, because it means the server can avoid
+inflating most objects at all and just send the bytes directly from
+disk. This optimization can't work when an object is stored as a delta
+against a base which the receiver does not have (and which we are not
+already sending). In that case the server "breaks" the delta and has to
+find a new one, which has a high CPU cost. Therefore it's important for
+performance that the set of objects in on-disk delta relationships match
+what a client would fetch.
+
+In a normal repository, this tends to work automatically. The objects
+are mostly reachable from the branches and tags, and that's what clients
+fetch. Any deltas we find on the server are likely to be between objects
+the client has or will have.
+
+But in some repository setups, you may have several related but separate
+groups of ref tips, with clients tending to fetch those groups
+independently. For example, imagine that you are hosting several "forks"
+of a repository in a single shared object store, and letting clients
+view them as separate repositories through `GIT_NAMESPACE` or separate
+repos using the alternates mechanism. A naive repack may find that the
+optimal delta for an object is against a base that is only found in
+another fork. But when a client fetches, they will not have the base
+object, and we'll have to find a new delta on the fly.
+
+A similar situation may exist if you have many refs outside of
+`refs/heads/` and `refs/tags/` that point to related objects (e.g.,
+`refs/pull` or `refs/changes` used by some hosting providers). By
+default, clients fetch only heads and tags, and deltas against objects
+found only in those other groups cannot be sent as-is.
+
+Delta islands solve this problem by allowing you to group your refs into
+distinct "islands". Pack-objects computes which objects are reachable
+from which islands, and refuses to make a delta from an object `A`
+against a base which is not present in all of `A`'s islands. This
+results in slightly larger packs (because we miss some delta
+opportunities), but guarantees that a fetch of one island will not have
+to recompute deltas on the fly due to crossing island boundaries.
+
+Islands are configured via the `pack.island` option, which can be
+specified multiple times. Each value is a left-anchored regular
+expressions matching refnames. For example:
+
+---
+[pack]
+island = refs/heads/
+island = refs/tags/
+---
+
+puts heads and tags into an island (whose name is the empty string; see
+below for more on naming). Any refs which do not match those regular
+expressions (e.g., `refs/pull/123`) is not in any island. Any object
+which is reachable only from `refs/pull/` (but not heads or tags) is
+therefore not a candidate to be used as a base for `refs/heads/`.
+
+Refs are grouped into islands based on their "names", and two regexes
+that produce the same name are considered to be in the same island. The
+names are computed from the regexes by concatenating any capture groups
+from the regex (and if there are none, then the name is the empty
+string, as in the above example). This allows you to create arbitrary
+numbers of islands. For example, imagine you store the refs for each
+fork in `refs/virtual/ID`, where `ID` is a numeric identifier. You might
+then configure:
+
+---
+[pack]
+island = refs/virtual/([0-9]+)/heads/
+island = refs/virtual/([0-9]+)/tags/
+island = refs/virtual/([0-9]+)/(pull)/
+---
+
+That puts the heads and tags for each fork in their own island (named
+"1234" or similar), and the pull refs for each go into their own
+"1234-pull".
+
+Note that we pick a single island for each regex to go into, using "last
+one wins" ordering (which allows repo-specific config to take precedence
+over