Re: caching pull - stable partitioning of bundle requests

2019-04-15 Thread Pierre-Yves David



On 3/3/19 10:01 PM, Pulkit Goyal wrote:



On Tue, Feb 19, 2019 at 3:23 PM Boris FELD > wrote:


On 15/02/2019 21:16, Pulkit Goyal wrote:



On Thu, Oct 4, 2018 at 6:16 PM Boris FELD mailto:lothiral...@gmail.com>> wrote:

The road for moving this in Core is clear, but not short. So
far we have not been able to free the necessary time to do it.
Between the paying-client work, we have to do to pay salaries
(and keep users happy) and all the time we are already
investing in the community, we are already fairly busy.


In early 2017, Bitbucket gave $13, 500 to the Mercurial
project to be spent to help evolution to move forward. As far
as we know, this money is still unspent. Since stable range is
a critical part of obsmarkers discovery, unlocking this money
to be spent on upstreaming stable range would be a good idea
(and fits its initial purposes). Paying for this kind of work
will reduce the contention with client work and help us, or
others, to dedicate time for it sooner than later.


I definitely agree that obsmarker discovery is a critical part.
Pulling from `hg-committed` is slower sometimes as compared to
pulling on a repo (5-7x size of hg-committed) with server having
thousands of heads.

Do you have any updates on the stable-range cache? In the current
state the cache is pretty big and a lot of people faced problem
with the cache size.  Also in case of strip or some other
commands, it rebuilts the cache which takes more than 10 minutes
on large repo which is definitely a bummer. Are you working on
making that better and take less size? How is experimentation of
evolve+stablerange going on?


# Regarding the cache-size:

We know that the current version caches many entries that are
trivial to compute and does not need to be cached. In addition, the
current storage (SQLite) does not seem very efficient.

So the next iteration of the cache should be significantly smaller.


# Regarding cache invalidation:

A lot of the data in the caches are an inherent property of the
changeset and therefore immutable. It's easy to preserve then during
strip to avoid having to recompute things from scratch. In addition,
these immutable data should be exchange during pull alongside the
associated changesets to avoid client recomputing the same data over
and over.

The current implementation is an experimental/research
implementation, all this should get smoothed directly in Core during
the upstreaming.


I am bit confused when you say "things should get smoothed directly in 
Core during the upstreaming". Which one of the following did you mean:


1) send a patch of the current implementation to core, once that patch 
gets in, try to improve the implementation in core
2) send a series to core which contains patch of the current 
implementation and other patches improving the implementation


2) is same as what Augie did for narrow, linelog, remotefilelog, Greg 
did for sparse, I did for infinitepush.


Which one do you mean here?


Mostly a variation of the second one. The code currently in evolve has a 
lot of extra unnecessary complexity that comes from the fact it happen 
from an extensions and that is was initially research code that evolve 
toward the current solution. Overall we can keep the main algorithm and 
reimplement more of what is around directly in core. In particular, most 
of the immutable property we compute could directly goes in new version 
of the revlog index to trivialize their storage.


For the record, joerg Sonnenberger looked at his during the mini-sprint 
and we discussed how it could be allied to exchange arbitrary notes for 
changesets (eg: new tags mechanism, code signing, CI status, etc…)


Cheers,

--
Pierre-Yves David
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: caching pull - stable partitioning of bundle requests

2019-03-03 Thread Pulkit Goyal
On Tue, Feb 19, 2019 at 3:23 PM Boris FELD  wrote:

> On 15/02/2019 21:16, Pulkit Goyal wrote:
>
>
>
> On Thu, Oct 4, 2018 at 6:16 PM Boris FELD  wrote:
>
>> The road for moving this in Core is clear, but not short. So far we have
>> not been able to free the necessary time to do it. Between the
>> paying-client work, we have to do to pay salaries (and keep users happy)
>> and all the time we are already investing in the community, we are already
>> fairly busy.
>>
>> In early 2017, Bitbucket gave $13, 500 to the Mercurial project to be
>> spent to help evolution to move forward. As far as we know, this money is
>> still unspent. Since stable range is a critical part of obsmarkers
>> discovery, unlocking this money to be spent on upstreaming stable range
>> would be a good idea (and fits its initial purposes). Paying for this kind
>> of work will reduce the contention with client work and help us, or others,
>> to dedicate time for it sooner than later.
>>
>
> I definitely agree that obsmarker discovery is a critical part. Pulling
> from `hg-committed` is slower sometimes as compared to pulling on a repo
> (5-7x size of hg-committed) with server having thousands of heads.
>
> Do you have any updates on the stable-range cache? In the current state
> the cache is pretty big and a lot of people faced problem with the cache
> size.  Also in case of strip or some other commands, it rebuilts the cache
> which takes more than 10 minutes on large repo which is definitely a
> bummer. Are you working on making that better and take less size? How is
> experimentation of evolve+stablerange going on?
>
>
> # Regarding the cache-size:
>
> We know that the current version caches many entries that are trivial to
> compute and does not need to be cached. In addition, the current storage
> (SQLite) does not seem very efficient.
>
> So the next iteration of the cache should be significantly smaller.
>
>
> # Regarding cache invalidation:
>
> A lot of the data in the caches are an inherent property of the changeset
> and therefore immutable. It's easy to preserve then during strip to avoid
> having to recompute things from scratch. In addition, these immutable data
> should be exchange during pull alongside the associated changesets to avoid
> client recomputing the same data over and over.
>
> The current implementation is an experimental/research implementation, all
> this should get smoothed directly in Core during the upstreaming.
>

I am bit confused when you say "things should get smoothed directly in Core
during the upstreaming". Which one of the following did you mean:

1) send a patch of the current implementation to core, once that patch gets
in, try to improve the implementation in core
2) send a series to core which contains patch of the current implementation
and other patches improving the implementation

2) is same as what Augie did for narrow, linelog, remotefilelog, Greg did
for sparse, I did for infinitepush.

Which one do you mean here?
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: caching pull - stable partitioning of bundle requests

2019-02-19 Thread Boris FELD
On 15/02/2019 21:16, Pulkit Goyal wrote:
>
>
> On Thu, Oct 4, 2018 at 6:16 PM Boris FELD  > wrote:
>
> The road for moving this in Core is clear, but not short. So far
> we have not been able to free the necessary time to do it. Between
> the paying-client work, we have to do to pay salaries (and keep
> users happy) and all the time we are already investing in the
> community, we are already fairly busy.
>
>
> In early 2017, Bitbucket gave $13, 500 to the Mercurial project to
> be spent to help evolution to move forward. As far as we know,
> this money is still unspent. Since stable range is a critical part
> of obsmarkers discovery, unlocking this money to be spent on
> upstreaming stable range would be a good idea (and fits its
> initial purposes). Paying for this kind of work will reduce the
> contention with client work and help us, or others, to dedicate
> time for it sooner than later.
>
>
> I definitely agree that obsmarker discovery is a critical part.
> Pulling from `hg-committed` is slower sometimes as compared to pulling
> on a repo (5-7x size of hg-committed) with server having thousands of
> heads.
>
> Do you have any updates on the stable-range cache? In the current
> state the cache is pretty big and a lot of people faced problem with
> the cache size.  Also in case of strip or some other commands, it
> rebuilts the cache which takes more than 10 minutes on large repo
> which is definitely a bummer. Are you working on making that better
> and take less size? How is experimentation of evolve+stablerange going on?

# Regarding the cache-size:

We know that the current version caches many entries that are trivial to
compute and does not need to be cached. In addition, the current storage
(SQLite) does not seem very efficient.

So the next iteration of the cache should be significantly smaller.


# Regarding cache invalidation:

A lot of the data in the caches are an inherent property of the
changeset and therefore immutable. It's easy to preserve then during
strip to avoid having to recompute things from scratch. In addition,
these immutable data should be exchange during pull alongside the
associated changesets to avoid client recomputing the same data over and
over.

The current implementation is an experimental/research implementation,
all this should get smoothed directly in Core during the upstreaming.


# Regarding  cache-computation speed:

The current implementation is a "research" version written in Python, it
is not geared toward efficiency and contains a lot of indirections that
were helpful to reach the current solution but are now getting in the
way of performance.

The initial implementation (in Evolve), focused on finding a solution
with good scaling property (good computational and space complexity).
However, we did not spend too much time improving the "constant" factor.
Now that we know where we are headed we can have a much better
implementation.

Once we have better on disk storage, native code and client/server
exchange of most of the data, the impact of stable-range should get to a
negligible level.


# Regarding what's next:

The experimental implementation cleared the unknown around stable-range
computation and caching. However, even if the road is clear, a sizable
amount of work remains, especially to move away from the unsuitable
SQLite storage. We think that putting the Bitbucket donation to use is
the best way to make sure this work gets done soon.
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: caching pull - stable partitioning of bundle requests

2019-02-15 Thread Pulkit Goyal
On Thu, Oct 4, 2018 at 6:16 PM Boris FELD  wrote:

> The road for moving this in Core is clear, but not short. So far we have
> not been able to free the necessary time to do it. Between the
> paying-client work, we have to do to pay salaries (and keep users happy)
> and all the time we are already investing in the community, we are already
> fairly busy.
>
> In early 2017, Bitbucket gave $13, 500 to the Mercurial project to be
> spent to help evolution to move forward. As far as we know, this money is
> still unspent. Since stable range is a critical part of obsmarkers
> discovery, unlocking this money to be spent on upstreaming stable range
> would be a good idea (and fits its initial purposes). Paying for this kind
> of work will reduce the contention with client work and help us, or others,
> to dedicate time for it sooner than later.
>

I definitely agree that obsmarker discovery is a critical part. Pulling
from `hg-committed` is slower sometimes as compared to pulling on a repo
(5-7x size of hg-committed) with server having thousands of heads.

Do you have any updates on the stable-range cache? In the current state the
cache is pretty big and a lot of people faced problem with the cache size.
Also in case of strip or some other commands, it rebuilts the cache which
takes more than 10 minutes on large repo which is definitely a bummer. Are
you working on making that better and take less size? How is
experimentation of evolve+stablerange going on?
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: caching pull - stable partitioning of bundle requests

2018-10-08 Thread Erik van Zijst
On Thu, Oct 4, 2018 at 8:16 AM Boris FELD  wrote:
> This slicing is based on the "stablerange" algorithm. The same as the one 
> used to bisect obsmarkers during obsmarkers discovery.
>
> The road for moving this in Core is clear, but not short. So far we have not 
> been able to free the necessary time to do it. Between the paying-client 
> work, we have to do to pay salaries (and keep users happy) and all the time 
> we are already investing in the community, we are already fairly busy.
>
> In early 2017, Bitbucket gave $13, 500 to the Mercurial project to be spent 
> to help evolution to move forward. As far as we know, this money is still 
> unspent. Since stable range is a critical part of obsmarkers discovery, 
> unlocking this money to be spent on upstreaming stable range would be a good 
> idea (and fits its initial purposes). Paying for this kind of work will 
> reduce the contention with client work and help us, or others, to dedicate 
> time for it sooner than later.

Safe rebasing remains a priority for Bitbucket and the biggest reason
Mercurial has lost feature parity with Git on Bitbucket over the past
years as we added online merge strategies, squashing and rebasing.

While smf worked hard to add Evolve as an experimental labs feature,
along with pull request quash/rebasing, we sadly had to remove it
after customers opted in without properly realizing that Evolve is not
a Core feature and requires local installation of the extension across
all clients, plunging their workflows and repos in disarray.

Early access to non-core functionality is fine for skilled developers
thoroughly familiar with their tools, but has proved to be unsuitable
to average users and so as long as things don't land in Core they
remain out of reach for everyone. This was the motivation behind the
donation. To get Evolve, or at least the obsmarker exchange, into Core
and enabled by default. Only then can Bitbucket start to offer
rebasing workflows, regain parity with Git and position Mercurial as
the superior VCS it should be.

We sent the donation through the SFC to leave the final selection of
Evolve contributors at the discretion of the steering committee. We
had pressed for speedy allocation to those contributing materially to
Evolve's development, yet 18 months later it still has not been spent.
Even so, development of Evolve has seen substantial progress and it
seems fair to allocate the funds accordingly. The fact that the work
on obsmarker discovery is now being leveraged to extend clonebundles
to regular, non-clones pulls seems to further justify applying the
money towards this work.

Cheers,
Erik
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: caching pull - stable partitioning of bundle requests

2018-10-04 Thread Boris FELD

On 27/09/2018 18:15, Gregory Szorc wrote:
> On Wed, Sep 26, 2018 at 11:13 AM Boris FELD  > wrote:
>
> Hi everyone,
>
> Pulling from a server involves expensive server-side computation
> that we wish to cache. However, since the client can pull any
> arbitrary set of revision, grouping and dispatching the data to be
> cached is ahard problem.
>
> When we implemented the new discovery for obsolescence markers, we
> developed a "stablerange" method to build an efficient way to
> slice the changesets graph into ranges. In addition to solving the
> obsolescence markers discovery problem, this "stablerange"
> principle seemed to be useful for more usages, in particular,the
> caching of pulls.
>
> Right now, with the current pull bundle implementation, here is
> how it work: you manually create and manually declare bundles
> containing either all changesets (that could also be used for
> clone bundles) or more specific ones. When the client request some
> changesets, the server searches a bundle containing the needed
> range and send it. This often involves more than the requested
> data. The client needs to filter out the extraneous data. Then the
> client does a discovery to catch any missing changesets from the
> bundle. If the server doesn't find a valid pull bundle, a normal
> discovery is done.The manual bundle managements is suboptimal, the
> search for appropriate bundles has a bad complexity and the extra
> roundtrip and discovery adds extra slowness.
>
> This weekend, we build a "simple" prototype that use "stablerange"
> to slice changegroup request in "getbundle" into multiple
> bundlesthat can be reused from one pull to another. That slicing
> happens as part of a normal pull, during the getbundle call and
> after the normal discovery happens. There are no needs for an
> extra discovery and getbundle call after it.
>
> With this"stablerange"based strategy,we start from the set of
> requested changesets to generate a set of "standard" range
> covering all of them. This slicing has a good algorithmic
> complexity that depends on the size of the selected "missing" set
> of changesets. So the associated cost of will scale well with the
> size of the associated pull. In addition, we no longer have to do
> an expensive search into a list existing bundles. This helps to
> scale small pulls and increase the number of bundles we can cache,
> as the time we spend selecting bundle no longer depends on the
> numbers of cached ones. Since we can exactly cover the client
> request, we also no longer need to issue anextra pull roundtrip
> after the cache retrieval.
>
> That slicing focus on producing ranges that:
>
>   * Have a high chance to be reusable in a pull selecting similar
> changesets,
>
>   * Gather most of the changesets in large bundles.
>
>
> This caching strategy inherits the nice "stablerange" properties
> regarding repository growth
>
>   * When a few changesets are appended to a repository, only a few
> ranges haveto be added.
>
>   * The overall number of ranges (and associated bundles) to
> create to represent all possible ranges has anO(N log(N))
> complexity.
>
>
> For example, here are the 15 ranges selected for a full clone of
> mozilla-central:
>
>   * |262114 changesets|
>
>   * |30 changesets|
>
>   * |65536 changesets|
>
>   * |32741 changesets|
>
>   * |20 changesets|
>
>   * |7 changesets|
>
>   * |8192 changesets|
>
>   * |243 changesets|
>
>   * |13 changesets|
>
>   * |114 changesets|
>
>   * |14 changesets|
>
>   * |32 changesets|
>
>   * |16 changesets|
>
>   * |8 changesets|
>
>   * |1 changesets|
>
>  *
>
>
> If we only clone a subset of the repository, the larger ranges get
> reused (hg clone --rev -5000):
>
>   * |262114 changesets found in caches|
>
>   * |30 changesets found in caches|
>
>   * |65536 changesets found in caches|
>
>   * |32741 changesets found in caches|
>
>   * |20 changesets found in caches|
>
>   * |7 changesets found in caches|
>
>   * |2048 changesets found|
>
>   * |1024 changesets found|
>
>   * |482 changesets found|
>
>   * |30 changesets found|
>
>   * |32 changesets found|
>
>   * |1 changesets found|
>
>   * |7 changesets found|
>
>   * |4 changesets found|
>
>   * |2 changesets found|
>
>   * |1 changesets found|
>
> As you can see, the larger ranges of this second pull are common
> with the previous pull, allowing to reuse cached bundles.
>
> The prototype is available in a small "pullbundle" extension. It
> focuses on the slicing itself and we did not implement anything
> fancy for the cache storage and delivery. We simply store
>  

Re: caching pull - stable partitioning of bundle requests

2018-09-27 Thread Gregory Szorc
On Wed, Sep 26, 2018 at 11:13 AM Boris FELD  wrote:

> Hi everyone,
>
> Pulling from a server involves expensive server-side computation that we
> wish to cache. However, since the client can pull any arbitrary set of
> revision, grouping and dispatching the data to be cached is a hard
> problem.
>
> When we implemented the new discovery for obsolescence markers, we
> developed a "stablerange" method to build an efficient way to slice the
> changesets graph into ranges. In addition to solving the obsolescence
> markers discovery problem, this "stablerange" principle seemed to be useful
> for more usages, in particular, the caching of pulls.
>
> Right now, with the current pull bundle implementation, here is how it
> work: you manually create and manually declare bundles containing either
> all changesets (that could also be used for clone bundles) or more specific
> ones. When the client request some changesets, the server searches a bundle
> containing the needed range and send it. This often involves more than the
> requested data. The client needs to filter out the extraneous data. Then
> the client does a discovery to catch any missing changesets from the
> bundle. If the server doesn't find a valid pull bundle, a normal discovery
> is done. The manual bundle managements is suboptimal, the search for
> appropriate bundles has a bad complexity and the extra roundtrip and
> discovery adds extra slowness.
>
> This weekend, we build a "simple" prototype that use "stablerange" to
> slice changegroup request in "getbundle" into multiple bundles that can
> be reused from one pull to another. That slicing happens as part of a
> normal pull, during the getbundle call and after the normal discovery
> happens. There are no needs for an extra discovery and getbundle call after
> it.
>
> With this "stablerange" based strategy, we start from the set of
> requested changesets to generate a set of "standard" range covering all
> of them. This slicing has a good algorithmic complexity that depends on the
> size of the selected "missing" set of changesets. So the associated cost of
> will scale well with the size of the associated pull. In addition, we no
> longer have to do an expensive search into a list existing bundles. This
> helps to scale small pulls and increase the number of bundles we can cache,
> as the time we spend selecting bundle no longer depends on the numbers of
> cached ones. Since we can exactly cover the client request, we also no
> longer need to issue an extra pull roundtrip after the cache retrieval.
>
> That slicing focus on producing ranges that:
>
>- Have a high chance to be reusable in a pull selecting similar
>changesets,
>
>
>- Gather most of the changesets in large bundles.
>
>
> This caching strategy inherits the nice "stablerange" properties regarding
> repository growth
>
>- When a few changesets are appended to a repository, only a few
>ranges have to be added.
>
>
>- The overall number of ranges (and associated bundles) to create to
>represent all possible ranges has an O(N log(N)) complexity.
>
>
> For example, here are the 15 ranges selected for a full clone of
> mozilla-central:
>
>
>- 262114 changesets
>
>
>- 30 changesets
>
>
>- 65536 changesets
>
>
>- 32741 changesets
>
>
>- 20 changesets
>
>
>- 7 changesets
>
>
>- 8192 changesets
>
>
>- 243 changesets
>
>
>- 13 changesets
>
>
>- 114 changesets
>
>
>- 14 changesets
>
>
>- 32 changesets
>
>
>- 16 changesets
>
>
>- 8 changesets
>
>
>- 1 changesets
>
>
>-
>
> If we only clone a subset of the repository, the larger ranges get reused
> (hg clone --rev -5000):
>
>- 262114 changesets found in caches
>
>
>- 30 changesets found in caches
>
>
>- 65536 changesets found in caches
>
>
>- 32741 changesets found in caches
>
>
>- 20 changesets found in caches
>
>
>- 7 changesets found in caches
>
>
>- 2048 changesets found
>
>
>- 1024 changesets found
>
>
>- 482 changesets found
>
>
>- 30 changesets found
>
>
>- 32 changesets found
>
>
>- 1 changesets found
>
>
>- 7 changesets found
>
>
>- 4 changesets found
>
>
>- 2 changesets found
>
>
>- 1 changesets found
>
> As you can see, the larger ranges of this second pull are common with the
> previous pull, allowing to reuse cached bundles.
>
> The prototype is available in a small "pullbundle" extension. It focuses
> on the slicing itself and we did not implement anything fancy for the cache
> storage and delivery. We simply store generated bundle on disk and we read
> it from disk when it is needed again. Others, like Joerg Sonnenberger or 
> Gregory
> Szorc, are already working on the "cache delivery" problem.
>
> We are getting good result our of that prototypes when testing it on
> clones of mozilla-central and netbsd-src. See "Example Result" section for
> detail.
>
> The prototype is up and running on our hgweb "mirror" instance
> 

Re: caching pull - stable partitioning of bundle requests

2018-09-26 Thread Joerg Sonnenberger
On Wed, Sep 26, 2018 at 08:13:13PM +0200, Boris FELD wrote:
> Then the client does a discovery to catch any missing changesets from
> the bundle. If the server doesn't find a valid pull bundle, a normal
> discovery is done. The manual bundle managements is suboptimal, the
> search for appropriate bundles has a bad complexity and the extra
> roundtrip and discovery adds extra slowness.

I don't think this classification is correct. The discovery phase is run
once, to find the common revisions and the missing heads. The server
decides based on that if it wants to send a prebuilt bundle. The client
updates the sets *without* running discovery again and just asks the
server again. As such, the overhead is one additional roundtrip per
bundle. The searching for appropiate bundles is currently unnecessary
slow, because the necessary queries are unnecessary slow.

All that said, I will look at the stableranges and more important, how
they deal with various situations.

Joerg
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


caching pull - stable partitioning of bundle requests

2018-09-26 Thread Boris FELD

Hi everyone,

Pulling from a server involves expensive server-side computation that we 
wish to cache. However, since the client can pull any arbitrary set of 
revision, grouping and dispatching the data to be cached is ahard problem.


When we implemented the new discovery for obsolescence markers, we 
developed a "stablerange" method to build an efficient way to slice the 
changesets graph into ranges. In addition to solving the obsolescence 
markers discovery problem, this "stablerange" principle seemed to be 
useful for more usages, in particular,the caching of pulls.


Right now, with the current pull bundle implementation, here is how it 
work: you manually create and manually declare bundles containing either 
all changesets (that could also be used for clone bundles) or more 
specific ones. When the client request some changesets, the server 
searches a bundle containing the needed range and send it. This often 
involves more than the requested data. The client needs to filter out 
the extraneous data. Then the client does a discovery to catch any 
missing changesets from the bundle. If the server doesn't find a valid 
pull bundle, a normal discovery is done.The manual bundle managements is 
suboptimal, the search for appropriate bundles has a bad complexity and 
the extra roundtrip and discovery adds extra slowness.


This weekend, we build a "simple" prototype that use "stablerange" to 
slice changegroup request in "getbundle" into multiple bundlesthat can 
be reused from one pull to another. That slicing happens as part of a 
normal pull, during the getbundle call and after the normal discovery 
happens. There are no needs for an extra discovery and getbundle call 
after it.


With this"stablerange"based strategy,we start from the set of requested 
changesets to generate a set of "standard" range covering all of them. 
This slicing has a good algorithmic complexity that depends on the size 
of the selected "missing" set of changesets. So the associated cost of 
will scale well with the size of the associated pull. In addition, we no 
longer have to do an expensive search into a list existing bundles. This 
helps to scale small pulls and increase the number of bundles we can 
cache, as the time we spend selecting bundle no longer depends on the 
numbers of cached ones. Since we can exactly cover the client request, 
we also no longer need to issue anextra pull roundtrip after the cache 
retrieval.


That slicing focus on producing ranges that:

 * Have a high chance to be reusable in a pull selecting similar
   changesets,

 * Gather most of the changesets in large bundles.


This caching strategy inherits the nice "stablerange" properties 
regarding repository growth


 * When a few changesets are appended to a repository, only a few
   ranges haveto be added.

 * The overall number of ranges (and associated bundles) to create to
   represent all possible ranges has anO(N log(N)) complexity.


For example, here are the 15 ranges selected for a full clone of 
mozilla-central:


 * |262114 changesets|

 * |30 changesets|

 * |65536 changesets|

 * |32741 changesets|

 * |20 changesets|

 * |7 changesets|

 * |8192 changesets|

 * |243 changesets|

 * |13 changesets|

 * |114 changesets|

 * |14 changesets|

 * |32 changesets|

 * |16 changesets|

 * |8 changesets|

 * |1 changesets|

 *


If we only clone a subset of the repository, the larger ranges get 
reused (hg clone --rev -5000):


 * |262114 changesets found in caches|

 * |30 changesets found in caches|

 * |65536 changesets found in caches|

 * |32741 changesets found in caches|

 * |20 changesets found in caches|

 * |7 changesets found in caches|

 * |2048 changesets found|

 * |1024 changesets found|

 * |482 changesets found|

 * |30 changesets found|

 * |32 changesets found|

 * |1 changesets found|

 * |7 changesets found|

 * |4 changesets found|

 * |2 changesets found|

 * |1 changesets found|

As you can see, the larger ranges of this second pull are common with 
the previous pull, allowing to reuse cached bundles.


The prototype is available in a small "pullbundle" extension. It focuses 
on the slicing itself and we did not implement anything fancy for the 
cache storage and delivery. We simply store generated bundle on disk and 
we read it from disk when it is needed again. Others, like Joerg 
Sonnenberger or Gregory Szorc, are already working on the "cache 
delivery" problem.


We are getting good result our of that prototypes when testing it on 
clones of mozilla-central and netbsd-src. See "Example Result" section 
for detail.


The prototype is up and running on our hgweb "mirror" instance 
https://mirror.octobus.net/.


The extension comes with a small debug command that produces statistic 
of the ranges that multiple random pulls would use.


The "stablerange" implementation currently still[1] live in the evolve 
extensions, so we put the extensions in the same repository for 
simplicity as "pullbundle". This is not