Re: RFC: bitmap storage for precursors and phases

2017-03-07 Thread Pierre-Yves David



On 02/18/2017 04:51 AM, Jun Wu wrote:

Excerpts from Stanislau Hlebik's message of 2017-02-17 16:06:33 +:

This is implementation of two caches (nonpublic + precursor) using
serialized sorted lists and sets
https://bitbucket.org/stashlebik/hg/commits/99879579ac2848a2567810b677d8344150a7b319?at=hiddenbitmaps_lists


I had a quick look. Here are my suggestions:

1. Prefer caching at a lower-level, so they get more widely used.


+1 on that,



   Practically, do not change localrepo.py, change phasecache.py and
   obsolete.py instead.

   I happened to do some refactoring in phasecache [1]. I'll send a V2 soon,
   since the change seems related.

   Note: phasecache is some C code so you need some data to prove the change
   is worthy.

2. The "Python set" serialization is already in repoview.py. Reuse it.

   Maybe move "_writehiddencache" to scmutil, or a new simple class.
   "tryreadcache" could be changed and then moved.

   Then the functions could be reused in obsolete.getrevs.

[1]: 
https://www.mercurial-scm.org/pipermail/mercurial-devel/2017-February/092817.html
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel



--
Pierre-Yves David
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: RFC: bitmap storage for precursors and phases

2017-03-07 Thread Pierre-Yves David



On 02/14/2017 02:04 AM, Sean Farley wrote:

Jun Wu  writes:


In general, I think this is a good direction. Some random thoughts:

  - general purposed

I think the bitmap is not always a cache, so it should only have
operations like set/unset/readfromdisk/writetodisk. Practically, I won't
couple cache invalidation with the bitmap implementation.

In additional, I'll try to avoid using Python-only types in the
interface. So once we decide to rewrite part of the implementation in
native C, we won't have trouble.

See "revset" below for a possibility that bitmap is used as a non-set.

  - revset

This is a possibility that probably won't happen any time soon.

The revset currently uses Python set for maintaining its state. For huge
sets, Python sets may not be a good option. And various operations could
benefit from an always-topologically-sorted set, which is the bitmap.

  - mmap

My intuition is that bitmaps fit better with mmap which can reduce the
reading loading cost. I think "vfs.mmapread" could be a thing, and
various places can benefit from it - Gabor recently showed interest in
loading revlog data by mmap, I had patches that uses mmap to read revlog
index.

In additional, not directly related to this series, I'm a big fan of
single direction data flow. But the current code base does not seem to do a
good job in this area. As we are adding more caching layers to the code
base, it'd be nice if we have some tiny framework maintaining the dependency
of all kinds of data, to be able to understand the data flow easily, and
just to be more confident about loading orders. I think people more
experienced on architecture may want to share some ideas here.


I was thinking about a more high-level approach (please feel free to
pick apart):

r = repo.filtered("bitmap1")
r2 = r.filtered("bitmap2")

So that r2 would be an intersection of bitmap1 and bitmap2 (haven't
thought about a union nor the inverse).



This double filtering idea is interresting. maybe we could have the 
'repoview' API understant repo.filtered("foo+bar") as a combination of 
filtering of foo+bar. The smart part of repoview (eg: filter hierarchy 
for cache inheritance, cache key, etc) should be able to automatically 
compute what do to for a combinaison.


Exposing the bitmap at that level seems strange. I think it is better to 
have the internal implementation of the filtering rely on a bitmat than 
to have the repository/repoview API to expose bitmap directly.


Cheers,

--
Pierre-Yves David
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: RFC: bitmap storage for precursors and phases

2017-02-21 Thread Jun Wu
Excerpts from Sean Farley's message of 2017-02-21 15:45:44 -0800:
> Augie Fackler  writes:
> 
> > On Fri, Feb 17, 2017 at 07:14:12PM -0800, Jun Wu wrote:
> >> Excerpts from Bryan O'Sullivan's message of 2017-02-17 13:29:58 -0800:
> >> > I think there are multiple topics being discussed:
> >> >
> >> >   1. How to solve the overhead loading hiddenrevs with minimal changes?
> >> >   2. Why is the bitmap format interesting (future use-cases)?
> >> >
> >> For 2,
> >>
> >>   len(filteredrevs) in my hg-committed is 2155. It's small. I tend to think
> >>   about solutions that scale longer and are not too complex to build. That
> >>   may or may not be a good habit.
> >>
> >
> > For what it's worth, my clone of hg has 12716 hidden revisions - about
> > a quarter of the entire repo history is hidden.
> 
> Actually, that's 40%. Mine is a bit over 25%.

I think you have incorrectly excluded hidden revs from repo history:

12716.0 / (len(repo) - len(repo.changelog.filteredrevs) + 12716) = 28.8% # 
correct
12716.0 / (len(repo) - len(repo.changelog.filteredrevs)) = 40.3% # incorrect

Yours should be around 20%:

8000.0 / (len(repo) - len(repo.changelog.filteredrevs) + 8000) = 20.3%
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: RFC: bitmap storage for precursors and phases

2017-02-21 Thread Sean Farley
Augie Fackler  writes:

> On Fri, Feb 17, 2017 at 07:14:12PM -0800, Jun Wu wrote:
>> Excerpts from Bryan O'Sullivan's message of 2017-02-17 13:29:58 -0800:
>> > I think there are multiple topics being discussed:
>> >
>> >   1. How to solve the overhead loading hiddenrevs with minimal changes?
>> >   2. Why is the bitmap format interesting (future use-cases)?
>> >
>> For 2,
>>
>>   len(filteredrevs) in my hg-committed is 2155. It's small. I tend to think
>>   about solutions that scale longer and are not too complex to build. That
>>   may or may not be a good habit.
>>
>
> For what it's worth, my clone of hg has 12716 hidden revisions - about
> a quarter of the entire repo history is hidden.

Actually, that's 40%. Mine is a bit over 25%.
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: RFC: bitmap storage for precursors and phases

2017-02-21 Thread Sean Farley
Jun Wu  writes:

> Excerpts from Bryan O'Sullivan's message of 2017-02-17 13:29:58 -0800:
>> On Fri, Feb 17, 2017 at 10:30 AM, Jun Wu  wrote:
>> 
>> > Excerpts from Stanislau Hlebik's message of 2017-02-17 11:24:34 +:
>> > > As I said before we will load all non-public revs in one set and all
>> >
>> > The problem is, loading a Python set from disk is O(size-of-the-set).
>> >
>> > Bitmap's loading cost should be basically 0 (with mmap). I think that's why
>> > we want bitmap at the first place. There are other choices like packfile
>> > index, hash tables, but bitmap is the simplest and most efficient.
>> >
>> 
>> Hey folks,
>> 
>> I haven't yet seen mention of some considerations that seem very important
>> in driving the decision-making, so I'd appreciate it if someone could fill
>> me in.
>> 
>> Firstly, what's our current understanding of the sizes and compositions of
>> these sets of numbers? In theory, we have a lot of data from practical
>> application at Facebook, but nobody's brought it up.
>> 
>> To clarify why this matters, if an obsstore contains say 100 entries, then
>> a very naive implementation should be fine. If it's related to something
>> else, such as the size of, amount of activity in, or local age of a repo,
>> then maybe we have a different set of decisions to make.
>
> First, I agree that if N is small, O(N) or O(log(N)), or O(1), do not matter
> much in practice. I tend to always reason about future-proof solutions,
> which may or may not be a good habit - I'll adjust in the future.
>
> I think there are multiple topics being discussed:
>
>   1. How to solve the overhead loading hiddenrevs with minimal changes?
>   2. Why is the bitmap format interesting (future use-cases)?
>
> For 1,
>
>   I once tried to make one bit of revlog flag as the source of truth of
>   "hidden", which looks a nice solution. But it got rejected because revlog
>   has to be append-only. Moving the flag bits out formed the bitmap idea
>   naturally.
>
>   Bitmap fits here but it does not mean we have to use it to solve this
>   particular problem. For example, stateful chg will get us some easy wins,
>   and see the problem analysis and proposed solution below.
>
>   Note that we actually have a simple caching format for the "hiddenrevs".
>   See repoview.py, computehidden.
>
>   But what's wrong with "computehidden" is the choice of the cache key - it
>   calls "hideablerevs", which takes 0.5 seconds in obsolete.getrevs(repo,
>   'obsolete'), which reads the giant obsstore (in hg-committed) and do some
>   complex computation. The result is then hashed and used as a cache key of
>   "computehidden".
>
>   So, if we do not want to know the "obsolete()" revisions, but just want to
>   know what's hidden. Changing the cache key would be a very straightforward
>   and effective solution which does not require a new file format.
>
>   That'll help commands like "hg st", "hg log -T foo", etc. But it won't
>   help complex "log"s which do require the "obsolete()" revisions. To speed
>   up the latter, we can probably just copy the caching infrastructure from
>   repoview.py.
>
> For 2,
>
>   len(filteredrevs) in my hg-committed is 2155. It's small. I tend to think

Another data point for my hg-committed:

len(filteredrevs) > 8k

len(obs markers) > 120k
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: RFC: bitmap storage for precursors and phases

2017-02-20 Thread Stanislau Hlebik
Excerpts from Augie Fackler's message of 2017-02-19 21:06:53 -0500:
> On Fri, Feb 17, 2017 at 09:59:48PM +, Stanislau Hlebik wrote:
> > Excerpts from Bryan O'Sullivan's message of 2017-02-17 13:29:58 -0800:
> > > On Fri, Feb 17, 2017 at 10:30 AM, Jun Wu  wrote:
> > >
> > > > Excerpts from Stanislau Hlebik's message of 2017-02-17 11:24:34 +:
> > > > > As I said before we will load all non-public revs in one set and all
> > > >
> > > > The problem is, loading a Python set from disk is O(size-of-the-set).
> > > >
> > > > Bitmap's loading cost should be basically 0 (with mmap). I think that's 
> > > > why
> > > > we want bitmap at the first place. There are other choices like packfile
> > > > index, hash tables, but bitmap is the simplest and most efficient.
> > > >
> > >
> > > Hey folks,
> > >
> > > I haven't yet seen mention of some considerations that seem very important
> > > in driving the decision-making, so I'd appreciate it if someone could fill
> > > me in.
> > >
> > > Firstly, what's our current understanding of the sizes and compositions of
> > > these sets of numbers? In theory, we have a lot of data from practical
> > > application at Facebook, but nobody's brought it up.
> >
> > I assume that both sets (set for nonpublic commits and set for
> > obsstore) are going to be very small comparing to the repo size. I
> > expect both sets < 1% of the repo size. And the sets is going to be
> > sparse.
> 
> I replied elsewhere in the thread, but in my clone of hg it's on the
> order of 25-30% of the history, so assuming it's going to be very
> sparse is probably unwise.

In that case it's better to use bitmaps. But to do it we need to get rid
of filteredrevs iteration in scmutil.filteredhash() function.
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: RFC: bitmap storage for precursors and phases

2017-02-19 Thread Augie Fackler
On Fri, Feb 17, 2017 at 09:59:48PM +, Stanislau Hlebik wrote:
> Excerpts from Bryan O'Sullivan's message of 2017-02-17 13:29:58 -0800:
> > On Fri, Feb 17, 2017 at 10:30 AM, Jun Wu  wrote:
> >
> > > Excerpts from Stanislau Hlebik's message of 2017-02-17 11:24:34 +:
> > > > As I said before we will load all non-public revs in one set and all
> > >
> > > The problem is, loading a Python set from disk is O(size-of-the-set).
> > >
> > > Bitmap's loading cost should be basically 0 (with mmap). I think that's 
> > > why
> > > we want bitmap at the first place. There are other choices like packfile
> > > index, hash tables, but bitmap is the simplest and most efficient.
> > >
> >
> > Hey folks,
> >
> > I haven't yet seen mention of some considerations that seem very important
> > in driving the decision-making, so I'd appreciate it if someone could fill
> > me in.
> >
> > Firstly, what's our current understanding of the sizes and compositions of
> > these sets of numbers? In theory, we have a lot of data from practical
> > application at Facebook, but nobody's brought it up.
>
> I assume that both sets (set for nonpublic commits and set for
> obsstore) are going to be very small comparing to the repo size. I
> expect both sets < 1% of the repo size. And the sets is going to be
> sparse.

I replied elsewhere in the thread, but in my clone of hg it's on the
order of 25-30% of the history, so assuming it's going to be very
sparse is probably unwise.
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: RFC: bitmap storage for precursors and phases

2017-02-19 Thread Augie Fackler
On Fri, Feb 17, 2017 at 07:14:12PM -0800, Jun Wu wrote:
> Excerpts from Bryan O'Sullivan's message of 2017-02-17 13:29:58 -0800:
> > I think there are multiple topics being discussed:
> >
> >   1. How to solve the overhead loading hiddenrevs with minimal changes?
> >   2. Why is the bitmap format interesting (future use-cases)?
> >
> For 2,
>
>   len(filteredrevs) in my hg-committed is 2155. It's small. I tend to think
>   about solutions that scale longer and are not too complex to build. That
>   may or may not be a good habit.
>

For what it's worth, my clone of hg has 12716 hidden revisions - about
a quarter of the entire repo history is hidden.
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: RFC: bitmap storage for precursors and phases

2017-02-18 Thread Stanislau Hlebik
Excerpts from Jun Wu's message of 2017-02-17 19:14:12 -0800:
> Excerpts from Bryan O'Sullivan's message of 2017-02-17 13:29:58 -0800:
> > On Fri, Feb 17, 2017 at 10:30 AM, Jun Wu  wrote:
> > 
> > > Excerpts from Stanislau Hlebik's message of 2017-02-17 11:24:34 +:
> > > > As I said before we will load all non-public revs in one set and all
> > >
> > > The problem is, loading a Python set from disk is O(size-of-the-set).
> > >
> > > Bitmap's loading cost should be basically 0 (with mmap). I think that's 
> > > why
> > > we want bitmap at the first place. There are other choices like packfile
> > > index, hash tables, but bitmap is the simplest and most efficient.
> > >
> > 
> > Hey folks,
> > 
> > I haven't yet seen mention of some considerations that seem very important
> > in driving the decision-making, so I'd appreciate it if someone could fill
> > me in.
> > 
> > Firstly, what's our current understanding of the sizes and compositions of
> > these sets of numbers? In theory, we have a lot of data from practical
> > application at Facebook, but nobody's brought it up.
> > 
> > To clarify why this matters, if an obsstore contains say 100 entries, then
> > a very naive implementation should be fine. If it's related to something
> > else, such as the size of, amount of activity in, or local age of a repo,
> > then maybe we have a different set of decisions to make.
> 
> First, I agree that if N is small, O(N) or O(log(N)), or O(1), do not matter
> much in practice. I tend to always reason about future-proof solutions,
> which may or may not be a good habit - I'll adjust in the future.
> 
> I think there are multiple topics being discussed:
> 
>   1. How to solve the overhead loading hiddenrevs with minimal changes?
>   2. Why is the bitmap format interesting (future use-cases)?
> 
> For 1,
> 
>   I once tried to make one bit of revlog flag as the source of truth of
>   "hidden", which looks a nice solution. But it got rejected because revlog
>   has to be append-only. Moving the flag bits out formed the bitmap idea
>   naturally.
> 
>   Bitmap fits here but it does not mean we have to use it to solve this
>   particular problem. For example, stateful chg will get us some easy wins,
>   and see the problem analysis and proposed solution below.
> 
>   Note that we actually have a simple caching format for the "hiddenrevs".
>   See repoview.py, computehidden.
> 
>   But what's wrong with "computehidden" is the choice of the cache key - it
>   calls "hideablerevs", which takes 0.5 seconds in obsolete.getrevs(repo,
>   'obsolete'), which reads the giant obsstore (in hg-committed) and do some
>   complex computation. The result is then hashed and used as a cache key of
>   "computehidden".
> 
>   So, if we do not want to know the "obsolete()" revisions, but just want to
>   know what's hidden. Changing the cache key would be a very straightforward
>   and effective solution which does not require a new file format.

Hm, changing the cache key to smth like phaseroots + obsstore +
changelog mtime + size  may be an easy and good win. That's a good idea.
Although it will still get invalidated after each write.

> 
>   That'll help commands like "hg st", "hg log -T foo", etc. But it won't
>   help complex "log"s which do require the "obsolete()" revisions. To speed
>   up the latter, we can probably just copy the caching infrastructure from
>   repoview.py.
> 
> For 2,
> 
>   len(filteredrevs) in my hg-committed is 2155. It's small. I tend to think
>   about solutions that scale longer and are not too complex to build. That
>   may or may not be a good habit.
>   
>   That said, I agree if the set is small and perf, then whatever easier to
>   implement makes sense.
>   
>   I have some crazy ideas about the future where a giant bitmap is
>   practically useful. For example, source controlling "visible heads" and
>   the obsstore, then adding some C code to quickly fill a bitmap (requires
>   "set" and "setancestors/descendants"), then we can show the user whenever
>   their repo used to look like back in any time by using the bitmap filter.
>   That'll help investigate hard issues, and implement "hg undo".
> 
> > Secondly, what operations are being performed on these sets today, and do
> > we know if those operations are going to change?
> >
> > For instance, if a may be extended via new elements added to its end, then
> > an mmap-based implementation immediately becomes awkward. If a set is
> 
> Not that bad if we preallocate space.
> 
> > traversed or we perform operations like union or intersection, then a
> > bitmap might make sense if the set is very small and dense, but not if it's
> > big and sparse.
> 
> Not necessarily true if we use an advanced variant, like the "roaringbitmap"
> mentioned by you last year [1].
> 
> [1]: 
> https://www.mercurial-scm.org/pipermail/mercurial-devel/2016-September/087864.html
> 
> > (Jun's idea that mmap is effectively free is also not borne out in
> > 

Re: RFC: bitmap storage for precursors and phases

2017-02-17 Thread Jun Wu
Excerpts from Stanislau Hlebik's message of 2017-02-17 16:06:33 +:
> This is implementation of two caches (nonpublic + precursor) using
> serialized sorted lists and sets
> https://bitbucket.org/stashlebik/hg/commits/99879579ac2848a2567810b677d8344150a7b319?at=hiddenbitmaps_lists

I had a quick look. Here are my suggestions:

1. Prefer caching at a lower-level, so they get more widely used.

   Practically, do not change localrepo.py, change phasecache.py and
   obsolete.py instead.

   I happened to do some refactoring in phasecache [1]. I'll send a V2 soon,
   since the change seems related.

   Note: phasecache is some C code so you need some data to prove the change
   is worthy.

2. The "Python set" serialization is already in repoview.py. Reuse it.

   Maybe move "_writehiddencache" to scmutil, or a new simple class.
   "tryreadcache" could be changed and then moved.

   Then the functions could be reused in obsolete.getrevs.

[1]: 
https://www.mercurial-scm.org/pipermail/mercurial-devel/2017-February/092817.html
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: RFC: bitmap storage for precursors and phases

2017-02-17 Thread Jun Wu
Excerpts from Bryan O'Sullivan's message of 2017-02-17 13:29:58 -0800:
> On Fri, Feb 17, 2017 at 10:30 AM, Jun Wu  wrote:
> 
> > Excerpts from Stanislau Hlebik's message of 2017-02-17 11:24:34 +:
> > > As I said before we will load all non-public revs in one set and all
> >
> > The problem is, loading a Python set from disk is O(size-of-the-set).
> >
> > Bitmap's loading cost should be basically 0 (with mmap). I think that's why
> > we want bitmap at the first place. There are other choices like packfile
> > index, hash tables, but bitmap is the simplest and most efficient.
> >
> 
> Hey folks,
> 
> I haven't yet seen mention of some considerations that seem very important
> in driving the decision-making, so I'd appreciate it if someone could fill
> me in.
> 
> Firstly, what's our current understanding of the sizes and compositions of
> these sets of numbers? In theory, we have a lot of data from practical
> application at Facebook, but nobody's brought it up.
> 
> To clarify why this matters, if an obsstore contains say 100 entries, then
> a very naive implementation should be fine. If it's related to something
> else, such as the size of, amount of activity in, or local age of a repo,
> then maybe we have a different set of decisions to make.

First, I agree that if N is small, O(N) or O(log(N)), or O(1), do not matter
much in practice. I tend to always reason about future-proof solutions,
which may or may not be a good habit - I'll adjust in the future.

I think there are multiple topics being discussed:

  1. How to solve the overhead loading hiddenrevs with minimal changes?
  2. Why is the bitmap format interesting (future use-cases)?

For 1,

  I once tried to make one bit of revlog flag as the source of truth of
  "hidden", which looks a nice solution. But it got rejected because revlog
  has to be append-only. Moving the flag bits out formed the bitmap idea
  naturally.

  Bitmap fits here but it does not mean we have to use it to solve this
  particular problem. For example, stateful chg will get us some easy wins,
  and see the problem analysis and proposed solution below.

  Note that we actually have a simple caching format for the "hiddenrevs".
  See repoview.py, computehidden.

  But what's wrong with "computehidden" is the choice of the cache key - it
  calls "hideablerevs", which takes 0.5 seconds in obsolete.getrevs(repo,
  'obsolete'), which reads the giant obsstore (in hg-committed) and do some
  complex computation. The result is then hashed and used as a cache key of
  "computehidden".

  So, if we do not want to know the "obsolete()" revisions, but just want to
  know what's hidden. Changing the cache key would be a very straightforward
  and effective solution which does not require a new file format.

  That'll help commands like "hg st", "hg log -T foo", etc. But it won't
  help complex "log"s which do require the "obsolete()" revisions. To speed
  up the latter, we can probably just copy the caching infrastructure from
  repoview.py.

For 2,

  len(filteredrevs) in my hg-committed is 2155. It's small. I tend to think
  about solutions that scale longer and are not too complex to build. That
  may or may not be a good habit.
  
  That said, I agree if the set is small and perf, then whatever easier to
  implement makes sense.
  
  I have some crazy ideas about the future where a giant bitmap is
  practically useful. For example, source controlling "visible heads" and
  the obsstore, then adding some C code to quickly fill a bitmap (requires
  "set" and "setancestors/descendants"), then we can show the user whenever
  their repo used to look like back in any time by using the bitmap filter.
  That'll help investigate hard issues, and implement "hg undo".

> Secondly, what operations are being performed on these sets today, and do
> we know if those operations are going to change?
>
> For instance, if a may be extended via new elements added to its end, then
> an mmap-based implementation immediately becomes awkward. If a set is

Not that bad if we preallocate space.

> traversed or we perform operations like union or intersection, then a
> bitmap might make sense if the set is very small and dense, but not if it's
> big and sparse.

Not necessarily true if we use an advanced variant, like the "roaringbitmap"
mentioned by you last year [1].

[1]: 
https://www.mercurial-scm.org/pipermail/mercurial-devel/2016-September/087864.html

> (Jun's idea that mmap is effectively free is also not borne out in
> practice, but that's a second-order issue here.)
> 
> So tell us what the needed API is, what the characteristics of the data
> are, and that'll help understand how to make a good decision.

___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: RFC: bitmap storage for precursors and phases

2017-02-17 Thread Stanislau Hlebik
Excerpts from Bryan O'Sullivan's message of 2017-02-17 13:29:58 -0800:
> On Fri, Feb 17, 2017 at 10:30 AM, Jun Wu  wrote:
> 
> > Excerpts from Stanislau Hlebik's message of 2017-02-17 11:24:34 +:
> > > As I said before we will load all non-public revs in one set and all
> >
> > The problem is, loading a Python set from disk is O(size-of-the-set).
> >
> > Bitmap's loading cost should be basically 0 (with mmap). I think that's why
> > we want bitmap at the first place. There are other choices like packfile
> > index, hash tables, but bitmap is the simplest and most efficient.
> >
> 
> Hey folks,
> 
> I haven't yet seen mention of some considerations that seem very important
> in driving the decision-making, so I'd appreciate it if someone could fill
> me in.
> 
> Firstly, what's our current understanding of the sizes and compositions of
> these sets of numbers? In theory, we have a lot of data from practical
> application at Facebook, but nobody's brought it up.

I assume that both sets (set for nonpublic commits and set for
obsstore) are going to be very small comparing to the repo size. I
expect both sets < 1% of the repo size. And the sets is going to be
sparse.

> 
> To clarify why this matters, if an obsstore contains say 100 entries, then
> a very naive implementation should be fine. If it's related to something
> else, such as the size of, amount of activity in, or local age of a repo,
> then maybe we have a different set of decisions to make.
> 
> Secondly, what operations are being performed on these sets today, and do
> we know if those operations are going to change?
> 
> For instance, if a may be extended via new elements added to its end, then
> an mmap-based implementation immediately becomes awkward. If a set is
> traversed or we perform operations like union or intersection, then a
> bitmap might make sense if the set is very small and dense, but not if it's
> big and sparse.

We need three operations: iteration over the whole
set, testing if an element belongs to the set and adding new elements.
New elements can be added to the end of the set.
Not sure if more operations will be added, probably not.

Agree, we can't use bitmaps for iteration.
But potentially we can clean all of the places where iteration is used.
Most of them are pretty straight-forward except for one tricky case in
scmutil.filteredhash() function.

> 
> (Jun's idea that mmap is effectively free is also not borne out in
> practice, but that's a second-order issue here.)
> 
> So tell us what the needed API is, what the characteristics of the data
> are, and that'll help understand how to make a good decision.
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: RFC: bitmap storage for precursors and phases

2017-02-17 Thread Bryan O'Sullivan
On Fri, Feb 17, 2017 at 10:30 AM, Jun Wu  wrote:

> Excerpts from Stanislau Hlebik's message of 2017-02-17 11:24:34 +:
> > As I said before we will load all non-public revs in one set and all
>
> The problem is, loading a Python set from disk is O(size-of-the-set).
>
> Bitmap's loading cost should be basically 0 (with mmap). I think that's why
> we want bitmap at the first place. There are other choices like packfile
> index, hash tables, but bitmap is the simplest and most efficient.
>

Hey folks,

I haven't yet seen mention of some considerations that seem very important
in driving the decision-making, so I'd appreciate it if someone could fill
me in.

Firstly, what's our current understanding of the sizes and compositions of
these sets of numbers? In theory, we have a lot of data from practical
application at Facebook, but nobody's brought it up.

To clarify why this matters, if an obsstore contains say 100 entries, then
a very naive implementation should be fine. If it's related to something
else, such as the size of, amount of activity in, or local age of a repo,
then maybe we have a different set of decisions to make.

Secondly, what operations are being performed on these sets today, and do
we know if those operations are going to change?

For instance, if a may be extended via new elements added to its end, then
an mmap-based implementation immediately becomes awkward. If a set is
traversed or we perform operations like union or intersection, then a
bitmap might make sense if the set is very small and dense, but not if it's
big and sparse.

(Jun's idea that mmap is effectively free is also not borne out in
practice, but that's a second-order issue here.)

So tell us what the needed API is, what the characteristics of the data
are, and that'll help understand how to make a good decision.
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: RFC: bitmap storage for precursors and phases

2017-02-17 Thread Stanislau Hlebik
Excerpts from Jun Wu's message of 2017-02-17 10:30:45 -0800:
> Excerpts from Stanislau Hlebik's message of 2017-02-17 11:24:34 +:
> > As I said before we will load all non-public revs in one set and all
> 
> The problem is, loading a Python set from disk is O(size-of-the-set).
> 
> Bitmap's loading cost should be basically 0 (with mmap). I think that's why
> we want bitmap at the first place. There are other choices like packfile
> index, hash tables, but bitmap is the simplest and most efficient.

Agree, but Python set is going to be small and loading it is also fast.

> 
> > precursor revs in another set. So `ispublic()` and `isprecursor()`
> > checks are very fast, it's just a set lookup. Same with update - it's
> > just an insertion in the set.
> 
> If cache invalidation happens for common write operations and the cache
> needs to be rebuilt from scratch, then the solution is not better than
> stateful chg - the chg way is easier (no need to serialize data), and safer
> (not vulnerable to cache file corruption).
> 
> To make the bitmap cache really useful (solving problems that chg cannot not
> solve), we need to incrementally update the cache, and avoid rebuilding it
> from scratch, at least for common commands like "rebase", "amend", "commit".
> 
> > > Some code can be changed - "scmutil.filteredhash" seems to be one user
> > > that iterates "filteredrevs". But what it needs is only a hash - it
> > > could hash something else, like the mtime, size etc.
> > 
> > Bookmarks, changelog, obsstore and tags can affect filtered set. 
> > For filtered repo we'll need to use size + mtime of bookmarks,
> > changelog, obsstore, tags and maybe even smth else. That maybe
> > error-prone.
> 
> The "filteredhash" seems to be only used by tags and branchmap cache. I
> think we can actually get rid of it by only caching unfiltered versions and
> do the filtering later. It also seems wrong to have tags and filteredrevs
> mutually depend on each other.

If we can get rid of `filteredhash` then we can probably proceed with
bitmaps. Although using bitmaps will still require bigger code changes
than using serialized sets.

> 
> > > Bitmaps could also be smarter - like maintaining the min and max revisions
> > > so it does not need to be exactly len(repo).
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: RFC: bitmap storage for precursors and phases

2017-02-17 Thread Jun Wu
Excerpts from Stanislau Hlebik's message of 2017-02-17 11:24:34 +:
> As I said before we will load all non-public revs in one set and all

The problem is, loading a Python set from disk is O(size-of-the-set).

Bitmap's loading cost should be basically 0 (with mmap). I think that's why
we want bitmap at the first place. There are other choices like packfile
index, hash tables, but bitmap is the simplest and most efficient.

> precursor revs in another set. So `ispublic()` and `isprecursor()`
> checks are very fast, it's just a set lookup. Same with update - it's
> just an insertion in the set.

If cache invalidation happens for common write operations and the cache
needs to be rebuilt from scratch, then the solution is not better than
stateful chg - the chg way is easier (no need to serialize data), and safer
(not vulnerable to cache file corruption).

To make the bitmap cache really useful (solving problems that chg cannot not
solve), we need to incrementally update the cache, and avoid rebuilding it
from scratch, at least for common commands like "rebase", "amend", "commit".

> > Some code can be changed - "scmutil.filteredhash" seems to be one user
> > that iterates "filteredrevs". But what it needs is only a hash - it
> > could hash something else, like the mtime, size etc.
> 
> Bookmarks, changelog, obsstore and tags can affect filtered set. 
> For filtered repo we'll need to use size + mtime of bookmarks,
> changelog, obsstore, tags and maybe even smth else. That maybe
> error-prone.

The "filteredhash" seems to be only used by tags and branchmap cache. I
think we can actually get rid of it by only caching unfiltered versions and
do the filtering later. It also seems wrong to have tags and filteredrevs
mutually depend on each other.

> > Bitmaps could also be smarter - like maintaining the min and max revisions
> > so it does not need to be exactly len(repo).
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: RFC: bitmap storage for precursors and phases

2017-02-17 Thread Stanislau Hlebik
Excerpts from Stanislau Hlebik's message of 2017-02-17 11:24:34 +:
> Excerpts from Jun Wu's message of 2017-02-16 13:42:46 -0800:
> > Excerpts from Stanislau Hlebik's message of 2017-02-16 19:39:07 +:
> > > Excerpts from Stanislau Hlebik's message of 2017-02-14 09:29:25 +:
> > > > Excerpts from Sean Farley's message of 2017-02-13 18:30:25 -0800:
> > > > > Jun Wu  writes:
> > > > > 
> > > > > > Excerpts from Sean Farley's message of 2017-02-13 17:04:35 -0800:
> > > > > >> I was thinking about a more high-level approach (please feel free 
> > > > > >> to
> > > > > >> pick apart):
> > > > > >> 
> > > > > >> r = repo.filtered("bitmap1")
> > > > > >> r2 = r.filtered("bitmap2")
> > > > > >> 
> > > > > >> So that r2 would be an intersection of bitmap1 and bitmap2 (haven't
> > > > > >> thought about a union nor the inverse).
> > > > > >
> > > > > > That does not conflict with my comments. It could be implemented as 
> > > > > > nested
> > > > > > filters, or flatten the bitmap by doing an "or" operation.
> > > > > 
> > > > > Righto. Just wanted to bring it up early before things are set in 
> > > > > stone.
> > > > 
> > > > Current `repoview.filtered()` implementation can apply only one filter. 
> > > > I
> > > > think it will be error-prone to change it, won't it?
> > > 
> > > It seems that it's better to use sorted lists instead of bitmaps. In a
> > > couple of places it is expected that repo.filteredrevs supports
> > > iteration. But iteration over a bitmap is very slow. Instead we can
> > > store list of non-public revs and list of precursor revs and load them
> > > in a set.
> > > 
> > > It will be slow for the case where repo has lots of draft commits.
> > > In this case it's probably better to disable this feature completely.
> > 
> > I think it's fine to have slow iteration, as long as the iteration operation
> > is uncommon. The point is, the most common operation should be fast - that
> > is, ishidden(rev) should be fast. Stateful chg will help the read-only case,
> > the bitmap + mmap would ideally make write (like rebase) cases fast
> .
> As I said before we will load all non-public revs in one set and all
> precursor revs in another set. So `ispublic()` and `isprecursor()`
> checks are very fast, it's just a set lookup. Same with update - it's
> just an insertion in the set.
> 
> > 
> > Some code can be changed - "scmutil.filteredhash" seems to be one user that
> > iterates "filteredrevs". But what it needs is only a hash - it could hash
> > something else, like the mtime, size etc.
> 
> Bookmarks, changelog, obsstore and tags can affect filtered set. 
> For filtered repo we'll need to use size + mtime of bookmarks,
> changelog, obsstore, tags and maybe even smth else. That maybe
> error-prone.

This is implementation of two caches (nonpublic + precursor) using
serialized sorted lists and sets
https://bitbucket.org/stashlebik/hg/commits/99879579ac2848a2567810b677d8344150a7b319?at=hiddenbitmaps_lists

> > 
> > Bitmaps could also be smarter - like maintaining the min and max revisions
> > so it does not need to be exactly len(repo).
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: RFC: bitmap storage for precursors and phases

2017-02-17 Thread Stanislau Hlebik
Excerpts from Jun Wu's message of 2017-02-16 13:42:46 -0800:
> Excerpts from Stanislau Hlebik's message of 2017-02-16 19:39:07 +:
> > Excerpts from Stanislau Hlebik's message of 2017-02-14 09:29:25 +:
> > > Excerpts from Sean Farley's message of 2017-02-13 18:30:25 -0800:
> > > > Jun Wu  writes:
> > > > 
> > > > > Excerpts from Sean Farley's message of 2017-02-13 17:04:35 -0800:
> > > > >> I was thinking about a more high-level approach (please feel free to
> > > > >> pick apart):
> > > > >> 
> > > > >> r = repo.filtered("bitmap1")
> > > > >> r2 = r.filtered("bitmap2")
> > > > >> 
> > > > >> So that r2 would be an intersection of bitmap1 and bitmap2 (haven't
> > > > >> thought about a union nor the inverse).
> > > > >
> > > > > That does not conflict with my comments. It could be implemented as 
> > > > > nested
> > > > > filters, or flatten the bitmap by doing an "or" operation.
> > > > 
> > > > Righto. Just wanted to bring it up early before things are set in stone.
> > > 
> > > Current `repoview.filtered()` implementation can apply only one filter. I
> > > think it will be error-prone to change it, won't it?
> > 
> > It seems that it's better to use sorted lists instead of bitmaps. In a
> > couple of places it is expected that repo.filteredrevs supports
> > iteration. But iteration over a bitmap is very slow. Instead we can
> > store list of non-public revs and list of precursor revs and load them
> > in a set.
> > 
> > It will be slow for the case where repo has lots of draft commits.
> > In this case it's probably better to disable this feature completely.
> 
> I think it's fine to have slow iteration, as long as the iteration operation
> is uncommon. The point is, the most common operation should be fast - that
> is, ishidden(rev) should be fast. Stateful chg will help the read-only case,
> the bitmap + mmap would ideally make write (like rebase) cases fast
.
As I said before we will load all non-public revs in one set and all
precursor revs in another set. So `ispublic()` and `isprecursor()`
checks are very fast, it's just a set lookup. Same with update - it's
just an insertion in the set.

> 
> Some code can be changed - "scmutil.filteredhash" seems to be one user that
> iterates "filteredrevs". But what it needs is only a hash - it could hash
> something else, like the mtime, size etc.

Bookmarks, changelog, obsstore and tags can affect filtered set. 
For filtered repo we'll need to use size + mtime of bookmarks,
changelog, obsstore, tags and maybe even smth else. That maybe
error-prone.
> 
> Bitmaps could also be smarter - like maintaining the min and max revisions
> so it does not need to be exactly len(repo).
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: RFC: bitmap storage for precursors and phases

2017-02-16 Thread Jun Wu
Excerpts from Stanislau Hlebik's message of 2017-02-16 19:39:07 +:
> Excerpts from Stanislau Hlebik's message of 2017-02-14 09:29:25 +:
> > Excerpts from Sean Farley's message of 2017-02-13 18:30:25 -0800:
> > > Jun Wu  writes:
> > > 
> > > > Excerpts from Sean Farley's message of 2017-02-13 17:04:35 -0800:
> > > >> I was thinking about a more high-level approach (please feel free to
> > > >> pick apart):
> > > >> 
> > > >> r = repo.filtered("bitmap1")
> > > >> r2 = r.filtered("bitmap2")
> > > >> 
> > > >> So that r2 would be an intersection of bitmap1 and bitmap2 (haven't
> > > >> thought about a union nor the inverse).
> > > >
> > > > That does not conflict with my comments. It could be implemented as 
> > > > nested
> > > > filters, or flatten the bitmap by doing an "or" operation.
> > > 
> > > Righto. Just wanted to bring it up early before things are set in stone.
> > 
> > Current `repoview.filtered()` implementation can apply only one filter. I
> > think it will be error-prone to change it, won't it?
> 
> It seems that it's better to use sorted lists instead of bitmaps. In a
> couple of places it is expected that repo.filteredrevs supports
> iteration. But iteration over a bitmap is very slow. Instead we can
> store list of non-public revs and list of precursor revs and load them
> in a set.
> 
> It will be slow for the case where repo has lots of draft commits.
> In this case it's probably better to disable this feature completely.

I think it's fine to have slow iteration, as long as the iteration operation
is uncommon. The point is, the most common operation should be fast - that
is, ishidden(rev) should be fast. Stateful chg will help the read-only case,
the bitmap + mmap would ideally make write (like rebase) cases fast.

Some code can be changed - "scmutil.filteredhash" seems to be one user that
iterates "filteredrevs". But what it needs is only a hash - it could hash
something else, like the mtime, size etc.

Bitmaps could also be smarter - like maintaining the min and max revisions
so it does not need to be exactly len(repo).
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: RFC: bitmap storage for precursors and phases

2017-02-16 Thread Stanislau Hlebik
Excerpts from Stanislau Hlebik's message of 2017-02-14 09:29:25 +:
> Excerpts from Sean Farley's message of 2017-02-13 18:30:25 -0800:
> > Jun Wu  writes:
> > 
> > > Excerpts from Sean Farley's message of 2017-02-13 17:04:35 -0800:
> > >> I was thinking about a more high-level approach (please feel free to
> > >> pick apart):
> > >> 
> > >> r = repo.filtered("bitmap1")
> > >> r2 = r.filtered("bitmap2")
> > >> 
> > >> So that r2 would be an intersection of bitmap1 and bitmap2 (haven't
> > >> thought about a union nor the inverse).
> > >
> > > That does not conflict with my comments. It could be implemented as nested
> > > filters, or flatten the bitmap by doing an "or" operation.
> > 
> > Righto. Just wanted to bring it up early before things are set in stone.
> 
> Current `repoview.filtered()` implementation can apply only one filter. I
> think it will be error-prone to change it, won't it?

It seems that it's better to use sorted lists instead of bitmaps. In a
couple of places it is expected that repo.filteredrevs supports
iteration. But iteration over a bitmap is very slow. Instead we can
store list of non-public revs and list of precursor revs and load them
in a set.

It will be slow for the case where repo has lots of draft commits.
In this case it's probably better to disable this feature completely.
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: RFC: bitmap storage for precursors and phases

2017-02-14 Thread Stanislau Hlebik
Excerpts from Gregory Szorc's message of 2017-02-13 20:28:37 -0700:
> 
> > On Feb 10, 2017, at 10:33, Stanislau Hlebik  wrote:
> > 
> > Last year Durham sent a proposal for bitmap storage for computehidden() 
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.mercurial-2Dscm.org_pipermail_mercurial-2Ddevel_2016-2DSeptember_087860.html=DwIFAg=5VD0RTtNlTh3ycd41b3MUw=1EQ58Dmb5uX1qHujcsT1Mg=ee96J8IDM8uhZeONo6skPSXItouNwkeRGNxHeCaXiXM=biY_gtRGCl25baAE_Nlw8cl4YUefBM0Xb7L0kuGklRA=
> >  .
> >  
> > It got a few useful comments, two most important comments:
> > Use bitmaps for lower-level data structures, for example, bitmap for public 
> > commits and bitmap for commits that are affected by obsolescence markers
> > Add cache validation checks
> >  
> > This is a new RFC that addresses these issues.
> > In the code, there is a cache only for precursors. Later I'm planning to 
> > add cache for public commits.
> > Cache validation key was added. It can be any key. For precursorcache I've 
> > chosen obsstore file size and mtime and cache validation key. For 
> > phasecache we can use hash of sorted phaseroots file since this file is 
> > usually small. 
> 
> Until there is a better supported mechanism for storing "draft pushes" in an 
> external store (Facebook's inifinitepush extension may fulfill this), 
> Mozilla's "Try" and code review repos need to scale to >100,000 heads / phase 
> roots. So a hash of phaseroots may not scale. (Of course, one solution to 
> this problem would be inverting phases storage to be head based instead of 
> root based. That doesn't solve problem if you have 100k public heads and 100k 
> draft heads. Scaling is hard.)

Then I will use the same cache key as for 'obsstore' - mtime + size.
BTW infinitepush is ready and opensourced so you can give it a try: 
https://bitbucket.org/facebook/hg-experimental/src/567c1477eebf43ff5fe7412c650ec49aa3e44a3c/infinitepush/?at=default

> 
> >  
> > In a few places, I decided to delete cache entirely:
> > Caches are blown up if we receive obsmarkers via bundle2 or 
> > exchange._pullobsolete(). This is not necessary since cache won't be valid 
> > on the next run anyway because obsstore size and/or mtime was changed. We 
> > can change it.
> > Obsstore can store markers for node that are not present in the repo. Since 
> > bitmap is basically a dict from rev to boolean it can't store markers for 
> > revisions that are not in the repo. It doesn't cause issues unless missing 
> > node is received from bundle or during pull. In this case precursorcache 
> > doesn't recognize it as obsoleted. To fix it cache is deleted during 
> > changegroup.apply() if we receive a node that is in obsstore.successors 
> > dict.
> > Similar thing in bundlerepo - new obsoleted nodes may have been added, 
> > let's just not use precursorcache in bundlerepo.
> >  
> >  
> > The code is here: 
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__bitbucket.org_stashlebik_hg_commits_branch_hiddenbitmaps=DwIFAg=5VD0RTtNlTh3ycd41b3MUw=1EQ58Dmb5uX1qHujcsT1Mg=ee96J8IDM8uhZeONo6skPSXItouNwkeRGNxHeCaXiXM=dp9GiK3_WKq2zB8mVuMy2ksPF5tnGV_1T_OSAp6jD8o=
> >  
> > It's not super clean yet, but all tests pass (except maybe for commit/code 
> > checks).
> > Please look and let me know what you think!
> >  
> >  
> > Thanks,
> > Stas
> > ___
> > Mercurial-devel mailing list
> > Mercurial-devel@mercurial-scm.org
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.mercurial-2Dscm.org_mailman_listinfo_mercurial-2Ddevel=DwIFAg=5VD0RTtNlTh3ycd41b3MUw=1EQ58Dmb5uX1qHujcsT1Mg=ee96J8IDM8uhZeONo6skPSXItouNwkeRGNxHeCaXiXM=mqvxWBluu-CxyLDxFLW49s61VhbzU_NiqHHCSu6NVxw=
> >  
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: RFC: bitmap storage for precursors and phases

2017-02-14 Thread Stanislau Hlebik
Excerpts from Sean Farley's message of 2017-02-13 18:30:25 -0800:
> Jun Wu  writes:
> 
> > Excerpts from Sean Farley's message of 2017-02-13 17:04:35 -0800:
> >> I was thinking about a more high-level approach (please feel free to
> >> pick apart):
> >> 
> >> r = repo.filtered("bitmap1")
> >> r2 = r.filtered("bitmap2")
> >> 
> >> So that r2 would be an intersection of bitmap1 and bitmap2 (haven't
> >> thought about a union nor the inverse).
> >
> > That does not conflict with my comments. It could be implemented as nested
> > filters, or flatten the bitmap by doing an "or" operation.
> 
> Righto. Just wanted to bring it up early before things are set in stone.

Current `repoview.filtered()` implementation can apply only one filter. I
think it will be error-prone to change it, won't it?
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: RFC: bitmap storage for precursors and phases

2017-02-13 Thread Gregory Szorc


> On Feb 10, 2017, at 10:33, Stanislau Hlebik  wrote:
> 
> Last year Durham sent a proposal for bitmap storage for computehidden() 
> https://www.mercurial-scm.org/pipermail/mercurial-devel/2016-September/087860.html.
>  
> It got a few useful comments, two most important comments:
> Use bitmaps for lower-level data structures, for example, bitmap for public 
> commits and bitmap for commits that are affected by obsolescence markers
> Add cache validation checks
>  
> This is a new RFC that addresses these issues.
> In the code, there is a cache only for precursors. Later I'm planning to add 
> cache for public commits.
> Cache validation key was added. It can be any key. For precursorcache I've 
> chosen obsstore file size and mtime and cache validation key. For phasecache 
> we can use hash of sorted phaseroots file since this file is usually small. 

Until there is a better supported mechanism for storing "draft pushes" in an 
external store (Facebook's inifinitepush extension may fulfill this), Mozilla's 
"Try" and code review repos need to scale to >100,000 heads / phase roots. So a 
hash of phaseroots may not scale. (Of course, one solution to this problem 
would be inverting phases storage to be head based instead of root based. That 
doesn't solve problem if you have 100k public heads and 100k draft heads. 
Scaling is hard.)

>  
> In a few places, I decided to delete cache entirely:
> Caches are blown up if we receive obsmarkers via bundle2 or 
> exchange._pullobsolete(). This is not necessary since cache won't be valid on 
> the next run anyway because obsstore size and/or mtime was changed. We can 
> change it.
> Obsstore can store markers for node that are not present in the repo. Since 
> bitmap is basically a dict from rev to boolean it can't store markers for 
> revisions that are not in the repo. It doesn't cause issues unless missing 
> node is received from bundle or during pull. In this case precursorcache 
> doesn't recognize it as obsoleted. To fix it cache is deleted during 
> changegroup.apply() if we receive a node that is in obsstore.successors dict.
> Similar thing in bundlerepo - new obsoleted nodes may have been added, let's 
> just not use precursorcache in bundlerepo.
>  
>  
> The code is here: 
> https://bitbucket.org/stashlebik/hg/commits/branch/hiddenbitmaps
> It's not super clean yet, but all tests pass (except maybe for commit/code 
> checks).
> Please look and let me know what you think!
>  
>  
> Thanks,
> Stas
> ___
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: RFC: bitmap storage for precursors and phases

2017-02-13 Thread Sean Farley
Jun Wu  writes:

> Excerpts from Sean Farley's message of 2017-02-13 17:04:35 -0800:
>> I was thinking about a more high-level approach (please feel free to
>> pick apart):
>> 
>> r = repo.filtered("bitmap1")
>> r2 = r.filtered("bitmap2")
>> 
>> So that r2 would be an intersection of bitmap1 and bitmap2 (haven't
>> thought about a union nor the inverse).
>
> That does not conflict with my comments. It could be implemented as nested
> filters, or flatten the bitmap by doing an "or" operation.

Righto. Just wanted to bring it up early before things are set in stone.
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: RFC: bitmap storage for precursors and phases

2017-02-13 Thread Jun Wu
Excerpts from Sean Farley's message of 2017-02-13 17:04:35 -0800:
> I was thinking about a more high-level approach (please feel free to
> pick apart):
> 
> r = repo.filtered("bitmap1")
> r2 = r.filtered("bitmap2")
> 
> So that r2 would be an intersection of bitmap1 and bitmap2 (haven't
> thought about a union nor the inverse).

That does not conflict with my comments. It could be implemented as nested
filters, or flatten the bitmap by doing an "or" operation.
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: RFC: bitmap storage for precursors and phases

2017-02-13 Thread Sean Farley
Jun Wu  writes:

> In general, I think this is a good direction. Some random thoughts:
>
>   - general purposed
>
> I think the bitmap is not always a cache, so it should only have
> operations like set/unset/readfromdisk/writetodisk. Practically, I won't
> couple cache invalidation with the bitmap implementation.
>
> In additional, I'll try to avoid using Python-only types in the
> interface. So once we decide to rewrite part of the implementation in
> native C, we won't have trouble.
>
> See "revset" below for a possibility that bitmap is used as a non-set.
>
>   - revset
>
> This is a possibility that probably won't happen any time soon.
>
> The revset currently uses Python set for maintaining its state. For huge
> sets, Python sets may not be a good option. And various operations could
> benefit from an always-topologically-sorted set, which is the bitmap.
>
>   - mmap
> 
> My intuition is that bitmaps fit better with mmap which can reduce the
> reading loading cost. I think "vfs.mmapread" could be a thing, and
> various places can benefit from it - Gabor recently showed interest in
> loading revlog data by mmap, I had patches that uses mmap to read revlog
> index.
>
> In additional, not directly related to this series, I'm a big fan of
> single direction data flow. But the current code base does not seem to do a
> good job in this area. As we are adding more caching layers to the code
> base, it'd be nice if we have some tiny framework maintaining the dependency
> of all kinds of data, to be able to understand the data flow easily, and
> just to be more confident about loading orders. I think people more
> experienced on architecture may want to share some ideas here.

I was thinking about a more high-level approach (please feel free to
pick apart):

r = repo.filtered("bitmap1")
r2 = r.filtered("bitmap2")

So that r2 would be an intersection of bitmap1 and bitmap2 (haven't
thought about a union nor the inverse).
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: RFC: bitmap storage for precursors and phases

2017-02-13 Thread Stanislau Hlebik
Excerpts from Jun Wu's message of 2017-02-10 11:04:24 -0800:
> In general, I think this is a good direction. Some random thoughts:
> 
>   - general purposed
> 
> I think the bitmap is not always a cache, so it should only have
> operations like set/unset/readfromdisk/writetodisk. Practically, I won't
> couple cache invalidation with the bitmap implementation.
> 
> In additional, I'll try to avoid using Python-only types in the
> interface. So once we decide to rewrite part of the implementation in
> native C, we won't have trouble.

I can decouple them into bitmap and bitmapcache. bitmap then need to
have an generic header. bitmapcache will store invalidation key there.

> 
> See "revset" below for a possibility that bitmap is used as a non-set.
> 
>   - revset
> 
> This is a possibility that probably won't happen any time soon.
> 
> The revset currently uses Python set for maintaining its state. For huge
> sets, Python sets may not be a good option. And various operations could
> benefit from an always-topologically-sorted set, which is the bitmap.
> 
>   - mmap
> 
> My intuition is that bitmaps fit better with mmap which can reduce the
> reading loading cost. I think "vfs.mmapread" could be a thing, and
> various places can benefit from it - Gabor recently showed interest in
> loading revlog data by mmap, I had patches that uses mmap to read revlog
> index.

Bitmap files are going to be small and reading them all in memory should
be fast. But if it'll ever become a bottleneck we can add mmap read of
bitmap files.

> 
> In additional, not directly related to this series, I'm a big fan of
> single direction data flow. But the current code base does not seem to do a
> good job in this area. As we are adding more caching layers to the code
> base, it'd be nice if we have some tiny framework maintaining the dependency
> of all kinds of data, to be able to understand the data flow easily, and
> just to be more confident about loading orders. I think people more
> experienced on architecture may want to share some ideas here.
> 
> Excerpts from Stanislau Hlebik's message of 2017-02-10 17:33:28 +:
> > Last year Durham sent a proposal for bitmap storage for computehidden() 
> > https://www.mercurial-scm.org/pipermail/mercurial-devel/2016-September/087860.html
> >  .
> > 
> > It got a few useful comments, two most important comments:
> > 
> >   1.  Use bitmaps for lower-level data structures, for example, bitmap for 
> > public commits and bitmap for commits that are affected by obsolescence 
> > markers
> >   2.  Add cache validation checks
> > 
> > This is a new RFC that addresses these issues.
> > 
> >   1.  In the code, there is a cache only for precursors. Later I'm planning 
> > to add cache for public commits.
> >   2.  Cache validation key was added. It can be any key. For precursorcache 
> > I've chosen obsstore file size and mtime and cache validation key. For 
> > phasecache we can use hash of sorted phaseroots file since this file is 
> > usually small.
> > 
> > 
> > In a few places, I decided to delete cache entirely:
> > 
> >   1.  Caches are blown up if we receive obsmarkers via bundle2 or 
> > exchange._pullobsolete(). This is not necessary since cache won't be valid 
> > on the next run anyway because obsstore size and/or mtime was changed. We 
> > can change it.
> >   2.  Obsstore can store markers for node that are not present in the repo. 
> > Since bitmap is basically a dict from rev to boolean it can't store markers 
> > for revisions that are not in the repo. It doesn't cause issues unless 
> > missing node is received from bundle or during pull. In this case 
> > precursorcache doesn't recognize it as obsoleted. To fix it cache is 
> > deleted during changegroup.apply() if we receive a node that is in 
> > obsstore.successors dict.
> >   3.  Similar thing in bundlerepo - new obsoleted nodes may have been 
> > added, let's just not use precursorcache in bundlerepo.
> > 
> > 
> > The code is here: 
> > https://bitbucket.org/stashlebik/hg/commits/branch/hiddenbitmaps 
> > It's not super clean yet, but all tests pass (except maybe for commit/code 
> > checks).
> > Please look and let me know what you think!
> > 
> > 
> > Thanks,
> > Stas
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: RFC: bitmap storage for precursors and phases

2017-02-10 Thread Jun Wu
In general, I think this is a good direction. Some random thoughts:

  - general purposed

I think the bitmap is not always a cache, so it should only have
operations like set/unset/readfromdisk/writetodisk. Practically, I won't
couple cache invalidation with the bitmap implementation.

In additional, I'll try to avoid using Python-only types in the
interface. So once we decide to rewrite part of the implementation in
native C, we won't have trouble.

See "revset" below for a possibility that bitmap is used as a non-set.

  - revset

This is a possibility that probably won't happen any time soon.

The revset currently uses Python set for maintaining its state. For huge
sets, Python sets may not be a good option. And various operations could
benefit from an always-topologically-sorted set, which is the bitmap.

  - mmap

My intuition is that bitmaps fit better with mmap which can reduce the
reading loading cost. I think "vfs.mmapread" could be a thing, and
various places can benefit from it - Gabor recently showed interest in
loading revlog data by mmap, I had patches that uses mmap to read revlog
index.

In additional, not directly related to this series, I'm a big fan of
single direction data flow. But the current code base does not seem to do a
good job in this area. As we are adding more caching layers to the code
base, it'd be nice if we have some tiny framework maintaining the dependency
of all kinds of data, to be able to understand the data flow easily, and
just to be more confident about loading orders. I think people more
experienced on architecture may want to share some ideas here.

Excerpts from Stanislau Hlebik's message of 2017-02-10 17:33:28 +:
> Last year Durham sent a proposal for bitmap storage for computehidden() 
> https://www.mercurial-scm.org/pipermail/mercurial-devel/2016-September/087860.html
>  .
> 
> It got a few useful comments, two most important comments:
> 
>   1.  Use bitmaps for lower-level data structures, for example, bitmap for 
> public commits and bitmap for commits that are affected by obsolescence 
> markers
>   2.  Add cache validation checks
> 
> This is a new RFC that addresses these issues.
> 
>   1.  In the code, there is a cache only for precursors. Later I'm planning 
> to add cache for public commits.
>   2.  Cache validation key was added. It can be any key. For precursorcache 
> I've chosen obsstore file size and mtime and cache validation key. For 
> phasecache we can use hash of sorted phaseroots file since this file is 
> usually small.
> 
> 
> In a few places, I decided to delete cache entirely:
> 
>   1.  Caches are blown up if we receive obsmarkers via bundle2 or 
> exchange._pullobsolete(). This is not necessary since cache won't be valid on 
> the next run anyway because obsstore size and/or mtime was changed. We can 
> change it.
>   2.  Obsstore can store markers for node that are not present in the repo. 
> Since bitmap is basically a dict from rev to boolean it can't store markers 
> for revisions that are not in the repo. It doesn't cause issues unless 
> missing node is received from bundle or during pull. In this case 
> precursorcache doesn't recognize it as obsoleted. To fix it cache is deleted 
> during changegroup.apply() if we receive a node that is in 
> obsstore.successors dict.
>   3.  Similar thing in bundlerepo - new obsoleted nodes may have been added, 
> let's just not use precursorcache in bundlerepo.
> 
> 
> The code is here: 
> https://bitbucket.org/stashlebik/hg/commits/branch/hiddenbitmaps 
> It's not super clean yet, but all tests pass (except maybe for commit/code 
> checks).
> Please look and let me know what you think!
> 
> 
> Thanks,
> Stas
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel