subject:"Making chg stateful"

Re: Making chg stateful

2017-02-06 Thread Jun Wu

Excerpts from Yuya Nishihara's message of 2017-02-05 13:59:32 +0900:
> Perhaps that is a matter of taste, I like the explicit one. But I heard
> Python 3 has a generator-based coroutine, so using "yield" for that purpose
> is more common in Python 3?
>
> @preload('index')
> def preloadindex(repo):
> f = repo.svfs('changelog.i') # "f" can be shared naturally
> hash = fstat(f).st_size
> def loadindex():
> indexdata = f.read()
> hash = len(indexdata)
> return parseindex(indexdata), hash

The inner "loadindex" function makes things unnecessarily more complex in my
opinion. It makes the final return type harder to read, introduces extra
indentation and a function definition.

If the goal is to get rid of "yield", it could be done in different ways.
For example,

Choice A: explicitly pass oldhash and the oldobject. We probably need
the old object anyway to allow incremental updates:

@preload('obsstore', depend=['changelog'])
def preloadobsstore(repo, oldhash, oldobsstore):
obsstore = oldobsstore
newhash = ...
if newhash != oldhash:
 # note: this is an indentation not existed using yield
obsstore = ...
return newhash, obsstore

Choice B: just give the preload function access to the cache object:

@preload(['obsstore', 'changelog'])
def preloadmultiple(repo, cache):
oldhash, oldobsstore = cache['obsstore']


Choice A looks simple. Choice B is more expressive while the function body
could be out of control - ex. it could skip setting cache['obsstore'] even
if it says so in its decorator.

While Choice A may serve most requirements just well, I think "yield" is
more expressive, and avoid unnecessary indentation. For example, the
following could not be expressed using "return" cleanly and easily:

- Optional dependency:

  if fastpath_cannot_work:
 yield preload.require('changelog')

  Maybe "raise MissingRequirement('changelog')", but that makes
  the function "restarted", while the generator could continue
  executing without losing context.

- Optionally change the cache key:

  As mentioned in "4)", [1]

Therefore I think the generator way avoids all kinds of indentations
(reminds me of nodejs callback hell), and has better extensibility. So I
currently prefer it. Otherwise the "Choice A" may be good enough for now.
The code base has already used "yield" for its future implementation used
for batched remote peer calls.

[1] 
https://www.mercurial-scm.org/pipermail/mercurial-devel/2017-February/092584.html

> return hash, loadindex
> 
> BTW, when will 'f' be closed if we take the generator-based abstraction?
>
> def preloadindex(repo):
> f = repo.svfs('changelog.i')
> try:
> yield ...
> yield ...
> finally:
> f.close()
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel

Re: Making chg stateful

2017-02-05 Thread Jun Wu

Excerpts from Yuya Nishihara's message of 2017-02-05 13:31:29 +0900:
> > > Since we shouldn't pass repo to revlog (it's layering violation), I think
> > > we'll need a thin wrapper for chgcache anyway.
> > 
> > I mentioned this in the second mail, "4) Where to get preloaded results (in
> > worker)", we could just expose some kind of global state, like a
> > "globalcache" module.
> 
> Does it mean any low-level objects will directly access to the global cache?
> That seems ugly (but maybe I'm biased as I'm really allergic to global data.)

Yes. But I don't think the alternative will be better. To avoid using global
state, existing functions need to take extra arguments. That's also
contagious - to preload some low-level objects, all functions through the
call stack need change.

I think the root cause of the fact that global state is considered bad is
because it's mutable by random code, and could make things hard to predict
or debug. We can enforce it to be only mutable by the chg preloading API, so
people cannot modify its state outside the preloading framework. That sounds
good enough to me.

> > > > Things to consider:
> > > > 
> > > >   a) Objects being preloaded have dependency - ex. the obsstore depends 
> > > > on
> > > >  changelog and other things. The preload functions run in a defined
> > > >  order.
> > > 
> > > Maybe dependencies could be passed as arguments?
> > 
> > Ideally, these expensive calculating (ex. obsstore) could be moved to the
> > index object. In the reality, that requires too much work, and obsstore
> > preloading requires a subset of "repo", including "repo.revs".
> > 
> > It's possible to decouple obsstore preloading from the repo object, but that
> > could be a lot of work too.
> 
> My opinion for obsstore is that the most costly part would be populating 100k+
> objects from file, and the other costly parts could be mitigated by some 
> higher-
> level cache in repoview.py.
> 
> But I think this topic was discussed thoroughly between you and pyd before.
> I'm not intended to bring it up again.

Not discussed. I did mention a plan to move it to the index, but that'll
surely take a very long time.

There are multiple levels of logic related to obsstore, namely,
obsstore.getrevs, where the current logic uses revsets (thus needs a heavy
repo object). And repoview.compute*, which calls obsstore.getrevs and does
some extra simple caching and calculation.

Even without obsstore, other things (phases, bookmarks) depend on the
changelog. Therefore I think dependency problem must be solved, if we want
to have more things preloaded to achieve the best perf.

> > > >   b) The index file is not always a single file, depending on "vfs".
> > > 
> > > Yes. vfs could be owned by storage/chgcache class.
> > > 
> > > >   c) The user may want to control what to preload. For example, if they 
> > > > have
> > > >  an incompatible manifest, they could make changelog preloaded, but 
> > > > not
> > > >  manifest.
> > > 
> > > No idea about (c).
> > > 
> > > >   d) Users can add other preloading items easily, not only just the
> > > >  predefined ones.
> > > 
> > > So probably we'll need an extensible table of preloadable items.
> > 
> > If you check my prototype code, it's using a registrar to collect all
> > @preload functions.
> 
> Yes. I wanted to say we would need this kind of abstraction anyway.
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel

Re: Making chg stateful

2017-02-04 Thread Yuya Nishihara

On Fri, 3 Feb 2017 19:50:48 +, Jun Wu wrote:
> Excerpts from Yuya Nishihara's message of 2017-02-04 00:31:45 +0900:
> > On Fri, 3 Feb 2017 09:33:32 +, Jun Wu wrote:
> > > 1) What does the worker talk to the master via IPC?
> > > 
> > >   Current: worker sends "possible_dirty_repo: $REPO_PATH" after runcommand
> > > 
> > > It's coupled with the (pseudo) repo object. But the master could 
> > > preload
> > > things in correct order - first changelog, then phase, obsstore, etc.
> > > 
> > >   What if: worker sends "possible_dirty_index: $INDEX_PATH",
> > >   "possible_dirty_obsstore $OBSSTORE_PATH" etc.
> > > 
> > > It looks more flexible at first sight. However the dependency issue
> > > could be tricky to solve - for example, the obsstore requires 
> > > changelog
> > > (and revset).  It's much harder to preload obsstore without a pseudo
> > > repo object with repo.changelog, repo.revs available.
> > > 
> > >   Therefore, probably just send the repo path.
> > 
> > Given this, I think pipes would be simpler than using shared memory. The 
> > master
> > IO can be multiplexed by select().
> 
> If it's one pipe per worker, the code change (I tried) is too much. I think
> a single non-blocking pipe could be an option.

Yep. As we'll rely on atomic write(), we won't have to create pipe per worker.

> > > 2) Preload function signature (used by master's background thread).
> > > 
> > >   This function is to hash and load things. It runs in the master, and is
> > >   basically a generator (so the "loading" part could be skipped on cache
> > >   hit, and it's easier to make content and hash_with_content consistent, 
> > > if
> > >   they share states):
> > > 
> > >   @preloadregistar(something)
> > >   def preloadfunc(...):
> > >   yield quick_hash
> > >   yield content, hash_with_content
> > 
> > Isn't it simpler if preloadfunc() returns quick_hash and a delayed function
> > to build content? I feel this generator is kinda abuse and makes it less 
> > clear
> > where the costly part starts.
> 
> I have thought about this. "Generator" is used as a lightweight process that
> could be terminated cheaply. It shares states (local variables) for hashing
> and loading, making certain things easier.
> 
> Compare:
> 
>   @preload('index')
>   def preloadindex(repo):
>   f = repo.svfs('changelog.i') # "f" can be shared naturally
>   hash = fstat(f).st_size
>   yield hash
> 
>   indexdata = f.read()
>   hash = len(indexdata)
>   yield parseindex(indexdata), hash
> 
> vs.
> 
>   def loadindex(f):
>   indexdata = f.read()
>   hash = len(indexdata)
>   return parseindex(indexdata), hash
> 
>   @preload('index')
>   def preloadindex(repo):
>   f = repo.svfs('changelog.i')
>   hash = repo.svfs('changelog.i').st_size
>   loadfunc = lambda: loadindex(f) # pass "f" explicitly 
>   return hash, loadfunc
>   
> Generator seems to be the shortest way to express the logic.
> I think it's easy to say that things after the first "yield" is costly.
> 
> For readibility, I think improved form mentioned at the end of the previous
> mail "yield preload.foo(...)" makes things obvious.

Perhaps that is a matter of taste, I like the explicit one. But I heard
Python 3 has a generator-based coroutine, so using "yield" for that purpose
is more common in Python 3?

@preload('index')
def preloadindex(repo):
f = repo.svfs('changelog.i') # "f" can be shared naturally
hash = fstat(f).st_size
def loadindex():
indexdata = f.read()
hash = len(indexdata)
return parseindex(indexdata), hash
return hash, loadindex

BTW, when will 'f' be closed if we take the generator-based abstraction?

def preloadindex(repo):
f = repo.svfs('changelog.i')
try:
yield ...
yield ...
finally:
f.close()
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel

Re: Making chg stateful

2017-02-04 Thread Yuya Nishihara

On Fri, 3 Feb 2017 20:03:18 +, Jun Wu wrote:
> Excerpts from Yuya Nishihara's message of 2017-02-04 00:11:22 +0900:
> > On Thu, 2 Feb 2017 16:56:11 +, Jun Wu wrote:
> > > Excerpts from Yuya Nishihara's message of 2017-02-03 00:45:22 +0900:
> > > > On Thu, 2 Feb 2017 09:34:47 +, Jun Wu wrote:
> > > > > So what state do we store?
> > > > > 
> > > > >   {repopath: {name: (hash, content)}}. For example:
> > > > > 
> > > > > cache = {'/home/foo/repo1': {'index': ('hash', changelogindex),
> > > > >  'bookmarks': ('hash', bookmarks),
> > > > >   },
> > > > >  '/home/foo/repo2': {  },  }
> > > > > 
> > > > >   The main ideas here are:
> > > > > 1) Store the lowest level objects, like the C changelog index.
> > > > >Because higher level objects could be changed by extensions in
> > > > >unpredictable ways. (this is not true in my hacky prototype 
> > > > > though)
> > > > > 2) Hash everything. For changelog, it's like the file stat of
> > > > >changelog.i. There must be a strong guarantee that the hash 
> > > > > matches
> > > > >the content, which could be challenging, but not impossible. 
> > > > > I'll
> > > > >cover more details below.
> > > > > 
> > > > >   The cache is scoped by repo to make the API simpler/easy to use. It 
> > > > > may
> > > > >   be interesting to have some global state (like passing back the 
> > > > > extension
> > > > >   path to import them at runtime).
> > > > 
> > > > Regarding this and "2) Side-effect-free repo", can or should we design 
> > > > the API
> > > > as something like a low-level storage interface? That will allow us to 
> > > > not
> > > > make repo/revlog know too much about chg.
> > > > 
> > > > I don't have any concrete idea, but that would work as follows:
> > > > 
> > > >  1. chg injects an object to select storage backends
> > > > e.g. repo.storage = chgpreloadable(repo.storage, cache)
> > > >  2. repo passes it to revlog, etc.
> > > >  3. revlog uses it to read indexfile, where in-memory cache may be 
> > > > returned
> > > > e.g. storage.parserevlog(indexfile)
> > > >
> > > > Perhaps, this 'storage' object is similar to the one you call 
> > > > 'baserepository'.
> > > 
> > > I'm not sure if I get the idea (probably not). How does the implementation
> > > in the master server look like?
> > 
> > I was just thinking about how to hack the real repo object without 
> > introducing
> > much mess. Perhaps the master server wouldn't be that different from your 
> > idea.
> > 
> > > It feels more like "repo.chgcache" to me and the difference is that the
> > > vanilla hg will be changed to access objects via it (so the interface 
> > > looks
> > > more consistent).
> > 
> > Yeah, it might be like repo.chgcache.
> > 
> > Since we shouldn't pass repo to revlog (it's layering violation), I think
> > we'll need a thin wrapper for chgcache anyway.
> 
> I mentioned this in the second mail, "4) Where to get preloaded results (in
> worker)", we could just expose some kind of global state, like a
> "globalcache" module.

Does it mean any low-level objects will directly access to the global cache?
That seems ugly (but maybe I'm biased as I'm really allergic to global data.)

> > > Things to consider:
> > > 
> > >   a) Objects being preloaded have dependency - ex. the obsstore depends on
> > >  changelog and other things. The preload functions run in a defined
> > >  order.
> > 
> > Maybe dependencies could be passed as arguments?
> 
> Ideally, these expensive calculating (ex. obsstore) could be moved to the
> index object. In the reality, that requires too much work, and obsstore
> preloading requires a subset of "repo", including "repo.revs".
> 
> It's possible to decouple obsstore preloading from the repo object, but that
> could be a lot of work too.

My opinion for obsstore is that the most costly part would be populating 100k+
objects from file, and the other costly parts could be mitigated by some higher-
level cache in repoview.py.

But I think this topic was discussed thoroughly between you and pyd before.
I'm not intended to bring it up again.

> > >   b) The index file is not always a single file, depending on "vfs".
> > 
> > Yes. vfs could be owned by storage/chgcache class.
> > 
> > >   c) The user may want to control what to preload. For example, if they 
> > > have
> > >  an incompatible manifest, they could make changelog preloaded, but 
> > > not
> > >  manifest.
> > 
> > No idea about (c).
> > 
> > >   d) Users can add other preloading items easily, not only just the
> > >  predefined ones.
> > 
> > So probably we'll need an extensible table of preloadable items.
> 
> If you check my prototype code, it's using a registrar to collect all
> @preload functions.

Yes. I wanted to say we would need this kind of abstraction anyway.
___
Mercurial-deve

Re: Making chg stateful

2017-02-03 Thread Jun Wu

Excerpts from Yuya Nishihara's message of 2017-02-04 00:11:22 +0900:
> On Thu, 2 Feb 2017 16:56:11 +, Jun Wu wrote:
> > Excerpts from Yuya Nishihara's message of 2017-02-03 00:45:22 +0900:
> > > On Thu, 2 Feb 2017 09:34:47 +, Jun Wu wrote:
> > > > So what state do we store?
> > > > 
> > > >   {repopath: {name: (hash, content)}}. For example:
> > > > 
> > > > cache = {'/home/foo/repo1': {'index': ('hash', changelogindex),
> > > >  'bookmarks': ('hash', bookmarks),
> > > >   },
> > > >  '/home/foo/repo2': {  },  }
> > > > 
> > > >   The main ideas here are:
> > > > 1) Store the lowest level objects, like the C changelog index.
> > > >Because higher level objects could be changed by extensions in
> > > >unpredictable ways. (this is not true in my hacky prototype 
> > > > though)
> > > > 2) Hash everything. For changelog, it's like the file stat of
> > > >changelog.i. There must be a strong guarantee that the hash 
> > > > matches
> > > >the content, which could be challenging, but not impossible. I'll
> > > >cover more details below.
> > > > 
> > > >   The cache is scoped by repo to make the API simpler/easy to use. It 
> > > > may
> > > >   be interesting to have some global state (like passing back the 
> > > > extension
> > > >   path to import them at runtime).
> > > 
> > > Regarding this and "2) Side-effect-free repo", can or should we design 
> > > the API
> > > as something like a low-level storage interface? That will allow us to not
> > > make repo/revlog know too much about chg.
> > > 
> > > I don't have any concrete idea, but that would work as follows:
> > > 
> > >  1. chg injects an object to select storage backends
> > > e.g. repo.storage = chgpreloadable(repo.storage, cache)
> > >  2. repo passes it to revlog, etc.
> > >  3. revlog uses it to read indexfile, where in-memory cache may be 
> > > returned
> > > e.g. storage.parserevlog(indexfile)
> > >
> > > Perhaps, this 'storage' object is similar to the one you call 
> > > 'baserepository'.
> > 
> > I'm not sure if I get the idea (probably not). How does the implementation
> > in the master server look like?
> 
> I was just thinking about how to hack the real repo object without introducing
> much mess. Perhaps the master server wouldn't be that different from your 
> idea.
> 
> > It feels more like "repo.chgcache" to me and the difference is that the
> > vanilla hg will be changed to access objects via it (so the interface looks
> > more consistent).
> 
> Yeah, it might be like repo.chgcache.
> 
> Since we shouldn't pass repo to revlog (it's layering violation), I think
> we'll need a thin wrapper for chgcache anyway.

I mentioned this in the second mail, "4) Where to get preloaded results (in
worker)", we could just expose some kind of global state, like a
"globalcache" module.

> > Things to consider:
> > 
> >   a) Objects being preloaded have dependency - ex. the obsstore depends on
> >  changelog and other things. The preload functions run in a defined
> >  order.
> 
> Maybe dependencies could be passed as arguments?

Ideally, these expensive calculating (ex. obsstore) could be moved to the
index object. In the reality, that requires too much work, and obsstore
preloading requires a subset of "repo", including "repo.revs".

It's possible to decouple obsstore preloading from the repo object, but that
could be a lot of work too.

> >   b) The index file is not always a single file, depending on "vfs".
> 
> Yes. vfs could be owned by storage/chgcache class.
> 
> >   c) The user may want to control what to preload. For example, if they have
> >  an incompatible manifest, they could make changelog preloaded, but not
> >  manifest.
> 
> No idea about (c).
> 
> >   d) Users can add other preloading items easily, not only just the
> >  predefined ones.
> 
> So probably we'll need an extensible table of preloadable items.

If you check my prototype code, it's using a registrar to collect all
@preload functions.

> > I think "storage.parserevlog(indexfile)" (treating index separately, without
> > from a repo object) may have trouble dealing with "a)".
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel

Re: Making chg stateful

2017-02-03 Thread Jun Wu

Excerpts from Yuya Nishihara's message of 2017-02-04 00:31:45 +0900:
> On Fri, 3 Feb 2017 09:33:32 +, Jun Wu wrote:
> > 1) What does the worker talk to the master via IPC?
> > 
> >   Current: worker sends "possible_dirty_repo: $REPO_PATH" after runcommand
> > 
> > It's coupled with the (pseudo) repo object. But the master could preload
> > things in correct order - first changelog, then phase, obsstore, etc.
> > 
> >   What if: worker sends "possible_dirty_index: $INDEX_PATH",
> >   "possible_dirty_obsstore $OBSSTORE_PATH" etc.
> > 
> > It looks more flexible at first sight. However the dependency issue
> > could be tricky to solve - for example, the obsstore requires changelog
> > (and revset).  It's much harder to preload obsstore without a pseudo
> > repo object with repo.changelog, repo.revs available.
> > 
> >   Therefore, probably just send the repo path.
> 
> Given this, I think pipes would be simpler than using shared memory. The 
> master
> IO can be multiplexed by select().

If it's one pipe per worker, the code change (I tried) is too much. I think
a single non-blocking pipe could be an option.

> (But I don't have strong opinion to disagree with using shm.)
> 
> > 2) Preload function signature (used by master's background thread).
> > 
> >   This function is to hash and load things. It runs in the master, and is
> >   basically a generator (so the "loading" part could be skipped on cache
> >   hit, and it's easier to make content and hash_with_content consistent, if
> >   they share states):
> > 
> >   @preloadregistar(something)
> >   def preloadfunc(...):
> >   yield quick_hash
> >   yield content, hash_with_content
> 
> Isn't it simpler if preloadfunc() returns quick_hash and a delayed function
> to build content? I feel this generator is kinda abuse and makes it less clear
> where the costly part starts.

I have thought about this. "Generator" is used as a lightweight process that
could be terminated cheaply. It shares states (local variables) for hashing
and loading, making certain things easier.

Compare:

  @preload('index')
  def preloadindex(repo):
  f = repo.svfs('changelog.i') # "f" can be shared naturally
  hash = fstat(f).st_size
  yield hash

  indexdata = f.read()
  hash = len(indexdata)
  yield parseindex(indexdata), hash

vs.

  def loadindex(f):
  indexdata = f.read()
  hash = len(indexdata)
  return parseindex(indexdata), hash

  @preload('index')
  def preloadindex(repo):
  f = repo.svfs('changelog.i')
  hash = repo.svfs('changelog.i').st_size
  loadfunc = lambda: loadindex(f) # pass "f" explicitly 
  return hash, loadfunc
  
Generator seems to be the shortest way to express the logic.
I think it's easy to say that things after the first "yield" is costly.

For readibility, I think improved form mentioned at the end of the previous
mail "yield preload.foo(...)" makes things obvious.
  
> >   Problem: What's "..." ? It could be "repopath" as that's exactly what the
> >   worker sends to us. But that will make the preload function more
> >   complicated to be correct - repo has "requirements", etc and the index is
> >   not always simply path.join(repo_path, 'store', '00changelog.i').
> >   
> >   Similar to 1, if we decouple from the repo object, index preloading would
> >   be fine:
> > 
> >   @preload('index')
> >   def preloadindex(indexpath):
> >   
> > 
> >   But things with dependency will have trouble:
> > 
> >   @preload('obsstore')
> >   def preloadobsstore(obsstorepath):
> >   # how to access repo.revs('') ?
> > 
> >   Therefore, the preload function needs a repo (even it's "pseudo") to be
> >   convenient to implement.
> 
> For the "...", I agree repopath isn't enough.
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel

Re: Making chg stateful

2017-02-03 Thread Yuya Nishihara

On Fri, 3 Feb 2017 09:33:32 +, Jun Wu wrote:
> 1) What does the worker talk to the master via IPC?
> 
>   Current: worker sends "possible_dirty_repo: $REPO_PATH" after runcommand
> 
> It's coupled with the (pseudo) repo object. But the master could preload
> things in correct order - first changelog, then phase, obsstore, etc.
> 
>   What if: worker sends "possible_dirty_index: $INDEX_PATH",
>   "possible_dirty_obsstore $OBSSTORE_PATH" etc.
> 
> It looks more flexible at first sight. However the dependency issue
> could be tricky to solve - for example, the obsstore requires changelog
> (and revset).  It's much harder to preload obsstore without a pseudo
> repo object with repo.changelog, repo.revs available.
> 
>   Therefore, probably just send the repo path.

Given this, I think pipes would be simpler than using shared memory. The master
IO can be multiplexed by select().

(But I don't have strong opinion to disagree with using shm.)

> 2) Preload function signature (used by master's background thread).
> 
>   This function is to hash and load things. It runs in the master, and is
>   basically a generator (so the "loading" part could be skipped on cache
>   hit, and it's easier to make content and hash_with_content consistent, if
>   they share states):
> 
>   @preloadregistar(something)
>   def preloadfunc(...):
>   yield quick_hash
>   yield content, hash_with_content

Isn't it simpler if preloadfunc() returns quick_hash and a delayed function
to build content? I feel this generator is kinda abuse and makes it less clear
where the costly part starts.

>   Problem: What's "..." ? It could be "repopath" as that's exactly what the
>   worker sends to us. But that will make the preload function more
>   complicated to be correct - repo has "requirements", etc and the index is
>   not always simply path.join(repo_path, 'store', '00changelog.i').
>   
>   Similar to 1, if we decouple from the repo object, index preloading would
>   be fine:
> 
>   @preload('index')
>   def preloadindex(indexpath):
>   
> 
>   But things with dependency will have trouble:
> 
>   @preload('obsstore')
>   def preloadobsstore(obsstorepath):
>   # how to access repo.revs('') ?
> 
>   Therefore, the preload function needs a repo (even it's "pseudo") to be
>   convenient to implement.

For the "...", I agree repopath isn't enough.
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel

Re: Making chg stateful

2017-02-03 Thread Yuya Nishihara

On Thu, 2 Feb 2017 16:56:11 +, Jun Wu wrote:
> Excerpts from Yuya Nishihara's message of 2017-02-03 00:45:22 +0900:
> > On Thu, 2 Feb 2017 09:34:47 +, Jun Wu wrote:
> > > So what state do we store?
> > > 
> > >   {repopath: {name: (hash, content)}}. For example:
> > > 
> > > cache = {'/home/foo/repo1': {'index': ('hash', changelogindex),
> > >  'bookmarks': ('hash', bookmarks),
> > >   },
> > >  '/home/foo/repo2': {  },  }
> > > 
> > >   The main ideas here are:
> > > 1) Store the lowest level objects, like the C changelog index.
> > >Because higher level objects could be changed by extensions in
> > >unpredictable ways. (this is not true in my hacky prototype though)
> > > 2) Hash everything. For changelog, it's like the file stat of
> > >changelog.i. There must be a strong guarantee that the hash matches
> > >the content, which could be challenging, but not impossible. I'll
> > >cover more details below.
> > > 
> > >   The cache is scoped by repo to make the API simpler/easy to use. It may
> > >   be interesting to have some global state (like passing back the 
> > > extension
> > >   path to import them at runtime).
> > 
> > Regarding this and "2) Side-effect-free repo", can or should we design the 
> > API
> > as something like a low-level storage interface? That will allow us to not
> > make repo/revlog know too much about chg.
> > 
> > I don't have any concrete idea, but that would work as follows:
> > 
> >  1. chg injects an object to select storage backends
> > e.g. repo.storage = chgpreloadable(repo.storage, cache)
> >  2. repo passes it to revlog, etc.
> >  3. revlog uses it to read indexfile, where in-memory cache may be returned
> > e.g. storage.parserevlog(indexfile)
> >
> > Perhaps, this 'storage' object is similar to the one you call 
> > 'baserepository'.
> 
> I'm not sure if I get the idea (probably not). How does the implementation
> in the master server look like?

I was just thinking about how to hack the real repo object without introducing
much mess. Perhaps the master server wouldn't be that different from your idea.

> It feels more like "repo.chgcache" to me and the difference is that the
> vanilla hg will be changed to access objects via it (so the interface looks
> more consistent).

Yeah, it might be like repo.chgcache.

Since we shouldn't pass repo to revlog (it's layering violation), I think
we'll need a thin wrapper for chgcache anyway.

> Things to consider:
> 
>   a) Objects being preloaded have dependency - ex. the obsstore depends on
>  changelog and other things. The preload functions run in a defined
>  order.

Maybe dependencies could be passed as arguments?

>   b) The index file is not always a single file, depending on "vfs".

Yes. vfs could be owned by storage/chgcache class.

>   c) The user may want to control what to preload. For example, if they have
>  an incompatible manifest, they could make changelog preloaded, but not
>  manifest.

No idea about (c).

>   d) Users can add other preloading items easily, not only just the
>  predefined ones.

So probably we'll need an extensible table of preloadable items.

> I think "storage.parserevlog(indexfile)" (treating index separately, without
> from a repo object) may have trouble dealing with "a)".
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel

Re: Making chg stateful

2017-02-03 Thread Jun Wu

To be clear, the whole preloading framework needs design decisions made in
different places. I'll go through them one by one:

1) What does the worker talk to the master via IPC?

  Current: worker sends "possible_dirty_repo: $REPO_PATH" after runcommand

It's coupled with the (pseudo) repo object. But the master could preload
things in correct order - first changelog, then phase, obsstore, etc.

  What if: worker sends "possible_dirty_index: $INDEX_PATH",
  "possible_dirty_obsstore $OBSSTORE_PATH" etc.

It looks more flexible at first sight. However the dependency issue
could be tricky to solve - for example, the obsstore requires changelog
(and revset).  It's much harder to preload obsstore without a pseudo
repo object with repo.changelog, repo.revs available.

  Therefore, probably just send the repo path.

2) Preload function signature (used by master's background thread).

  This function is to hash and load things. It runs in the master, and is
  basically a generator (so the "loading" part could be skipped on cache
  hit, and it's easier to make content and hash_with_content consistent, if
  they share states):

  @preloadregistar(something)
  def preloadfunc(...):
  yield quick_hash
  yield content, hash_with_content

  Problem: What's "..." ? It could be "repopath" as that's exactly what the
  worker sends to us. But that will make the preload function more
  complicated to be correct - repo has "requirements", etc and the index is
  not always simply path.join(repo_path, 'store', '00changelog.i').

  Similar to 1, if we decouple from the repo object, index preloading would
  be fine:

  @preload('index')
  def preloadindex(indexpath):

  But things with dependency will have trouble:

  @preload('obsstore')
  def preloadobsstore(obsstorepath):
  # how to access repo.revs('') ?

  Therefore, the preload function needs a repo (even it's "pseudo") to be
  convenient to implement.

  What to yield may be changed, see "4)".

3) Where to store preloaded results (in master) ?

  The storage can only be a global state. Currently it's a two-level dict,
  repo_path is the 1st level key, while "name"s (ex. "changelog",
  "obsstore") are the 2nd level.

  I think the global state could be either public or private.

4) Where to get preloaded results (in worker) ?

  repo.chgcache (or repo.whatever) mentioned previously is just a shortcut
  to "globalstate[repo.path]". In the real world,
  globalstate[repo.path]['index'], globalstate[repo.path]['obsstore'], etc.
  will be accessed.

  Note: when they are accessed, the framework should know how to calculate
  the latest hash and detect cache invalidation.

  If we want to decouple from repo, we may want to swap the 2 keys of
  "globalstate", ex. globalstate['index'][indexpath]. That requires the
  function to yield the indexpath out so the framework knows where to check:

  @preload('index')
  def preloadindex(repo):
  yield indexpath
  yield hash
  yield content, hash

  This is helpful if something is shared. Like a shared repo etc. Otherwise,
  it may make the code a bit longer than necessary. So maybe we make what to
  "yield" typed:

  @preload('index')
  def preloadindex(repo):
  yield preload.key(indexpath) # optional, default to repo.root
  yield preload.checkhash(hash)
  yield preload.update(content, hash)

  This also looks more flexible - we may extend the preloading framework by
  adding more contents.

Excerpts from Yuya Nishihara's message of 2017-02-03 00:45:22 +0900:
> On Thu, 2 Feb 2017 09:34:47 +, Jun Wu wrote:
> > Perf Numbers
> > 
> >   I wrote a hacky prototype [2] that shows significant improvements on
> >   various commands in our repo:
> >   
> >Before   After (in seconds)
> >   chg bookmark 0.40 0.08
> >   chg log -r . -T '{node}' 0.50 0.08
> >   chg sl   0.71 0.27
> >   chg id   0.71 0.37
> >   
> >   And hg-committed (with 12M obsstore) could benefit from it too mainly
> >   because the obsstore becomes preloaded:
> >   
> >Before   After
> >   chg bookmark 0.56 0.06
> >   chg log -r . -T '{node}' 0.56 0.07
> >   chg log -r . -p  0.86 0.08
> >   chg sl   0.84 0.21
> >   chg id   0.54 0.06
> >   
> >   So I think it's nice to get a formal implementation upstreamed. It's also
> >   easier to solve the perf issues individually like building the hidden
> >   bitmap, building an serialization format for the radix tree, etc.
> >   Although we can also build them later.
> 
> Nice. I want this perf number today. :)
> 
> > So what state do we store?
> > 
> >   {repopath: {name: (hash, content)}}. For example:
> > 
> > cache = {'/home/foo/repo1': {'index

Re: Making chg stateful

2017-02-02 Thread Jun Wu

Excerpts from Yuya Nishihara's message of 2017-02-03 00:45:22 +0900:
> On Thu, 2 Feb 2017 09:34:47 +, Jun Wu wrote:
> > Perf Numbers
> > 
> >   I wrote a hacky prototype [2] that shows significant improvements on
> >   various commands in our repo:
> >   
> >Before   After (in seconds)
> >   chg bookmark 0.40 0.08
> >   chg log -r . -T '{node}' 0.50 0.08
> >   chg sl   0.71 0.27
> >   chg id   0.71 0.37
> >   
> >   And hg-committed (with 12M obsstore) could benefit from it too mainly
> >   because the obsstore becomes preloaded:
> >   
> >Before   After
> >   chg bookmark 0.56 0.06
> >   chg log -r . -T '{node}' 0.56 0.07
> >   chg log -r . -p  0.86 0.08
> >   chg sl   0.84 0.21
> >   chg id   0.54 0.06
> >   
> >   So I think it's nice to get a formal implementation upstreamed. It's also
> >   easier to solve the perf issues individually like building the hidden
> >   bitmap, building an serialization format for the radix tree, etc.
> >   Although we can also build them later.
> 
> Nice. I want this perf number today. :)
> 
> > So what state do we store?
> > 
> >   {repopath: {name: (hash, content)}}. For example:
> > 
> > cache = {'/home/foo/repo1': {'index': ('hash', changelogindex),
> >  'bookmarks': ('hash', bookmarks),
> >   },
> >  '/home/foo/repo2': {  },  }
> > 
> >   The main ideas here are:
> > 1) Store the lowest level objects, like the C changelog index.
> >Because higher level objects could be changed by extensions in
> >unpredictable ways. (this is not true in my hacky prototype though)
> > 2) Hash everything. For changelog, it's like the file stat of
> >changelog.i. There must be a strong guarantee that the hash matches
> >the content, which could be challenging, but not impossible. I'll
> >cover more details below.
> > 
> >   The cache is scoped by repo to make the API simpler/easy to use. It may
> >   be interesting to have some global state (like passing back the extension
> >   path to import them at runtime).
> 
> Regarding this and "2) Side-effect-free repo", can or should we design the API
> as something like a low-level storage interface? That will allow us to not
> make repo/revlog know too much about chg.
> 
> I don't have any concrete idea, but that would work as follows:
> 
>  1. chg injects an object to select storage backends
> e.g. repo.storage = chgpreloadable(repo.storage, cache)
>  2. repo passes it to revlog, etc.
>  3. revlog uses it to read indexfile, where in-memory cache may be returned
> e.g. storage.parserevlog(indexfile)
>
> Perhaps, this 'storage' object is similar to the one you call 
> 'baserepository'.

I'm not sure if I get the idea (probably not). How does the implementation
in the master server look like?

It feels more like "repo.chgcache" to me and the difference is that the
vanilla hg will be changed to access objects via it (so the interface looks
more consistent).

Things to consider:

  a) Objects being preloaded have dependency - ex. the obsstore depends on
 changelog and other things. The preload functions run in a defined
 order.
  b) The index file is not always a single file, depending on "vfs".
  c) The user may want to control what to preload. For example, if they have
 an incompatible manifest, they could make changelog preloaded, but not
 manifest.
  d) Users can add other preloading items easily, not only just the
 predefined ones.

I think "storage.parserevlog(indexfile)" (treating index separately, without
from a repo object) may have trouble dealing with "a)".

> >   Then, repo.chgcache['index'] becomes available in worker processes. When
> >   initializing the changelog index, try to use the chg cache:
> > 
> >   # inside changelog's revlog.__init__:
> >   # note: repo.chgcache is empty for non-chg cases, a fallback is needed
> >   self.index = repo.chgcache.get('index') or _loadindex(...)
> > 
> >   The API is the simplest that I can think of, while being also reasonably
> >   flexible (for example, we can add additional steps like forcing building
> >   the radix tree etc). But I'm open to suggestions.
> > 
> > Some implementation details
> > 
> >   (This is the part that I want feedback the most)
> > 
> >   1) IPC between chg master and forked worker
> > 
> >  This is how the worker tells the master about what (ex. repo paths) to
> >  preload.
> > 
> >  I think it does not need to be 100% reliable. So I use shared memory in
> >  the hacky prototype [2]. Pipes are reliable and may notify master
> >  quicker, while have risks of blocking. shm seems much easier to
> >  implement and I think the latency of

Re: Making chg stateful

2017-02-02 Thread Jun Wu

Excerpts from Simon Farnsworth's message of 2017-02-02 15:25:45 +:
> On 02/02/2017 09:34, Jun Wu wrote:
> 
> 
> > What's the API?
> >
> >   (This is an existing implementation detail. I'm open to any ideas)
> >
> >   For example, let's say we want to preload the changelog index (code
> >   simplified so it does not take care of all corner cases).
> >
> >   First, tell chg how to hash and load something:
> >
> >   from mercurial.chgserver import repopreload
> >
> >   @repopreload('index')
> >   def foopreloader(repo):
> >   # use the size of changelog.i as the hash. note: the hash function
> >   # must be very fast.
> >   hash = repo.svfs.stat('00changelog.i').st_size
> >   # tell chg about the current hash. if hash matches, the generator
> >   # function stops here.
> >   yield hash
> >
> >   # if hash mismatches, load the changelog in a slower way.
> >   with repo.svfs('00changelog.i') as f:
> >   data = f.read()
> >   hash = len(f)
> >   index = _buildindex(data)
> >   index.partialmatch('') # force build the radix tree
> >   # tell chg about the loading result and the hash that
> >   # absolutely matches the result.
> >   yield index, hash
> >
> >   Then, repo.chgcache['index'] becomes available in worker processes. When
> >   initializing the changelog index, try to use the chg cache:
> >
> >   # inside changelog's revlog.__init__:
> >   # note: repo.chgcache is empty for non-chg cases, a fallback is needed
> >   self.index = repo.chgcache.get('index') or _loadindex(...)
> >
> 
> Can we ensure that chgcache.get(key) always gets an answer (i.e. no 
> fallback needed)? My concern is divergence between the fallback and 
> chgcache versions of the data fetch.

It's diverged intentionally, mainly because 1:

 1. The chg version could have more internal states to preload. Take the
changelog index for example, chg would like to force pre-populating the
partialmatch radix tree, and other indexes we will have in the future.
 2. If chg is not used (ex. Windows), it's not necessary to calculate the
hash, although that's cheap.

I think chg preload function could call the vanilla load function, plus some
extra work. That will ensure consistency.

> It seems like it should be trivial to implement, too - repo.chgcache for 
> the non-chg case would do:
> 
>  g = preloadfunctable[name](self._prepo)
>  hash = next(g)
>  content, hash = next(g)
>  return g
> 
> knowing that computing hash is cheap.
> 
> This would mean that you'd just do:
>  self.index = repo.chgcache.get('index')
> and if chg can speed you up, it does, while if it's not in use, you get 
> the old slow path. It also means that chg and hg will give the same 
> results, and that there's pressure to use a cheap hash for invalidations 
> (because the repo is effectively always hashed.
> 
> >   The API is the simplest that I can think of, while being also reasonably
> >   flexible (for example, we can add additional steps like forcing building
> >   the radix tree etc). But I'm open to suggestions.
> >
> > Some implementation details
> >
> >   (This is the part that I want feedback the most)
> >
> >   1) IPC between chg master and forked worker
> >
> >  This is how the worker tells the master about what (ex. repo paths) to
> >  preload.
> >
> >  I think it does not need to be 100% reliable. So I use shared memory in
> >  the hacky prototype [2]. Pipes are reliable and may notify master
> >  quicker, while have risks of blocking. shm seems much easier to
> >  implement and I think the latency of master getting the information is
> >  not a big deal. I prefer shm, but other ideas are welcomed.
> >
> 
> Did you try O_NONBLOCK pipes, out of interest?

That could be an option. The ugly part is "write"ing more than PIPE_BUF size
could success with partial content written. So it requires special logic to
deal with boundaries, maybe by adding '\0's at both ends.

> Having said that, given that the worker is forked from the master, shm 
> seems perfectly reasonable to me - the worker and master will, by 
> definition, be running the same code.
> 
> However, rather than coding it ourselves, I'd be inclined to use 
> multiprocessing (https://docs.python.org/2/library/multiprocessing.html ) 
> to provide the shared memory primitives you need.

I actually checked multiprocessing before. It requires you to use its
special fork() implementation instead of just "os.fork()" so its fork hook
gets executed. It has a complex abstraction about fork hooks, raw pointers,
serialization, etc. and does not actually support Windows well. Given the
unnecessary complexity and overhead of multiprocessing, I think a 50-line
implementation is better.

> >   2) Side-effect-free repo
> >
> >  This is the most "painful" part for the preloading framework

Re: Making chg stateful

2017-02-02 Thread Yuya Nishihara

On Thu, 2 Feb 2017 09:34:47 +, Jun Wu wrote:
> Perf Numbers
> 
>   I wrote a hacky prototype [2] that shows significant improvements on
>   various commands in our repo:
>   
>Before   After (in seconds)
>   chg bookmark 0.40 0.08
>   chg log -r . -T '{node}' 0.50 0.08
>   chg sl   0.71 0.27
>   chg id   0.71 0.37
>   
>   And hg-committed (with 12M obsstore) could benefit from it too mainly
>   because the obsstore becomes preloaded:
>   
>Before   After
>   chg bookmark 0.56 0.06
>   chg log -r . -T '{node}' 0.56 0.07
>   chg log -r . -p  0.86 0.08
>   chg sl   0.84 0.21
>   chg id   0.54 0.06
>   
>   So I think it's nice to get a formal implementation upstreamed. It's also
>   easier to solve the perf issues individually like building the hidden
>   bitmap, building an serialization format for the radix tree, etc.
>   Although we can also build them later.

Nice. I want this perf number today. :)

> So what state do we store?
> 
>   {repopath: {name: (hash, content)}}. For example:
> 
> cache = {'/home/foo/repo1': {'index': ('hash', changelogindex),
>  'bookmarks': ('hash', bookmarks),
>   },
>  '/home/foo/repo2': {  },  }
> 
>   The main ideas here are:
> 1) Store the lowest level objects, like the C changelog index.
>Because higher level objects could be changed by extensions in
>unpredictable ways. (this is not true in my hacky prototype though)
> 2) Hash everything. For changelog, it's like the file stat of
>changelog.i. There must be a strong guarantee that the hash matches
>the content, which could be challenging, but not impossible. I'll
>cover more details below.
> 
>   The cache is scoped by repo to make the API simpler/easy to use. It may
>   be interesting to have some global state (like passing back the extension
>   path to import them at runtime).

Regarding this and "2) Side-effect-free repo", can or should we design the API
as something like a low-level storage interface? That will allow us to not
make repo/revlog know too much about chg.

I don't have any concrete idea, but that would work as follows:

 1. chg injects an object to select storage backends
e.g. repo.storage = chgpreloadable(repo.storage, cache)
 2. repo passes it to revlog, etc.
 3. revlog uses it to read indexfile, where in-memory cache may be returned
e.g. storage.parserevlog(indexfile)

Perhaps, this 'storage' object is similar to the one you call 'baserepository'.

>   Then, repo.chgcache['index'] becomes available in worker processes. When
>   initializing the changelog index, try to use the chg cache:
> 
>   # inside changelog's revlog.__init__:
>   # note: repo.chgcache is empty for non-chg cases, a fallback is needed
>   self.index = repo.chgcache.get('index') or _loadindex(...)
> 
>   The API is the simplest that I can think of, while being also reasonably
>   flexible (for example, we can add additional steps like forcing building
>   the radix tree etc). But I'm open to suggestions.
> 
> Some implementation details
> 
>   (This is the part that I want feedback the most)
> 
>   1) IPC between chg master and forked worker
> 
>  This is how the worker tells the master about what (ex. repo paths) to
>  preload.
> 
>  I think it does not need to be 100% reliable. So I use shared memory in
>  the hacky prototype [2]. Pipes are reliable and may notify master
>  quicker, while have risks of blocking. shm seems much easier to
>  implement and I think the latency of master getting the information is
>  not a big deal. I prefer shm, but other ideas are welcomed.

I like using pipes and it won't be complex so long as an IPC message is short
enough to write atomically (< ~4k). But pipe-vs-shm is merely an implementation
detail, so we can switch them later if needed.

>   2) Side-effect-free repo
> 
>  This is the most "painful" part for the preloading framework to be
>  confident.
> 
>  The preload function has a signature that takes a repo object. But chg
>  could not provide a real repo object. So it's best-effort.
> 
>  On the other hand, most preload functions only need to hash file stats,
>  i.e. just use repo.vfs, repo.svfs etc. They do not need a full-featured
>  repo object.
> 
>  Therefore I think the choices are:
> 
>   a) Provide a repo object that is: localrepository - side effects
>  (maximum compatibility)
>   b) Build a fresh new repo object that has only the minimal part, ex.
>  vfs, svfs etc. (less compatible)
>   c) Do not provide a repo object. Just provide "repo path".
>  (move the burden to the preload function writers)

Re: Making chg stateful

2017-02-02 Thread Simon Farnsworth


On 02/02/2017 09:34, Jun Wu wrote:



What's the API?

  (This is an existing implementation detail. I'm open to any ideas)

  For example, let's say we want to preload the changelog index (code
  simplified so it does not take care of all corner cases).

  First, tell chg how to hash and load something:

  from mercurial.chgserver import repopreload

  @repopreload('index')
  def foopreloader(repo):
  # use the size of changelog.i as the hash. note: the hash function
  # must be very fast.
  hash = repo.svfs.stat('00changelog.i').st_size
  # tell chg about the current hash. if hash matches, the generator
  # function stops here.
  yield hash

  # if hash mismatches, load the changelog in a slower way.
  with repo.svfs('00changelog.i') as f:
  data = f.read()
  hash = len(f)
  index = _buildindex(data)
  index.partialmatch('') # force build the radix tree
  # tell chg about the loading result and the hash that
  # absolutely matches the result.
  yield index, hash

  Then, repo.chgcache['index'] becomes available in worker processes. When
  initializing the changelog index, try to use the chg cache:

  # inside changelog's revlog.__init__:
  # note: repo.chgcache is empty for non-chg cases, a fallback is needed
  self.index = repo.chgcache.get('index') or _loadindex(...)



Can we ensure that chgcache.get(key) always gets an answer (i.e. no 
fallback needed)? My concern is divergence between the fallback and 
chgcache versions of the data fetch.


It seems like it should be trivial to implement, too - repo.chgcache for 
the non-chg case would do:


g = preloadfunctable[name](self._prepo)
hash = next(g)
content, hash = next(g)
return g

knowing that computing hash is cheap.

This would mean that you'd just do:
self.index = repo.chgcache.get('index')
and if chg can speed you up, it does, while if it's not in use, you get 
the old slow path. It also means that chg and hg will give the same 
results, and that there's pressure to use a cheap hash for invalidations 
(because the repo is effectively always hashed.



  The API is the simplest that I can think of, while being also reasonably
  flexible (for example, we can add additional steps like forcing building
  the radix tree etc). But I'm open to suggestions.

Some implementation details

  (This is the part that I want feedback the most)

  1) IPC between chg master and forked worker

 This is how the worker tells the master about what (ex. repo paths) to
 preload.

 I think it does not need to be 100% reliable. So I use shared memory in
 the hacky prototype [2]. Pipes are reliable and may notify master
 quicker, while have risks of blocking. shm seems much easier to
 implement and I think the latency of master getting the information is
 not a big deal. I prefer shm, but other ideas are welcomed.



Did you try O_NONBLOCK pipes, out of interest?

Having said that, given that the worker is forked from the master, shm 
seems perfectly reasonable to me - the worker and master will, by 
definition, be running the same code.


However, rather than coding it ourselves, I'd be inclined to use 
multiprocessing (https://docs.python.org/2/library/multiprocessing.html) 
to provide the shared memory primitives you need.



  2) Side-effect-free repo

 This is the most "painful" part for the preloading framework to be
 confident.

 The preload function has a signature that takes a repo object. But chg
 could not provide a real repo object. So it's best-effort.

 On the other hand, most preload functions only need to hash file stats,
 i.e. just use repo.vfs, repo.svfs etc. They do not need a full-featured
 repo object.

 Therefore I think the choices are:

  a) Provide a repo object that is: localrepository - side effects
 (maximum compatibility)
  b) Build a fresh new repo object that has only the minimal part, ex.
 vfs, svfs etc. (less compatible)
  c) Do not provide a repo object. Just provide "repo path".
 (move the burden to the preload function writers)

 I currently prefer a). The plan is to move part of "localrepository" to
 a side-effect free "baserepository" and use "baserepository" in the
 background preloading thread.



I like this plan.


  3) Side effect of extensions

 The new chg framework will not run uisetup()s. Where the preloading
 framework does sometimes depend on some side effects of extensions'
 uisetup()s. For example, the *manifest extension could change greatly
 about what the manifest structure so the default manifest preloading
 won't work as expected.

 I think there are different ways to address this:

  a) Add a top-level "chgsetup()" which only gets called by chg, and is
 meant to be side-effect free.

Re: Making chg stateful

2017-02-02 Thread Jun Wu

Fixes a mysterious typo that affects reading.

Excerpts from Jun Wu's message of 2017-02-02 09:34:47 +:
> This is mainly to discuss more details about chg repo preloading idea [1].
> 
> Perf Numbers
> 
>   I wrote a hacky prototype [2] that shows significant improvements on
>   various commands in our repo:
>   
>Before   After (in seconds)
>   chg bookmark 0.40 0.08
>   chg log -r . -T '{node}' 0.50 0.08
>   chg sl   0.71 0.27
>   chg id   0.71 0.37
>   
>   And hg-committed (with 12M obsstore) could benefit from it too mainly
>   because the obsstore becomes preloaded:
>   
>Before   After
>   chg bookmark 0.56 0.06
>   chg log -r . -T '{node}' 0.56 0.07
>   chg log -r . -p  0.86 0.08
>   chg sl   0.84 0.21
>   chg id   0.54 0.06
>   
>   So I think it's nice to get a formal implementation upstreamed. It's also
>   easier to solve the perf issues individually like building the hidden
    "to solve" should be "than solving"

>   bitmap, building an serialization format for the radix tree, etc.
>   Although we can also build them later.
> 
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel

Making chg stateful

2017-02-02 Thread Jun Wu

This is mainly to discuss more details about chg repo preloading idea [1].

Perf Numbers

  I wrote a hacky prototype [2] that shows significant improvements on
  various commands in our repo:
  
   Before   After (in seconds)
  chg bookmark 0.40 0.08
  chg log -r . -T '{node}' 0.50 0.08
  chg sl   0.71 0.27
  chg id   0.71 0.37
  
  And hg-committed (with 12M obsstore) could benefit from it too mainly
  because the obsstore becomes preloaded:
  
   Before   After
  chg bookmark 0.56 0.06
  chg log -r . -T '{node}' 0.56 0.07
  chg log -r . -p  0.86 0.08
  chg sl   0.84 0.21
  chg id   0.54 0.06
  
  So I think it's nice to get a formal implementation upstreamed. It's also
  easier to solve the perf issues individually like building the hidden
  bitmap, building an serialization format for the radix tree, etc.
  Although we can also build them later.

Stateful?

  Stateful is to the chg master server process.

  Currently the chg master process is stateless, after forking, the client
  talks to the forked worker directly, without affecting the master, and the
  worker does not talk back to master too:

client master  worker
  | connect() -> | accept()
  :  | fork() -> |
  | send() > | recv()
  # client and worker no longer talk to master

  Therefore the master is currently stateless - it can only preload
  extensions but nothing about the repo state.

  The general direction is to make the worker tells the master what needs to
  be preloaded (like repo paths), and the master has a background thread
  preloading them. Then at fork(), the new worker will get the cache for
  free.

How about just preloading the repo object?

  There was a failed experiment: [3].

  The repo object depends on too many side-effects, and gets invalidated too
  easily. chg will not run uisetup()s, so it'll become harder to get a
  proper repo object.

So what state do we store?

  {repopath: {name: (hash, content)}}. For example:

cache = {'/home/foo/repo1': {'index': ('hash', changelogindex),
 'bookmarks': ('hash', bookmarks),
  },
 '/home/foo/repo2': {  },  }

  The main ideas here are:
1) Store the lowest level objects, like the C changelog index.
   Because higher level objects could be changed by extensions in
   unpredictable ways. (this is not true in my hacky prototype though)
2) Hash everything. For changelog, it's like the file stat of
   changelog.i. There must be a strong guarantee that the hash matches
   the content, which could be challenging, but not impossible. I'll
   cover more details below.

  The cache is scoped by repo to make the API simpler/easy to use. It may
  be interesting to have some global state (like passing back the extension
  path to import them at runtime).

What's the API?

  (This is an existing implementation detail. I'm open to any ideas)

  For example, let's say we want to preload the changelog index (code
  simplified so it does not take care of all corner cases).

  First, tell chg how to hash and load something:

  from mercurial.chgserver import repopreload

  @repopreload('index')
  def foopreloader(repo):
  # use the size of changelog.i as the hash. note: the hash function
  # must be very fast.
  hash = repo.svfs.stat('00changelog.i').st_size
  # tell chg about the current hash. if hash matches, the generator
  # function stops here.
  yield hash

  # if hash mismatches, load the changelog in a slower way.
  with repo.svfs('00changelog.i') as f:
  data = f.read()
  hash = len(f)
  index = _buildindex(data)
  index.partialmatch('') # force build the radix tree
  # tell chg about the loading result and the hash that
  # absolutely matches the result.
  yield index, hash

  Then, repo.chgcache['index'] becomes available in worker processes. When
  initializing the changelog index, try to use the chg cache:

  # inside changelog's revlog.__init__:
  # note: repo.chgcache is empty for non-chg cases, a fallback is needed
  self.index = repo.chgcache.get('index') or _loadindex(...)

  The API is the simplest that I can think of, while being also reasonably
  flexible (for example, we can add additional steps like forcing building
  the radix tree etc). But I'm open to suggestions.

Some implementation details

  (This is the part that I want feedback the most)

  1) IPC between chg master and forked worker

 This is how the worker tells the master about what (ex. repo paths) to
 preload.

 I thi

Re: Making chg stateful

Re: Making chg stateful

Re: Making chg stateful

Re: Making chg stateful

Re: Making chg stateful

Re: Making chg stateful

Re: Making chg stateful

Re: Making chg stateful

Re: Making chg stateful

Re: Making chg stateful

Re: Making chg stateful

Re: Making chg stateful

Re: Making chg stateful

Re: Making chg stateful

Making chg stateful

15 matches

Site Navigation

Mail list logo

Footer information