Re: Making chg stateful
Excerpts from Yuya Nishihara's message of 2017-02-05 13:59:32 +0900: > Perhaps that is a matter of taste, I like the explicit one. But I heard > Python 3 has a generator-based coroutine, so using "yield" for that purpose > is more common in Python 3? > > @preload('index') > def preloadindex(repo): > f = repo.svfs('changelog.i') # "f" can be shared naturally > hash = fstat(f).st_size > def loadindex(): > indexdata = f.read() > hash = len(indexdata) > return parseindex(indexdata), hash The inner "loadindex" function makes things unnecessarily more complex in my opinion. It makes the final return type harder to read, introduces extra indentation and a function definition. If the goal is to get rid of "yield", it could be done in different ways. For example, Choice A: explicitly pass oldhash and the oldobject. We probably need the old object anyway to allow incremental updates: @preload('obsstore', depend=['changelog']) def preloadobsstore(repo, oldhash, oldobsstore): obsstore = oldobsstore newhash = ... if newhash != oldhash: # note: this is an indentation not existed using yield obsstore = ... return newhash, obsstore Choice B: just give the preload function access to the cache object: @preload(['obsstore', 'changelog']) def preloadmultiple(repo, cache): oldhash, oldobsstore = cache['obsstore'] Choice A looks simple. Choice B is more expressive while the function body could be out of control - ex. it could skip setting cache['obsstore'] even if it says so in its decorator. While Choice A may serve most requirements just well, I think "yield" is more expressive, and avoid unnecessary indentation. For example, the following could not be expressed using "return" cleanly and easily: - Optional dependency: if fastpath_cannot_work: yield preload.require('changelog') Maybe "raise MissingRequirement('changelog')", but that makes the function "restarted", while the generator could continue executing without losing context. - Optionally change the cache key: As mentioned in "4)", [1] Therefore I think the generator way avoids all kinds of indentations (reminds me of nodejs callback hell), and has better extensibility. So I currently prefer it. Otherwise the "Choice A" may be good enough for now. The code base has already used "yield" for its future implementation used for batched remote peer calls. [1] https://www.mercurial-scm.org/pipermail/mercurial-devel/2017-February/092584.html > return hash, loadindex > > BTW, when will 'f' be closed if we take the generator-based abstraction? > > def preloadindex(repo): > f = repo.svfs('changelog.i') > try: > yield ... > yield ... > finally: > f.close() ___ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Re: Making chg stateful
Excerpts from Yuya Nishihara's message of 2017-02-05 13:31:29 +0900: > > > Since we shouldn't pass repo to revlog (it's layering violation), I think > > > we'll need a thin wrapper for chgcache anyway. > > > > I mentioned this in the second mail, "4) Where to get preloaded results (in > > worker)", we could just expose some kind of global state, like a > > "globalcache" module. > > Does it mean any low-level objects will directly access to the global cache? > That seems ugly (but maybe I'm biased as I'm really allergic to global data.) Yes. But I don't think the alternative will be better. To avoid using global state, existing functions need to take extra arguments. That's also contagious - to preload some low-level objects, all functions through the call stack need change. I think the root cause of the fact that global state is considered bad is because it's mutable by random code, and could make things hard to predict or debug. We can enforce it to be only mutable by the chg preloading API, so people cannot modify its state outside the preloading framework. That sounds good enough to me. > > > > Things to consider: > > > > > > > > a) Objects being preloaded have dependency - ex. the obsstore depends > > > > on > > > > changelog and other things. The preload functions run in a defined > > > > order. > > > > > > Maybe dependencies could be passed as arguments? > > > > Ideally, these expensive calculating (ex. obsstore) could be moved to the > > index object. In the reality, that requires too much work, and obsstore > > preloading requires a subset of "repo", including "repo.revs". > > > > It's possible to decouple obsstore preloading from the repo object, but that > > could be a lot of work too. > > My opinion for obsstore is that the most costly part would be populating 100k+ > objects from file, and the other costly parts could be mitigated by some > higher- > level cache in repoview.py. > > But I think this topic was discussed thoroughly between you and pyd before. > I'm not intended to bring it up again. Not discussed. I did mention a plan to move it to the index, but that'll surely take a very long time. There are multiple levels of logic related to obsstore, namely, obsstore.getrevs, where the current logic uses revsets (thus needs a heavy repo object). And repoview.compute*, which calls obsstore.getrevs and does some extra simple caching and calculation. Even without obsstore, other things (phases, bookmarks) depend on the changelog. Therefore I think dependency problem must be solved, if we want to have more things preloaded to achieve the best perf. > > > > b) The index file is not always a single file, depending on "vfs". > > > > > > Yes. vfs could be owned by storage/chgcache class. > > > > > > > c) The user may want to control what to preload. For example, if they > > > > have > > > > an incompatible manifest, they could make changelog preloaded, but > > > > not > > > > manifest. > > > > > > No idea about (c). > > > > > > > d) Users can add other preloading items easily, not only just the > > > > predefined ones. > > > > > > So probably we'll need an extensible table of preloadable items. > > > > If you check my prototype code, it's using a registrar to collect all > > @preload functions. > > Yes. I wanted to say we would need this kind of abstraction anyway. ___ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Re: Making chg stateful
On Fri, 3 Feb 2017 19:50:48 +, Jun Wu wrote: > Excerpts from Yuya Nishihara's message of 2017-02-04 00:31:45 +0900: > > On Fri, 3 Feb 2017 09:33:32 +, Jun Wu wrote: > > > 1) What does the worker talk to the master via IPC? > > > > > > Current: worker sends "possible_dirty_repo: $REPO_PATH" after runcommand > > > > > > It's coupled with the (pseudo) repo object. But the master could > > > preload > > > things in correct order - first changelog, then phase, obsstore, etc. > > > > > > What if: worker sends "possible_dirty_index: $INDEX_PATH", > > > "possible_dirty_obsstore $OBSSTORE_PATH" etc. > > > > > > It looks more flexible at first sight. However the dependency issue > > > could be tricky to solve - for example, the obsstore requires > > > changelog > > > (and revset). It's much harder to preload obsstore without a pseudo > > > repo object with repo.changelog, repo.revs available. > > > > > > Therefore, probably just send the repo path. > > > > Given this, I think pipes would be simpler than using shared memory. The > > master > > IO can be multiplexed by select(). > > If it's one pipe per worker, the code change (I tried) is too much. I think > a single non-blocking pipe could be an option. Yep. As we'll rely on atomic write(), we won't have to create pipe per worker. > > > 2) Preload function signature (used by master's background thread). > > > > > > This function is to hash and load things. It runs in the master, and is > > > basically a generator (so the "loading" part could be skipped on cache > > > hit, and it's easier to make content and hash_with_content consistent, > > > if > > > they share states): > > > > > > @preloadregistar(something) > > > def preloadfunc(...): > > > yield quick_hash > > > yield content, hash_with_content > > > > Isn't it simpler if preloadfunc() returns quick_hash and a delayed function > > to build content? I feel this generator is kinda abuse and makes it less > > clear > > where the costly part starts. > > I have thought about this. "Generator" is used as a lightweight process that > could be terminated cheaply. It shares states (local variables) for hashing > and loading, making certain things easier. > > Compare: > > @preload('index') > def preloadindex(repo): > f = repo.svfs('changelog.i') # "f" can be shared naturally > hash = fstat(f).st_size > yield hash > > indexdata = f.read() > hash = len(indexdata) > yield parseindex(indexdata), hash > > vs. > > def loadindex(f): > indexdata = f.read() > hash = len(indexdata) > return parseindex(indexdata), hash > > @preload('index') > def preloadindex(repo): > f = repo.svfs('changelog.i') > hash = repo.svfs('changelog.i').st_size > loadfunc = lambda: loadindex(f) # pass "f" explicitly > return hash, loadfunc > > Generator seems to be the shortest way to express the logic. > I think it's easy to say that things after the first "yield" is costly. > > For readibility, I think improved form mentioned at the end of the previous > mail "yield preload.foo(...)" makes things obvious. Perhaps that is a matter of taste, I like the explicit one. But I heard Python 3 has a generator-based coroutine, so using "yield" for that purpose is more common in Python 3? @preload('index') def preloadindex(repo): f = repo.svfs('changelog.i') # "f" can be shared naturally hash = fstat(f).st_size def loadindex(): indexdata = f.read() hash = len(indexdata) return parseindex(indexdata), hash return hash, loadindex BTW, when will 'f' be closed if we take the generator-based abstraction? def preloadindex(repo): f = repo.svfs('changelog.i') try: yield ... yield ... finally: f.close() ___ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Re: Making chg stateful
On Fri, 3 Feb 2017 20:03:18 +, Jun Wu wrote: > Excerpts from Yuya Nishihara's message of 2017-02-04 00:11:22 +0900: > > On Thu, 2 Feb 2017 16:56:11 +, Jun Wu wrote: > > > Excerpts from Yuya Nishihara's message of 2017-02-03 00:45:22 +0900: > > > > On Thu, 2 Feb 2017 09:34:47 +, Jun Wu wrote: > > > > > So what state do we store? > > > > > > > > > > {repopath: {name: (hash, content)}}. For example: > > > > > > > > > > cache = {'/home/foo/repo1': {'index': ('hash', changelogindex), > > > > > 'bookmarks': ('hash', bookmarks), > > > > > }, > > > > > '/home/foo/repo2': { }, } > > > > > > > > > > The main ideas here are: > > > > > 1) Store the lowest level objects, like the C changelog index. > > > > >Because higher level objects could be changed by extensions in > > > > >unpredictable ways. (this is not true in my hacky prototype > > > > > though) > > > > > 2) Hash everything. For changelog, it's like the file stat of > > > > >changelog.i. There must be a strong guarantee that the hash > > > > > matches > > > > >the content, which could be challenging, but not impossible. > > > > > I'll > > > > >cover more details below. > > > > > > > > > > The cache is scoped by repo to make the API simpler/easy to use. It > > > > > may > > > > > be interesting to have some global state (like passing back the > > > > > extension > > > > > path to import them at runtime). > > > > > > > > Regarding this and "2) Side-effect-free repo", can or should we design > > > > the API > > > > as something like a low-level storage interface? That will allow us to > > > > not > > > > make repo/revlog know too much about chg. > > > > > > > > I don't have any concrete idea, but that would work as follows: > > > > > > > > 1. chg injects an object to select storage backends > > > > e.g. repo.storage = chgpreloadable(repo.storage, cache) > > > > 2. repo passes it to revlog, etc. > > > > 3. revlog uses it to read indexfile, where in-memory cache may be > > > > returned > > > > e.g. storage.parserevlog(indexfile) > > > > > > > > Perhaps, this 'storage' object is similar to the one you call > > > > 'baserepository'. > > > > > > I'm not sure if I get the idea (probably not). How does the implementation > > > in the master server look like? > > > > I was just thinking about how to hack the real repo object without > > introducing > > much mess. Perhaps the master server wouldn't be that different from your > > idea. > > > > > It feels more like "repo.chgcache" to me and the difference is that the > > > vanilla hg will be changed to access objects via it (so the interface > > > looks > > > more consistent). > > > > Yeah, it might be like repo.chgcache. > > > > Since we shouldn't pass repo to revlog (it's layering violation), I think > > we'll need a thin wrapper for chgcache anyway. > > I mentioned this in the second mail, "4) Where to get preloaded results (in > worker)", we could just expose some kind of global state, like a > "globalcache" module. Does it mean any low-level objects will directly access to the global cache? That seems ugly (but maybe I'm biased as I'm really allergic to global data.) > > > Things to consider: > > > > > > a) Objects being preloaded have dependency - ex. the obsstore depends on > > > changelog and other things. The preload functions run in a defined > > > order. > > > > Maybe dependencies could be passed as arguments? > > Ideally, these expensive calculating (ex. obsstore) could be moved to the > index object. In the reality, that requires too much work, and obsstore > preloading requires a subset of "repo", including "repo.revs". > > It's possible to decouple obsstore preloading from the repo object, but that > could be a lot of work too. My opinion for obsstore is that the most costly part would be populating 100k+ objects from file, and the other costly parts could be mitigated by some higher- level cache in repoview.py. But I think this topic was discussed thoroughly between you and pyd before. I'm not intended to bring it up again. > > > b) The index file is not always a single file, depending on "vfs". > > > > Yes. vfs could be owned by storage/chgcache class. > > > > > c) The user may want to control what to preload. For example, if they > > > have > > > an incompatible manifest, they could make changelog preloaded, but > > > not > > > manifest. > > > > No idea about (c). > > > > > d) Users can add other preloading items easily, not only just the > > > predefined ones. > > > > So probably we'll need an extensible table of preloadable items. > > If you check my prototype code, it's using a registrar to collect all > @preload functions. Yes. I wanted to say we would need this kind of abstraction anyway. ___ Mercurial-deve
Re: Making chg stateful
Excerpts from Yuya Nishihara's message of 2017-02-04 00:11:22 +0900: > On Thu, 2 Feb 2017 16:56:11 +, Jun Wu wrote: > > Excerpts from Yuya Nishihara's message of 2017-02-03 00:45:22 +0900: > > > On Thu, 2 Feb 2017 09:34:47 +, Jun Wu wrote: > > > > So what state do we store? > > > > > > > > {repopath: {name: (hash, content)}}. For example: > > > > > > > > cache = {'/home/foo/repo1': {'index': ('hash', changelogindex), > > > > 'bookmarks': ('hash', bookmarks), > > > > }, > > > > '/home/foo/repo2': { }, } > > > > > > > > The main ideas here are: > > > > 1) Store the lowest level objects, like the C changelog index. > > > >Because higher level objects could be changed by extensions in > > > >unpredictable ways. (this is not true in my hacky prototype > > > > though) > > > > 2) Hash everything. For changelog, it's like the file stat of > > > >changelog.i. There must be a strong guarantee that the hash > > > > matches > > > >the content, which could be challenging, but not impossible. I'll > > > >cover more details below. > > > > > > > > The cache is scoped by repo to make the API simpler/easy to use. It > > > > may > > > > be interesting to have some global state (like passing back the > > > > extension > > > > path to import them at runtime). > > > > > > Regarding this and "2) Side-effect-free repo", can or should we design > > > the API > > > as something like a low-level storage interface? That will allow us to not > > > make repo/revlog know too much about chg. > > > > > > I don't have any concrete idea, but that would work as follows: > > > > > > 1. chg injects an object to select storage backends > > > e.g. repo.storage = chgpreloadable(repo.storage, cache) > > > 2. repo passes it to revlog, etc. > > > 3. revlog uses it to read indexfile, where in-memory cache may be > > > returned > > > e.g. storage.parserevlog(indexfile) > > > > > > Perhaps, this 'storage' object is similar to the one you call > > > 'baserepository'. > > > > I'm not sure if I get the idea (probably not). How does the implementation > > in the master server look like? > > I was just thinking about how to hack the real repo object without introducing > much mess. Perhaps the master server wouldn't be that different from your > idea. > > > It feels more like "repo.chgcache" to me and the difference is that the > > vanilla hg will be changed to access objects via it (so the interface looks > > more consistent). > > Yeah, it might be like repo.chgcache. > > Since we shouldn't pass repo to revlog (it's layering violation), I think > we'll need a thin wrapper for chgcache anyway. I mentioned this in the second mail, "4) Where to get preloaded results (in worker)", we could just expose some kind of global state, like a "globalcache" module. > > Things to consider: > > > > a) Objects being preloaded have dependency - ex. the obsstore depends on > > changelog and other things. The preload functions run in a defined > > order. > > Maybe dependencies could be passed as arguments? Ideally, these expensive calculating (ex. obsstore) could be moved to the index object. In the reality, that requires too much work, and obsstore preloading requires a subset of "repo", including "repo.revs". It's possible to decouple obsstore preloading from the repo object, but that could be a lot of work too. > > b) The index file is not always a single file, depending on "vfs". > > Yes. vfs could be owned by storage/chgcache class. > > > c) The user may want to control what to preload. For example, if they have > > an incompatible manifest, they could make changelog preloaded, but not > > manifest. > > No idea about (c). > > > d) Users can add other preloading items easily, not only just the > > predefined ones. > > So probably we'll need an extensible table of preloadable items. If you check my prototype code, it's using a registrar to collect all @preload functions. > > I think "storage.parserevlog(indexfile)" (treating index separately, without > > from a repo object) may have trouble dealing with "a)". ___ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Re: Making chg stateful
Excerpts from Yuya Nishihara's message of 2017-02-04 00:31:45 +0900: > On Fri, 3 Feb 2017 09:33:32 +, Jun Wu wrote: > > 1) What does the worker talk to the master via IPC? > > > > Current: worker sends "possible_dirty_repo: $REPO_PATH" after runcommand > > > > It's coupled with the (pseudo) repo object. But the master could preload > > things in correct order - first changelog, then phase, obsstore, etc. > > > > What if: worker sends "possible_dirty_index: $INDEX_PATH", > > "possible_dirty_obsstore $OBSSTORE_PATH" etc. > > > > It looks more flexible at first sight. However the dependency issue > > could be tricky to solve - for example, the obsstore requires changelog > > (and revset). It's much harder to preload obsstore without a pseudo > > repo object with repo.changelog, repo.revs available. > > > > Therefore, probably just send the repo path. > > Given this, I think pipes would be simpler than using shared memory. The > master > IO can be multiplexed by select(). If it's one pipe per worker, the code change (I tried) is too much. I think a single non-blocking pipe could be an option. > (But I don't have strong opinion to disagree with using shm.) > > > 2) Preload function signature (used by master's background thread). > > > > This function is to hash and load things. It runs in the master, and is > > basically a generator (so the "loading" part could be skipped on cache > > hit, and it's easier to make content and hash_with_content consistent, if > > they share states): > > > > @preloadregistar(something) > > def preloadfunc(...): > > yield quick_hash > > yield content, hash_with_content > > Isn't it simpler if preloadfunc() returns quick_hash and a delayed function > to build content? I feel this generator is kinda abuse and makes it less clear > where the costly part starts. I have thought about this. "Generator" is used as a lightweight process that could be terminated cheaply. It shares states (local variables) for hashing and loading, making certain things easier. Compare: @preload('index') def preloadindex(repo): f = repo.svfs('changelog.i') # "f" can be shared naturally hash = fstat(f).st_size yield hash indexdata = f.read() hash = len(indexdata) yield parseindex(indexdata), hash vs. def loadindex(f): indexdata = f.read() hash = len(indexdata) return parseindex(indexdata), hash @preload('index') def preloadindex(repo): f = repo.svfs('changelog.i') hash = repo.svfs('changelog.i').st_size loadfunc = lambda: loadindex(f) # pass "f" explicitly return hash, loadfunc Generator seems to be the shortest way to express the logic. I think it's easy to say that things after the first "yield" is costly. For readibility, I think improved form mentioned at the end of the previous mail "yield preload.foo(...)" makes things obvious. > > Problem: What's "..." ? It could be "repopath" as that's exactly what the > > worker sends to us. But that will make the preload function more > > complicated to be correct - repo has "requirements", etc and the index is > > not always simply path.join(repo_path, 'store', '00changelog.i'). > > > > Similar to 1, if we decouple from the repo object, index preloading would > > be fine: > > > > @preload('index') > > def preloadindex(indexpath): > > > > > > But things with dependency will have trouble: > > > > @preload('obsstore') > > def preloadobsstore(obsstorepath): > > # how to access repo.revs('') ? > > > > Therefore, the preload function needs a repo (even it's "pseudo") to be > > convenient to implement. > > For the "...", I agree repopath isn't enough. ___ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Re: Making chg stateful
On Fri, 3 Feb 2017 09:33:32 +, Jun Wu wrote: > 1) What does the worker talk to the master via IPC? > > Current: worker sends "possible_dirty_repo: $REPO_PATH" after runcommand > > It's coupled with the (pseudo) repo object. But the master could preload > things in correct order - first changelog, then phase, obsstore, etc. > > What if: worker sends "possible_dirty_index: $INDEX_PATH", > "possible_dirty_obsstore $OBSSTORE_PATH" etc. > > It looks more flexible at first sight. However the dependency issue > could be tricky to solve - for example, the obsstore requires changelog > (and revset). It's much harder to preload obsstore without a pseudo > repo object with repo.changelog, repo.revs available. > > Therefore, probably just send the repo path. Given this, I think pipes would be simpler than using shared memory. The master IO can be multiplexed by select(). (But I don't have strong opinion to disagree with using shm.) > 2) Preload function signature (used by master's background thread). > > This function is to hash and load things. It runs in the master, and is > basically a generator (so the "loading" part could be skipped on cache > hit, and it's easier to make content and hash_with_content consistent, if > they share states): > > @preloadregistar(something) > def preloadfunc(...): > yield quick_hash > yield content, hash_with_content Isn't it simpler if preloadfunc() returns quick_hash and a delayed function to build content? I feel this generator is kinda abuse and makes it less clear where the costly part starts. > Problem: What's "..." ? It could be "repopath" as that's exactly what the > worker sends to us. But that will make the preload function more > complicated to be correct - repo has "requirements", etc and the index is > not always simply path.join(repo_path, 'store', '00changelog.i'). > > Similar to 1, if we decouple from the repo object, index preloading would > be fine: > > @preload('index') > def preloadindex(indexpath): > > > But things with dependency will have trouble: > > @preload('obsstore') > def preloadobsstore(obsstorepath): > # how to access repo.revs('') ? > > Therefore, the preload function needs a repo (even it's "pseudo") to be > convenient to implement. For the "...", I agree repopath isn't enough. ___ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Re: Making chg stateful
On Thu, 2 Feb 2017 16:56:11 +, Jun Wu wrote: > Excerpts from Yuya Nishihara's message of 2017-02-03 00:45:22 +0900: > > On Thu, 2 Feb 2017 09:34:47 +, Jun Wu wrote: > > > So what state do we store? > > > > > > {repopath: {name: (hash, content)}}. For example: > > > > > > cache = {'/home/foo/repo1': {'index': ('hash', changelogindex), > > > 'bookmarks': ('hash', bookmarks), > > > }, > > > '/home/foo/repo2': { }, } > > > > > > The main ideas here are: > > > 1) Store the lowest level objects, like the C changelog index. > > >Because higher level objects could be changed by extensions in > > >unpredictable ways. (this is not true in my hacky prototype though) > > > 2) Hash everything. For changelog, it's like the file stat of > > >changelog.i. There must be a strong guarantee that the hash matches > > >the content, which could be challenging, but not impossible. I'll > > >cover more details below. > > > > > > The cache is scoped by repo to make the API simpler/easy to use. It may > > > be interesting to have some global state (like passing back the > > > extension > > > path to import them at runtime). > > > > Regarding this and "2) Side-effect-free repo", can or should we design the > > API > > as something like a low-level storage interface? That will allow us to not > > make repo/revlog know too much about chg. > > > > I don't have any concrete idea, but that would work as follows: > > > > 1. chg injects an object to select storage backends > > e.g. repo.storage = chgpreloadable(repo.storage, cache) > > 2. repo passes it to revlog, etc. > > 3. revlog uses it to read indexfile, where in-memory cache may be returned > > e.g. storage.parserevlog(indexfile) > > > > Perhaps, this 'storage' object is similar to the one you call > > 'baserepository'. > > I'm not sure if I get the idea (probably not). How does the implementation > in the master server look like? I was just thinking about how to hack the real repo object without introducing much mess. Perhaps the master server wouldn't be that different from your idea. > It feels more like "repo.chgcache" to me and the difference is that the > vanilla hg will be changed to access objects via it (so the interface looks > more consistent). Yeah, it might be like repo.chgcache. Since we shouldn't pass repo to revlog (it's layering violation), I think we'll need a thin wrapper for chgcache anyway. > Things to consider: > > a) Objects being preloaded have dependency - ex. the obsstore depends on > changelog and other things. The preload functions run in a defined > order. Maybe dependencies could be passed as arguments? > b) The index file is not always a single file, depending on "vfs". Yes. vfs could be owned by storage/chgcache class. > c) The user may want to control what to preload. For example, if they have > an incompatible manifest, they could make changelog preloaded, but not > manifest. No idea about (c). > d) Users can add other preloading items easily, not only just the > predefined ones. So probably we'll need an extensible table of preloadable items. > I think "storage.parserevlog(indexfile)" (treating index separately, without > from a repo object) may have trouble dealing with "a)". ___ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Re: Making chg stateful
To be clear, the whole preloading framework needs design decisions made in different places. I'll go through them one by one: 1) What does the worker talk to the master via IPC? Current: worker sends "possible_dirty_repo: $REPO_PATH" after runcommand It's coupled with the (pseudo) repo object. But the master could preload things in correct order - first changelog, then phase, obsstore, etc. What if: worker sends "possible_dirty_index: $INDEX_PATH", "possible_dirty_obsstore $OBSSTORE_PATH" etc. It looks more flexible at first sight. However the dependency issue could be tricky to solve - for example, the obsstore requires changelog (and revset). It's much harder to preload obsstore without a pseudo repo object with repo.changelog, repo.revs available. Therefore, probably just send the repo path. 2) Preload function signature (used by master's background thread). This function is to hash and load things. It runs in the master, and is basically a generator (so the "loading" part could be skipped on cache hit, and it's easier to make content and hash_with_content consistent, if they share states): @preloadregistar(something) def preloadfunc(...): yield quick_hash yield content, hash_with_content Problem: What's "..." ? It could be "repopath" as that's exactly what the worker sends to us. But that will make the preload function more complicated to be correct - repo has "requirements", etc and the index is not always simply path.join(repo_path, 'store', '00changelog.i'). Similar to 1, if we decouple from the repo object, index preloading would be fine: @preload('index') def preloadindex(indexpath): But things with dependency will have trouble: @preload('obsstore') def preloadobsstore(obsstorepath): # how to access repo.revs('') ? Therefore, the preload function needs a repo (even it's "pseudo") to be convenient to implement. What to yield may be changed, see "4)". 3) Where to store preloaded results (in master) ? The storage can only be a global state. Currently it's a two-level dict, repo_path is the 1st level key, while "name"s (ex. "changelog", "obsstore") are the 2nd level. I think the global state could be either public or private. 4) Where to get preloaded results (in worker) ? repo.chgcache (or repo.whatever) mentioned previously is just a shortcut to "globalstate[repo.path]". In the real world, globalstate[repo.path]['index'], globalstate[repo.path]['obsstore'], etc. will be accessed. Note: when they are accessed, the framework should know how to calculate the latest hash and detect cache invalidation. If we want to decouple from repo, we may want to swap the 2 keys of "globalstate", ex. globalstate['index'][indexpath]. That requires the function to yield the indexpath out so the framework knows where to check: @preload('index') def preloadindex(repo): yield indexpath yield hash yield content, hash This is helpful if something is shared. Like a shared repo etc. Otherwise, it may make the code a bit longer than necessary. So maybe we make what to "yield" typed: @preload('index') def preloadindex(repo): yield preload.key(indexpath) # optional, default to repo.root yield preload.checkhash(hash) yield preload.update(content, hash) This also looks more flexible - we may extend the preloading framework by adding more contents. Excerpts from Yuya Nishihara's message of 2017-02-03 00:45:22 +0900: > On Thu, 2 Feb 2017 09:34:47 +, Jun Wu wrote: > > Perf Numbers > > > > I wrote a hacky prototype [2] that shows significant improvements on > > various commands in our repo: > > > >Before After (in seconds) > > chg bookmark 0.40 0.08 > > chg log -r . -T '{node}' 0.50 0.08 > > chg sl 0.71 0.27 > > chg id 0.71 0.37 > > > > And hg-committed (with 12M obsstore) could benefit from it too mainly > > because the obsstore becomes preloaded: > > > >Before After > > chg bookmark 0.56 0.06 > > chg log -r . -T '{node}' 0.56 0.07 > > chg log -r . -p 0.86 0.08 > > chg sl 0.84 0.21 > > chg id 0.54 0.06 > > > > So I think it's nice to get a formal implementation upstreamed. It's also > > easier to solve the perf issues individually like building the hidden > > bitmap, building an serialization format for the radix tree, etc. > > Although we can also build them later. > > Nice. I want this perf number today. :) > > > So what state do we store? > > > > {repopath: {name: (hash, content)}}. For example: > > > > cache = {'/home/foo/repo1': {'index
Re: Making chg stateful
Excerpts from Yuya Nishihara's message of 2017-02-03 00:45:22 +0900: > On Thu, 2 Feb 2017 09:34:47 +, Jun Wu wrote: > > Perf Numbers > > > > I wrote a hacky prototype [2] that shows significant improvements on > > various commands in our repo: > > > >Before After (in seconds) > > chg bookmark 0.40 0.08 > > chg log -r . -T '{node}' 0.50 0.08 > > chg sl 0.71 0.27 > > chg id 0.71 0.37 > > > > And hg-committed (with 12M obsstore) could benefit from it too mainly > > because the obsstore becomes preloaded: > > > >Before After > > chg bookmark 0.56 0.06 > > chg log -r . -T '{node}' 0.56 0.07 > > chg log -r . -p 0.86 0.08 > > chg sl 0.84 0.21 > > chg id 0.54 0.06 > > > > So I think it's nice to get a formal implementation upstreamed. It's also > > easier to solve the perf issues individually like building the hidden > > bitmap, building an serialization format for the radix tree, etc. > > Although we can also build them later. > > Nice. I want this perf number today. :) > > > So what state do we store? > > > > {repopath: {name: (hash, content)}}. For example: > > > > cache = {'/home/foo/repo1': {'index': ('hash', changelogindex), > > 'bookmarks': ('hash', bookmarks), > > }, > > '/home/foo/repo2': { }, } > > > > The main ideas here are: > > 1) Store the lowest level objects, like the C changelog index. > >Because higher level objects could be changed by extensions in > >unpredictable ways. (this is not true in my hacky prototype though) > > 2) Hash everything. For changelog, it's like the file stat of > >changelog.i. There must be a strong guarantee that the hash matches > >the content, which could be challenging, but not impossible. I'll > >cover more details below. > > > > The cache is scoped by repo to make the API simpler/easy to use. It may > > be interesting to have some global state (like passing back the extension > > path to import them at runtime). > > Regarding this and "2) Side-effect-free repo", can or should we design the API > as something like a low-level storage interface? That will allow us to not > make repo/revlog know too much about chg. > > I don't have any concrete idea, but that would work as follows: > > 1. chg injects an object to select storage backends > e.g. repo.storage = chgpreloadable(repo.storage, cache) > 2. repo passes it to revlog, etc. > 3. revlog uses it to read indexfile, where in-memory cache may be returned > e.g. storage.parserevlog(indexfile) > > Perhaps, this 'storage' object is similar to the one you call > 'baserepository'. I'm not sure if I get the idea (probably not). How does the implementation in the master server look like? It feels more like "repo.chgcache" to me and the difference is that the vanilla hg will be changed to access objects via it (so the interface looks more consistent). Things to consider: a) Objects being preloaded have dependency - ex. the obsstore depends on changelog and other things. The preload functions run in a defined order. b) The index file is not always a single file, depending on "vfs". c) The user may want to control what to preload. For example, if they have an incompatible manifest, they could make changelog preloaded, but not manifest. d) Users can add other preloading items easily, not only just the predefined ones. I think "storage.parserevlog(indexfile)" (treating index separately, without from a repo object) may have trouble dealing with "a)". > > Then, repo.chgcache['index'] becomes available in worker processes. When > > initializing the changelog index, try to use the chg cache: > > > > # inside changelog's revlog.__init__: > > # note: repo.chgcache is empty for non-chg cases, a fallback is needed > > self.index = repo.chgcache.get('index') or _loadindex(...) > > > > The API is the simplest that I can think of, while being also reasonably > > flexible (for example, we can add additional steps like forcing building > > the radix tree etc). But I'm open to suggestions. > > > > Some implementation details > > > > (This is the part that I want feedback the most) > > > > 1) IPC between chg master and forked worker > > > > This is how the worker tells the master about what (ex. repo paths) to > > preload. > > > > I think it does not need to be 100% reliable. So I use shared memory in > > the hacky prototype [2]. Pipes are reliable and may notify master > > quicker, while have risks of blocking. shm seems much easier to > > implement and I think the latency of
Re: Making chg stateful
Excerpts from Simon Farnsworth's message of 2017-02-02 15:25:45 +: > On 02/02/2017 09:34, Jun Wu wrote: > > > > What's the API? > > > > (This is an existing implementation detail. I'm open to any ideas) > > > > For example, let's say we want to preload the changelog index (code > > simplified so it does not take care of all corner cases). > > > > First, tell chg how to hash and load something: > > > > from mercurial.chgserver import repopreload > > > > @repopreload('index') > > def foopreloader(repo): > > # use the size of changelog.i as the hash. note: the hash function > > # must be very fast. > > hash = repo.svfs.stat('00changelog.i').st_size > > # tell chg about the current hash. if hash matches, the generator > > # function stops here. > > yield hash > > > > # if hash mismatches, load the changelog in a slower way. > > with repo.svfs('00changelog.i') as f: > > data = f.read() > > hash = len(f) > > index = _buildindex(data) > > index.partialmatch('') # force build the radix tree > > # tell chg about the loading result and the hash that > > # absolutely matches the result. > > yield index, hash > > > > Then, repo.chgcache['index'] becomes available in worker processes. When > > initializing the changelog index, try to use the chg cache: > > > > # inside changelog's revlog.__init__: > > # note: repo.chgcache is empty for non-chg cases, a fallback is needed > > self.index = repo.chgcache.get('index') or _loadindex(...) > > > > Can we ensure that chgcache.get(key) always gets an answer (i.e. no > fallback needed)? My concern is divergence between the fallback and > chgcache versions of the data fetch. It's diverged intentionally, mainly because 1: 1. The chg version could have more internal states to preload. Take the changelog index for example, chg would like to force pre-populating the partialmatch radix tree, and other indexes we will have in the future. 2. If chg is not used (ex. Windows), it's not necessary to calculate the hash, although that's cheap. I think chg preload function could call the vanilla load function, plus some extra work. That will ensure consistency. > It seems like it should be trivial to implement, too - repo.chgcache for > the non-chg case would do: > > g = preloadfunctable[name](self._prepo) > hash = next(g) > content, hash = next(g) > return g > > knowing that computing hash is cheap. > > This would mean that you'd just do: > self.index = repo.chgcache.get('index') > and if chg can speed you up, it does, while if it's not in use, you get > the old slow path. It also means that chg and hg will give the same > results, and that there's pressure to use a cheap hash for invalidations > (because the repo is effectively always hashed. > > > The API is the simplest that I can think of, while being also reasonably > > flexible (for example, we can add additional steps like forcing building > > the radix tree etc). But I'm open to suggestions. > > > > Some implementation details > > > > (This is the part that I want feedback the most) > > > > 1) IPC between chg master and forked worker > > > > This is how the worker tells the master about what (ex. repo paths) to > > preload. > > > > I think it does not need to be 100% reliable. So I use shared memory in > > the hacky prototype [2]. Pipes are reliable and may notify master > > quicker, while have risks of blocking. shm seems much easier to > > implement and I think the latency of master getting the information is > > not a big deal. I prefer shm, but other ideas are welcomed. > > > > Did you try O_NONBLOCK pipes, out of interest? That could be an option. The ugly part is "write"ing more than PIPE_BUF size could success with partial content written. So it requires special logic to deal with boundaries, maybe by adding '\0's at both ends. > Having said that, given that the worker is forked from the master, shm > seems perfectly reasonable to me - the worker and master will, by > definition, be running the same code. > > However, rather than coding it ourselves, I'd be inclined to use > multiprocessing (https://docs.python.org/2/library/multiprocessing.html ) > to provide the shared memory primitives you need. I actually checked multiprocessing before. It requires you to use its special fork() implementation instead of just "os.fork()" so its fork hook gets executed. It has a complex abstraction about fork hooks, raw pointers, serialization, etc. and does not actually support Windows well. Given the unnecessary complexity and overhead of multiprocessing, I think a 50-line implementation is better. > > 2) Side-effect-free repo > > > > This is the most "painful" part for the preloading framework
Re: Making chg stateful
On Thu, 2 Feb 2017 09:34:47 +, Jun Wu wrote: > Perf Numbers > > I wrote a hacky prototype [2] that shows significant improvements on > various commands in our repo: > >Before After (in seconds) > chg bookmark 0.40 0.08 > chg log -r . -T '{node}' 0.50 0.08 > chg sl 0.71 0.27 > chg id 0.71 0.37 > > And hg-committed (with 12M obsstore) could benefit from it too mainly > because the obsstore becomes preloaded: > >Before After > chg bookmark 0.56 0.06 > chg log -r . -T '{node}' 0.56 0.07 > chg log -r . -p 0.86 0.08 > chg sl 0.84 0.21 > chg id 0.54 0.06 > > So I think it's nice to get a formal implementation upstreamed. It's also > easier to solve the perf issues individually like building the hidden > bitmap, building an serialization format for the radix tree, etc. > Although we can also build them later. Nice. I want this perf number today. :) > So what state do we store? > > {repopath: {name: (hash, content)}}. For example: > > cache = {'/home/foo/repo1': {'index': ('hash', changelogindex), > 'bookmarks': ('hash', bookmarks), > }, > '/home/foo/repo2': { }, } > > The main ideas here are: > 1) Store the lowest level objects, like the C changelog index. >Because higher level objects could be changed by extensions in >unpredictable ways. (this is not true in my hacky prototype though) > 2) Hash everything. For changelog, it's like the file stat of >changelog.i. There must be a strong guarantee that the hash matches >the content, which could be challenging, but not impossible. I'll >cover more details below. > > The cache is scoped by repo to make the API simpler/easy to use. It may > be interesting to have some global state (like passing back the extension > path to import them at runtime). Regarding this and "2) Side-effect-free repo", can or should we design the API as something like a low-level storage interface? That will allow us to not make repo/revlog know too much about chg. I don't have any concrete idea, but that would work as follows: 1. chg injects an object to select storage backends e.g. repo.storage = chgpreloadable(repo.storage, cache) 2. repo passes it to revlog, etc. 3. revlog uses it to read indexfile, where in-memory cache may be returned e.g. storage.parserevlog(indexfile) Perhaps, this 'storage' object is similar to the one you call 'baserepository'. > Then, repo.chgcache['index'] becomes available in worker processes. When > initializing the changelog index, try to use the chg cache: > > # inside changelog's revlog.__init__: > # note: repo.chgcache is empty for non-chg cases, a fallback is needed > self.index = repo.chgcache.get('index') or _loadindex(...) > > The API is the simplest that I can think of, while being also reasonably > flexible (for example, we can add additional steps like forcing building > the radix tree etc). But I'm open to suggestions. > > Some implementation details > > (This is the part that I want feedback the most) > > 1) IPC between chg master and forked worker > > This is how the worker tells the master about what (ex. repo paths) to > preload. > > I think it does not need to be 100% reliable. So I use shared memory in > the hacky prototype [2]. Pipes are reliable and may notify master > quicker, while have risks of blocking. shm seems much easier to > implement and I think the latency of master getting the information is > not a big deal. I prefer shm, but other ideas are welcomed. I like using pipes and it won't be complex so long as an IPC message is short enough to write atomically (< ~4k). But pipe-vs-shm is merely an implementation detail, so we can switch them later if needed. > 2) Side-effect-free repo > > This is the most "painful" part for the preloading framework to be > confident. > > The preload function has a signature that takes a repo object. But chg > could not provide a real repo object. So it's best-effort. > > On the other hand, most preload functions only need to hash file stats, > i.e. just use repo.vfs, repo.svfs etc. They do not need a full-featured > repo object. > > Therefore I think the choices are: > > a) Provide a repo object that is: localrepository - side effects > (maximum compatibility) > b) Build a fresh new repo object that has only the minimal part, ex. > vfs, svfs etc. (less compatible) > c) Do not provide a repo object. Just provide "repo path". > (move the burden to the preload function writers)
Re: Making chg stateful
On 02/02/2017 09:34, Jun Wu wrote: What's the API? (This is an existing implementation detail. I'm open to any ideas) For example, let's say we want to preload the changelog index (code simplified so it does not take care of all corner cases). First, tell chg how to hash and load something: from mercurial.chgserver import repopreload @repopreload('index') def foopreloader(repo): # use the size of changelog.i as the hash. note: the hash function # must be very fast. hash = repo.svfs.stat('00changelog.i').st_size # tell chg about the current hash. if hash matches, the generator # function stops here. yield hash # if hash mismatches, load the changelog in a slower way. with repo.svfs('00changelog.i') as f: data = f.read() hash = len(f) index = _buildindex(data) index.partialmatch('') # force build the radix tree # tell chg about the loading result and the hash that # absolutely matches the result. yield index, hash Then, repo.chgcache['index'] becomes available in worker processes. When initializing the changelog index, try to use the chg cache: # inside changelog's revlog.__init__: # note: repo.chgcache is empty for non-chg cases, a fallback is needed self.index = repo.chgcache.get('index') or _loadindex(...) Can we ensure that chgcache.get(key) always gets an answer (i.e. no fallback needed)? My concern is divergence between the fallback and chgcache versions of the data fetch. It seems like it should be trivial to implement, too - repo.chgcache for the non-chg case would do: g = preloadfunctable[name](self._prepo) hash = next(g) content, hash = next(g) return g knowing that computing hash is cheap. This would mean that you'd just do: self.index = repo.chgcache.get('index') and if chg can speed you up, it does, while if it's not in use, you get the old slow path. It also means that chg and hg will give the same results, and that there's pressure to use a cheap hash for invalidations (because the repo is effectively always hashed. The API is the simplest that I can think of, while being also reasonably flexible (for example, we can add additional steps like forcing building the radix tree etc). But I'm open to suggestions. Some implementation details (This is the part that I want feedback the most) 1) IPC between chg master and forked worker This is how the worker tells the master about what (ex. repo paths) to preload. I think it does not need to be 100% reliable. So I use shared memory in the hacky prototype [2]. Pipes are reliable and may notify master quicker, while have risks of blocking. shm seems much easier to implement and I think the latency of master getting the information is not a big deal. I prefer shm, but other ideas are welcomed. Did you try O_NONBLOCK pipes, out of interest? Having said that, given that the worker is forked from the master, shm seems perfectly reasonable to me - the worker and master will, by definition, be running the same code. However, rather than coding it ourselves, I'd be inclined to use multiprocessing (https://docs.python.org/2/library/multiprocessing.html) to provide the shared memory primitives you need. 2) Side-effect-free repo This is the most "painful" part for the preloading framework to be confident. The preload function has a signature that takes a repo object. But chg could not provide a real repo object. So it's best-effort. On the other hand, most preload functions only need to hash file stats, i.e. just use repo.vfs, repo.svfs etc. They do not need a full-featured repo object. Therefore I think the choices are: a) Provide a repo object that is: localrepository - side effects (maximum compatibility) b) Build a fresh new repo object that has only the minimal part, ex. vfs, svfs etc. (less compatible) c) Do not provide a repo object. Just provide "repo path". (move the burden to the preload function writers) I currently prefer a). The plan is to move part of "localrepository" to a side-effect free "baserepository" and use "baserepository" in the background preloading thread. I like this plan. 3) Side effect of extensions The new chg framework will not run uisetup()s. Where the preloading framework does sometimes depend on some side effects of extensions' uisetup()s. For example, the *manifest extension could change greatly about what the manifest structure so the default manifest preloading won't work as expected. I think there are different ways to address this: a) Add a top-level "chgsetup()" which only gets called by chg, and is meant to be side-effect free.
Re: Making chg stateful
Fixes a mysterious typo that affects reading. Excerpts from Jun Wu's message of 2017-02-02 09:34:47 +: > This is mainly to discuss more details about chg repo preloading idea [1]. > > Perf Numbers > > I wrote a hacky prototype [2] that shows significant improvements on > various commands in our repo: > >Before After (in seconds) > chg bookmark 0.40 0.08 > chg log -r . -T '{node}' 0.50 0.08 > chg sl 0.71 0.27 > chg id 0.71 0.37 > > And hg-committed (with 12M obsstore) could benefit from it too mainly > because the obsstore becomes preloaded: > >Before After > chg bookmark 0.56 0.06 > chg log -r . -T '{node}' 0.56 0.07 > chg log -r . -p 0.86 0.08 > chg sl 0.84 0.21 > chg id 0.54 0.06 > > So I think it's nice to get a formal implementation upstreamed. It's also > easier to solve the perf issues individually like building the hidden "to solve" should be "than solving" > bitmap, building an serialization format for the radix tree, etc. > Although we can also build them later. > ___ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Making chg stateful
This is mainly to discuss more details about chg repo preloading idea [1]. Perf Numbers I wrote a hacky prototype [2] that shows significant improvements on various commands in our repo: Before After (in seconds) chg bookmark 0.40 0.08 chg log -r . -T '{node}' 0.50 0.08 chg sl 0.71 0.27 chg id 0.71 0.37 And hg-committed (with 12M obsstore) could benefit from it too mainly because the obsstore becomes preloaded: Before After chg bookmark 0.56 0.06 chg log -r . -T '{node}' 0.56 0.07 chg log -r . -p 0.86 0.08 chg sl 0.84 0.21 chg id 0.54 0.06 So I think it's nice to get a formal implementation upstreamed. It's also easier to solve the perf issues individually like building the hidden bitmap, building an serialization format for the radix tree, etc. Although we can also build them later. Stateful? Stateful is to the chg master server process. Currently the chg master process is stateless, after forking, the client talks to the forked worker directly, without affecting the master, and the worker does not talk back to master too: client master worker | connect() -> | accept() : | fork() -> | | send() > | recv() # client and worker no longer talk to master Therefore the master is currently stateless - it can only preload extensions but nothing about the repo state. The general direction is to make the worker tells the master what needs to be preloaded (like repo paths), and the master has a background thread preloading them. Then at fork(), the new worker will get the cache for free. How about just preloading the repo object? There was a failed experiment: [3]. The repo object depends on too many side-effects, and gets invalidated too easily. chg will not run uisetup()s, so it'll become harder to get a proper repo object. So what state do we store? {repopath: {name: (hash, content)}}. For example: cache = {'/home/foo/repo1': {'index': ('hash', changelogindex), 'bookmarks': ('hash', bookmarks), }, '/home/foo/repo2': { }, } The main ideas here are: 1) Store the lowest level objects, like the C changelog index. Because higher level objects could be changed by extensions in unpredictable ways. (this is not true in my hacky prototype though) 2) Hash everything. For changelog, it's like the file stat of changelog.i. There must be a strong guarantee that the hash matches the content, which could be challenging, but not impossible. I'll cover more details below. The cache is scoped by repo to make the API simpler/easy to use. It may be interesting to have some global state (like passing back the extension path to import them at runtime). What's the API? (This is an existing implementation detail. I'm open to any ideas) For example, let's say we want to preload the changelog index (code simplified so it does not take care of all corner cases). First, tell chg how to hash and load something: from mercurial.chgserver import repopreload @repopreload('index') def foopreloader(repo): # use the size of changelog.i as the hash. note: the hash function # must be very fast. hash = repo.svfs.stat('00changelog.i').st_size # tell chg about the current hash. if hash matches, the generator # function stops here. yield hash # if hash mismatches, load the changelog in a slower way. with repo.svfs('00changelog.i') as f: data = f.read() hash = len(f) index = _buildindex(data) index.partialmatch('') # force build the radix tree # tell chg about the loading result and the hash that # absolutely matches the result. yield index, hash Then, repo.chgcache['index'] becomes available in worker processes. When initializing the changelog index, try to use the chg cache: # inside changelog's revlog.__init__: # note: repo.chgcache is empty for non-chg cases, a fallback is needed self.index = repo.chgcache.get('index') or _loadindex(...) The API is the simplest that I can think of, while being also reasonably flexible (for example, we can add additional steps like forcing building the radix tree etc). But I'm open to suggestions. Some implementation details (This is the part that I want feedback the most) 1) IPC between chg master and forked worker This is how the worker tells the master about what (ex. repo paths) to preload. I thi