Re: [PATCH rfc] manifest: write a more efficient version of lazymanifest, in pure python
On Fri, 2016-08-26 at 16:57 -0700, Sean Farley wrote: > Matt Mackallwrites: > > > > > On Wed, 2016-08-24 at 23:55 -0400, Augie Fackler wrote: > > > > > > > > > > > > > > > On Aug 20, 2016, at 5:02 PM, Maciej Fijalkowski > > > > wrote: > > > > > > > > # HG changeset patch > > > > # User Maciej Fijalkowski > > > > # Date 1471726818 -7200 > > > > # Sat Aug 20 23:00:18 2016 +0200 > > > > # Node ID 21b2401d468d6b24c1658468e4fc5ce8744f925b > > > > # Parent 300f14ea21432face8d7e6cdcf92ba9d2f1f92dc > > > > manifest: write a more efficient version of lazymanifest, in pure python > > > > > > > > Questions outsdanding: > > > > * who calls filtercopy? noone in tests at the very least > > > manifest.py line 287 or thereabouts: it’s used for matches() on a > > > manifest. > > > test-status-rev.t will exercise that codepath if that helps. > > > > > > > > > > > > > > > * are the performance tradeoffs ok here, notably __delitem__? > > > Probably. It looks very similar to the C version. Deleting a ton of > > > entries > > > from a large manifest is probably still tragic, since it’ll be a lot of > > > copies, but for a first pass this is already so much better than the naive > > > version that was there... > > > > > > > > > > > > > > > * should we use mmap instead of reading stuff from a file? > > > This I can’t answer. Maybe? > > > > > > > > > > > > > > > * current version does not support 21 or 22 long hashes, why are they > > > > necessary? > > > There are a handful of (weird) places that do something like poke a “+” on > > > the > > > end of a hash in the manifest to mark it as dirty, but that is never saved > > > to > > > disk. > > FWIW, we could probably replace the "+" hack with an "C"*40 hack. The SHA1 > > space > > is big enough that we can punch some other small holes in it in addition to > > the > > null hash. > I thought Yuya was working on adding "F"*40 (unless I missed something > special about "C"?)? It's just a number with a 2^-160 probability of collision with a real hash. ... was chosen as a pseudo-hash for the working copy pseudo-changeset to contrast with the all-zeros representation of nullid. C... is simply a (confusing) mnemonic for "changed" since M isn't available. Having the pseudo-hashes for the file-level and changeset-level be the same is slightly risky since we have nothing in Python that prevents us from ever comparing these different types. But it has the upside that they have similar interpretations. -- Mathematics is the supreme nostalgia of our time. ___ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Re: [PATCH rfc] manifest: write a more efficient version of lazymanifest, in pure python
On Fri, 2016-08-26 at 16:57 -0700, Sean Farley wrote: > Matt Mackallwrites: > > > > > On Wed, 2016-08-24 at 23:55 -0400, Augie Fackler wrote: > > > > > > > > > > > > > > > On Aug 20, 2016, at 5:02 PM, Maciej Fijalkowski > > > > wrote: > > > > > > > > # HG changeset patch > > > > # User Maciej Fijalkowski > > > > # Date 1471726818 -7200 > > > > # Sat Aug 20 23:00:18 2016 +0200 > > > > # Node ID 21b2401d468d6b24c1658468e4fc5ce8744f925b > > > > # Parent 300f14ea21432face8d7e6cdcf92ba9d2f1f92dc > > > > manifest: write a more efficient version of lazymanifest, in pure python > > > > > > > > Questions outsdanding: > > > > * who calls filtercopy? noone in tests at the very least > > > manifest.py line 287 or thereabouts: it’s used for matches() on a > > > manifest. > > > test-status-rev.t will exercise that codepath if that helps. > > > > > > > > > > > > > > > * are the performance tradeoffs ok here, notably __delitem__? > > > Probably. It looks very similar to the C version. Deleting a ton of > > > entries > > > from a large manifest is probably still tragic, since it’ll be a lot of > > > copies, but for a first pass this is already so much better than the naive > > > version that was there... > > > > > > > > > > > > > > > * should we use mmap instead of reading stuff from a file? > > > This I can’t answer. Maybe? > > > > > > > > > > > > > > > * current version does not support 21 or 22 long hashes, why are they > > > > necessary? > > > There are a handful of (weird) places that do something like poke a “+” on > > > the > > > end of a hash in the manifest to mark it as dirty, but that is never saved > > > to > > > disk. > > FWIW, we could probably replace the "+" hack with an "C"*40 hack. The SHA1 > > space > > is big enough that we can punch some other small holes in it in addition to > > the > > null hash. > I thought Yuya was working on adding "F"*40 (unless I missed something > special about "C"?)? It's just a number with a 2^-160 probability of collision with a real hash. ... was chosen as a pseudo-hash for the working copy pseudo-changeset to contrast with the all-zeros representation of nullid. C... is simply a (confusing) mnemonic for "changed" since M isn't available. Having the pseudo-hashes for the file-level and changeset-level be the same is slightly risky since we have nothing in Python that prevents us from ever comparing these different types. But it has the upside that they have similar interpretations. -- Mathematics is the supreme nostalgia of our time. ___ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Re: [PATCH rfc] manifest: write a more efficient version of lazymanifest, in pure python
On Fri, 26 Aug 2016 16:57:25 -0700, Sean Farley wrote: > Matt Mackallwrites: > > On Wed, 2016-08-24 at 23:55 -0400, Augie Fackler wrote: > >> > On Aug 20, 2016, at 5:02 PM, Maciej Fijalkowski wrote: > >> > * current version does not support 21 or 22 long hashes, why are they > >> > necessary? > >> There are a handful of (weird) places that do something like poke a “+” on > >> the > >> end of a hash in the manifest to mark it as dirty, but that is never saved > >> to > >> disk. > > > > FWIW, we could probably replace the "+" hack with an "C"*40 hack. The SHA1 > > space > > is big enough that we can punch some other small holes in it in addition to > > the > > null hash. > > I thought Yuya was working on adding "F"*40 (unless I missed something > special about "C"?)? Yes, "F"*40 for wdir revision. I have experimental patches. Is that related to the "+" hack in manifest? ___ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
Re: [PATCH rfc] manifest: write a more efficient version of lazymanifest, in pure python
> On Aug 20, 2016, at 5:02 PM, Maciej Fijalkowskiwrote: > > # HG changeset patch > # User Maciej Fijalkowski > # Date 1471726818 -7200 > # Sat Aug 20 23:00:18 2016 +0200 > # Node ID 21b2401d468d6b24c1658468e4fc5ce8744f925b > # Parent 300f14ea21432face8d7e6cdcf92ba9d2f1f92dc > manifest: write a more efficient version of lazymanifest, in pure python > > Questions outsdanding: > * who calls filtercopy? noone in tests at the very least manifest.py line 287 or thereabouts: it’s used for matches() on a manifest. test-status-rev.t will exercise that codepath if that helps. > * are the performance tradeoffs ok here, notably __delitem__? Probably. It looks very similar to the C version. Deleting a ton of entries from a large manifest is probably still tragic, since it’ll be a lot of copies, but for a first pass this is already so much better than the naive version that was there... > * should we use mmap instead of reading stuff from a file? This I can’t answer. Maybe? > * current version does not support 21 or 22 long hashes, why are they > necessary? There are a handful of (weird) places that do something like poke a “+” on the end of a hash in the manifest to mark it as dirty, but that is never saved to disk. > > diff --git a/mercurial/manifest.py b/mercurial/manifest.py > --- a/mercurial/manifest.py > +++ b/mercurial/manifest.py > @@ -104,68 +104,277 @@ > _checkforbidden(files) > return ''.join(lines) > > -class _lazymanifest(dict): > -"""This is the pure implementation of lazymanifest. > - > -It has not been optimized *at all* and is not lazy. > -""" > +class LazyManifestIter(object): > +def __init__(self, lm): > +self.pos = 0 > +self.lm = lm > > -def __init__(self, data): > -dict.__init__(self) > -for f, n, fl in _parse(data): > -self[f] = n, fl > +def next(self): > +try: > +data, pos = self.lm._get(self.pos) > +except IndexError: > +raise StopIteration > +if pos == -1: > +self.pos += 1 > +return data[0] > +self.pos += 1 > +zeropos = data.find('\x00', pos) > +return data[pos:zeropos] > > -def __setitem__(self, k, v): > -node, flag = v > -assert node is not None > -if len(node) > 21: > -node = node[:21] # match c implementation behavior > -dict.__setitem__(self, k, (node, flag)) > +class LazyManifestIterEntries(object): > +def __init__(self, lm): > +self.lm = lm > +self.pos = 0 > > def __iter__(self): > -return iter(sorted(dict.keys(self))) > +return self > > -def iterkeys(self): > -return iter(sorted(dict.keys(self))) > +def next(self): > +try: > +data, pos = self.lm._get(self.pos) > +except IndexError: > +raise StopIteration > +if pos == -1: > +self.pos += 1 > +return data > +zeropos = data.find('\x00', pos) > +hashval = unhexlify(data, self.lm.extrainfo[self.pos], > +zeropos + 1, 40) > +flags = self.lm._getflags(data, self.pos, zeropos) > +self.pos += 1 > +return (data[pos:zeropos], hashval, flags) > > -def iterentries(self): > -return ((f, e[0], e[1]) for f, e in sorted(self.iteritems())) > +def unhexlify(data, extra, pos, length): > +s = data[pos:pos+length] > +if extra: > +s += extra & 0xff > +return s.decode("hex") > + > +class _lazymanifest(object): > +def __init__(self, data, positions=None, extrainfo=None, extradata=None): > +if positions is None: > +self.positions = self.find_lines(data) > +self.extrainfo = [0] * len(self.positions) > +self.data = data > +self.extradata = [] > +else: > +self.positions = positions[:] > +self.extrainfo = extrainfo[:] > +self.extradata = extradata[:] > +self.data = data > + > +def find_lines(self, data): > +if not data: > +return [] > +pos = data.find("\n") > +positions = [0] > +while pos < len(data) - 1 and pos != -1: > +positions.append(pos + 1) > +pos = data.find("\n", pos + 1) > +return positions > + > +def _get(self, index): > +# get the position encoded in pos: > +# positive number is an index in 'data' > +# negative number is in extrapieces > +pos = self.positions[index] > +if pos >= 0: > +return self.data, pos > +return self.extradata[-pos-1], -1 > + > +def _getkey(self, pos): > +if pos >= 0: > +return self.data[pos:self.data.find('\x00', pos + 1)] > +return self.extradata[-pos-1][0] > + > +def bsearch(self, key): > +first = 0 > +last =