Re: [PATCH rfc] manifest: write a more efficient version of lazymanifest, in pure python

2016-08-29 Thread Matt Mackall
On Fri, 2016-08-26 at 16:57 -0700, Sean Farley wrote:
> Matt Mackall  writes:
> 
> > 
> > On Wed, 2016-08-24 at 23:55 -0400, Augie Fackler wrote:
> > > 
> > > > 
> > > > 
> > > > On Aug 20, 2016, at 5:02 PM, Maciej Fijalkowski 
> > > > wrote:
> > > > 
> > > > # HG changeset patch
> > > > # User Maciej Fijalkowski 
> > > > # Date 1471726818 -7200
> > > > #  Sat Aug 20 23:00:18 2016 +0200
> > > > # Node ID 21b2401d468d6b24c1658468e4fc5ce8744f925b
> > > > # Parent  300f14ea21432face8d7e6cdcf92ba9d2f1f92dc
> > > > manifest: write a more efficient version of lazymanifest, in pure python
> > > > 
> > > > Questions outsdanding:
> > > > * who calls filtercopy? noone in tests at the very least
> > > manifest.py line 287 or thereabouts: it’s used for matches() on a
> > > manifest.
> > > test-status-rev.t will exercise that codepath if that helps.
> > > 
> > > > 
> > > > 
> > > > * are the performance tradeoffs ok here, notably __delitem__?
> > > Probably. It looks very similar to the C version. Deleting a ton of
> > > entries
> > > from a large manifest is probably still tragic, since it’ll be a lot of
> > > copies, but for a first pass this is already so much better than the naive
> > > version that was there...
> > > 
> > > > 
> > > > 
> > > > * should we use mmap instead of reading stuff from a file?
> > > This I can’t answer. Maybe?
> > > 
> > > > 
> > > > 
> > > > * current version does not support 21 or 22 long hashes, why are they
> > > > necessary?
> > > There are a handful of (weird) places that do something like poke a “+” on
> > > the
> > > end of a hash in the manifest to mark it as dirty, but that is never saved
> > > to
> > > disk.
> > FWIW, we could probably replace the "+" hack with an "C"*40 hack. The SHA1
> > space
> > is big enough that we can punch some other small holes in it in addition to
> > the
> > null hash.
> I thought Yuya was working on adding "F"*40 (unless I missed something
> special about "C"?)?

It's just a number with a 2^-160 probability of collision with a real hash.
... was chosen as a pseudo-hash for the working copy pseudo-changeset to
contrast with the all-zeros representation of nullid. C... is simply a
(confusing) mnemonic for "changed" since M isn't available.

Having the pseudo-hashes for the file-level and changeset-level be the same is
slightly risky since we have nothing in Python that prevents us from ever
comparing these different types. But it has the upside that they have similar
interpretations.

-- 
Mathematics is the supreme nostalgia of our time.

___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: [PATCH rfc] manifest: write a more efficient version of lazymanifest, in pure python

2016-08-27 Thread Matt Mackall
On Fri, 2016-08-26 at 16:57 -0700, Sean Farley wrote:
> Matt Mackall  writes:
> 
> > 
> > On Wed, 2016-08-24 at 23:55 -0400, Augie Fackler wrote:
> > > 
> > > > 
> > > > 
> > > > On Aug 20, 2016, at 5:02 PM, Maciej Fijalkowski 
> > > > wrote:
> > > > 
> > > > # HG changeset patch
> > > > # User Maciej Fijalkowski 
> > > > # Date 1471726818 -7200
> > > > #  Sat Aug 20 23:00:18 2016 +0200
> > > > # Node ID 21b2401d468d6b24c1658468e4fc5ce8744f925b
> > > > # Parent  300f14ea21432face8d7e6cdcf92ba9d2f1f92dc
> > > > manifest: write a more efficient version of lazymanifest, in pure python
> > > > 
> > > > Questions outsdanding:
> > > > * who calls filtercopy? noone in tests at the very least
> > > manifest.py line 287 or thereabouts: it’s used for matches() on a
> > > manifest.
> > > test-status-rev.t will exercise that codepath if that helps.
> > > 
> > > > 
> > > > 
> > > > * are the performance tradeoffs ok here, notably __delitem__?
> > > Probably. It looks very similar to the C version. Deleting a ton of
> > > entries
> > > from a large manifest is probably still tragic, since it’ll be a lot of
> > > copies, but for a first pass this is already so much better than the naive
> > > version that was there...
> > > 
> > > > 
> > > > 
> > > > * should we use mmap instead of reading stuff from a file?
> > > This I can’t answer. Maybe?
> > > 
> > > > 
> > > > 
> > > > * current version does not support 21 or 22 long hashes, why are they
> > > > necessary?
> > > There are a handful of (weird) places that do something like poke a “+” on
> > > the
> > > end of a hash in the manifest to mark it as dirty, but that is never saved
> > > to
> > > disk.
> > FWIW, we could probably replace the "+" hack with an "C"*40 hack. The SHA1
> > space
> > is big enough that we can punch some other small holes in it in addition to
> > the
> > null hash.
> I thought Yuya was working on adding "F"*40 (unless I missed something
> special about "C"?)?

It's just a number with a 2^-160 probability of collision with a real hash.
... was chosen as a pseudo-hash for the working copy pseudo-changeset to
contrast with the all-zeros representation of nullid. C... is simply a
(confusing) mnemonic for "changed" since M isn't available.

Having the pseudo-hashes for the file-level and changeset-level be the same is
slightly risky since we have nothing in Python that prevents us from ever
comparing these different types. But it has the upside that they have similar
interpretations.

-- 
Mathematics is the supreme nostalgia of our time.

___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: [PATCH rfc] manifest: write a more efficient version of lazymanifest, in pure python

2016-08-27 Thread Yuya Nishihara
On Fri, 26 Aug 2016 16:57:25 -0700, Sean Farley wrote:
> Matt Mackall  writes:
> > On Wed, 2016-08-24 at 23:55 -0400, Augie Fackler wrote:
> >> > On Aug 20, 2016, at 5:02 PM, Maciej Fijalkowski  wrote:
> >> > * current version does not support 21 or 22 long hashes, why are they
> >> > necessary?
> >> There are a handful of (weird) places that do something like poke a “+” on 
> >> the
> >> end of a hash in the manifest to mark it as dirty, but that is never saved 
> >> to
> >> disk.
> >
> > FWIW, we could probably replace the "+" hack with an "C"*40 hack. The SHA1 
> > space
> > is big enough that we can punch some other small holes in it in addition to 
> > the
> > null hash.
> 
> I thought Yuya was working on adding "F"*40 (unless I missed something
> special about "C"?)?

Yes, "F"*40 for wdir revision. I have experimental patches. Is that related to
the "+" hack in manifest?
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: [PATCH rfc] manifest: write a more efficient version of lazymanifest, in pure python

2016-08-24 Thread Augie Fackler

> On Aug 20, 2016, at 5:02 PM, Maciej Fijalkowski  wrote:
> 
> # HG changeset patch
> # User Maciej Fijalkowski 
> # Date 1471726818 -7200
> #  Sat Aug 20 23:00:18 2016 +0200
> # Node ID 21b2401d468d6b24c1658468e4fc5ce8744f925b
> # Parent  300f14ea21432face8d7e6cdcf92ba9d2f1f92dc
> manifest: write a more efficient version of lazymanifest, in pure python
> 
> Questions outsdanding:
> * who calls filtercopy? noone in tests at the very least

manifest.py line 287 or thereabouts: it’s used for matches() on a manifest. 
test-status-rev.t will exercise that codepath if that helps.

> * are the performance tradeoffs ok here, notably __delitem__?

Probably. It looks very similar to the C version. Deleting a ton of entries 
from a large manifest is probably still tragic, since it’ll be a lot of copies, 
but for a first pass this is already so much better than the naive version that 
was there...

> * should we use mmap instead of reading stuff from a file?

This I can’t answer. Maybe?

> * current version does not support 21 or 22 long hashes, why are they 
> necessary?

There are a handful of (weird) places that do something like poke a “+” on the 
end of a hash in the manifest to mark it as dirty, but that is never saved to 
disk.

> 
> diff --git a/mercurial/manifest.py b/mercurial/manifest.py
> --- a/mercurial/manifest.py
> +++ b/mercurial/manifest.py
> @@ -104,68 +104,277 @@
> _checkforbidden(files)
> return ''.join(lines)
> 
> -class _lazymanifest(dict):
> -"""This is the pure implementation of lazymanifest.
> -
> -It has not been optimized *at all* and is not lazy.
> -"""
> +class LazyManifestIter(object):
> +def __init__(self, lm):
> +self.pos = 0
> +self.lm = lm
> 
> -def __init__(self, data):
> -dict.__init__(self)
> -for f, n, fl in _parse(data):
> -self[f] = n, fl
> +def next(self):
> +try:
> +data, pos = self.lm._get(self.pos)
> +except IndexError:
> +raise StopIteration
> +if pos == -1:
> +self.pos += 1
> +return data[0]
> +self.pos += 1
> +zeropos = data.find('\x00', pos)
> +return data[pos:zeropos]
> 
> -def __setitem__(self, k, v):
> -node, flag = v
> -assert node is not None
> -if len(node) > 21:
> -node = node[:21] # match c implementation behavior
> -dict.__setitem__(self, k, (node, flag))
> +class LazyManifestIterEntries(object):
> +def __init__(self, lm):
> +self.lm = lm
> +self.pos = 0
> 
> def __iter__(self):
> -return iter(sorted(dict.keys(self)))
> +return self
> 
> -def iterkeys(self):
> -return iter(sorted(dict.keys(self)))
> +def next(self):
> +try:
> +data, pos = self.lm._get(self.pos)
> +except IndexError:
> +raise StopIteration
> +if pos == -1:
> +self.pos += 1
> +return data
> +zeropos = data.find('\x00', pos)
> +hashval = unhexlify(data, self.lm.extrainfo[self.pos],
> +zeropos + 1, 40)
> +flags = self.lm._getflags(data, self.pos, zeropos)
> +self.pos += 1
> +return (data[pos:zeropos], hashval, flags)
> 
> -def iterentries(self):
> -return ((f, e[0], e[1]) for f, e in sorted(self.iteritems()))
> +def unhexlify(data, extra, pos, length):
> +s = data[pos:pos+length]
> +if extra:
> +s += extra & 0xff
> +return s.decode("hex")
> +
> +class _lazymanifest(object):
> +def __init__(self, data, positions=None, extrainfo=None, extradata=None):
> +if positions is None:
> +self.positions = self.find_lines(data)
> +self.extrainfo = [0] * len(self.positions)
> +self.data = data
> +self.extradata = []
> +else:
> +self.positions = positions[:]
> +self.extrainfo = extrainfo[:]
> +self.extradata = extradata[:]
> +self.data = data
> +
> +def find_lines(self, data):
> +if not data:
> +return []
> +pos = data.find("\n")
> +positions = [0]
> +while pos < len(data) - 1 and pos != -1:
> +positions.append(pos + 1)
> +pos = data.find("\n", pos + 1)
> +return positions
> +
> +def _get(self, index):
> +# get the position encoded in pos:
> +#   positive number is an index in 'data'
> +#   negative number is in extrapieces
> +pos = self.positions[index]
> +if pos >= 0:
> +return self.data, pos
> +return self.extradata[-pos-1], -1
> +
> +def _getkey(self, pos):
> +if pos >= 0:
> +return self.data[pos:self.data.find('\x00', pos + 1)]
> +return self.extradata[-pos-1][0]
> +
> +def bsearch(self, key):
> +first = 0
> +last =