Re: [PATCH RFC] changegroup: leave out all parent file revisions when creating bundles

2019-10-14 Thread Mads Kiilerich

On 10/11/19 7:57 PM, Mads Kiilerich wrote:

# HG changeset patch
# User Mads Kiilerich 
# Date 1570804632 -7200
#  Fri Oct 11 16:37:12 2019 +0200
# Node ID 72d12ee773795edc163f73b9160e5d29022878dd
# Parent  52781d57313d512efb7150603104bea3ca11d0eb
changegroup: leave out all parent file revisions when creating bundles


I'm not sure the analysis nailed the root cause. It is tricky to 
reproduce good test cases.


But one problem we could fix by pruning ancestors instead of using linkrevs:

Create a repo with aliasing:

  $ hg init repo1
  $ cd repo1
  $ touch f
  $ hg ci -Aqm 0
  $ echo 1 > f
  $ hg ci -m 1f1
  $ hg up -cqr 0
  $ hg branch -q b
  $ echo 1 > f  # linkrev aliasing to rev 1
  $ hg ci -m 2f1

When bundling rev 2 for a repo that has rev 1, f will be skipped even 
though it

isn't an ancestor:

  $ hg bundle -v -r 2 --base 1 bundle.hg
  1 changesets found
  uncompressed size of bundle content:
   185 (changelog)
   163 (manifests)

A bundle with missing ancestor revisions would fail unbundling with "abort:
00changelog.i@d681519c3ea7: unknown parent!". But if using 
fastpathlinkrev and
the ancestors are present but the file revs are not, the bundle can be 
applied

but will break on use:

  $ hg clone -qU . -r 0 repo2
  $ hg -R repo2 pull bundle.hg
  pulling from bundle.hg
  searching for changes
  adding changesets
  adding manifests
  adding file changes
  added 1 changesets with 0 changes to 0 files
  new changesets 5e690c649d09 (1 drafts)
  (run 'hg update' to get a working copy)
  $ hg -R repo2 up -r tip
  abort: data/f.i@d0c79e1d3309: no match found!
  [255]


/Mads

___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


[PATCH RFC] changegroup: leave out all parent file revisions when creating bundles

2019-10-11 Thread Mads Kiilerich
# HG changeset patch
# User Mads Kiilerich 
# Date 1570804632 -7200
#  Fri Oct 11 16:37:12 2019 +0200
# Node ID 72d12ee773795edc163f73b9160e5d29022878dd
# Parent  52781d57313d512efb7150603104bea3ca11d0eb
changegroup: leave out all parent file revisions when creating bundles

When creating a bundle, we get the revisions to bundle, and also the set of
revisions that have been discovered to be common between sender and receiver of
the bundle - which by definition will include all the ancestor revisions. File
revisions where linkrev points at common revisions will be pruned from the
bundle. The common revisions can be represented lazy and efficiently.

That is making the optimistic assumption that the linkrev revision is the
common one. Sometimes it isn't. In that case, the bundle will contain file
revisions that already are common. The increased size of the bundle might be a
problem, but it also makes the bundle size a useless approximation of
information content, and of how much it actually will increase repository size.

The linkrev based pruning does, however, have the advantage that it can leave
out common file revisions, even without being ancestors. Such aggressive
pruning will create bundles that only apply in a specific context, and
effectively make them depend on additional parents than the actual bundle
parents. It is debatable if that is a good feature and how much it actually
gains, but let's leave that out of scope and keep it for now.

We want to avoid that bundles contain filelog entries that are present in
parent revisions of the bundle content. We will do that by enumerating the
files that seem to be changed anywhere in the bundled set, find all parent
revisions of the bundled changesets, for the changed files find the set of file
revisions in the parent changesets, and leave these out when bundling.

To avoid visiting all base revisions for each file, we start by collecting
revisions of all touched files in all parents and keep that in memory. That
will be O(files_changed_in_set * revisions) ... but while bundled changesets
might be complex internally and all could have two unique parents outside the
bundle, they rarely have a lot of parents outside the set. It is also rare to
change a lot of files ... but even then it seems reasonable to be able to hold
the names of all files and some hashes in memory.) (Alternatively, we could
keep all manifests around and look files up on demand ... but manifests might
be big and changes are usually sparse, so it seems more efficient to compute
the sets upfront.)

Testing with the current verbose logging shows where this change makes a
difference ... including a few "because parents, not linkrev" places where this
change will have a positive impact. The changes generally seem reasonable to me
...

TODO:
gather feedback
more testing
investigate the
test-remotefilelog-bgprefetch.t instability
reduce the current logging

diff --git a/hgext/remotefilelog/shallowbundle.py 
b/hgext/remotefilelog/shallowbundle.py
--- a/hgext/remotefilelog/shallowbundle.py
+++ b/hgext/remotefilelog/shallowbundle.py
@@ -71,7 +71,8 @@ class shallowcg1packer(changegroup.cgpac
 try:
 linknodes, commonrevs, source = args
 except ValueError:
-commonrevs, source, mfdicts, fastpathlinkrev, fnodes, clrevs = args
+(commonrevs, source, mfdicts, fastpathlinkrev, fnodes, clrevs,
+ setparents) = args
 if shallowutil.isenabled(self._repo):
 repo = self._repo
 if isinstance(repo, bundlerepo.bundlerepository):
diff --git a/mercurial/changegroup.py b/mercurial/changegroup.py
--- a/mercurial/changegroup.py
+++ b/mercurial/changegroup.py
@@ -931,6 +931,9 @@ class cgpacker(object):
 clrevorder = clstate[b'clrevorder']
 manifests = clstate[b'manifests']
 changedfiles = clstate[b'changedfiles']
+setparents = sorted(set(clstate[b'parentrevs'])
+.difference(clstate[b'changerevs'])
+.difference([-1]))
 
 # We need to make sure that the linkrev in the changegroup refers to
 # the first changeset that introduced the manifest or file revision.
@@ -1005,6 +1008,7 @@ class cgpacker(object):
 fastpathlinkrev,
 fnodes,
 clrevs,
+setparents,
 )
 
 for path, deltas in it:
@@ -1044,12 +1048,16 @@ class cgpacker(object):
 mfl = self._repo.manifestlog
 changedfiles = set()
 clrevtomanifestrev = {}
+changerevs = set()
+parentrevs = set()
 
 state = {
 b'clrevorder': clrevorder,
 b'manifests': manifests,
 b'changedfiles': changedfiles,
 b'clrevtomanifestrev': clrevtomanifestrev,
+b'changerevs': changerevs,
+b'parentrevs': parentrevs,
 }
 
 if not (generate or self._ellipses):
@@ -1106,6 +1114,9 @@ class cgpacker(object):