Hello all, TIL: Prototype covering incoming/pull can be found at the end after the hyphen line.
This mail is about a prototype for a discovery mechanism based on Invertible Bloom Lookup Tables (IBLT). An IBLT is a generalized form of a Bloom Filter and can be used to efficiently compute set differences. The only requirement is that the set items are canonical and have the same size, e.g. a cryptographic hash over a record can be used as identifier. It is efficient in the follow senses: (1) The data to be exchanged between two parties scales linear with the size of the different (i.e. O(|A\B + B\A|) given sets A and B). (2) The number of round trips is expected to be O(1), normally one for the estimation step and a second for the actual exchange. Depending on the details, both steps can be part of one round trip in opposite directions. For the changeset discovery, this approach is interesting because it works independently of the structure of the commit graph as long as they are similar enough. This is different from the currently used sampling protocol that scales based on the total and disjoint number of DAG heads. This property also makes it attractive for exchanging additional data attached to commits like changeset signatures or provenance records. The expected O(1) derives from the probalistic nature of the algorithm, e.g. it derives from properties of random 3-graphs. The principles are also found in modern construction mechanisms for perfect hash tables or error-correcting codes. The success rate increases with size and practical simulations of the hypergraphs show that it works almost always even for moderate sizes in thousand entries range. Based on the natural and justified choices of (1) deterministically keyed hash functions, (2) fixed size categories with exponential growth and (3) retries with the next larger size as normal handling for failure cases, the data structure can be pre-computed and easily updated incrementally. Without trying for optimal encodings, space requirement would be O(d * log d) with d being the maximum allowed difference size and update cost of O(log d) when adding a new key. ------- A working prototype as alternative to the current set discovery algorithm can be found in https://www.netbsd.org/~joerg/iblt-exchange.diff There are a couple of hacks and short cuts in the code, but it works well enough to illustrate the point. It is written to be minimally intrusive. For example, there is no cache (in)validation or incremental update implemented. From a protocol perspective, it is also much more inefficient than necessary. It reuses the current getbundle wire command. For this, it needs to compute (or at least approximate a superset of) the common heads and the remote sets. This is done by using both the node id of the commit as well as the parent nodes as key, so it tripples the key size and therefore the exchanged data. A better but more intrusive approach replaces the parameters of getbundle directly with the IBLT. Test case is my local pkgsrc repository with ~776k commits and a clone without my local changesets. Incoming shows 71 changes. Test uses ssh over localhost. All values are best of three runs. New protocol Old protocol Run time 2.40s 2.77s Data to server 2784 24,338 Data from server 19,218,995 19,181,398 For the no-change case of incoming against the same repository: New protocol Old protocol Run time 0.62s 0.62s Data to server 544 9,965 Data from server 38,796 11,028 For reference, the estimator step uses 35,308 Bytes and results in an IBLT of 181 entries for the first case and 32 entries for the second case with a size of 11,958 Bytes and 2124 Bytes respectively. Joerg _______________________________________________ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel