I have no idea about the schema changes, but to choose a digest for detection of identity reverts is pretty simple. The really difficult part is to choose a locally sensitive hash or fingerprint that works for very similar revisions with a lot of content.
I would propose that the digest is stored in the database, and that a lsh or fingerprint is calculated on the fly by the API, unless someone can find a really good way to make and store a lsh or fingerprint that has all necessary properties. For all the purposes I know (and care) about the digest will be used for detection of identity reverts, while the lsh/fingerprint will be used for resynchronization after difficult partly reverts. In addition it seems likely that fingerprints are necessary for more fine-grained analysis. It seems like the necessary properties for lsh and the fingerprint scales with increasing content, that makes it difficult to precompute a value. John On Mon, Nov 28, 2011 at 2:28 AM, Tim Starling <[email protected]> wrote: > On 28/11/11 08:29, Brion Vibber wrote: >> So... this seems to have snuck back in a month ago: >> https://www.mediawiki.org/wiki/Special:Code/MediaWiki/101021 >> >> https://bugzilla.wikimedia.org/show_bug.cgi?id=21860 > > I don't think it really "snuck", Rob has been talking about it for a > while, see e.g. comment 27. > >> Have we resolved the deployment questions on how to actually do the change? >> Just want to make sure ops has plenty of warning before 1.19 comes down the >> pipe. (Especially if we have to revert anything back to 1.18 during/after!) > > It can be deployed like any column addition to a large table: on the > slaves first, then switch masters, then on the old masters. For 1.17 > we changed categorylinks (60M rows on enwiki), and that caused no > problems. In 1.18 the schema changes were done by ops (Asher), and > included flaggedrevs which is 30M rows on dewiki. > > The revision table is 320M rows on enwiki, but it doesn't pose any > special challenges, as long as there's enough disk space. The snapshot > host db26 is the only host which may possibly be in danger of running > out of space, but if its snapshots are deleted and the space > reallocated to /a then it won't have any trouble. > > Like the previous schema changes, this schema change will be done in > advance of the software version change. The old version will work with > the new schema, and the default value is harmless, so reverting back > to 1.18 or restarting the populate script won't be a problem. > > -- Tim Starling > > > _______________________________________________ > Wikitech-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
