I have no idea about the schema changes, but to choose a digest for
detection of identity reverts is pretty simple. The really difficult
part is to choose a locally sensitive hash or fingerprint that works
for very similar revisions with a lot of content.

I would propose that the digest is stored in the database, and that a
lsh or fingerprint is calculated on the fly by the API, unless someone
can find a really good way to make and store a lsh or fingerprint that
has all necessary properties.

For all the purposes I know (and care) about the digest will be used
for detection of identity reverts, while the lsh/fingerprint will be
used for resynchronization after difficult partly reverts. In addition
it seems likely that fingerprints are necessary for more fine-grained
analysis.

It seems like the necessary properties for lsh and the fingerprint
scales with increasing content, that makes it difficult to precompute
a value.

John

On Mon, Nov 28, 2011 at 2:28 AM, Tim Starling <[email protected]> wrote:
> On 28/11/11 08:29, Brion Vibber wrote:
>> So... this seems to have snuck back in a month ago:
>> https://www.mediawiki.org/wiki/Special:Code/MediaWiki/101021
>>
>> https://bugzilla.wikimedia.org/show_bug.cgi?id=21860
>
> I don't think it really "snuck", Rob has been talking about it for a
> while, see e.g. comment 27.
>
>> Have we resolved the deployment questions on how to actually do the change?
>> Just want to make sure ops has plenty of warning before 1.19 comes down the
>> pipe. (Especially if we have to revert anything back to 1.18 during/after!)
>
> It can be deployed like any column addition to a large table: on the
> slaves first, then switch masters, then on the old masters. For 1.17
> we changed categorylinks (60M rows on enwiki), and that caused no
> problems. In 1.18 the schema changes were done by ops (Asher), and
> included flaggedrevs which is 30M rows on dewiki.
>
> The revision table is 320M rows on enwiki, but it doesn't pose any
> special challenges, as long as there's enough disk space. The snapshot
> host db26 is the only host which may possibly be in danger of running
> out of space, but if its snapshots are deleted and the space
> reallocated to /a then it won't have any trouble.
>
> Like the previous schema changes, this schema change will be done in
> advance of the software version change. The old version will work with
> the new schema, and the default value is harmless, so reverting back
> to 1.18 or restarting the populate script won't be a problem.
>
> -- Tim Starling
>
>
> _______________________________________________
> Wikitech-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to