Compute the hashes on the fly for the offline analysis doesn’t work for 
Wikistats 1.0, as it only parses the stub dumps, without article content, just 
metadata.
Parsing the full archive dumps is a quite expensive, time-wise.

This may change with Wikistats 2.0 with has a totally different process flow. 
That I can't tell.

Erik Zachte

-----Original Message-----
From: Wikitech-l [mailto:[email protected]] On Behalf Of 
Daniel Kinzler
Sent: Friday, September 15, 2017 12:52
To: Wikimedia developers <[email protected]>
Subject: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

Hi all!

I'm working on the database schema for Multi-Content-Revisions (MCR) 
<https://www.mediawiki.org/wiki/Multi-Content_Revisions/Database_Schema> and 
I'd like to get rid of the rev_sha1 field:

Maintaining revision hashes (the rev_sha1 field) is expensive, and becomes more 
expensive with MCR. With multiple content objects per revision, we need to 
track the hash for each slot, and then re-calculate the sha1 for each revision.

That's expensive especially in terms of bytes-per-database-row, which impacts 
query performance.

So, what do we need the rev_sha1 field for? As far as I know, nothing in core 
uses it, and I'm not aware of any extension using it either. It seems to be 
used primarily in offline analysis for detecting (manual) reverts by looking 
for revisions with the same hash.

Is that reason enough for dragging all the hashes around the database with 
every revision update? Or can we just compute the hashes on the fly for the 
offline analysis? Computing hashes is slow since the content needs to be loaded 
first, but it would only have to be done for pairs of revisions of the same 
page with the same size, which should be a pretty good optimization.

Also, I believe Roan is currently looking for a better mechanism for tracking 
all kinds of reverts directly.

So, can we drop rev_sha1?

--
Daniel Kinzler
Principal Platform Engineer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to