What I wonder is – does this *need* to be a part of the database table, or can it be a dataset generated from each revision and then published separately? This way each user wouldn’t have to individually compute the hashes while we also get the (ostensible) benefit of getting them out of the table.
On September 15, 2017 at 12:41:03 PM, Andrew Otto ([email protected]) wrote: We should hear from Joseph, Dan, Marcel, and Aaron H on this I think, but from the little I know: Most analytical computations (for things like reverts, as you say) don’t have easy access to content, so computing SHAs on the fly is pretty hard. MediaWiki history reconstruction relies on the SHA to figure out what revisions revert other revisions, as there is no reliable way to know if something is a revert other than by comparing SHAs. See https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history (particularly the *revert* fields). On Fri, Sep 15, 2017 at 1:49 PM, Erik Zachte <[email protected]> wrote: > Compute the hashes on the fly for the offline analysis doesn’t work for > Wikistats 1.0, as it only parses the stub dumps, without article content, > just metadata. > Parsing the full archive dumps is a quite expensive, time-wise. > > This may change with Wikistats 2.0 with has a totally different process > flow. That I can't tell. > > Erik Zachte > > -----Original Message----- > From: Wikitech-l [mailto:[email protected]] On > Behalf Of Daniel Kinzler > Sent: Friday, September 15, 2017 12:52 > To: Wikimedia developers <[email protected]> > Subject: [Wikitech-l] Can we drop revision hashes (rev_sha1)? > > Hi all! > > I'm working on the database schema for Multi-Content-Revisions (MCR) < > https://www.mediawiki.org/wiki/Multi-Content_Revisions/Database_Schema> > and I'd like to get rid of the rev_sha1 field: > > Maintaining revision hashes (the rev_sha1 field) is expensive, and becomes > more expensive with MCR. With multiple content objects per revision, we > need to track the hash for each slot, and then re-calculate the sha1 for > each revision. > > That's expensive especially in terms of bytes-per-database-row, which > impacts query performance. > > So, what do we need the rev_sha1 field for? As far as I know, nothing in > core uses it, and I'm not aware of any extension using it either. It seems > to be used primarily in offline analysis for detecting (manual) reverts by > looking for revisions with the same hash. > > Is that reason enough for dragging all the hashes around the database with > every revision update? Or can we just compute the hashes on the fly for the > offline analysis? Computing hashes is slow since the content needs to be > loaded first, but it would only have to be done for pairs of revisions of > the same page with the same size, which should be a pretty good > optimization. > > Also, I believe Roan is currently looking for a better mechanism for > tracking all kinds of reverts directly. > > So, can we drop rev_sha1? > > -- > Daniel Kinzler > Principal Platform Engineer > > Wikimedia Deutschland > Gesellschaft zur Förderung Freien Wissens e.V. > > _______________________________________________ > Wikitech-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > > _______________________________________________ > Wikitech-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
