[Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread Daniel Kinzler
Hi all! I'm working on the database schema for Multi-Content-Revisions (MCR) and I'd like to get rid of the rev_sha1 field: Maintaining revision hashes (the rev_sha1 field) is expensive, and becomes more expensive with MCR.

[Wikitech-l] TechCom Radar, 2017-09-13

2017-09-15 Thread Daniel Kinzler
Hello all! Find below the minutes of the last meeting of the Technical Committee. * Wednesday’s IRC discussion was about all things JobQueue: https://www.mediawiki.org/wiki/User:Daniel_Kinzler_(WMDE)/Job_Queue * RFC declined as proposed: restructuring the MediaWiki repo

[Wikitech-l] [Outreachy Round 15] Applications are now open

2017-09-15 Thread Srishti Sethi
Hello all, Applications for the Outreachy December 2017 to March 2018 internships are now open. If you are interested in participating, learn application process steps and ideas for Wikimedia projects here https://www.mediawiki.org/wiki/Outreachy/Round_15 Help us spread the word among your

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread Andrew Otto
We should hear from Joseph, Dan, Marcel, and Aaron H on this I think, but from the little I know: Most analytical computations (for things like reverts, as you say) don’t have easy access to content, so computing SHAs on the fly is pretty hard. MediaWiki history reconstruction relies on the SHA

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread James Hare
What I wonder is – does this *need* to be a part of the database table, or can it be a dataset generated from each revision and then published separately? This way each user wouldn’t have to individually compute the hashes while we also get the (ostensible) benefit of getting them out of the

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread Stas Malyshev
Hi! > We should hear from Joseph, Dan, Marcel, and Aaron H on this I think, but > from the little I know: > > Most analytical computations (for things like reverts, as you say) don’t > have easy access to content, so computing SHAs on the fly is pretty hard. > MediaWiki history reconstruction

Re: [Wikitech-l] Web application to upload csv file to wiktionary

2017-09-15 Thread Shrinivasan T
> A web version of the command-line tool you wrote probably won't just > magically exist; if you don't have the time to write it, you can direct > people to PAWS which provides a browser-based console for running scripts. > Not as nice but still easier than setting up Python locally. >

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread Erik Zachte
Compute the hashes on the fly for the offline analysis doesn’t work for Wikistats 1.0, as it only parses the stub dumps, without article content, just metadata. Parsing the full archive dumps is a quite expensive, time-wise. This may change with Wikistats 2.0 with has a totally different

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread Gergo Tisza
At a quick glance, EventBus and FlaggedRevs are the two extensions using the hashes. EventBust just puts them into the emitted data; FlaggedRevs detects reverts to the latest stable revision that way (so there is no rev_sha1 based lookup in either case, although in the case of FlaggedRevs I could

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread C. Scott Ananian
Alternatively, perhaps "hash" could be an optional part of an MCR chunk? We could keep it for the wikitext, but drop the hash for the metadata, and drop any support for a "combined" hash over wikitext + all-other-pieces. ...which begs the question about how reverts work in MCR. Is it just the

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread Chad
We could keep it in the XML dumps (it's part of the XSD after all)...just compute it at export time. Not terribly hard, I don't think, we should have the parsed content already on hand -Chad On Fri, Sep 15, 2017 at 12:51 PM James Hare wrote: > What I wonder is – does

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread Andrew Otto
> can it be a dataset generated from each revision and then published separately? Perhaps it be generated asynchronously via a job? Either stored in revision or a separate table. On Fri, Sep 15, 2017 at 4:06 PM, Andrew Otto wrote: > > As a random idea - would it be

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread Daniel Kinzler
A revert restores a previous revision. It covers all slots. The fact that reverts, watching, protecting, etc still works per page, while you can have multiple kinds of different content on the page, is indeed the point of MCR. Am 15.09.2017 um 22:23 schrieb C. Scott Ananian: > Alternatively,

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread Stas Malyshev
Hi! On 9/15/17 1:06 PM, Andrew Otto wrote: >> As a random idea - would it be possible to calculate the hashes > when data is transitioned from SQL to Hadoop storage? > > We take monthly snapshots of the entire history, so every month we’d > have to pull the content of every revision ever made :o

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread Daniel Kinzler
Am 15.09.2017 um 19:49 schrieb Erik Zachte: > Compute the hashes on the fly for the offline analysis doesn’t work for > Wikistats 1.0, as it only parses the stub dumps, without article content, > just metadata. > Parsing the full archive dumps is a quite expensive, time-wise. We can always

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread Daniel Kinzler
Ok, a little more detail here: For MCR, we would have to keep around the hash of each content object ("slot") AND of each revision. This makes the revision and content tables "wider", which is a problem because they grow quite "tall", too. It also means we have to compute a hash of hashes for

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread Andrew Otto
> As a random idea - would it be possible to calculate the hashes when data is transitioned from SQL to Hadoop storage? We take monthly snapshots of the entire history, so every month we’d have to pull the content of every revision ever made :o On Fri, Sep 15, 2017 at 4:01 PM, Stas Malyshev

Re: [Wikitech-l] Can we drop revision hashes (rev_sha1)?

2017-09-15 Thread Matthew Flaschen
On 09/15/2017 06:51 AM, Daniel Kinzler wrote: Also, I believe Roan is currently looking for a better mechanism for tracking all kinds of reverts directly. Let's see if we want to use rev_sha1 for that better solution (a way to track reverts within MW itself) before we drop it. I know Roan