> can it be a dataset generated from each revision and then published separately?
Perhaps it be generated asynchronously via a job? Either stored in revision or a separate table. On Fri, Sep 15, 2017 at 4:06 PM, Andrew Otto <[email protected]> wrote: > > As a random idea - would it be possible to calculate the hashes when data > is transitioned from SQL to Hadoop storage? > > We take monthly snapshots of the entire history, so every month we’d have > to pull the content of every revision ever made :o > > > On Fri, Sep 15, 2017 at 4:01 PM, Stas Malyshev <[email protected]> > wrote: > >> Hi! >> >> > We should hear from Joseph, Dan, Marcel, and Aaron H on this I think, >> but >> > from the little I know: >> > >> > Most analytical computations (for things like reverts, as you say) don’t >> > have easy access to content, so computing SHAs on the fly is pretty >> hard. >> > MediaWiki history reconstruction relies on the SHA to figure out what >> > revisions revert other revisions, as there is no reliable way to know if >> > something is a revert other than by comparing SHAs. >> >> As a random idea - would it be possible to calculate the hashes when >> data is transitioned from SQL to Hadoop storage? I imagine that would >> slow down the transition, but not sure if it'd be substantial or not. If >> we're using the hash just to compare revisions, we could also use >> different hash (maybe non-crypto hash?) which may be faster. >> >> -- >> Stas Malyshev >> [email protected] >> > > _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
