> can it be a dataset generated from each revision and then published
separately?

Perhaps it be generated asynchronously via a job?  Either stored in
revision or a separate table.

On Fri, Sep 15, 2017 at 4:06 PM, Andrew Otto <[email protected]> wrote:

> > As a random idea - would it be possible to calculate the hashes when data
> is transitioned from SQL to Hadoop storage?
>
> We take monthly snapshots of the entire history, so every month we’d have
> to pull the content of every revision ever made :o
>
>
> On Fri, Sep 15, 2017 at 4:01 PM, Stas Malyshev <[email protected]>
> wrote:
>
>> Hi!
>>
>> > We should hear from Joseph, Dan, Marcel, and Aaron H on this I think,
>> but
>> > from the little I know:
>> >
>> > Most analytical computations (for things like reverts, as you say) don’t
>> > have easy access to content, so computing SHAs on the fly is pretty
>> hard.
>> > MediaWiki history reconstruction relies on the SHA to figure out what
>> > revisions revert other revisions, as there is no reliable way to know if
>> > something is a revert other than by comparing SHAs.
>>
>> As a random idea - would it be possible to calculate the hashes when
>> data is transitioned from SQL to Hadoop storage? I imagine that would
>> slow down the transition, but not sure if it'd be substantial or not. If
>> we're using the hash just to compare revisions, we could also use
>> different hash (maybe non-crypto hash?) which may be faster.
>>
>> --
>> Stas Malyshev
>> [email protected]
>>
>
>
_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to