Re: [Wiki-research-l] How to track all the diffs in real time?

Mitar Mon, 15 Dec 2014 14:19:59 -0800

Hi!

What more do you have in mind that could be in "augmented stream" than
the current RCstream data + difffs as they are provided by the API?



Mitar

On Mon, Dec 15, 2014 at 10:22 PM, Maximilian Klein <isa...@gmail.com> wrote:
> All,
> Thanks for the great responses. It seems like, Andrew, Ed, DataSift, and
> Mitar are now all offering overlapping solutions to the real-time diff
> monitoring problem. The one thing I take away from that is that if the API
> is robust enough to serve these 4 clients in real time, then adding another
> is a drop in the bucket.
>
> However, as others like Yuvi pointed out, and Aaron has prototyped we could
> make this better, by serving an augmented RCstream. I wonder how easy it
> would be to allow community development on that project since it seems that
> it would require access to the full databases, which only WMF developers
> seem to have access to at the moment.
>
> Make a great day,
> Max Klein ‽ http://notconfusing.com/
>
> On Mon, Dec 15, 2014 at 5:09 AM, Flöck, Fabian <fabian.flo...@gesis.org>
> wrote:
>>
>> If anyone is interested in a faster processing of revision differences,
>> you could also adapt the strategy we implemented for wikiwho [1], which is
>> keeping track of bigger unchanged text chunks with hashes and just diffing
>> the remaining text (usually a relatively small part oft the article). We
>> specifically introduced that technique because diffing all the text was too
>> expensive. And in principle, it can produce the same output, although we
>> currently use it for authorship detection, which is a slightly different
>> task.  Anyway, it is on average >100 times faster than pure "traditional"
>> diffing. Maybe that is useful for someone. Code is available at github [2].
>>
>> [1] http://f-squared.org/wikiwho
>> [2] https://github.com/maribelacosta/wikiwho
>>
>>
>> On 14.12.2014, at 07:23, Jeremy Baron <jer...@tuxmachine.com> wrote:
>>
>> On Dec 13, 2014 12:33 PM, "Aaron Halfaker" <ahalfa...@wikimedia.org>
>> wrote:
>> > 1. It turns out that generating diffs is computationally complex, so
>> > generating them in real time is slow and lame.  I'm working to generate all
>> > diffs historically using Hadoop and then have a live system listening to
>> > recent changes to keep the data up-to-date[2].
>>
>> IIRC Mako does that in ~4 hours (maybe outdated and takes longer now) for
>> all enwiki diffs for all time. (don't remember if this is namespace limited)
>> But also using an extraordinary amount of RAM. i.e. hundreds of GB
>>
>> AIUI, there's no dynamic memory allocation. revisions are loaded into
>> fixed-size buffers larger than the largest revision.
>>
>> https://github.com/makoshark/wikiq
>>
>> -Jeremy
>>
>> _______________________________________________
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>>
>>
>>
>>
>> Cheers,
>> Fabian
>>
>> --
>> Fabian Flöck
>> Research Associate
>> Computational Social Science department @GESIS
>> Unter Sachsenhausen 6-8, 50667 Cologne, Germany
>> Tel: + 49 (0) 221-47694-208
>> fabian.flo...@gesis.org
>>
>> www.gesis.org
>> www.facebook.com/gesis.org
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>



-- 
http://mitar.tnode.com/
https://twitter.com/mitar_m

_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] How to track all the diffs in real time?

Reply via email to