Hi! What more do you have in mind that could be in "augmented stream" than the current RCstream data + difffs as they are provided by the API?
Mitar On Mon, Dec 15, 2014 at 10:22 PM, Maximilian Klein <isa...@gmail.com> wrote: > All, > Thanks for the great responses. It seems like, Andrew, Ed, DataSift, and > Mitar are now all offering overlapping solutions to the real-time diff > monitoring problem. The one thing I take away from that is that if the API > is robust enough to serve these 4 clients in real time, then adding another > is a drop in the bucket. > > However, as others like Yuvi pointed out, and Aaron has prototyped we could > make this better, by serving an augmented RCstream. I wonder how easy it > would be to allow community development on that project since it seems that > it would require access to the full databases, which only WMF developers > seem to have access to at the moment. > > Make a great day, > Max Klein ‽ http://notconfusing.com/ > > On Mon, Dec 15, 2014 at 5:09 AM, Flöck, Fabian <fabian.flo...@gesis.org> > wrote: >> >> If anyone is interested in a faster processing of revision differences, >> you could also adapt the strategy we implemented for wikiwho [1], which is >> keeping track of bigger unchanged text chunks with hashes and just diffing >> the remaining text (usually a relatively small part oft the article). We >> specifically introduced that technique because diffing all the text was too >> expensive. And in principle, it can produce the same output, although we >> currently use it for authorship detection, which is a slightly different >> task. Anyway, it is on average >100 times faster than pure "traditional" >> diffing. Maybe that is useful for someone. Code is available at github [2]. >> >> [1] http://f-squared.org/wikiwho >> [2] https://github.com/maribelacosta/wikiwho >> >> >> On 14.12.2014, at 07:23, Jeremy Baron <jer...@tuxmachine.com> wrote: >> >> On Dec 13, 2014 12:33 PM, "Aaron Halfaker" <ahalfa...@wikimedia.org> >> wrote: >> > 1. It turns out that generating diffs is computationally complex, so >> > generating them in real time is slow and lame. I'm working to generate all >> > diffs historically using Hadoop and then have a live system listening to >> > recent changes to keep the data up-to-date[2]. >> >> IIRC Mako does that in ~4 hours (maybe outdated and takes longer now) for >> all enwiki diffs for all time. (don't remember if this is namespace limited) >> But also using an extraordinary amount of RAM. i.e. hundreds of GB >> >> AIUI, there's no dynamic memory allocation. revisions are loaded into >> fixed-size buffers larger than the largest revision. >> >> https://github.com/makoshark/wikiq >> >> -Jeremy >> >> _______________________________________________ >> Wiki-research-l mailing list >> Wiki-research-l@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> >> >> >> >> >> Cheers, >> Fabian >> >> -- >> Fabian Flöck >> Research Associate >> Computational Social Science department @GESIS >> Unter Sachsenhausen 6-8, 50667 Cologne, Germany >> Tel: + 49 (0) 221-47694-208 >> fabian.flo...@gesis.org >> >> www.gesis.org >> www.facebook.com/gesis.org >> >> >> >> >> >> >> _______________________________________________ >> Wiki-research-l mailing list >> Wiki-research-l@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> > > _______________________________________________ > Wiki-research-l mailing list > Wiki-research-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > -- http://mitar.tnode.com/ https://twitter.com/mitar_m _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l