Re: [Wiki-research-l] How to track all the diffs in real time?
If anyone is interested in a faster processing of revision differences, you could also adapt the strategy we implemented for wikiwho [1], which is keeping track of bigger unchanged text chunks with hashes and just diffing the remaining text (usually a relatively small part oft the article). We specifically introduced that technique because diffing all the text was too expensive. And in principle, it can produce the same output, although we currently use it for authorship detection, which is a slightly different task. Anyway, it is on average 100 times faster than pure traditional diffing. Maybe that is useful for someone. Code is available at github [2]. [1] http://f-squared.org/wikiwho [2] https://github.com/maribelacosta/wikiwho On 14.12.2014, at 07:23, Jeremy Baron jer...@tuxmachine.commailto:jer...@tuxmachine.com wrote: On Dec 13, 2014 12:33 PM, Aaron Halfaker ahalfa...@wikimedia.orgmailto:ahalfa...@wikimedia.org wrote: 1. It turns out that generating diffs is computationally complex, so generating them in real time is slow and lame. I'm working to generate all diffs historically using Hadoop and then have a live system listening to recent changes to keep the data up-to-date[2]. IIRC Mako does that in ~4 hours (maybe outdated and takes longer now) for all enwiki diffs for all time. (don't remember if this is namespace limited) But also using an extraordinary amount of RAM. i.e. hundreds of GB AIUI, there's no dynamic memory allocation. revisions are loaded into fixed-size buffers larger than the largest revision. https://github.com/makoshark/wikiq -Jeremy ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.orgmailto:Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l Cheers, Fabian -- Fabian Flöck Research Associate Computational Social Science department @GESIS Unter Sachsenhausen 6-8, 50667 Cologne, Germany Tel: + 49 (0) 221-47694-208 fabian.flo...@gesis.orgmailto:fabian.flo...@gesis.org www.gesis.org www.facebook.com/gesis.org ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] How to track all the diffs in real time?
All, Thanks for the great responses. It seems like, Andrew, Ed, DataSift, and Mitar are now all offering overlapping solutions to the real-time diff monitoring problem. The one thing I take away from that is that if the API is robust enough to serve these 4 clients in real time, then adding another is a drop in the bucket. However, as others like Yuvi pointed out, and Aaron has prototyped we could make this better, by serving an augmented RCstream. I wonder how easy it would be to allow community development on that project since it seems that it would require access to the full databases, which only WMF developers seem to have access to at the moment. Make a great day, Max Klein ‽ http://notconfusing.com/ On Mon, Dec 15, 2014 at 5:09 AM, Flöck, Fabian fabian.flo...@gesis.org wrote: If anyone is interested in a faster processing of revision differences, you could also adapt the strategy we implemented for wikiwho [1], which is keeping track of bigger unchanged text chunks with hashes and just diffing the remaining text (usually a relatively small part oft the article). We specifically introduced that technique because diffing all the text was too expensive. And in principle, it can produce the same output, although we currently use it for authorship detection, which is a slightly different task. Anyway, it is on average 100 times faster than pure traditional diffing. Maybe that is useful for someone. Code is available at github [2]. [1] http://f-squared.org/wikiwho [2] https://github.com/maribelacosta/wikiwho On 14.12.2014, at 07:23, Jeremy Baron jer...@tuxmachine.com wrote: On Dec 13, 2014 12:33 PM, Aaron Halfaker ahalfa...@wikimedia.org wrote: 1. It turns out that generating diffs is computationally complex, so generating them in real time is slow and lame. I'm working to generate all diffs historically using Hadoop and then have a live system listening to recent changes to keep the data up-to-date[2]. IIRC Mako does that in ~4 hours (maybe outdated and takes longer now) for all enwiki diffs for all time. (don't remember if this is namespace limited) But also using an extraordinary amount of RAM. i.e. hundreds of GB AIUI, there's no dynamic memory allocation. revisions are loaded into fixed-size buffers larger than the largest revision. https://github.com/makoshark/wikiq -Jeremy ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l Cheers, Fabian -- Fabian Flöck Research Associate Computational Social Science department @GESIS Unter Sachsenhausen 6-8, 50667 Cologne, Germany Tel: + 49 (0) 221-47694-208 fabian.flo...@gesis.org www.gesis.org www.facebook.com/gesis.org ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] How to track all the diffs in real time?
Hi! What more do you have in mind that could be in augmented stream than the current RCstream data + difffs as they are provided by the API? Mitar On Mon, Dec 15, 2014 at 10:22 PM, Maximilian Klein isa...@gmail.com wrote: All, Thanks for the great responses. It seems like, Andrew, Ed, DataSift, and Mitar are now all offering overlapping solutions to the real-time diff monitoring problem. The one thing I take away from that is that if the API is robust enough to serve these 4 clients in real time, then adding another is a drop in the bucket. However, as others like Yuvi pointed out, and Aaron has prototyped we could make this better, by serving an augmented RCstream. I wonder how easy it would be to allow community development on that project since it seems that it would require access to the full databases, which only WMF developers seem to have access to at the moment. Make a great day, Max Klein ‽ http://notconfusing.com/ On Mon, Dec 15, 2014 at 5:09 AM, Flöck, Fabian fabian.flo...@gesis.org wrote: If anyone is interested in a faster processing of revision differences, you could also adapt the strategy we implemented for wikiwho [1], which is keeping track of bigger unchanged text chunks with hashes and just diffing the remaining text (usually a relatively small part oft the article). We specifically introduced that technique because diffing all the text was too expensive. And in principle, it can produce the same output, although we currently use it for authorship detection, which is a slightly different task. Anyway, it is on average 100 times faster than pure traditional diffing. Maybe that is useful for someone. Code is available at github [2]. [1] http://f-squared.org/wikiwho [2] https://github.com/maribelacosta/wikiwho On 14.12.2014, at 07:23, Jeremy Baron jer...@tuxmachine.com wrote: On Dec 13, 2014 12:33 PM, Aaron Halfaker ahalfa...@wikimedia.org wrote: 1. It turns out that generating diffs is computationally complex, so generating them in real time is slow and lame. I'm working to generate all diffs historically using Hadoop and then have a live system listening to recent changes to keep the data up-to-date[2]. IIRC Mako does that in ~4 hours (maybe outdated and takes longer now) for all enwiki diffs for all time. (don't remember if this is namespace limited) But also using an extraordinary amount of RAM. i.e. hundreds of GB AIUI, there's no dynamic memory allocation. revisions are loaded into fixed-size buffers larger than the largest revision. https://github.com/makoshark/wikiq -Jeremy ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l Cheers, Fabian -- Fabian Flöck Research Associate Computational Social Science department @GESIS Unter Sachsenhausen 6-8, 50667 Cologne, Germany Tel: + 49 (0) 221-47694-208 fabian.flo...@gesis.org www.gesis.org www.facebook.com/gesis.org ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- http://mitar.tnode.com/ https://twitter.com/mitar_m ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] How to track all the diffs in real time?
On Sat, Dec 13, 2014 at 2:34 PM, Yuvi Panda yuvipa...@gmail.com wrote: If a lot of people are doing this, then perhaps it makes sense to have an 'augmented real time streaming' interface that is an exact replica of the streaming interface but with diffs added. Or rather, if I were to build such a thing, would people be interested in using it? -- Yuvi Panda T http://yuvi.in/blog ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] How to track all the diffs in real time?
Great idea, Yuvi. Speaking as someone who just downloaded diffs for a month of data from the streaming API for a research project, I certainly could see an 'augmented stream' with diffs included being very useful for research and also for bots. On Sat, Dec 13, 2014 at 10:52 PM, Yuvi Panda yuvipa...@gmail.com wrote: On Sat, Dec 13, 2014 at 2:34 PM, Yuvi Panda yuvipa...@gmail.com wrote: If a lot of people are doing this, then perhaps it makes sense to have an 'augmented real time streaming' interface that is an exact replica of the streaming interface but with diffs added. Or rather, if I were to build such a thing, would people be interested in using it? -- Yuvi Panda T http://yuvi.in/blog ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Scott Hale Oxford Internet Institute University of Oxford http://www.scotthale.net/ scott.h...@oii.ox.ac.uk ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] How to track all the diffs in real time?
Oh dear god, that would be incredible. The non-streaming API has a wonderful bug: if you request a series of diffs, and there are 1 uncached diffs in that series, only the first uncached diff will be returned. For the rest it returns...an error? No. Some kind of special value? No. It returns an empty string. You know: that thing it also returns if there is no difference . So instead you stream edits and compute the diffs yourself and everything goes a bit Pete Tong. Having this service around would be a lifesaver. On 13 December 2014 at 10:14, Scott Hale computermacgy...@gmail.com wrote: Great idea, Yuvi. Speaking as someone who just downloaded diffs for a month of data from the streaming API for a research project, I certainly could see an 'augmented stream' with diffs included being very useful for research and also for bots. On Sat, Dec 13, 2014 at 10:52 PM, Yuvi Panda yuvipa...@gmail.com wrote: On Sat, Dec 13, 2014 at 2:34 PM, Yuvi Panda yuvipa...@gmail.com wrote: If a lot of people are doing this, then perhaps it makes sense to have an 'augmented real time streaming' interface that is an exact replica of the streaming interface but with diffs added. Or rather, if I were to build such a thing, would people be interested in using it? -- Yuvi Panda T http://yuvi.in/blog ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Scott Hale Oxford Internet Institute University of Oxford http://www.scotthale.net/ scott.h...@oii.ox.ac.uk ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- Oliver Keyes Research Analyst Wikimedia Foundation ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] How to track all the diffs in real time?
+1 Yuvi About a year ago I put together a little program that identified .uk external links in Wikipedia’s changes for the web archiving folks at the British Library. Because it needed to fetch the diff for each change I never pushed it very far, out of concerns for the API traffic. I never asked though, so good on Max for bringing it up. Rather than setting up an additional stream endpoint I wonder if it might be feasible to add a query parameter to the existing one? So, something like: http://stream.wikimedia.org/rc?diff=true //Ed signature.asc Description: Message signed with OpenPGP using GPGMail ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] How to track all the diffs in real time?
Hey folks, I've been working on building up a revision diffs service that you'd be able to listen to or download a dump of revision diffs. See https://github.com/halfak/Difference-Engine for my progress on the live system and https://github.com/halfak/MediaWiki-Streaming for my progress developing a Hadoop Streaming primer to generate old diffs[1]. See also https://github.com/halfak/Deltas for some experimental diff algorithms developed specifically to track content moves in Wikipedia revisions. In the short term, I can share diff datasets. In the near-term, I'm wondering if you folks would be interested in working on the project with me. If so, let me know and I'll give you a more complete status update. 1. It turns out that generating diffs is computationally complex, so generating them in real time is slow and lame. I'm working to generate all diffs historically using Hadoop and then have a live system listening to recent changes to keep the data up-to-date[2]. 2. https://github.com/halfak/MediaWiki-events -Aaron On Sat, Dec 13, 2014 at 9:16 AM, Ed Summers e...@pobox.com wrote: +1 Yuvi About a year ago I put together a little program that identified .uk external links in Wikipedia’s changes for the web archiving folks at the British Library. Because it needed to fetch the diff for each change I never pushed it very far, out of concerns for the API traffic. I never asked though, so good on Max for bringing it up. Rather than setting up an additional stream endpoint I wonder if it might be feasible to add a query parameter to the existing one? So, something like: http://stream.wikimedia.org/rc?diff=true //Ed ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] How to track all the diffs in real time?
Hi! I made a a Meteor DDP API to the stream of recent changes on all WikiMedia wikis. Now you can simply use DDP.connect on in your Meteor application to connect to stream of changes on Wikipedia. You can use MongoDB queries to limit only to those changes you are interested in. If there is interest, I could add also full diffs support and then you could try to hit this API. We could probably also eventually host it on Wikimedia Labs. http://wikimedia.meteor.com/ Mitar On Fri, Dec 12, 2014 at 11:53 PM, Maximilian Klein isa...@gmail.com wrote: Hello Researchers, I've been playing with Recent Changes Stream Interface recently, and have started trying to use the API's action=compare to look at every diff of every wiki in real time. The goal is to produce real-time analytics on the content that's being added or deleted. The only problem is that is will really hammer the API with lots of reads since it doesn't have a batch interface. Can I spawn multiple network threads and do 10+ reads per second forever without the API complaining? Can I warn someone about this and get a special exemption for research purposes? The other thing to do would be to use action=query to get the revisions in batches and do the diffing myself, but then i'm not guaranteed to be diffing in the same way that the site is. What techniques would you recommend? Make a great day, Max Klein ‽ http://notconfusing.com/ ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l -- http://mitar.tnode.com/ https://twitter.com/mitar_m ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] How to track all the diffs in real time?
On Dec 13, 2014 12:33 PM, Aaron Halfaker ahalfa...@wikimedia.org wrote: 1. It turns out that generating diffs is computationally complex, so generating them in real time is slow and lame. I'm working to generate all diffs historically using Hadoop and then have a live system listening to recent changes to keep the data up-to-date[2]. IIRC Mako does that in ~4 hours (maybe outdated and takes longer now) for all enwiki diffs for all time. (don't remember if this is namespace limited) But also using an extraordinary amount of RAM. i.e. hundreds of GB AIUI, there's no dynamic memory allocation. revisions are loaded into fixed-size buffers larger than the largest revision. https://github.com/makoshark/wikiq -Jeremy ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Re: [Wiki-research-l] How to track all the diffs in real time?
Hi Max -- let me ping the API folks. I don't think we researchers can make the final call on this. -Toby On Fri, Dec 12, 2014 at 2:53 PM, Maximilian Klein isa...@gmail.com wrote: Hello Researchers, I've been playing with Recent Changes Stream Interface https://wikitech.wikimedia.org/wiki/RCStream recently, and have started trying to use the API's *action=compare* to look at every diff of every wiki in real time. The goal is to produce real-time analytics on the content that's being added or deleted. The only problem is that is will really hammer the API with lots of reads since it doesn't have a batch interface. Can I spawn multiple network threads and do 10+ reads per second forever without the API complaining? Can I warn someone about this and get a special exemption for research purposes? The other thing to do would be to use *action=query* to get the revisions in batches and do the diffing myself, but then i'm not guaranteed to be diffing in the same way that the site is. What techniques would you recommend? Make a great day, Max Klein ‽ http://notconfusing.com/ ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l ___ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l