Re: [Wiki-research-l] How to track all the diffs in real time?

2014-12-15 Thread Flöck , Fabian
If anyone is interested in a faster processing of revision differences, you 
could also adapt the strategy we implemented for wikiwho [1], which is keeping 
track of bigger unchanged text chunks with hashes and just diffing the 
remaining text (usually a relatively small part oft the article). We 
specifically introduced that technique because diffing all the text was too 
expensive. And in principle, it can produce the same output, although we 
currently use it for authorship detection, which is a slightly different task.  
Anyway, it is on average 100 times faster than pure traditional diffing. 
Maybe that is useful for someone. Code is available at github [2].

[1] http://f-squared.org/wikiwho
[2] https://github.com/maribelacosta/wikiwho


On 14.12.2014, at 07:23, Jeremy Baron 
jer...@tuxmachine.commailto:jer...@tuxmachine.com wrote:


On Dec 13, 2014 12:33 PM, Aaron Halfaker 
ahalfa...@wikimedia.orgmailto:ahalfa...@wikimedia.org wrote:
 1. It turns out that generating diffs is computationally complex, so 
 generating them in real time is slow and lame.  I'm working to generate all 
 diffs historically using Hadoop and then have a live system listening to 
 recent changes to keep the data up-to-date[2].

IIRC Mako does that in ~4 hours (maybe outdated and takes longer now) for all 
enwiki diffs for all time. (don't remember if this is namespace limited) But 
also using an extraordinary amount of RAM. i.e. hundreds of GB

AIUI, there's no dynamic memory allocation. revisions are loaded into 
fixed-size buffers larger than the largest revision.

https://github.com/makoshark/wikiq

-Jeremy

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.orgmailto:Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




Cheers,
Fabian

--
Fabian Flöck
Research Associate
Computational Social Science department @GESIS
Unter Sachsenhausen 6-8, 50667 Cologne, Germany
Tel: + 49 (0) 221-47694-208
fabian.flo...@gesis.orgmailto:fabian.flo...@gesis.org

www.gesis.org
www.facebook.com/gesis.org





___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] How to track all the diffs in real time?

2014-12-15 Thread Maximilian Klein
All,
Thanks for the great responses. It seems like, Andrew, Ed, DataSift, and
Mitar are now all offering overlapping solutions to the real-time diff
monitoring problem. The one thing I take away from that is that if the API
is robust enough to serve these 4 clients in real time, then adding another
is a drop in the bucket.

However, as others like Yuvi pointed out, and Aaron has prototyped we could
make this better, by serving an augmented RCstream. I wonder how easy it
would be to allow community development on that project since it seems that
it would require access to the full databases, which only WMF developers
seem to have access to at the moment.

Make a great day,
Max Klein ‽ http://notconfusing.com/

On Mon, Dec 15, 2014 at 5:09 AM, Flöck, Fabian fabian.flo...@gesis.org
wrote:

  If anyone is interested in a faster processing of revision differences,
 you could also adapt the strategy we implemented for wikiwho [1], which is
 keeping track of bigger unchanged text chunks with hashes and just diffing
 the remaining text (usually a relatively small part oft the article). We
 specifically introduced that technique because diffing all the text was too
 expensive. And in principle, it can produce the same output, although we
 currently use it for authorship detection, which is a slightly different
 task.  Anyway, it is on average 100 times faster than pure traditional
 diffing. Maybe that is useful for someone. Code is available at github [2].

  [1] http://f-squared.org/wikiwho
 [2] https://github.com/maribelacosta/wikiwho


  On 14.12.2014, at 07:23, Jeremy Baron jer...@tuxmachine.com wrote:

  On Dec 13, 2014 12:33 PM, Aaron Halfaker ahalfa...@wikimedia.org
 wrote:
  1. It turns out that generating diffs is computationally complex, so
 generating them in real time is slow and lame.  I'm working to generate all
 diffs historically using Hadoop and then have a live system listening to
 recent changes to keep the data up-to-date[2].

 IIRC Mako does that in ~4 hours (maybe outdated and takes longer now) for
 all enwiki diffs for all time. (don't remember if this is namespace
 limited) But also using an extraordinary amount of RAM. i.e. hundreds of GB

 AIUI, there's no dynamic memory allocation. revisions are loaded into
 fixed-size buffers larger than the largest revision.

 https://github.com/makoshark/wikiq

 -Jeremy
 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l





 Cheers,
 Fabian

 --
 Fabian Flöck
  Research Associate
 Computational Social Science department @GESIS
 Unter Sachsenhausen 6-8, 50667 Cologne, Germany
 Tel: + 49 (0) 221-47694-208
 fabian.flo...@gesis.org

 www.gesis.org
 www.facebook.com/gesis.org






 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] How to track all the diffs in real time?

2014-12-15 Thread Mitar
Hi!

What more do you have in mind that could be in augmented stream than
the current RCstream data + difffs as they are provided by the API?


Mitar

On Mon, Dec 15, 2014 at 10:22 PM, Maximilian Klein isa...@gmail.com wrote:
 All,
 Thanks for the great responses. It seems like, Andrew, Ed, DataSift, and
 Mitar are now all offering overlapping solutions to the real-time diff
 monitoring problem. The one thing I take away from that is that if the API
 is robust enough to serve these 4 clients in real time, then adding another
 is a drop in the bucket.

 However, as others like Yuvi pointed out, and Aaron has prototyped we could
 make this better, by serving an augmented RCstream. I wonder how easy it
 would be to allow community development on that project since it seems that
 it would require access to the full databases, which only WMF developers
 seem to have access to at the moment.

 Make a great day,
 Max Klein ‽ http://notconfusing.com/

 On Mon, Dec 15, 2014 at 5:09 AM, Flöck, Fabian fabian.flo...@gesis.org
 wrote:

 If anyone is interested in a faster processing of revision differences,
 you could also adapt the strategy we implemented for wikiwho [1], which is
 keeping track of bigger unchanged text chunks with hashes and just diffing
 the remaining text (usually a relatively small part oft the article). We
 specifically introduced that technique because diffing all the text was too
 expensive. And in principle, it can produce the same output, although we
 currently use it for authorship detection, which is a slightly different
 task.  Anyway, it is on average 100 times faster than pure traditional
 diffing. Maybe that is useful for someone. Code is available at github [2].

 [1] http://f-squared.org/wikiwho
 [2] https://github.com/maribelacosta/wikiwho


 On 14.12.2014, at 07:23, Jeremy Baron jer...@tuxmachine.com wrote:

 On Dec 13, 2014 12:33 PM, Aaron Halfaker ahalfa...@wikimedia.org
 wrote:
  1. It turns out that generating diffs is computationally complex, so
  generating them in real time is slow and lame.  I'm working to generate all
  diffs historically using Hadoop and then have a live system listening to
  recent changes to keep the data up-to-date[2].

 IIRC Mako does that in ~4 hours (maybe outdated and takes longer now) for
 all enwiki diffs for all time. (don't remember if this is namespace limited)
 But also using an extraordinary amount of RAM. i.e. hundreds of GB

 AIUI, there's no dynamic memory allocation. revisions are loaded into
 fixed-size buffers larger than the largest revision.

 https://github.com/makoshark/wikiq

 -Jeremy

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l





 Cheers,
 Fabian

 --
 Fabian Flöck
 Research Associate
 Computational Social Science department @GESIS
 Unter Sachsenhausen 6-8, 50667 Cologne, Germany
 Tel: + 49 (0) 221-47694-208
 fabian.flo...@gesis.org

 www.gesis.org
 www.facebook.com/gesis.org






 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




-- 
http://mitar.tnode.com/
https://twitter.com/mitar_m

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] How to track all the diffs in real time?

2014-12-13 Thread Yuvi Panda
On Sat, Dec 13, 2014 at 2:34 PM, Yuvi Panda yuvipa...@gmail.com wrote:
 If a lot of people are doing this, then perhaps it makes sense to have
 an 'augmented real time streaming' interface that is an exact replica
 of the streaming interface but with diffs added.

Or rather, if I were to build such a thing, would people be interested
in using it?

-- 
Yuvi Panda T
http://yuvi.in/blog

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] How to track all the diffs in real time?

2014-12-13 Thread Scott Hale
Great idea, Yuvi. Speaking as someone who just downloaded diffs for a month
of data from the streaming API for a research project, I certainly could
see an 'augmented stream' with diffs included being very useful for
research and also for bots.


On Sat, Dec 13, 2014 at 10:52 PM, Yuvi Panda yuvipa...@gmail.com wrote:

 On Sat, Dec 13, 2014 at 2:34 PM, Yuvi Panda yuvipa...@gmail.com wrote:
  If a lot of people are doing this, then perhaps it makes sense to have
  an 'augmented real time streaming' interface that is an exact replica
  of the streaming interface but with diffs added.

 Or rather, if I were to build such a thing, would people be interested
 in using it?

 --
 Yuvi Panda T
 http://yuvi.in/blog

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



-- 
Scott Hale
Oxford Internet Institute
University of Oxford
http://www.scotthale.net/
scott.h...@oii.ox.ac.uk
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] How to track all the diffs in real time?

2014-12-13 Thread Oliver Keyes
Oh dear god, that would be incredible.

The non-streaming API has a wonderful bug: if you request a series of
diffs, and there are 1 uncached diffs in that series, only the first
uncached diff will be returned. For the rest it returns...an error? No.
Some kind of special value? No. It returns an empty string. You know: that
thing it also returns if there is no difference . So instead you stream
edits and compute the diffs yourself and everything goes a bit Pete Tong.
Having this service around would be a lifesaver.

On 13 December 2014 at 10:14, Scott Hale computermacgy...@gmail.com wrote:

 Great idea, Yuvi. Speaking as someone who just downloaded diffs for a
 month of data from the streaming API for a research project, I certainly
 could see an 'augmented stream' with diffs included being very useful for
 research and also for bots.


 On Sat, Dec 13, 2014 at 10:52 PM, Yuvi Panda yuvipa...@gmail.com wrote:

 On Sat, Dec 13, 2014 at 2:34 PM, Yuvi Panda yuvipa...@gmail.com wrote:
  If a lot of people are doing this, then perhaps it makes sense to have
  an 'augmented real time streaming' interface that is an exact replica
  of the streaming interface but with diffs added.

 Or rather, if I were to build such a thing, would people be interested
 in using it?

 --
 Yuvi Panda T
 http://yuvi.in/blog

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



 --
 Scott Hale
 Oxford Internet Institute
 University of Oxford
 http://www.scotthale.net/
 scott.h...@oii.ox.ac.uk

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] How to track all the diffs in real time?

2014-12-13 Thread Ed Summers
+1 Yuvi

About a year ago I put together a little program that identified .uk external 
links in Wikipedia’s changes for the web archiving folks at the British 
Library. Because it needed to fetch the diff for each change I never pushed it 
very far, out of concerns for the API traffic. I never asked though, so good on 
Max for bringing it up.

Rather than setting up an additional stream endpoint I wonder if it might be 
feasible to add a query parameter to the existing one? So, something like:

http://stream.wikimedia.org/rc?diff=true

//Ed


signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] How to track all the diffs in real time?

2014-12-13 Thread Aaron Halfaker
Hey folks,

I've been working on building up a revision diffs service that you'd be
able to listen to or download a dump of revision diffs.

See https://github.com/halfak/Difference-Engine for my progress on the live
system and https://github.com/halfak/MediaWiki-Streaming for my progress
developing a Hadoop Streaming primer to generate old diffs[1].  See also
https://github.com/halfak/Deltas for some experimental diff algorithms
developed specifically to track content moves in Wikipedia revisions.

In the short term, I can share diff datasets.  In the near-term, I'm
wondering if you folks would be interested in working on the project with
me.  If so, let me know and I'll give you a more complete status update.

1. It turns out that generating diffs is computationally complex, so
generating them in real time is slow and lame.  I'm working to generate all
diffs historically using Hadoop and then have a live system listening to
recent changes to keep the data up-to-date[2].
2. https://github.com/halfak/MediaWiki-events

-Aaron

On Sat, Dec 13, 2014 at 9:16 AM, Ed Summers e...@pobox.com wrote:

 +1 Yuvi

 About a year ago I put together a little program that identified .uk
 external links in Wikipedia’s changes for the web archiving folks at the
 British Library. Because it needed to fetch the diff for each change I
 never pushed it very far, out of concerns for the API traffic. I never
 asked though, so good on Max for bringing it up.

 Rather than setting up an additional stream endpoint I wonder if it might
 be feasible to add a query parameter to the existing one? So, something
 like:

 http://stream.wikimedia.org/rc?diff=true

 //Ed

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] How to track all the diffs in real time?

2014-12-13 Thread Mitar
Hi!

I made a a Meteor DDP API to the stream of recent changes on all
WikiMedia wikis. Now you can simply use DDP.connect on in your Meteor
application to connect to stream of changes on Wikipedia. You can use
MongoDB queries to limit only to those changes you are interested in.
If there is interest, I could add also full diffs support and then you
could try to hit this API. We could probably also eventually host it
on Wikimedia Labs.

http://wikimedia.meteor.com/


Mitar

On Fri, Dec 12, 2014 at 11:53 PM, Maximilian Klein isa...@gmail.com wrote:
 Hello Researchers,

 I've been playing with Recent Changes Stream Interface recently, and have
 started trying to use the API's action=compare to look at every diff of
 every wiki in real time. The goal is to produce real-time analytics on the
 content that's being added or deleted. The only problem is that is will
 really hammer the API with lots of reads since it doesn't have a batch
 interface. Can I spawn multiple network threads and do 10+ reads per second
 forever without the API complaining? Can I warn someone about this and get a
 special exemption for research purposes?

 The other thing to do would be to use action=query to get the revisions in
 batches and do the diffing myself, but then i'm not guaranteed to be diffing
 in the same way that the site is.

 What techniques would you recommend?


 Make a great day,
 Max Klein ‽ http://notconfusing.com/

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




-- 
http://mitar.tnode.com/
https://twitter.com/mitar_m

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] How to track all the diffs in real time?

2014-12-13 Thread Jeremy Baron
On Dec 13, 2014 12:33 PM, Aaron Halfaker ahalfa...@wikimedia.org wrote:
 1. It turns out that generating diffs is computationally complex, so
generating them in real time is slow and lame.  I'm working to generate all
diffs historically using Hadoop and then have a live system listening to
recent changes to keep the data up-to-date[2].

IIRC Mako does that in ~4 hours (maybe outdated and takes longer now) for
all enwiki diffs for all time. (don't remember if this is namespace
limited) But also using an extraordinary amount of RAM. i.e. hundreds of GB

AIUI, there's no dynamic memory allocation. revisions are loaded into
fixed-size buffers larger than the largest revision.

https://github.com/makoshark/wikiq

-Jeremy
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] How to track all the diffs in real time?

2014-12-12 Thread Toby Negrin
Hi Max -- let me ping the API folks. I don't think we researchers can make
the final call on this.

-Toby

On Fri, Dec 12, 2014 at 2:53 PM, Maximilian Klein isa...@gmail.com wrote:

 Hello Researchers,

 I've been playing with Recent Changes Stream Interface
 https://wikitech.wikimedia.org/wiki/RCStream recently, and have started
 trying to use the API's *action=compare* to look at every diff of every
 wiki in real time. The goal is to produce real-time analytics on the
 content that's being added or deleted. The only problem is that is will
 really hammer the API with lots of reads since it doesn't have a batch
 interface. Can I spawn multiple network threads and do 10+ reads per second
 forever without the API complaining? Can I warn someone about this and get
 a special exemption for research purposes?

 The other thing to do would be to use *action=query* to get the
 revisions in batches and do the diffing myself, but then i'm not guaranteed
 to be diffing in the same way that the site is.

 What techniques would you recommend?


 Make a great day,
 Max Klein ‽ http://notconfusing.com/

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l