Re: [Wikitech-l] Project Idea for GSoC 2013 - Bayesian Spam Filter

2013-04-23 Thread anubhav agarwal
Hey Quim,

I have drafted my proposal on my User
pagehttps://www.mediawiki.org/wiki/User:Anubhav_iitr.
I have already opened a bug in mediawiki for the Extension request in
bugzilla. Here is the
linkhttps://bugzilla.wikimedia.org/show_bug.cgi?id=47207.


I will be glad to have your feedback.
Can you suggest me whom I should I ask to mentor me ?


On Mon, Apr 15, 2013 at 10:50 PM, Quim Gil q...@wikimedia.org wrote:

 On 04/14/2013 06:34 AM, anubhav agarwal wrote:

 Hey Quim,

 Thanks for such a detailed response. Sorry for being inactive for these
 few
 days, I was undergoing some coursework evaluations.


 I hope they went well. First things first!

 You have some homework to do here as well. It is time to start drafting
 your application, open a related feature request in Bugzilla and find a
 mentor. See

 https://www.mediawiki.org/**wiki/Mentorship_programs/**
 Application_templatehttps://www.mediawiki.org/wiki/Mentorship_programs/Application_template


 --
 Quim Gil
 Technical Contributor Coordinator @ Wikimedia Foundation
 http://www.mediawiki.org/wiki/**User:Qgilhttp://www.mediawiki.org/wiki/User:Qgil

 __**_
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/**mailman/listinfo/wikitech-lhttps://lists.wikimedia.org/mailman/listinfo/wikitech-l




-- 
Cheers,
Anubhav


Anubhav Agarwal| 4rth Year  | Computer Science  Engineering | IIT Roorkee
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Project Idea for GSoC 2013 - Bayesian Spam Filter

2013-04-23 Thread Quim Gil

On 04/23/2013 05:42 AM, anubhav agarwal wrote:

Hey Quim,

I have drafted my proposal on my User
pagehttps://www.mediawiki.org/wiki/User:Anubhav_iitr.
I have already opened a bug in mediawiki for the Extension request in
bugzilla. Here is the
linkhttps://bugzilla.wikimedia.org/show_bug.cgi?id=47207.


I will be glad to have your feedback.
Can you suggest me whom I should I ask to mentor me ?


Chris is willing to co-mentor, but not alone. I asked another potential 
co-mentor but we are still waiting for his answer. Anybody interested? 
MediaWiki extension development skills required.


In any case, please apply to GSoC formally. You don't need to have the 
mentors assigned to do this and you can keep improving your proposal 
until the deadline.


--
Quim Gil
Technical Contributor Coordinator @ Wikimedia Foundation
http://www.mediawiki.org/wiki/User:Qgil

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Project Idea for GSoC 2013 - Bayesian Spam Filter

2013-04-15 Thread Platonides
On 14/04/13 15:41, anubhav agarwal wrote:
 I don't we could take in account the roll back for automated learning. It
 is not necessary that the person who edited the document, then rolled it
 back did because it was a spam.

Getting the right data to train from is hard, since wiki is so flexible.
The good point of rollback is that a) It's easy to detect, b) It's
restricted (a random user can't use it) and c) On some wikis policy
restricts it's use to “clearly bad edits”.

So you _should_ be training with unwanted edits. But there will be
false positives.



 Though a Train as spam checkbox is a good idea. I was thinking about the
 report spam button along with edit button on the top-right hand corner
 of a section.

However, that only tells you that somewhere in the page there is spam,
not what the spam is (the last revision? an edit from 2 months ago?) nor
does it encourage for fixing it.


 I was thinking of creating a Job Queue for big websites like Wikipedia,
 each edit will go in a queue which will be processed offline and then later
 roll backed to the original content if it triggers the alarm.

I'm not a big fan of this. You will have edit-conflicts to handle, and
it looks messy to have reverts by an extension. I recommend you to work
on the bayesian detection of spam, and leave the potential refactoring
to configure it to work through the job queue for later.

I think I could look in the archives of deleted pages from the WM-ES
wiki for spam data for you.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Project Idea for GSoC 2013 - Bayesian Spam Filter

2013-04-15 Thread Quim Gil

On 04/14/2013 06:34 AM, anubhav agarwal wrote:

Hey Quim,

Thanks for such a detailed response. Sorry for being inactive for these few
days, I was undergoing some coursework evaluations.


I hope they went well. First things first!

You have some homework to do here as well. It is time to start drafting 
your application, open a related feature request in Bugzilla and find a 
mentor. See


https://www.mediawiki.org/wiki/Mentorship_programs/Application_template

--
Quim Gil
Technical Contributor Coordinator @ Wikimedia Foundation
http://www.mediawiki.org/wiki/User:Qgil

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Project Idea for GSoC 2013 - Bayesian Spam Filter

2013-04-14 Thread anubhav agarwal
Hey Quim,

Thanks for such a detailed response. Sorry for being inactive for these few
days, I was undergoing some coursework evaluations.

On Tue, Apr 9, 2013 at 9:50 PM, Quim Gil q...@wikimedia.org wrote:

 Hi Anubhav,


 On 04/07/2013 06:05 PM, anubhav agarwal wrote:

 Hi,

 I am Anubhav Agarwal, a B.Tech 4th Year student at IIT Roorkee. I wish to
 apply for GSoC 2013 and I am thinking about Bayesian Spam Filter as a
 project for the same. I have drafted the Idea on my
 talkhttp://www.mediawiki.org/**wiki/User:Anubhav_iitrhttp://www.mediawiki.org/wiki/User:Anubhav_iitr
 page.


 I have done a first reality check with Chris Steipp, who oversees the area
 of security and also spam prevention. Your idea is interesting and it seems
 to be feasible. This is a very good first step!

 It would require adding a hook to MediaWiki core, but this could be a
 small, acceptable change. The rest could be developed as an extension of
 the ConfirmEdit extension.

 It might have a performance penalty in a site like English Wikipedia with
 plenty of concurrent edits, but for starters it could be potentially useful
 to the 99% of MediaWiki instances that have a significantly smaller number
 of daily edits and especially a very small number of editors and tools able
 / happy to deal with spam.


I was thinking of creating a Job Queue for big websites like Wikipedia,
each edit will go in a queue which will be processed offline and then later
roll backed to the original content if it triggers the alarm.



 As a next step, please

 1. Create a subpage for your proposal e.g. http://www.mediawiki.org/wiki/*
 *User:Anubhav_iitr/Bayesan_**spam_filterhttp://www.mediawiki.org/wiki/User:Anubhav_iitr/Bayesan_spam_filter

 2. File an enhancement request at https://bugzilla.wikimedia.**
 org/enter_bug.cgi?product=**MediaWiki%20extensionshttps://bugzilla.wikimedia.org/enter_bug.cgi?product=MediaWiki%20extensionsunder
  Extensions requests explaining your proposal and linking to the
 related wiki page.

 3. Reply to this thread sharing the link to the bug report so anybody
 interested can watch it.


Here is the link for the
bughttps://bugzilla.wikimedia.org/show_bug.cgi?id=47207,
as you said.




  I request you to go through this and give your suggestions on it.


 Yes, but you will get more feedback if you are diligent answering to the
 feedback received:

 http://www.mediawiki.org/wiki/**User_talk:Anubhav_iitrhttp://www.mediawiki.org/wiki/User_talk:Anubhav_iitr
  :)


 --
 Quim Gil
 Technical Contributor Coordinator @ Wikimedia Foundation
 http://www.mediawiki.org/wiki/**User:Qgilhttp://www.mediawiki.org/wiki/User:Qgil

 __**_
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/**mailman/listinfo/wikitech-lhttps://lists.wikimedia.org/mailman/listinfo/wikitech-l




-- 
Cheers,
Anubhav


Anubhav Agarwal| 4rth Year  | Computer Science  Engineering | IIT Roorkee
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Project Idea for GSoC 2013 - Bayesian Spam Filter

2013-04-14 Thread anubhav agarwal
Hi Platonides,

On Sat, Apr 13, 2013 at 4:04 AM, Platonides platoni...@gmail.com wrote:

 On 09/04/13 18:20, Quim Gil wrote:
  Hi Anubhav,
 
  I have done a first reality check with Chris Steipp, who oversees the
  area of security and also spam prevention. Your idea is interesting and
  it seems to be feasible. This is a very good first step!
 
  It would require adding a hook to MediaWiki core, but this could be a
  small, acceptable change.
 I agree. Adding a hook is no problem.

  The rest could be developed as an extension of
  the ConfirmEdit extension.

 I'm not sure on adding it to ConfirmEdit. I would develop it as an
 independent extension, which could then hook into ConfirmEdit or
 AbuseFilter.

 Anubhav wrote:
  Tasks
 
  Create a tool for wiki users to report Spam. A a simple way to
  train the a Bayesian DB. This should be accessible for any user
  with the permissions to undo or rollback those changes or to
  delete the new page/file. Understanding the metadata(IP, links,
  user) I can extract from the data (perhaps harnessing other
  services like blacklists).

 I think it would be more interesting if it could be trained
 automatically. Perhaps by automatically learning rollbacks as wrong.
 Maybe there could be a checkbox to train as spam when doing a revert,
 but I would avoid anything complex like Go to Special:TrainSpam and
 enter the revision number to mark as spam.


I don't we could take in account the roll back for automated learning. It
is not necessary that the person who edited the document, then rolled it
back did because it was a spam.

Though a Train as spam checkbox is a good idea. I was thinking about the
report spam button along with edit button on the top-right hand corner
of a section.



 Good luck!


 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l




-- 
Cheers,
Anubhav


Anubhav Agarwal| 4rth Year  | Computer Science  Engineering | IIT Roorkee
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Project Idea for GSoC 2013 - Bayesian Spam Filter

2013-04-13 Thread Paul Selitskas
On Sat, Apr 13, 2013 at 2:42 AM, Brian Wolff bawo...@gmail.com wrote:

 Qgill wrote:
It might have a performance penalty in a site like English Wikipedia with
 plenty of concurrent edits, but for starters it could be potentially useful
 to the 99% of MediaWiki instances that have a significantly smaller number
 of daily edits and especially a very small number of editors and tools able
 / happy to deal with spam.

 Hmm. I was playing with nlp-ish automated newpage patrol recently. One
 thing that crossed my mind was if it becomes too expensive, one could run
 the classifier in the job queue (and hence on a dedicated server(s) ) and
 tag changes shortly after the fact.

We have Parsoid running separately, don't we? Perhaps, the same
approach could work here as well.


 -bawolff
 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l



--
З павагай,
Павел Селіцкас/Pavel Selitskas
Wizardist @ Wikimedia projects

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Project Idea for GSoC 2013 - Bayesian Spam Filter

2013-04-12 Thread Platonides
On 09/04/13 18:20, Quim Gil wrote:
 Hi Anubhav,
 
 I have done a first reality check with Chris Steipp, who oversees the
 area of security and also spam prevention. Your idea is interesting and
 it seems to be feasible. This is a very good first step!
 
 It would require adding a hook to MediaWiki core, but this could be a
 small, acceptable change.
I agree. Adding a hook is no problem.

 The rest could be developed as an extension of
 the ConfirmEdit extension.

I'm not sure on adding it to ConfirmEdit. I would develop it as an
independent extension, which could then hook into ConfirmEdit or
AbuseFilter.

Anubhav wrote:
 Tasks
 
 Create a tool for wiki users to report Spam. A a simple way to
 train the a Bayesian DB. This should be accessible for any user 
 with the permissions to undo or rollback those changes or to
 delete the new page/file. Understanding the metadata(IP, links,
 user) I can extract from the data (perhaps harnessing other
 services like blacklists).

I think it would be more interesting if it could be trained
automatically. Perhaps by automatically learning rollbacks as wrong.
Maybe there could be a checkbox to train as spam when doing a revert,
but I would avoid anything complex like Go to Special:TrainSpam and
enter the revision number to mark as spam.

Good luck!


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Project Idea for GSoC 2013 - Bayesian Spam Filter

2013-04-12 Thread Brian Wolff
On 2013-04-12 7:33 PM, Platonides platoni...@gmail.com wrote:

 On 09/04/13 18:20, Quim Gil wrote:
  Hi Anubhav,
 
  I have done a first reality check with Chris Steipp, who oversees the
  area of security and also spam prevention. Your idea is interesting and
  it seems to be feasible. This is a very good first step!
 
  It would require adding a hook to MediaWiki core, but this could be a
  small, acceptable change.
 I agree. Adding a hook is no problem.


Well a hook is obviously no problem, im not sure why a new one would be
needed. Surely if the abuse filter has all the hooks it needs, so would
this.

Qgill wrote:
It might have a performance penalty in a site like English Wikipedia with
plenty of concurrent edits, but for starters it could be potentially useful
to the 99% of MediaWiki instances that have a significantly smaller number
of daily edits and especially a very small number of editors and tools able
/ happy to deal with spam.

Hmm. I was playing with nlp-ish automated newpage patrol recently. One
thing that crossed my mind was if it becomes too expensive, one could run
the classifier in the job queue (and hence on a dedicated server(s) ) and
tag changes shortly after the fact.

Last of all I would suggest you also read up on other people who have done
machine learning approaches to vandalism detection. In particular
user:cluebot_NG - http://en.wikipedia.org/wiki/User:Cluebot_NG . There is
also a list of academic papers on the subject at
http://en.wikipedia.org/w/index.php?title=User:Emijrp/Anti-vandalism_bot_census(that
said, an extension like you are proposing does not have to be as good
as the rather complex state of the art in order to be useful. Any effective
system would probably be quite useful).

-bawolff
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Project Idea for GSoC 2013 - Bayesian Spam Filter

2013-04-09 Thread Quim Gil

Hi Anubhav,

On 04/07/2013 06:05 PM, anubhav agarwal wrote:

Hi,

I am Anubhav Agarwal, a B.Tech 4th Year student at IIT Roorkee. I wish to
apply for GSoC 2013 and I am thinking about Bayesian Spam Filter as a
project for the same. I have drafted the Idea on my
talkhttp://www.mediawiki.org/wiki/User:Anubhav_iitrpage.


I have done a first reality check with Chris Steipp, who oversees the 
area of security and also spam prevention. Your idea is interesting and 
it seems to be feasible. This is a very good first step!


It would require adding a hook to MediaWiki core, but this could be a 
small, acceptable change. The rest could be developed as an extension of 
the ConfirmEdit extension.


It might have a performance penalty in a site like English Wikipedia 
with plenty of concurrent edits, but for starters it could be 
potentially useful to the 99% of MediaWiki instances that have a 
significantly smaller number of daily edits and especially a very small 
number of editors and tools able / happy to deal with spam.


As a next step, please

1. Create a subpage for your proposal e.g. 
http://www.mediawiki.org/wiki/User:Anubhav_iitr/Bayesan_spam_filter


2. File an enhancement request at 
https://bugzilla.wikimedia.org/enter_bug.cgi?product=MediaWiki%20extensions 
under Extensions requests explaining your proposal and linking to the 
related wiki page.


3. Reply to this thread sharing the link to the bug report so anybody 
interested can watch it.




I request you to go through this and give your suggestions on it.


Yes, but you will get more feedback if you are diligent answering to the 
feedback received:


http://www.mediawiki.org/wiki/User_talk:Anubhav_iitr  :)


--
Quim Gil
Technical Contributor Coordinator @ Wikimedia Foundation
http://www.mediawiki.org/wiki/User:Qgil

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l