Brian wrote:
> This extension is very important for training  machine learning
> vandalism detection bots. Recently published systems use only hundreds
> of examples of vandalism in training - not nearly enough to
> distinguish between the variety found in Wikipedia or generalize to
> new, unseen forms of vandalism. A large set of human created rules
> could be run against all previous edits in order to create a massive
> vandalism dataset.
As a machine-learning person, this seems like a somewhat problematic 
idea--- generating training examples *from a rule set* and then learning 
on them is just a very roundabout way of reconstructing that rule set. 
What you really want is a large dataset of human-labeled examples of 
vandalism / non-vandalism that *can't* currently be distinguished 
reliably by rules, so you can throw a machine-learning algorithm at the 
problem of trying to come up with some.

-Mark


_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to