Brian wrote: > This extension is very important for training machine learning > vandalism detection bots. Recently published systems use only hundreds > of examples of vandalism in training - not nearly enough to > distinguish between the variety found in Wikipedia or generalize to > new, unseen forms of vandalism. A large set of human created rules > could be run against all previous edits in order to create a massive > vandalism dataset. As a machine-learning person, this seems like a somewhat problematic idea--- generating training examples *from a rule set* and then learning on them is just a very roundabout way of reconstructing that rule set. What you really want is a large dataset of human-labeled examples of vandalism / non-vandalism that *can't* currently be distinguished reliably by rules, so you can throw a machine-learning algorithm at the problem of trying to come up with some.
-Mark _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
