On Thu, Mar 19, 2009 at 1:03 PM, Delirium <[email protected]> wrote:
> Brian wrote: > > This extension is very important for training machine learning > > vandalism detection bots. Recently published systems use only hundreds > > of examples of vandalism in training - not nearly enough to > > distinguish between the variety found in Wikipedia or generalize to > > new, unseen forms of vandalism. A large set of human created rules > > could be run against all previous edits in order to create a massive > > vandalism dataset. > As a machine-learning person, this seems like a somewhat problematic > idea--- generating training examples *from a rule set* and then learning > on them is just a very roundabout way of reconstructing that rule set. > What you really want is a large dataset of human-labeled examples of > vandalism / non-vandalism that *can't* currently be distinguished > reliably by rules, so you can throw a machine-learning algorithm at the > problem of trying to come up with some. > since theres already a database, this sounds like could be done flagging edits as "vandalism", and then reading the existing database information to extract these details, like ip, a diff of the change, etc.. that way, humans define what is a "vandalism", and the machine can learn the meaning. this may need a button or something, so users report this, and the database flag the edit -- -- ℱin del ℳensaje. _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
