On Thu, Mar 19, 2009 at 1:03 PM, Delirium <[email protected]> wrote:

> Brian wrote:
> > This extension is very important for training  machine learning
> > vandalism detection bots. Recently published systems use only hundreds
> > of examples of vandalism in training - not nearly enough to
> > distinguish between the variety found in Wikipedia or generalize to
> > new, unseen forms of vandalism. A large set of human created rules
> > could be run against all previous edits in order to create a massive
> > vandalism dataset.
> As a machine-learning person, this seems like a somewhat problematic
> idea--- generating training examples *from a rule set* and then learning
> on them is just a very roundabout way of reconstructing that rule set.
> What you really want is a large dataset of human-labeled examples of
> vandalism / non-vandalism that *can't* currently be distinguished
> reliably by rules, so you can throw a machine-learning algorithm at the
> problem of trying to come up with some.
>

since theres already a database, this sounds like could be done flagging
edits as "vandalism", and then reading the existing database information to
extract these details, like ip,  a diff of the change, etc..   that way,
humans define what is a "vandalism", and the machine can learn the meaning.

this may need a button or something, so users report this, and the database
flag the edit


-- 
--
ℱin del ℳensaje.
_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to