I presented a talk at Wikimania 2007 that espoused the virtues of
combining human measures of content with automatically determined
measures in order to generalize to unseen instances. Unfortunately all
those Wikimania talks seem to have been lost. It was related to this
article on predicting the quality ratings provided by the Wikipedia
Editorial Team:

Rassbach, L., Pincock, T., Mingus., B (2007). "Exploring the
Feasibility of Automatically Rating Online Article Quality"
http://upload.wikimedia.org/wikipedia/wikimania2007/d/d3/RassbachPincockMingus07.pdf

Delerium, you do make it sound as if merely having the tagged dataset
solves the entire problem. But there are really multiple problems. One
is learning to classify what you have been told is in the dataset
(e.g., that all instances of this rule in the edit history *really
are* vandalism). The other is learning about new reasons that this
edit is vandalism based on all the other occurences of vandalism and
non-vandalism and a sophisticated pre-parse of all the content that
breaks it down into natural language features.  Finally, you then wish
to use this system to bootstrap a vandalism detection system that can
generalize to entirely new instances of vandalism.

The primary way of doing this is to use positive and *negative*
examples of vandalism in conjunction with their features. A good set
of example features is an article or an edit's conformance with the
Wikipedia Manual of Style. I never implemented the entire MoS, but I
did do quite a bit of it and it is quite indicative of quality.

Generally speaking, it is not true that you can only draw conclusions
about what is immediately available in your dataset. It is true that,
with the exception of people, machine learning systems struggle with
generalization.

On Thu, Mar 19, 2009 at 6:03 AM, Delirium <[email protected]> wrote:
> Brian wrote:
>> This extension is very important for training  machine learning
>> vandalism detection bots. Recently published systems use only hundreds
>> of examples of vandalism in training - not nearly enough to
>> distinguish between the variety found in Wikipedia or generalize to
>> new, unseen forms of vandalism. A large set of human created rules
>> could be run against all previous edits in order to create a massive
>> vandalism dataset.
> As a machine-learning person, this seems like a somewhat problematic
> idea--- generating training examples *from a rule set* and then learning
> on them is just a very roundabout way of reconstructing that rule set.
> What you really want is a large dataset of human-labeled examples of
> vandalism / non-vandalism that *can't* currently be distinguished
> reliably by rules, so you can throw a machine-learning algorithm at the
> problem of trying to come up with some.
>
> -Mark
>
>
> _______________________________________________
> Wikitech-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to