Hi,
as Luca already mentioned, we (my colleagues Maribel Acosta and Felix Keppmann 
and me) are also working on an algorithm for authorship detection. Our approach 
is somewhat different than Luca and Michael's in that we rebuild authorship 
information for words in paragraphs and sentences via MD5-hashes (i.e. see if 
they have existed before at any time in the article) and use a Diff algorithm 
to detect the changes in the parts of the articles that haven't been seen 
before.

We build up on a older, more basic model of ours as described in the paper Luca 
already included in his mail [1]. Currently we are at 0,04 sec per revision for 
the pure calculation, without writing/reading the hashes to/from a database. 
This is the step we are working on now, to make the method incremental. We will 
make the code publicly available soon. We would like to contribute as much as 
we can to the Wikipedia authorship project with our solution and are open for 
any collaboration.

Another issue is of course accuracy of the found words, for which we will ask 
the community for input to evaluate it. We have set up a small gold standard 
set of 184 words and their origin (who wrote them in which revision) which can 
be found here: [2] . The words were randomly selected and their origin 
determined manually. I invite everyone to look at this set and make comments 
about if the postulated revisions of origin in this gold standard set seem to 
be right and extend it maybe. Although we will run an evaluation with a bigger 
user base, this serves as a useful starting point for preliminary testing. 
Right now we reach an accuracy of ~85% with this set (compared to ~50% of the 
old Wikitrust algorithm, see [1]), although there are still a lot of tuning 
possibilities in our algorithm.

Best,

Fabian

[1] 
http://wikipedia-academy.de/2012/w/images/2/24/23_Paper_Fabian_Fl%C3%B6ck_Andriy_Rodchenko.pdf
[2] 
https://docs.google.com/spreadsheet/ccc?key=0An7RIRiLIXD5dENITFpmU0c1RVZaU1NYeXZ0UEVVaEE#gid=0






--
Karlsruhe Institute of Technology (KIT)
Institute of Applied Informatics and Formal Description Methods

Dipl.-Medwiss. Fabian Flöck
Research Associate

Building 11.40, Room 222
KIT-Campus South
D-76128 Karlsruhe

Phone: +49 721 608 4 6584
Fax: +49 721 608 4 6580
Skype: f.floeck_work
E-Mail: [email protected]<mailto:[email protected]>
WWW: http://www.aifb.kit.edu/web/Fabian_Flöck

KIT – University of the State of Baden-Wuerttemberg and
National Research Center of the Helmholtz Association

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to