https://bugzilla.wikimedia.org/show_bug.cgi?id=47956
Web browser: ---
Bug ID: 47956
Summary: Authorship Tracking
Product: MediaWiki extensions
Version: unspecified
Hardware: All
OS: All
Status: UNCONFIRMED
Severity: enhancement
Priority: Unprioritized
Component: WikidataRepo
Assignee: [email protected]
Reporter: [email protected]
CC: [email protected]
Classification: Unclassified
Mobile Platform: ---
We propose to implement authorship tracking for the text in the Wikipedia. The
goal is to annotate every word of Wikipedia content with the revision where it
was inserted, and the author who created it.
We have been developing robust and efficient algorithms for computing the
authorship information. The algorithms compare each new revision of a page with
all previous revisions, and attribute any new content in the latest revision to
its earliest plausible match in previous content. In this way, if content is
deleted (e.g. by a vandal, or in the course of a dispute), and later
re-inserted, the content is still correctly attributed to its original author.
To achieve an efficient implementation, the algorithm keeps a specially-encoded
summary of the history of a wiki page. The size of this summary is proportional
to the amount of change that the page has undergone; as we drop information on
content that has been absent for longer than 90 days and longer than 100 edits,
this summary is on average about 10 times the size of a typical revision. When
a user creates a new revision, the algorithm:
Reads the page summary
Computes the authorship for the new revision, and stores it
Stores an updated summary of the history which includes also the new
revision.
The process takes about one second of processing time per revision, including
the time to serialize and un-serialize the summary, which is generally the
dominant time.
The algorithm code is already available, and it works. What we propose to do in
this Summer of Code project is make the algorithm run on the actual Wikipedia,
integrating the algorithm with the production environment, text store,
temporary database tables, etc, as required to make it actually work for many
(as many as possible or desired) language editions of the Wikipedia.
Detailed Information
We have been developing robust and efficient algorithms for computing the
authorship information. The algorithms compare each new revision of a page with
all previous revisions, and attribute any new content in the latest revision to
its earliest plausible match in previous content. In this way, if content is
deleted (e.g. by a vandal, or in the course of a dispute), and later
re-inserted, the content is still correctly attributed to its original author.
To achieve an efficient implementation, the algorithm keeps a specially-encoded
summary of the history of a wiki page. The size of this summary is proportional
to the amount of change that the page has undergone; as we drop information on
content that has been absent for longer than 90 days and longer than 100 edits,
this summary is on average about 10 times the size of a typical revision. When
a user creates a new revision, the algorithm:
Reads the page summary
Computes the authorship for the new revision, and stores it
Stores an updated summary of the history which includes also the new
revision.
The process takes about one second of processing time per revision, including
the time to serialize and un-serialize the summary, which is generally the
dominant time.
The algorithm code is already available, and it works. What we propose to do in
this Summer of Code project is make the algorithm run on the actual Wikipedia,
integrating the algorithm with the production environment, text store,
temporary database tables, etc, as required to make it actually work for many
(as many as possible or desired) language editions of the Wikipedia.
Detailed Information:
https://www.mediawiki.org/wiki/User:Mshavlovsky/Authorship_Tracking
--
You are receiving this mail because:
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l