Thank you Joseph; great to hear there is interest in building such a dataset. You say that the link information would need to be parsed from wikitext, which is complicated; would the pagelinks table help as an alternative source of data?
*Giovanni Luca Ciampaglia* ∙ glciampaglia.com Assistant Professor Computer Science and Engineering <https://www.usf.edu/engineering/cse/> ∙ University of South Florida <https://www.usf.edu/> *Due to Florida’s broad open records law, email to or from university employees is public record, available to the public and the media upon request.* On Thu, Feb 13, 2020 at 9:27 AM Joseph Allemandou <[email protected]> wrote: > Hi Giovanni, > Thank you for your message :) > You are correct in that there is no information on page-to-page link as of > today, as well as no information for instance on historical values of > revisions being redirects for instance. > We share with you the idea that such information is extremely valuable, and > we have in mind to be able to extract it at some point. > The reason for which it has not yet been done is because those pieces > of information are only available through parsing the wikitext of every > revision, which is not only resource intensive but also complicated > technically (templates, version changes etc). > You can be sure we will send another announcement when we'll release that > data :) > Best, > > On Tue, Feb 11, 2020 at 10:30 PM Giovanni Luca Ciampaglia < > [email protected]> > wrote: > > > Hi Joseph, > > > > Thanks a lot for creating and sharing such a valuable resource. I went > > through the schema and from what I understand there is no information > about > > page-to-page links, correct? Are there any resources that would provide > > such historical data? > > > > Best, > > > > *Giovanni Luca Ciampaglia* ∙ glciampaglia.com > > Assistant Professor > > Computer Science and Engineering > > <https://www.usf.edu/engineering/cse/> ∙ University > > of South Florida <https://www.usf.edu/> > > > > *Due to Florida’s broad open records law, email to or from university > > employees is public record, available to the public and the media upon > > request.* > > > > > > On Mon, Feb 10, 2020 at 11:28 AM Joseph Allemandou < > > [email protected]> wrote: > > > > > Hi Analytics People, > > > > > > The Wikimedia Analytics Team is pleased to announce the release of the > > most > > > complete dataset we have to date to analyze content and contributors > > > metadata: Mediawiki History [1] [2]. > > > > > > Data is in TSV format, released monthly around the 3rd of the month > > > usually, and every new release contains the full history of metadata. > > > > > > The dataset contains an enhanced [3] and historified [4] version of > user, > > > page and revision metadata and serves as a base to Wiksitats API on > > edits, > > > users and pages [5] [6]. > > > > > > We hope you will have as much fun playing with the data as we have > > building > > > it, and we're eager to hear from you [7], whether for issues, ideas or > > > usage of the data. > > > > > > Analytically yours, > > > > > > -- > > > Joseph Allemandou (joal) (he / him) > > > Sr Data Engineer > > > Wikimedia Foundation > > > > > > [1] https://dumps.wikimedia.org/other/mediawiki_history/readme.html > > > [2] > > > > > > > > > https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history_dumps > > > [3] Many pre-computed fields are present in the dataset, from > edit-counts > > > by user and page to reverts and reverted information, as well as time > > > between events. > > > [4] As accurate as possible historical usernames and page-titles (as > well > > > as user-groups and blocks) is available in addition to current values, > > and > > > are provided in a denormalized way to every event of the dataset. > > > [5] https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2 > > > [6] https://wikimedia.org/api/rest_v1/ > > > [7] > > > > > > > > > https://phabricator.wikimedia.org/maniphest/task/edit/?title=Mediawiki%20History%20Dumps&projectPHIDs=Analytics-Wikistats,Analytics > > > _______________________________________________ > > > Wiki-research-l mailing list > > > [email protected] > > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > > > > > _______________________________________________ > > Wiki-research-l mailing list > > [email protected] > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > > > > > -- > Joseph Allemandou (joal) (he / him) > Sr Data Engineer > Wikimedia Foundation > _______________________________________________ > Wiki-research-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > _______________________________________________ Wiki-research-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
