We've just published the full dataset of <ref> (citation) and map usage across wikis, please find the metadata here:
https://figshare.com/articles/dataset/Reference_and_map_usage_across_Wikimedia_wiki_pages/24064941 and the raw data here: https://analytics.wikimedia.org/published/datasets/one-off/html-dump-scraper-refs-and-maps/2023-06-01/ Feel free to reply with questions or suggestions, I hope you find the results useful in your own work! Kind regards, Adam W. [[mw:Adamw]] for https://meta.wikimedia.org/wiki/WMDE_Technical_Wishes On Wed, Aug 23, 2023 at 12:22 PM Adam Wight <[email protected]> wrote: > As part of Wikimedia Germany's work around reference reuse, we wrote a > tool which processes the HTML dumps of all articles and produces > detailed information about how Cite references (and Kartographer maps) > are used on each page. > > I'm writing this list for advice on how to publish the results so that > the data can be easily discovered and consumed by researchers. > Currently, the data is contained in 3,100 JSON and NDJSON files hosted > on a Wikimedia Cloud VPS server, with a total size of 3.4GB. The > outputs can be split or merged into whatever form will make them more > useable. > > For an overview of the columns and sample rows, please see this task: > https://phabricator.wikimedia.org/T341751 > > We plan to run the scraper again in the future, and its modular > architecture makes it simple to include or exclude additional > information if anyone has suggestions about what else we might want to > extract from rendered articles. To read more about the tool itself and > why we decided to process HTML dumps directly, see this post: > https://mw.ludd.net/wiki/Elixir/HTML_dump_scraper > > -Adam Wight > [[mw:Adamw]] > > https://meta.wikimedia.org/wiki/WMDE_Technical_Wishes > > -- Adam Wight - Developer - Wikimedia Deutschland e.V. - https://wikimedia.de _______________________________________________ Wiki-research-l mailing list -- [email protected] To unsubscribe send an email to [email protected]
