We've just published the full dataset of <ref> (citation) and map usage
across wikis, please find the metadata here:

https://figshare.com/articles/dataset/Reference_and_map_usage_across_Wikimedia_wiki_pages/24064941

and the raw data here:

https://analytics.wikimedia.org/published/datasets/one-off/html-dump-scraper-refs-and-maps/2023-06-01/

Feel free to reply with questions or suggestions, I hope you find the
results useful in your own work!

Kind regards,
Adam W.
[[mw:Adamw]]

for https://meta.wikimedia.org/wiki/WMDE_Technical_Wishes

On Wed, Aug 23, 2023 at 12:22 PM Adam Wight <[email protected]> wrote:

> As part of Wikimedia Germany's work around reference reuse, we wrote a
> tool which processes the HTML dumps of all articles and produces
> detailed information about how Cite references (and Kartographer maps)
> are used on each page.
>
> I'm writing this list for advice on how to publish the results so that
> the data can be easily discovered and consumed by researchers.
> Currently, the data is contained in 3,100 JSON and NDJSON files hosted
> on a Wikimedia Cloud VPS server, with a total size of 3.4GB.  The
> outputs can be split or merged into whatever form will make them more
> useable.
>
> For an overview of the columns and sample rows, please see this task:
> https://phabricator.wikimedia.org/T341751
>
> We plan to run the scraper again in the future, and its modular
> architecture makes it simple to include or exclude additional
> information if anyone has suggestions about what else we might want to
> extract from rendered articles.  To read more about the tool itself and
> why we decided to process HTML dumps directly, see this post:
> https://mw.ludd.net/wiki/Elixir/HTML_dump_scraper
>
> -Adam Wight
> [[mw:Adamw]]
>
> https://meta.wikimedia.org/wiki/WMDE_Technical_Wishes
>
>

-- 
Adam Wight - Developer - Wikimedia Deutschland e.V. - https://wikimedia.de
_______________________________________________
Wiki-research-l mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to