Hi all,
I wanted to flag an update from the Wikimedia Data Engineering team in case
it’s relevant to your work, especially for those of you who may rely on
Wikimedia XML dumps for research.
In short:
- A new set of MediaWiki Content File Exports is now available, providing
unparsed content from Wikimedia’s public wikis in XML format.
- There are two monthly datasets:
- mediawiki_content_history:
https://dumps.wikimedia.org/other/mediawiki_content_history/ -
Full revision
history for all pages
- mediawiki_content_current:
https://dumps.wikimedia.org/other/mediawiki_content_current/ -
Latest revision
only for each page
- This change was made because the legacy dump infrastructure at
https://dumps.wikimedia.org/backup-index.html has struggled to reliably
generate XML exports for larger wikis.
- The older XML dump pipeline is now considered deprecated, though SQL
dumps will continue and some legacy generation may persist temporarily.
You can read the full announcement here:
https://lists.wikimedia.org/hyperkitty/list/[email protected]/thread/E6D5EU4PMSTSOI2J7A46HJ3YW2W554CS/,
and view the full documentation at:
https://wikitech.wikimedia.org/wiki/MediaWiki_Content_File_Exports.
Best,
Kinneret
--
Kinneret Gordon
Lead Research Community Officer
Wikimedia Foundation <https://wikimediafoundation.org/>
*Learn more about Wikimedia Research <https://research.wikimedia.org/>*
_______________________________________________
Wiki-research-l mailing list -- [email protected]
To unsubscribe send an email to [email protected]