https://bugzilla.wikimedia.org/show_bug.cgi?id=45974
Web browser: ---
Bug ID: 45974
Summary: Publish a metadata file for each multipart dump
Product: Datasets
Version: unspecified
Hardware: All
OS: All
Status: NEW
Severity: enhancement
Priority: Unprioritized
Component: General/Unknown
Assignee: [email protected]
Reporter: [email protected]
CC: [email protected]
Classification: Unclassified
Mobile Platform: ---
Currently there is no way to programmatically determine the names of all the
parts of multipart dump files.
As far as I know only the English Wikipedia currently employs multipart dump
files.
Most such dumps are split into exactly 27 parts with names in the following
format:
enwiki-20130204-pages-meta-current1.xml-p000000010p000010000.bz2
enwiki-20130204-pages-articles27.xml-p029625017p038424363.bz2
If we assume that there will only ever be exactly 27 parts to each such dump we
can still only predetermine the part of the dump name before the .xml suffix -
We still have no way to know the part between the .xml suffix and the .bz2
suffix
But then we have the full history dumps, for which each of the 27 parts is
itself split into further parts. Examples:
enwiki-20130204-pages-meta-history1.xml-p000000010p000002141.bz2
enwiki-20130204-pages-meta-history1.xml-p000002142p000004315.bz2
enwiki-20130204-pages-meta-history1.xml-p000004318p000005912.bz2
enwiki-20130204-pages-meta-history1.xml-p000005913p000008179.bz2
enwiki-20130204-pages-meta-history1.xml-p000008180p000009875.bz2
enwiki-20130204-pages-meta-history1.xml-p000009877p000010000.bz2
The only way to currently automate the process of downloading all the parts of
the dumps relies on parsing the HTML pages about the dumps such as at
http://dumps.wikimedia.org/enwiki/20130204/
But this is not officially supported and if we were to make it so then we would
have to officially standardize the HTML format of those pages and ensure that
it doesn't change.
It seems a much more stable and future-proof option would be to come up with
some simple XML or other text format file for each multipart dump listing at
least the full file name for each part, though it's conceivable that other
helpful info could also be included.
--
You are receiving this mail because:
You are on the CC list for the bug.
You are watching all bug changes.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l