Summary: do we really need to recombine stub and page file
chunks into single huge files?
Product: XML Snapshots
We run en eikipedia dumps by producing multiple stub and page text files,
instead of one huge stub file and one huge page/meta/history file.
Recombining these into one file takes a long time; for the stubs it's not
horrible, as these files are smaller, but for the history files it is extremely
time-intensive (2 weeks). We could shorten that for the bz2 files by working
on dbzip2, brion's parallel bzip2 project from 2008, but we probably can't do
anything to speed up the recombine of the 7z files.
Do we really need to provide one huge file for these things? Example: the
combined bz2 history file is around 300GB, the combined 7z file is around 32
GB. And it will only get worse. Are several small files ok? Maybe we can just
skip this step.
This needs community discussion; are the whole files useful? What happens if
we wind up running 50 jobs and producing 50 pieces? Is this just too annoying?
Is it better instead because people can process these 50 files in parallel at
home? Would it be better if we serve up say no more than 20 separate pieces?
Do people care at all as long as they get the data on a regular basis?
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
Wikibugs-l mailing list