https://bugzilla.wikimedia.org/show_bug.cgi?id=27114

           Summary: do we really need to recombine stub and page file
                    chunks into single huge files?
           Product: XML Snapshots
           Version: unspecified
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: Normal
         Component: General
        AssignedTo: ar...@wikimedia.org
        ReportedBy: ar...@wikimedia.org
                CC: tf...@wikimedia.org
            Blocks: 27110


We run en eikipedia dumps by producing multiple stub and page text files,
instead of one huge stub file and one huge page/meta/history file. 

Recombining these into one file takes a long time; for the stubs it's not
horrible, as these files are smaller, but for the history files it is extremely
time-intensive (2 weeks).  We could shorten that for the bz2 files by working
on dbzip2, brion's parallel bzip2 project from 2008, but we probably can't do
anything to speed up the recombine of the 7z files.

Do we really need to provide one huge file for these things?  Example: the
combined bz2 history file is around 300GB, the combined 7z file is around 32
GB.  And it will only get worse. Are several small files ok?  Maybe we can just
skip this step.  

This needs community discussion; are the whole files useful?  What happens if
we wind up running 50 jobs and producing 50 pieces? Is this just too annoying?
Is it better instead because people can process these 50 files in parallel at
home? Would it be better if we serve up say no more than 20 separate pieces? 
Do people care at all as long as they get the data on a regular basis?

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

_______________________________________________
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to