Στις 26-10-2010, ημέρα Τρι, και ώρα 16:25 +0200, ο/η Platonides έγραψε:
> Robert Rohde wrote:
> > Many of the things done for the statistical analysis of database dumps
> > should be suitable for parallelization (e.g. break the dump into
> > chunks, process the chunks in parallel and sum the results).  You
> > could talk to Erik Zachte.  I don't know if his code has already been
> > designed for parallel processing though.
> 
> I don't think it's a good candidate since you are presumably using
> compressed files, and its decompression linearises it (and is most
> likely the bottleneck, too).

If one were clever (and I have some code that would enable one to be
clever), one could seek to some point in the (bzip2-compressed) file and
uncompress from there before processing.  Running a bunch of jobs each
decompressing only their small piece then becomes feasible.  I don't
have code that does this for gz or 7z; afaik these do not do compression
in discrete blocks.

Ariel



_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to