Στις 26-10-2010, ημέρα Τρι, και ώρα 16:25 +0200, ο/η Platonides έγραψε: > Robert Rohde wrote: > > Many of the things done for the statistical analysis of database dumps > > should be suitable for parallelization (e.g. break the dump into > > chunks, process the chunks in parallel and sum the results). You > > could talk to Erik Zachte. I don't know if his code has already been > > designed for parallel processing though. > > I don't think it's a good candidate since you are presumably using > compressed files, and its decompression linearises it (and is most > likely the bottleneck, too).
If one were clever (and I have some code that would enable one to be clever), one could seek to some point in the (bzip2-compressed) file and uncompress from there before processing. Running a bunch of jobs each decompressing only their small piece then becomes feasible. I don't have code that does this for gz or 7z; afaik these do not do compression in discrete blocks. Ariel _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l