Re: [Wikitech-l] Parallel computing project

Ariel T. Glenn Tue, 26 Oct 2010 15:12:48 -0700

Στις 27-10-2010, ημέρα Τετ, και ώρα 00:05 +0200, ο/η Ángel González
έγραψε:
> Ariel T. Glenn wrote:
> > If one were clever (and I have some code that would enable one to be
> > clever), one could seek to some point in the (bzip2-compressed) file and
> > uncompress from there before processing.  Running a bunch of jobs each
> > decompressing only their small piece then becomes feasible.  I don't
> > have code that does this for gz or 7z; afaik these do not do compression
> > in discrete blocks.
> > 
> > Ariel
> 
> The bzip2recover approach?
> I am not sure how much will be the gain after so much bit moving.
> Also, I was unable to continue from a flushed point, it may not be so easy.
> OTOH, if you already have an index and the blocks end at page boundaries
> (which is what I was doing), it becomes trivial.
> Remember that the previous block MUST continue up to the point where the
> next reader started processing inside the next block. And unlike what
> ttsiod said, you do encounter tags split between blocks in a normal
> compression.


I am able (using python bindings to the bzip2 library and some fiddling)
to seek to an arbitrary point, find the first block after the seek
point, and uncompress it and the following blocks in sequence.  That is
sufficient for our work, when we are talking about 250GB size compressed
files.

We process everything by pages, so we ensure that any reader reads only
specified page ranges from the file.  This avoids overlaps.

We don't build an index; we're only talking about parallelizing 10-20
jobs at once, not all 21 million pages, so building an index would not
be worth it.

Ariel



_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Parallel computing project

Reply via email to