On Thu, Mar 26, 2009 at 12:09 PM, Ilmari Karonen <[email protected]> wrote:
> ERSEK Laszlo wrote:
>> ** 4. Thanassis Tsiodras' offline reader, available under
>>
>> http://users.softlab.ece.ntua.gr/~ttsiod/buildWikipediaOffline.html
>>
>> uses, according to section "Seeking in the dump file", bzip2recover to
>> split the bzip2 blocks out of the single bzip2 stream. The page states
>>
>>       This process is fast (since it involves almost no CPU calculations
>>
>> While this may be true relative to other dump-processing operations,
>> bzip2recover is, in fact, not much more than a huge single threaded
>> bit-shifter, which even makes two passes over the dump. (IIRC, the first
>> pass shifts over the whole dump to find bzip2 block delimiteres, then the
>> second pass shifts the blocks found previously into byte-aligned, separate
>> bzip2 streams.)
>
> Hmm?  Admittedly, I don't know the bzip2 format very well, but as far as
> I understand it, there should be no bit-shifting involved: each block in
> the stream is a completely independent, self-contained sequence of bytes.

I believe the point is that each block is a self-contained sequence of
bits not bytes, so a block can terminate in the middle of a byte.  The
next block is appended immediately (if I understand correctly), so
block boundaries do not necessarily align to byte boundaries.  Hence
the need to do bit shifting.

-Robert Rohde

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to