On Thu, Mar 26, 2009 at 12:09 PM, Ilmari Karonen <[email protected]> wrote: > ERSEK Laszlo wrote: >> ** 4. Thanassis Tsiodras' offline reader, available under >> >> http://users.softlab.ece.ntua.gr/~ttsiod/buildWikipediaOffline.html >> >> uses, according to section "Seeking in the dump file", bzip2recover to >> split the bzip2 blocks out of the single bzip2 stream. The page states >> >> This process is fast (since it involves almost no CPU calculations >> >> While this may be true relative to other dump-processing operations, >> bzip2recover is, in fact, not much more than a huge single threaded >> bit-shifter, which even makes two passes over the dump. (IIRC, the first >> pass shifts over the whole dump to find bzip2 block delimiteres, then the >> second pass shifts the blocks found previously into byte-aligned, separate >> bzip2 streams.) > > Hmm? Admittedly, I don't know the bzip2 format very well, but as far as > I understand it, there should be no bit-shifting involved: each block in > the stream is a completely independent, self-contained sequence of bytes.
I believe the point is that each block is a self-contained sequence of bits not bytes, so a block can terminate in the middle of a byte. The next block is appended immediately (if I understand correctly), so block boundaries do not necessarily align to byte boundaries. Hence the need to do bit shifting. -Robert Rohde _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
