Robert Rohde wrote:
> However, to work with ruwiki, for example, one generally needs to
> decompress it to the full 170 GB.  To work with enwiki's full revision
> history, if such a dump is ever to exist again, would probably
> decompress to ~2 TB.  7z and bz2 are not great formats if one wants to
> extract only portions of the dump since there are few tools that would
> allow one to do so without first reinflating the whole file.  Hence,
> one of the advantages I see in my format is being able to have a dump
> that is still <10% the full inflated size while also being able to
> parse out selected articles or selected revisions in a straightforward
> manner.
> 
> -Robert Rohde

bzipping the pages by blocks as I did for my offline reader produces a
file size similar to the the original*
There may be ways to get similar results without having to rebuild the
revisions.
Also note that in both cases you still need an intermediate app to
provide input dumps for those tools.

*112% measuring enwiki-20081008-pages-meta-current. Looking at
ruwiki-20081228-history, both the original bz2 and my faster-access one
are 8.2G.




_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to