One of the things I can't understand is why we are extracting summary of
pages for Yahoo? Is it our job to do it? the dumps are really huge
e.g. forwikidata:<http://dumps.wikimedia.org/wikidatawiki/20140106/>
wikidatawiki-20140106-abstract.xml<http://dumps.wikimedia.org/wikidatawiki/20140106/wikidatawiki-20140106-abstract.xml>14.1
GB
Compare it to: full history:
wikidatawiki-20140106-pages-meta-history.xml.bz2<http://dumps.wikimedia.org/wikidatawiki/20140106/wikidatawiki-20140106-pages-meta-history.xml.bz2>8.8
GB

So why we are doing this?
Best


On Wed, Jan 22, 2014 at 4:10 AM, Anthony <o...@theendput.com> wrote:

> If you're going to use xz then you wouldn't even have to recompress the
> blocks that haven't changed and are already well compressed.
>
>
> On Tue, Jan 21, 2014 at 5:26 PM, Randall Farmer <rand...@wawd.com> wrote:
>
> > Ack, sorry for the (no subject); again in the right thread:
> >
> > > For external uses like XML dumps integrating the compression
> > > strategy into LZMA would however be very attractive. This would also
> > > benefit other users of LZMA compression like HBase.
> >
> > For dumps or other uses, 7za -mx=3 / xz -3 is your best bet.
> >
> > That has a 4 MB buffer, compression ratios within 15-25% of
> > current 7zip (or histzip), and goes at 30MB/s on my box,
> > which is still 8x faster than the status quo (going by a 1GB
> > benchmark).
> >
> > Trying to get quick-and-dirty long-range matching into LZMA isn't
> > feasible for me personally and there may be inherent technical
> > difficulties. Still, I left a note on the 7-Zip boards as folks
> > suggested; feel free to add anything there:
> > https://sourceforge.net/p/sevenzip/discussion/45797/thread/73ed3ad7/
> >
> > Thanks for the reply,
> > Randall
> > _______________________________________________
> > Wikitech-l mailing list
> > Wikitech-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



-- 
Amir
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to