One of the things I can't understand is why we are extracting summary of pages for Yahoo? Is it our job to do it? the dumps are really huge e.g. forwikidata:<http://dumps.wikimedia.org/wikidatawiki/20140106/> wikidatawiki-20140106-abstract.xml<http://dumps.wikimedia.org/wikidatawiki/20140106/wikidatawiki-20140106-abstract.xml>14.1 GB Compare it to: full history: wikidatawiki-20140106-pages-meta-history.xml.bz2<http://dumps.wikimedia.org/wikidatawiki/20140106/wikidatawiki-20140106-pages-meta-history.xml.bz2>8.8 GB
So why we are doing this? Best On Wed, Jan 22, 2014 at 4:10 AM, Anthony <o...@theendput.com> wrote: > If you're going to use xz then you wouldn't even have to recompress the > blocks that haven't changed and are already well compressed. > > > On Tue, Jan 21, 2014 at 5:26 PM, Randall Farmer <rand...@wawd.com> wrote: > > > Ack, sorry for the (no subject); again in the right thread: > > > > > For external uses like XML dumps integrating the compression > > > strategy into LZMA would however be very attractive. This would also > > > benefit other users of LZMA compression like HBase. > > > > For dumps or other uses, 7za -mx=3 / xz -3 is your best bet. > > > > That has a 4 MB buffer, compression ratios within 15-25% of > > current 7zip (or histzip), and goes at 30MB/s on my box, > > which is still 8x faster than the status quo (going by a 1GB > > benchmark). > > > > Trying to get quick-and-dirty long-range matching into LZMA isn't > > feasible for me personally and there may be inherent technical > > difficulties. Still, I left a note on the 7-Zip boards as folks > > suggested; feel free to add anything there: > > https://sourceforge.net/p/sevenzip/discussion/45797/thread/73ed3ad7/ > > > > Thanks for the reply, > > Randall > > _______________________________________________ > > Wikitech-l mailing list > > Wikitech-l@lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > > _______________________________________________ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > -- Amir _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l