Re: [Wikitech-l] [Xmldatadumps-l] Compressing full-history dumps faster

Gabriel Wicke Tue, 21 Jan 2014 09:43:34 -0800

On 01/21/2014 01:23 AM, Randall Farmer wrote:
> Anyway, I'm saying too many fundamentally unimportant words. If the status
> quo re: compression in fact causes enough pain to give histzip a fuller
> look, or if there's some way to redirect the tech in it towards a useful
> end, it would be great to hear from interested folks; if not, it was fun
> work but there may not be much more to do or say.


Efficient compression with large match windows is very interesting for
storing history in databases like Cassandra as well. When storing a
wikitext dump in Cassandra, gzip with its 32k sliding window yields a db
size of about 16-18% of the input text size. This could be much better
if repetitions larger than 32k could be caught. With more verbose HTML
this is even more important, as more articles will be larger than 32k.

For internal uses tool support is not very important, so a port of
histzip / rzip could work well. For external uses like XML dumps
integrating the compression strategy into LZMA would however be very
attractive. This would also benefit other users of LZMA compression like
HBase.

Gabriel

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Xmldatadumps-l] Compressing full-history dumps faster

Reply via email to