>
> For storing updateable indexes, Berkeley DB 4-5, GDBM, and higher-level
> options like SQLite are widely used. 
> LevelDB<https://code.google.com/p/leveldb/> is
> pretty cool too.
>

I think that with the amount of data we're dealing with, it makes sense to
have the file format under tight control. For example, saving a single byte
on each revision means total savings of ~500 MB for enwiki.

In any case, at this point it would be more work to switch to one of those
than to keep using the format I created.


> For delta coding, there's xdelta3 <http://xdelta.org/>, 
> open-vcdiff<https://code.google.com/p/open-vcdiff/>,
> and 
> Git's<http://stackoverflow.com/questions/9478023/is-the-git-binary-diff-algorithm-delta-storage-standardized>
> delta <https://github.com/git/git/blob/master/diff-delta.c> 
> code<https://github.com/git/git/blob/master/patch-delta.c>.
> (rzip <http://rzip.samba.org/>/rsync are wicked awesome, but not as easy
> to just drop in as a library.)
>

I'm certainly going to try to use some library for delta compression,
because they seem to do pretty much exactly what's needed here. Thanks for
the suggestions.

Petr Onderka
_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to