> > For storing updateable indexes, Berkeley DB 4-5, GDBM, and higher-level > options like SQLite are widely used. > LevelDB<https://code.google.com/p/leveldb/> is > pretty cool too. >
I think that with the amount of data we're dealing with, it makes sense to have the file format under tight control. For example, saving a single byte on each revision means total savings of ~500 MB for enwiki. In any case, at this point it would be more work to switch to one of those than to keep using the format I created. > For delta coding, there's xdelta3 <http://xdelta.org/>, > open-vcdiff<https://code.google.com/p/open-vcdiff/>, > and > Git's<http://stackoverflow.com/questions/9478023/is-the-git-binary-diff-algorithm-delta-storage-standardized> > delta <https://github.com/git/git/blob/master/diff-delta.c> > code<https://github.com/git/git/blob/master/patch-delta.c>. > (rzip <http://rzip.samba.org/>/rsync are wicked awesome, but not as easy > to just drop in as a library.) > I'm certainly going to try to use some library for delta compression, because they seem to do pretty much exactly what's needed here. Thanks for the suggestions. Petr Onderka _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
