Protocol Buffers are not a bad idea, but I'm not sure about their overhead.

AFAIK, PB have overhead of 1 byte per field.
If I'm counting correctly, with enwiki's 600M revisions and 8 fields per
revision, that means total overhead of more than 4 GB.
The fixed-size part of all revisions (i.e. without comment and text)
amounts to ~22 GB.
I think this means PB have too much overhead.

The overhead could be alleviated by using compression, but I didn't intend
to compress metadata.

So, I think I will start without PB. If I later decide to compress
metadata, I will also try to use PB and see if it works.

Also, I think that reading the binary format isn't going to be the biggest
issue if you're implementing your own library for incremental dumps,
especially if I'm going to use delta compression of revision texts.

Petr Onderka


On Mon, Jul 1, 2013 at 9:16 PM, Daniel Friesen
<dan...@nadir-seen-fire.com>wrote:

> Instead of XML "or" a proprietary binary format could we try using a
> standard binary format such as Protocol Buffers as a base to reduce the
> issues with having to implement the reading/writing in multiple languages?
>
> --
> ~Daniel Friesen (Dantman, Nadir-Seen-Fire) [http://danielfriesen.name/]
>
>
> On Mon, 01 Jul 2013 11:56:50 -0700, Tyler Romeo <tylerro...@gmail.com>
> wrote:
>
>  Petr is right on par with this one. The purpose of this version 2 for
>> dumps
>> is to allow protocol-specific incremental updating of the dump, which
>> would
>> be significantly more difficult in non-binary format.
>>
>> *-- *
>> *Tyler Romeo*
>> Stevens Institute of Technology, Class of 2016
>> Major in Computer Science
>> www.whizkidztech.com | tylerro...@gmail.com
>>
>
>
> ______________________________**_________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/**mailman/listinfo/wikitech-l<https://lists.wikimedia.org/mailman/listinfo/wikitech-l>
>
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to