Compressed XML is what the current dumps use and it doesn't work well
because:
* it can't be edited
* it doesn't support seeking

I think the only way to solve this is "obscure" and requires special code
to read and write.
(And endianness is not a problem if the specification says which one it
uses and the implementation sticks to it.)

Theoretically, I could use compressed XML in internal data structures, but
I think that just combines the disadvantages of both.

So, the size is not the main reason not to use XML, it's just one of the
reasons.

Petr Onderka


On Mon, Jul 1, 2013 at 7:26 PM, <[email protected]> wrote:

> On 07/01/2013 12:48:11 PM, Petr Onderka - [email protected] wrote:
>
>> >
>> > What is the intended format of the dump files? The page makes it sound
>> like
>> > it will have a binary format, which I'm not opposed to, but is
>> definitely
>> > something you should decide on.
>> >
>>
>> Yes, it is a binary format, I will make that clearer on the page.
>>
>> The advantage of a binary format is that it's smaller, which I think is
>> quite important.
>>
>
> In my experience binary formats have very little to recommend them.
>
> They are definitely more obscure. They sometimes suffer from endian
> problems. They require special code to read and write.
>
> In my experience I have found that the notion that they offer an advantage
> by being "smaller" is somewhat misguided.
>
> In particular, with XML, there is generally a very high degree of
> redundancy in the text, far more than in normal writing.
>
> The consequence of this regularity is that text based XML often compresses
> very, very well.
>
> I remember one particular instance where we were generating 30-50
> Megabytes of XML a day and needed to send it from the USA to the UK every
> day, in a situation where our leased data rate was really limiting. We were
> surprised and pleased to discover that zipping the files reduced them to
> only 1-2 MB. I have been skeptical of claims that binary formats are more
> efficient on the wire (where it matters most) ever since.
>
> I think you should do some experiments versus compressed XML to justify
> your claimed benefits of using a binary format.
>
> Jim
>
> <snip>
>
> --
> Jim Laurino
> [email protected]
> Please direct any reply to the list.
> Only mail from the listserver reaches this address.
>
>
> ______________________________**_________________
> Wikitech-l mailing list
> [email protected]
> https://lists.wikimedia.org/**mailman/listinfo/wikitech-l<https://lists.wikimedia.org/mailman/listinfo/wikitech-l>
>
_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to