Re: [Wikitech-l] [Xmldatadumps-l] Suggested file format of new incremental dumps

Petr Onderka Wed, 10 Jul 2013 06:36:16 -0700

On Mon, Jul 8, 2013 at 6:53 AM, Randall Farmer <[email protected]> wrote:


> > Keeping the dumps in a text-based format doesn't make sense, because
> that can't be updated efficiently, which is the whole reason for the new
> dumps.
>
> First, glad to see there's motion here.
>
> It's definitely true that recompressing the entire history to .bz2 or .7z
> goes very, very slowly. Also, I don't know of an existing tool that lets
> you just insert new data here and there without compressing all of the
> unchanged data as well. Those point towards some sort of format change.
>
> I'm not sure a new format has to be sparse or indexed to get around those
> two big problems.
>
> For full-history dumps, delta coding (or the related idea of long-range
> redundancy compression) runs faster than bzip2 or 7z and produces good
> compression ratios on full-history dumps, based on some 
> tests<https://www.mediawiki.org/wiki/Dbzip2#rzip_and_xdelta3>
> . (I'm going to focus mostly on full-history dumps here because they're
> the hard case and one Ariel said is currently painful--not everything here
> will apply to latest-revs dumps.)
>
> For inserting data, you do seemingly need to break the file up into
> independently-compressed sections containing just one page's revision
> history or a fragment of it, so you can add new diff(s) to a page's
> revision history without decompressing and recompressing the previous
> revisions. (Removing previously-dumped revisions is another story, but it's
> rarer.) You'd be in new territory just doing that; I don't know of existing
> compression tools that really allow that.
>
> You could do those two things, though, while still keeping full-history
> dumps a once-every-so-often batch process that produces a sorted file. The
> time to rewrite the file, stripped of the big compression steps, could be
> bearable--a disk can read or write about 100 MB/s, so just copying the 70G
> of the .7z enwiki dumps is well under an hour; if the part bound by CPU and
> other steps is smallish, you're OK.
>
> A format like the proposed one, with revisions inserted wherever there's
> free space when they come in, will also eventually fragment the revision
> history for one page (I think Ariel alluded to this in some early notes).
> Unlike sequential read/writes, seeks are something HDDs are sadly pretty
> slow at (hence the excitement about solid-state disks); if thousands of
> revisions are coming in a day, it eventually becomes slow to read things in
> the old page/revision order, and you need fancy techniques to defrag (maybe
> a big external-memory sort <http://en.wikipedia.org/wiki/External_sorting>)
> or you need to only read the dump on fast hardware that can handle the
> seeks. Doing occasional batch jobs that produce sorted files could help
> avoid the fragmentation question.
>

These are some interesting ideas.

You're right that the copying the whole dump is fast enough (it would
probably add about an hour to a process that currently takes several days).
But it would also pretty much force the use of delta compression. And while
I would like to use delta compression, I don't think it's a good idea to be
forced to use it, because I might not have the time for it or it might not
be good enough.

Because of that, I decided to stay with my indexed approach.


>  There's a great quote about the difficulty of "constructing a software
> design...to make it so simple that there are obviously no deficiencies."
> (Wikiquote came through with the full text/attribution, of 
> course<http://en.wikiquote.org/wiki/C._A._R._Hoare>.)
> I admit it's tricky and people can disagree about what's simple enough or
> even what approach is simpler of two choices, but it's something to strive
> for.
>
> Anyway, I'm wary about going into the technical weeds of other folks'
> projects, because, hey, it's your project! I'm trying to map out the
> options in the hope that you could get a product you're happier with and
> maybe give you more time in a tight three-month schedule to improve on your
> work and not just complete it. Whatever you do, good luck and I'm
> interested to see the results!
>

Feel free to comment more. I am the one implementing the project, but
that's all. Input from others is always welcome.

Petr Onderka
_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [Xmldatadumps-l] Suggested file format of new incremental dumps

Reply via email to