On Mon, Jul 8, 2013 at 6:53 AM, Randall Farmer <[email protected]> wrote:
> > Keeping the dumps in a text-based format doesn't make sense, because > that can't be updated efficiently, which is the whole reason for the new > dumps. > > First, glad to see there's motion here. > > It's definitely true that recompressing the entire history to .bz2 or .7z > goes very, very slowly. Also, I don't know of an existing tool that lets > you just insert new data here and there without compressing all of the > unchanged data as well. Those point towards some sort of format change. > > I'm not sure a new format has to be sparse or indexed to get around those > two big problems. > > For full-history dumps, delta coding (or the related idea of long-range > redundancy compression) runs faster than bzip2 or 7z and produces good > compression ratios on full-history dumps, based on some > tests<https://www.mediawiki.org/wiki/Dbzip2#rzip_and_xdelta3> > . (I'm going to focus mostly on full-history dumps here because they're > the hard case and one Ariel said is currently painful--not everything here > will apply to latest-revs dumps.) > > For inserting data, you do seemingly need to break the file up into > independently-compressed sections containing just one page's revision > history or a fragment of it, so you can add new diff(s) to a page's > revision history without decompressing and recompressing the previous > revisions. (Removing previously-dumped revisions is another story, but it's > rarer.) You'd be in new territory just doing that; I don't know of existing > compression tools that really allow that. > > You could do those two things, though, while still keeping full-history > dumps a once-every-so-often batch process that produces a sorted file. The > time to rewrite the file, stripped of the big compression steps, could be > bearable--a disk can read or write about 100 MB/s, so just copying the 70G > of the .7z enwiki dumps is well under an hour; if the part bound by CPU and > other steps is smallish, you're OK. > > A format like the proposed one, with revisions inserted wherever there's > free space when they come in, will also eventually fragment the revision > history for one page (I think Ariel alluded to this in some early notes). > Unlike sequential read/writes, seeks are something HDDs are sadly pretty > slow at (hence the excitement about solid-state disks); if thousands of > revisions are coming in a day, it eventually becomes slow to read things in > the old page/revision order, and you need fancy techniques to defrag (maybe > a big external-memory sort <http://en.wikipedia.org/wiki/External_sorting>) > or you need to only read the dump on fast hardware that can handle the > seeks. Doing occasional batch jobs that produce sorted files could help > avoid the fragmentation question. > These are some interesting ideas. You're right that the copying the whole dump is fast enough (it would probably add about an hour to a process that currently takes several days). But it would also pretty much force the use of delta compression. And while I would like to use delta compression, I don't think it's a good idea to be forced to use it, because I might not have the time for it or it might not be good enough. Because of that, I decided to stay with my indexed approach. > There's a great quote about the difficulty of "constructing a software > design...to make it so simple that there are obviously no deficiencies." > (Wikiquote came through with the full text/attribution, of > course<http://en.wikiquote.org/wiki/C._A._R._Hoare>.) > I admit it's tricky and people can disagree about what's simple enough or > even what approach is simpler of two choices, but it's something to strive > for. > > Anyway, I'm wary about going into the technical weeds of other folks' > projects, because, hey, it's your project! I'm trying to map out the > options in the hope that you could get a product you're happier with and > maybe give you more time in a tight three-month schedule to improve on your > work and not just complete it. Whatever you do, good luck and I'm > interested to see the results! > Feel free to comment more. I am the one implementing the project, but that's all. Input from others is always welcome. Petr Onderka _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
