> > That does not sound like much economically. Do keep in mind the cost of > porting, deploying, maintaining, obtaining, and so on, new tools.
Briefly, yes, CPU-hours don't cost too much, but I don't think the potential win is limited to the direct CPU-hours saved. In more detail: For Wikimedia a quicker-running task is probably easier to manage, maybe less likely to fail and thus need human attention; dump users get more up-to-date content if dump processing's quicker; users who get histzip also get a tool they can (for example) use to quickly pack a modified XML file through in a pipeline. It's a relatively small (500-line), hackable tool and could serve as a base for later work: for instance, I've tried to rig the format so future compressors make backwards-compatible archives they can insert into without recompressing all the TBs of input. There are pages on meta going a few years back about ideas for improving compression speed, and there were past format changes for operational reasons (chunking full-history dumps) and other dump-related proposals in Wikimedia-land (a project this past summer about a new dump tool), so I don't think I'm entirely swatting at gnats by trying to work up another possible tool. I'm talking about keeping at least one of the current, widely supported formats around, which I think would limit hardship for existing users. I'm sort of curious how many full-history-dump users there are and if they have anything to say. You mentioned porting; histzip is a Go program that's easy to cross-compile for different OSes/architectures (as I have for Windows/Mac/Linux on the github page, though not various BSDs). I would definitely recommend talking to Igor Pavlov (7-Zip) about this, > he might be interested in having this as part of 7-Zip as some kind of > "fast" option, and also the developers of the `xz` tools. There might > even be ways this could fit within existing extensibility mechanisms of > the formats. 7-Zip is definitely a very cool and flexible program. I think it can actually run faster than it's going in the current dumps setup: -mx=3 maintains ratios better than bzip's, but runs faster than bzip. That's a few times slower than histzip|bzip and slightly larger output, but it's a boost from the status quo. (There's an argument for maintaining that, not bzip, as the widely-supported format, which I'd mentioned in the xmldatadumps-l branch of this thread, or for just changing the 7z settings and calling it a day) Interesting to hear from Nemo that Pavlov was interested in long-range zipping. histzip doesn't have source he could drop into his C program (it's in Go) and it's really aimed at a narrow niche (long repetitions at a certain distance) so I doubt I could get it integrated there. Anyway, I'm saying too many fundamentally unimportant words. If the status quo re: compression in fact causes enough pain to give histzip a fuller look, or if there's some way to redirect the tech in it towards a useful end, it would be great to hear from interested folks; if not, it was fun work but there may not be much more to do or say. On Mon, Jan 20, 2014 at 4:49 PM, Bjoern Hoehrmann <[email protected]> wrote: > * Randall Farmer wrote: > >As I understand, compressing full-history dumps for English Wikipedia and > >other big wikis takes a lot of resources: enwiki history is about 10TB > >unpacked, and 7zip only packs a few MB/s/core. Even with 32 cores, that's > >over a day of server time. There's been talk about ways to speed that up > in > >the past.[1] > > That does not sound like much economically. Do keep in mind the cost of > porting, deploying, maintaining, obtaining, and so on, new tools. There > might be hundreds of downstream users and if every one of them has to > spend a couple of minutes adopting to a new format, that can quickly > outweigh any savings, as a simple example. > > >Technical datadaump aside: *How could I get this more thoroughly tested, > >then maybe added to the dump process, perhaps with an eye to eventually > >replacing for 7zip as the alternate, non-bzip2 compressor?* Who do I talk > >to to get started? (I'd dealt with Ariel Glenn before, but haven't seen > >activity from Ariel lately, and in any case maybe playing with a new tool > >falls under Labs or some other heading than dumps devops.) Am I nuts to be > >even asking about this? Are there things that would definitely need to > >change for integration to be possible? Basically, I'm trying to get this > >from a tech demo to something with real-world utility. > > I would definitely recommend talking to Igor Pavlov (7-Zip) about this, > he might be interested in having this as part of 7-Zip as some kind of > "fast" option, and also the developers of the `xz` tools. There might > even be ways this could fit within existing extensibility mechanisms of > the formats. Igor Pavlov tends to be quite response through the SF.net > bug tracker. In any case, they might be able to give directions how this > might become, or not, part of standard tools. > -- > Björn Höhrmann · mailto:[email protected] · http://bjoern.hoehrmann.de > Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de > 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ > > _______________________________________________ > Xmldatadumps-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l > _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
