>
> That does not sound like much economically. Do keep in mind the cost of
> porting, deploying, maintaining, obtaining, and so on, new tools.


Briefly, yes, CPU-hours don't cost too much, but I don't think the
potential win is limited to the direct CPU-hours saved.

In more detail: For Wikimedia a quicker-running task is probably easier to
manage, maybe less likely to fail and thus need human attention; dump users
get more up-to-date content if dump processing's quicker; users who get
histzip also get a tool they can (for example) use to quickly pack a
modified XML file through in a pipeline. It's a relatively small
(500-line), hackable tool and could serve as a base for later work: for
instance, I've tried to rig the format so future compressors make
backwards-compatible archives they can insert into without recompressing
all the TBs of input. There are pages on meta going a few years back about
ideas for improving compression speed, and there were past format changes
for operational reasons (chunking full-history dumps) and other
dump-related proposals in Wikimedia-land (a project this past summer about
a new dump tool), so I don't think I'm entirely swatting at gnats by trying
to work up another possible tool.

I'm talking about keeping at least one of the current, widely supported
formats around, which I think would limit hardship for existing users. I'm
sort of curious how many full-history-dump users there are and if they have
anything to say. You mentioned porting; histzip is a Go program that's easy
to cross-compile for different OSes/architectures (as I have for
Windows/Mac/Linux on the github page, though not various BSDs).

I would definitely recommend talking to Igor Pavlov (7-Zip) about this,
> he might be interested in having this as part of 7-Zip as some kind of
> "fast" option, and also the developers of the `xz` tools. There might
> even be ways this could fit within existing extensibility mechanisms of
> the formats.


7-Zip is definitely a very cool and flexible program. I think it can
actually run faster than it's going in the current dumps setup: -mx=3
maintains ratios better than bzip's, but runs faster than bzip. That's a
few times slower than histzip|bzip and slightly larger output, but it's a
boost from the status quo. (There's an argument for maintaining that, not
bzip, as the widely-supported format, which I'd mentioned in the
xmldatadumps-l branch of this thread, or for just changing the 7z settings
and calling it a day)

Interesting to hear from Nemo that Pavlov was interested in long-range
zipping. histzip doesn't have source he could drop into his C program (it's
in Go) and it's really aimed at a narrow niche (long repetitions at a
certain distance) so I doubt I could get it integrated there.

Anyway, I'm saying too many fundamentally unimportant words. If the status
quo re: compression in fact causes enough pain to give histzip a fuller
look, or if there's some way to redirect the tech in it towards a useful
end, it would be great to hear from interested folks; if not, it was fun
work but there may not be much more to do or say.


On Mon, Jan 20, 2014 at 4:49 PM, Bjoern Hoehrmann <[email protected]> wrote:

> * Randall Farmer wrote:
> >As I understand, compressing full-history dumps for English Wikipedia and
> >other big wikis takes a lot of resources: enwiki history is about 10TB
> >unpacked, and 7zip only packs a few MB/s/core. Even with 32 cores, that's
> >over a day of server time. There's been talk about ways to speed that up
> in
> >the past.[1]
>
> That does not sound like much economically. Do keep in mind the cost of
> porting, deploying, maintaining, obtaining, and so on, new tools. There
> might be hundreds of downstream users and if every one of them has to
> spend a couple of minutes adopting to a new format, that can quickly
> outweigh any savings, as a simple example.
>
> >Technical datadaump aside: *How could I get this more thoroughly tested,
> >then maybe added to the dump process, perhaps with an eye to eventually
> >replacing for 7zip as the alternate, non-bzip2 compressor?* Who do I talk
> >to to get started? (I'd dealt with Ariel Glenn before, but haven't seen
> >activity from Ariel lately, and in any case maybe playing with a new tool
> >falls under Labs or some other heading than dumps devops.) Am I nuts to be
> >even asking about this? Are there things that would definitely need to
> >change for integration to be possible? Basically, I'm trying to get this
> >from a tech demo to something with real-world utility.
>
> I would definitely recommend talking to Igor Pavlov (7-Zip) about this,
> he might be interested in having this as part of 7-Zip as some kind of
> "fast" option, and also the developers of the `xz` tools. There might
> even be ways this could fit within existing extensibility mechanisms of
> the formats. Igor Pavlov tends to be quite response through the SF.net
> bug tracker. In any case, they might be able to give directions how this
> might become, or not, part of standard tools.
> --
> Björn Höhrmann · mailto:[email protected] · http://bjoern.hoehrmann.de
> Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
> 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
>
> _______________________________________________
> Xmldatadumps-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
>
_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to