[Bug 26499] Include uncompressed size and other metadata in each dump file

bugzilla-daemon Mon, 29 Aug 2011 11:07:31 -0700

https://bugzilla.wikimedia.org/show_bug.cgi?id=26499


--- Comment #12 from Ariel T. Glenn <[email protected]> 2011-08-29 18:07:24 
UTC ---
(In response to comment 11) 
No they aren't but I have a C library that could be used to build such an index
without a ton of work, for bzip2 files; specifically, there is a utility to
find the offset to a block containing a specific pageID.  Since 7z and gzip
aren't block-oriented it's not possible to generate an index for those files.

However, this feature is not as useful as you might think.  For dump files that
contain all revisions, it can take quite a while to locate a given pageID. 
That's because there are a few pages which, if the guesser happens to land in
the middle of them, are ginormous (up to 163 GB) and take up to an hour to read
through.  If one prebuilt an index that mapped revision IDs to page IDs and
kept this in memory, things could be speeded up a fair amount; alternatively
one could work just with the current revisions.

(In response to comment 9)
Moving to xz will mean a rewrite of my bz2 library and utils and all the bits
that rely on them, so that's not likely to happen until Dumps 2.0.

(In response to comment 8)
The easiest way to provide metadata of this nature is, like the md5 sums, to
provide it in a separate file.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

[Bug 26499] Include uncompressed size and other metadata in each dump file

Reply via email to