[Bug 26499] Include uncompressed size and other metadata in each dump file
https://bugzilla.wikimedia.org/show_bug.cgi?id=26499 --- Comment #17 from Adam Wight 2011-10-06 18:06:02 UTC --- What about saving several indexes of data each in their own file? For illustration, tlwiki-20110926-pages-meta-history.xml.bz2.index-on-revision.sqlite3 tlwiki-20110926-pages-meta-history.xml.bz2.index-on-page.sqlite3 tlwiki-20110926-pages-meta-history.xml.bz2.index-on-title.sqlite3 -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 26499] Include uncompressed size and other metadata in each dump file
https://bugzilla.wikimedia.org/show_bug.cgi?id=26499 --- Comment #16 from Ariel T. Glenn 2011-08-29 22:19:33 UTC --- See Adminstrators'_noticeboard/Incidents, a total of 561938 revs last time I looked (which was over a month ago, surely even worse now). -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 26499] Include uncompressed size and other metadata in each dump file
https://bugzilla.wikimedia.org/show_bug.cgi?id=26499 Ángel González changed: What|Removed |Added CC||keis...@gmail.com --- Comment #15 from Ángel González 2011-08-29 22:04:36 UTC --- I have a similar one, too. Although in this case it recompressed the bzip2 files with given parameters. I didn't expect it to work efficiently with history dumps, but nonetheless I'm surprised that the pages get *that* big. -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 26499] Include uncompressed size and other metadata in each dump file
https://bugzilla.wikimedia.org/show_bug.cgi?id=26499 --- Comment #14 from Ariel T. Glenn 2011-08-29 19:39:55 UTC --- Yeah, I'm familiar with seek-bzip2, but it didn't do what I needed for my use case. I wanted to be able to easily locate a given XML page in a dump file without an index. The gzip tool appears to read through the entire file (and then keep it in memory) for random access, something we wouldn't want to do for large files like the en wikipedia dumps. Another approach is to make each page a separate bzip2 stream; I haven't decided whether that's a good thing or not (and it too would require reworking a bunch of thiings that aren't designed to handle multiple streams). -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 26499] Include uncompressed size and other metadata in each dump file
https://bugzilla.wikimedia.org/show_bug.cgi?id=26499 Andrew Dunbar changed: What|Removed |Added CC||hippytr...@gmail.com --- Comment #13 from Andrew Dunbar 2011-08-29 18:54:26 UTC --- There is a little tool for indexing the blocks in bzip2: http://bitbucket.org/james_taylor/seek-bzip2 There is a more complicated one for gzip too: http://svn.ghostscript.com/ghostscript/tags/zlib-1.2.3/examples/zran.c -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 26499] Include uncompressed size and other metadata in each dump file
https://bugzilla.wikimedia.org/show_bug.cgi?id=26499 --- Comment #12 from Ariel T. Glenn 2011-08-29 18:07:24 UTC --- (In response to comment 11) No they aren't but I have a C library that could be used to build such an index without a ton of work, for bzip2 files; specifically, there is a utility to find the offset to a block containing a specific pageID. Since 7z and gzip aren't block-oriented it's not possible to generate an index for those files. However, this feature is not as useful as you might think. For dump files that contain all revisions, it can take quite a while to locate a given pageID. That's because there are a few pages which, if the guesser happens to land in the middle of them, are ginormous (up to 163 GB) and take up to an hour to read through. If one prebuilt an index that mapped revision IDs to page IDs and kept this in memory, things could be speeded up a fair amount; alternatively one could work just with the current revisions. (In response to comment 9) Moving to xz will mean a rewrite of my bz2 library and utils and all the bits that rely on them, so that's not likely to happen until Dumps 2.0. (In response to comment 8) The easiest way to provide metadata of this nature is, like the md5 sums, to provide it in a separate file. -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 26499] Include uncompressed size and other metadata in each dump file
https://bugzilla.wikimedia.org/show_bug.cgi?id=26499 --- Comment #11 from Adam Wight 2011-06-04 11:07:57 UTC --- Make it a requirement that the compression library is able to report compressed block boundaries as it is working, so an index can be generated. This will open many possibilities for mediawiki on mobile, DVD, and other resource-limited scenarios. n.b. -- the libbzip2 counters are not accessible from php. -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 26499] Include uncompressed size and other metadata in each dump file
https://bugzilla.wikimedia.org/show_bug.cgi?id=26499 --- Comment #10 from Diederik van Liere 2011-06-03 22:04:31 UTC --- xz compression sounds good to me! -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 26499] Include uncompressed size and other metadata in each dump file
https://bugzilla.wikimedia.org/show_bug.cgi?id=26499 --- Comment #9 from Platonides 2011-06-03 22:00:31 UTC --- Diederik, they are not created uncompressed in memory. I think we should just move to xz (mainly for the space benefits), which would provide the uncompressed size as an added value. -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 26499] Include uncompressed size and other metadata in each dump file
https://bugzilla.wikimedia.org/show_bug.cgi?id=26499 Diederik van Liere changed: What|Removed |Added Keywords||analytics -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 26499] Include uncompressed size and other metadata in each dump file
https://bugzilla.wikimedia.org/show_bug.cgi?id=26499 --- Comment #8 from Diederik van Liere 2011-06-02 22:40:04 UTC --- Or alternatively, first create the page XML elements and once that's done and you have collected meta data like number of articles, uncompressed size, etc. prepend the metadata, and XML element to the xml file. A simple cat operation would do that, and finally append at the end of the XML document the closing tag. -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 26499] Include uncompressed size and other metadata in each dump file
https://bugzilla.wikimedia.org/show_bug.cgi?id=26499 --- Comment #7 from Platonides 2011-06-02 22:35:03 UTC --- Sorry, I didn't pay enough attention to the first post, I was thinking in giving that metadata separatedly. -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 26499] Include uncompressed size and other metadata in each dump file
https://bugzilla.wikimedia.org/show_bug.cgi?id=26499 --- Comment #6 from Brion Vibber 2011-06-02 21:54:24 UTC --- (In reply to comment #5) > > Dump files are generated directly to their compressed form, so these exact > > things aren't really possible to put in. > You can just keep the count when writing it (eg, libbzip2 has counters just > for > giving the applications that convenience). Well yes, but you won't have that final count until you've finished writing the entire file, so you can't really include it in the header of the file. You can put it in another file, or maybe you can append it as some kind of metadata at the *end* of the compressed file, or a second file directory entry or something depending on the format. -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 26499] Include uncompressed size and other metadata in each dump file
https://bugzilla.wikimedia.org/show_bug.cgi?id=26499 Platonides changed: What|Removed |Added CC||platoni...@gmail.com --- Comment #5 from Platonides 2011-06-02 21:50:25 UTC --- > Dump files are generated directly to their compressed form, so these exact > things aren't really possible to put in. You can just keep the count when writing it (eg, libbzip2 has counters just for giving the applications that convenience). -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 26499] Include uncompressed size and other metadata in each dump file
https://bugzilla.wikimedia.org/show_bug.cgi?id=26499 Adam Wight changed: What|Removed |Added Summary|Include size of the dump|Include uncompressed size |file in each dump file |and other metadata in each ||dump file --- Comment #4 from Adam Wight 2011-02-24 22:23:39 UTC --- A rough proposal for the metadata, please help elaborate: (page_id_start, page_id_end, generator_id_string, snapshot_timestamp, namespaces, history_selector, uncompressed_size ...) If one of the job outputs is corrupted, for example, this will make it easy to diagnose and recover. -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l