[Bug 26499] Include uncompressed size and other metadata in each dump file

2011-10-06 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=26499

--- Comment #17 from Adam Wight s...@ludd.net 2011-10-06 18:06:02 UTC ---
What about saving several indexes of data each in their own file?

For illustration,

  tlwiki-20110926-pages-meta-history.xml.bz2.index-on-revision.sqlite3
  tlwiki-20110926-pages-meta-history.xml.bz2.index-on-page.sqlite3
  tlwiki-20110926-pages-meta-history.xml.bz2.index-on-title.sqlite3

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 26499] Include uncompressed size and other metadata in each dump file

2011-08-29 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=26499

--- Comment #12 from Ariel T. Glenn ar...@wikimedia.org 2011-08-29 18:07:24 
UTC ---
(In response to comment 11) 
No they aren't but I have a C library that could be used to build such an index
without a ton of work, for bzip2 files; specifically, there is a utility to
find the offset to a block containing a specific pageID.  Since 7z and gzip
aren't block-oriented it's not possible to generate an index for those files.

However, this feature is not as useful as you might think.  For dump files that
contain all revisions, it can take quite a while to locate a given pageID. 
That's because there are a few pages which, if the guesser happens to land in
the middle of them, are ginormous (up to 163 GB) and take up to an hour to read
through.  If one prebuilt an index that mapped revision IDs to page IDs and
kept this in memory, things could be speeded up a fair amount; alternatively
one could work just with the current revisions.

(In response to comment 9)
Moving to xz will mean a rewrite of my bz2 library and utils and all the bits
that rely on them, so that's not likely to happen until Dumps 2.0.

(In response to comment 8)
The easiest way to provide metadata of this nature is, like the md5 sums, to
provide it in a separate file.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 26499] Include uncompressed size and other metadata in each dump file

2011-08-29 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=26499

Andrew Dunbar hippytr...@gmail.com changed:

   What|Removed |Added

 CC||hippytr...@gmail.com

--- Comment #13 from Andrew Dunbar hippytr...@gmail.com 2011-08-29 18:54:26 
UTC ---
There is a little tool for indexing the blocks in bzip2:
http://bitbucket.org/james_taylor/seek-bzip2

There is a more complicated one for gzip too:
http://svn.ghostscript.com/ghostscript/tags/zlib-1.2.3/examples/zran.c

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 26499] Include uncompressed size and other metadata in each dump file

2011-08-29 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=26499

--- Comment #14 from Ariel T. Glenn ar...@wikimedia.org 2011-08-29 19:39:55 
UTC ---
Yeah, I'm familiar with seek-bzip2, but it didn't do what I needed for my use
case.  I wanted to be able to easily locate a given XML page in a dump file
without an index. The gzip tool appears to read through the entire file (and
then keep it in memory) for random access, something we wouldn't want to do for
large files like the en wikipedia dumps. 

Another approach is to make each page a separate bzip2 stream; I haven't
decided whether that's a good thing or not (and it too would require reworking
a bunch of thiings that aren't designed to handle multiple streams).

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 26499] Include uncompressed size and other metadata in each dump file

2011-08-29 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=26499

Ángel González keis...@gmail.com changed:

   What|Removed |Added

 CC||keis...@gmail.com

--- Comment #15 from Ángel González keis...@gmail.com 2011-08-29 22:04:36 UTC 
---
I have a similar one, too. Although in this case it recompressed the bzip2
files with given parameters.

I didn't expect it to work efficiently with history dumps, but nonetheless I'm
surprised that the pages get *that* big.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 26499] Include uncompressed size and other metadata in each dump file

2011-08-29 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=26499

--- Comment #16 from Ariel T. Glenn ar...@wikimedia.org 2011-08-29 22:19:33 
UTC ---
See Adminstrators'_noticeboard/Incidents, a total of 561938 revs last time I
looked (which was over a month ago, surely even worse now).

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 26499] Include uncompressed size and other metadata in each dump file

2011-06-04 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=26499

--- Comment #11 from Adam Wight s...@ludd.net 2011-06-04 11:07:57 UTC ---
Make it a requirement that the compression library is able to report compressed
block boundaries as it is working, so an index can be generated.  This will
open many possibilities for mediawiki on mobile, DVD, and other
resource-limited scenarios.

n.b. -- the libbzip2 counters are not accessible from php.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 26499] Include uncompressed size and other metadata in each dump file

2011-06-03 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=26499

Diederik van Liere dvanli...@gmail.com changed:

   What|Removed |Added

   Keywords||analytics

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 26499] Include uncompressed size and other metadata in each dump file

2011-06-03 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=26499

--- Comment #9 from Platonides platoni...@gmail.com 2011-06-03 22:00:31 UTC 
---
Diederik, they are not created uncompressed in memory.

I think we should just move to xz (mainly for the space benefits), which would
provide the uncompressed size as an added value.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 26499] Include uncompressed size and other metadata in each dump file

2011-06-03 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=26499

--- Comment #10 from Diederik van Liere dvanli...@gmail.com 2011-06-03 
22:04:31 UTC ---
xz compression sounds good to me!

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 26499] Include uncompressed size and other metadata in each dump file

2011-06-02 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=26499

Platonides platoni...@gmail.com changed:

   What|Removed |Added

 CC||platoni...@gmail.com

--- Comment #5 from Platonides platoni...@gmail.com 2011-06-02 21:50:25 UTC 
---
 Dump files are generated directly to their compressed form, so these exact
 things aren't really possible to put in.
You can just keep the count when writing it (eg, libbzip2 has counters just for
giving the applications that convenience).

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 26499] Include uncompressed size and other metadata in each dump file

2011-06-02 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=26499

--- Comment #6 from Brion Vibber br...@wikimedia.org 2011-06-02 21:54:24 UTC 
---
(In reply to comment #5)
  Dump files are generated directly to their compressed form, so these exact
  things aren't really possible to put in.
 You can just keep the count when writing it (eg, libbzip2 has counters just 
 for
 giving the applications that convenience).

Well yes, but you won't have that final count until you've finished writing the
entire file, so you can't really include it in the header of the file. You can
put it in another file, or maybe you can append it as some kind of metadata at
the *end* of the compressed file, or a second file directory entry or something
depending on the format.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 26499] Include uncompressed size and other metadata in each dump file

2011-06-02 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=26499

--- Comment #7 from Platonides platoni...@gmail.com 2011-06-02 22:35:03 UTC 
---
Sorry, I didn't pay enough attention to the first post, I was thinking in
giving that metadata separatedly.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 26499] Include uncompressed size and other metadata in each dump file

2011-06-02 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=26499

--- Comment #8 from Diederik van Liere dvanli...@gmail.com 2011-06-02 
22:40:04 UTC ---
Or alternatively, first create the page XML elements and once that's done and
you have collected meta data like number of articles, uncompressed size, etc.
prepend the metadata, siteinfo and mediawiki XML element to the xml file. A
simple cat operation would do that, and finally append at the end of the XML
document the closing /mediawiki tag.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 26499] Include uncompressed size and other metadata in each dump file

2011-02-24 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=26499

Adam Wight s...@ludd.net changed:

   What|Removed |Added

Summary|Include size of the dump|Include uncompressed size
   |file in each dump file  |and other metadata in each
   ||dump file

--- Comment #4 from Adam Wight s...@ludd.net 2011-02-24 22:23:39 UTC ---
A rough proposal for the metadata, please help elaborate: (page_id_start,
page_id_end, generator_id_string, snapshot_timestamp, namespaces,
history_selector, uncompressed_size ...)

If one of the job outputs is corrupted, for example, this will make it easy to
diagnose and recover.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l