[Bug 27618] Backup dumps could contain a title index

2011-11-14 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=27618

--- Comment #8 from Ariel T. Glenn ar...@wikimedia.org 2011-11-14 15:23:35 
UTC ---
Out of curiosity, what do the various bz2 offline readers need, byte, or byte
and bit, or bzip2 boundary and offset?  

I expect the offline readers don't really use namespace or page ids for
anything, so adding the full page title (i.e. namespace:title) should suffice. 
If we're talking only about things in the main article space then it doesn't
matter at all (but what about images?)...

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 27618] Backup dumps could contain a title index

2011-11-14 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=27618

Ángel González keis...@gmail.com changed:

   What|Removed |Added

 CC||keis...@gmail.com

--- Comment #9 from Ángel González keis...@gmail.com 2011-11-14 15:46:09 UTC 
---
I used bzip2 boundary + title hash.
If your index is 315 MB, even dropping the ability to perform random search,
you will hardly be efficient in a consumer PC with maybe just 512 MB of RAM.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 27618] Backup dumps could contain a title index

2011-11-12 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=27618

Diederik van Liere dvanli...@gmail.com changed:

   What|Removed |Added

 CC||dvanli...@gmail.com

--- Comment #7 from Diederik van Liere dvanli...@gmail.com 2011-11-13 
01:14:58 UTC ---
I like this idea and I think two things need to be added to this patch:
1) Currently only the title is written to the index file, but that should also
included the namespace or use the page_id instead of the title.
2) As Ariel mentioned, we are generating the dumps in multiple parts so the
index file should also keep track in which file the article can be found.

Best,

Diederik

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 27618] Backup dumps could contain a title index

2011-11-09 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=27618

Sumana Harihareswara suma...@panix.com changed:

   What|Removed |Added

   Keywords||need-review
 CC||suma...@panix.com

--- Comment #6 from Sumana Harihareswara suma...@panix.com 2011-11-10 
00:16:38 UTC ---
Adding the need-review keyword because my impression is that Adam wanted other
developers to check his approach and give feedback.  Thanks for the patch,
Adam!

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 27618] Backup dumps could contain a title index

2011-03-18 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=27618

--- Comment #4 from Adam Wight s...@ludd.net 2011-03-18 06:45:44 UTC ---
Created attachment 8310
  -- https://bugzilla.wikimedia.org/attachment.cgi?id=8310
ROUGH

Not much to show yet, but in case someone wants to lend a hand...
My intention is that:
* each backup job records the arguments with which it was invoked
* an index entry is recorded for each page, giving its offset into the
compressed data being generated

Problems:
1) there is no convention for saving to a second file stream (the index file)
2) bz2 php library does not expose the libbz2.so tell function, nor could
that function work without flushing buffers.  Perhaps the recorded offset can
be addressed by bz2 chunk, then by uncompressed offset.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 27618] Backup dumps could contain a title index

2011-02-22 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=27618

Mark A. Hershberger m...@everybody.org changed:

   What|Removed |Added

   Priority|Normal  |High
 CC||ar...@wikimedia.org,
   ||m...@everybody.org
 AssignedTo|wikibugs-l@lists.wikimedia. |s...@ludd.net
   |org |

--- Comment #1 from Mark A. Hershberger m...@everybody.org 2011-02-22 
17:59:03 UTC ---
(In reply to comment #0)
 The simplest remedy would be to register a dump filter which creates a text
 file mapping article title - byte offset.  If this is done during the backup
 process, there is almost no resource overhead.
 
 I can write a patch if other developers agree this would be a worthwhile
 pursuit.

I'm interested.  CCing Ariel for input and assigning to you.  Let's have a
patch!

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 27618] Backup dumps could contain a title index

2011-02-22 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=27618

--- Comment #2 from Ariel T. Glenn ar...@wikimedia.org 2011-02-23 00:12:29 
UTC ---
How will this work for runs that do parts in parallel?  I still don't know if
those pieces should be recombined later but at present we are running on the
assumption that they should be.  Not a big issue, it's just that you'll need to
write a little script to recalculate the byte offsets for the combined dump
when that phase runs, keeping track of the bit alignment to get the page start
byte in later pieces right.

This would be handy for a number of things actually, so I'd like to see it
happen.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 27618] Backup dumps could contain a title index

2011-02-22 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=27618

--- Comment #3 from Adam Wight s...@ludd.net 2011-02-23 00:16:44 UTC ---
Interesting--
Also, the byte offsets are into the compressed data of course, ftell(STDOUT),
and the boundaries between bz2 chunks also becomes very relevant.

Thanks, I'll have a patch for review this week!

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l