Re: [Toolserver-l] Dumps handling / storage / updating etc...

2011-12-12 Thread Carl (CBM)
On Sun, Dec 11, 2011 at 10:47 AM, Platonides platoni...@gmail.com wrote:
 You seem to think that piping the output from bzip2 will hold the xml
 dump uncompressed in memory until your script processes it. That's wrong.
 bzip2 will begin uncompressing and writing to the pipe, when the pipe
 fills, it will get blocked. As your perl script reads from there,
 there's space freed and the unbzipping can progress.

This is correct, but the overall memory usage depends on the XML
library and programming technique being used. For XML that is too
large to comfortably fit in memory, there are techniques to allow for
the script to process the data before the entire XML file is parsed
(google SAX or stream-oriented parsing). But this requires more
advanced programming techniques, such as callbacks, compared to the
more naive method of parsing all the XML into a data structure and
then returning the data structure.  That naive technique can result in
large memory use if, say, the program tries to create a memory array
of every page revision on enwiki.

Of course if the perl script is doing the parsing itself, by just
matching regular expressions, this is not hard to do in a
stream-oriented way.

- Carl

___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette


Re: [Toolserver-l] Dumps handling / storage / updating etc...

2011-12-12 Thread Platonides
On 12/12/11 13:59, Carl (CBM) wrote:
 This is correct, but the overall memory usage depends on the XML
 library and programming technique being used. For XML that is too
 large to comfortably fit in memory, there are techniques to allow for
 the script to process the data before the entire XML file is parsed
 (google SAX or stream-oriented parsing). But this requires more
 advanced programming techniques, such as callbacks, compared to the
 more naive method of parsing all the XML into a data structure and
 then returning the data structure.  That naive technique can result in
 large memory use if, say, the program tries to create a memory array
 of every page revision on enwiki.
 
 Of course if the perl script is doing the parsing itself, by just
 matching regular expressions, this is not hard to do in a
 stream-oriented way.
 
 - Carl

Obviously. No matter if it's read from a .xml or a .xml.bz2, if it tried
to build a xml tree in memory the memory usage would be incredibly huge.
I would expect such app to get killed for such.


___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette


Re: [Toolserver-l] Dumps handling / storage / updating etc...

2011-12-11 Thread Stefan Kühn
Am 10.12.2011 20:52, schrieb Jeremy Baron:
 Is it sufficient to receive the XML on stdin or do you need to be able to 
 seek?

 It is trivial to give you XML on stdin e.g.
 $  path/to/bz2 bzip2 -d | perl script.pl

Hmm, the stdin is possible, but I think this will need many memory of 
RAM on the server. I think this is no option for the future. Every 
language grows every day and the dumps will also grow. The next problem 
is the parallel use of a compressed file. If more user use this 
compressed file like your idea, then bzip2 will crash the server IMHO.

I think it is no problem to store the uncompressed XML files for an easy 
usage. We should make rules, where they have to stay and how long or we 
need a list, where every user can say I need only the two newest dumps 
of enwiki, dewiki, If a dump is not needed, then we can delete this 
file.

Stefan (sk)


___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette


Re: [Toolserver-l] Dumps handling / storage / updating etc...

2011-12-11 Thread Mike Dupont
Hi there,
I have experience with this topic.

here is a simple read function :
use Compress::Bzip2 qw(:all );
use IO::Uncompress::Bunzip2 qw ($Bunzip2Error);
use IO::File;
sub ReadFile
{
my $filename=shift;
my $html=;
my $fh;
if ($filename =~/.bz2/)
{
$fh=IO::Uncompress::Bunzip2-new( $filename) or die Couldn't open
bzipped input file: $Bunzip2Error\n;

}
else
{
$fh= IO::File-new( $filename) or die Couldn't open input file $@\n;
}

while($fh)
{
$html .= $_;
}
  $html;
}

I have examples of how to process the huge bz file in parts here,
without downloading the whole thing
http://bazaar.launchpad.net/~jamesmikedupont/+junk/openstreetmap-wikipedia/view/head:/GetPart.pl
basically you can download with http a partialfile
http://bazaar.launchpad.net/~jamesmikedupont/+junk/openstreetmap-wikipedia/view/head:/GetPart.pl#L122
 $req-init_header('Range' = sprintf(bytes=%s-%s,
 $startpos ,
 $endpos - 1
  ));

then use bz2 recover to extract data from that block.

let me know if you have any questions



On Sat, Dec 10, 2011 at 8:52 PM, Jeremy Baron jer...@tuxmachine.com wrote:
 On Sat, Dec 10, 2011 at 14:18, Stefan Kühn kueh...@gmx.net wrote:
 I work with perl and need the
 uncompressed file in XML to read the dump. I have no idea how to read
 with perl a compressed file.

 Is it sufficient to receive the XML on stdin or do you need to be able to 
 seek?

 It is trivial to give you XML on stdin e.g.
 $  path/to/bz2 bzip2 -d | perl script.pl

 -Jeremy

 ___
 Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
 https://lists.wikimedia.org/mailman/listinfo/toolserver-l
 Posting guidelines for this list: 
 https://wiki.toolserver.org/view/Mailing_list_etiquette



-- 
James Michael DuPont
Member of Free Libre Open Source Software Kosova http://flossk.org

___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette


Re: [Toolserver-l] Dumps handling / storage / updating etc...

2011-12-11 Thread Stefan Kühn
Am 11.12.2011 11:02, schrieb Mike Dupont:

 let me know if you have any questions

Thanks for this script. I will try this.

Stefan (sk)

___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette


Re: [Toolserver-l] Dumps handling / storage / updating etc...

2011-12-11 Thread Lars Aronsson
On 12/11/2011 10:45 AM, Stefan Kühn wrote:
 Hmm, the stdin is possible, but I think this will need many memory of
 RAM on the server. I think this is no option for the future. Every
 language grows every day and the dumps will also grow.

No, Stefan, it's not a matter of RAM, but of CPU. When your program
reads from a pipe, the decompression program (bunzip2 or gunzip)
consumes a few extra processor cycles every time your program
reads the next kilobyte or megabyte of input. Most often, these CPU
cycles are cheaper than storing the uncompressed XML file on disk.

Sometimes, reading compressed data and decompressing it, is
also faster than reading the larger uncompressed data from disk.

If you read the entire compressed file into RAM and decompress it
in RAM before starting to use it, then a lot of RAM will be needed.
But there is no reason to do this for an XML file, which is always
processed like a stream or sequence. (Remember that UNIX pipes
were invented in a time when streaming data from one tape station
to another was common, and a PDP-11 had 32 Kbyte of RAM.)

Here's how I read the *.sql.gz files in Perl:

 my $page = enwiki-2028-page.sql.gz;
 if ($page =~ /\.gz$/) {
 open(PAGE, gunzip $page |);
 } else {
 open(PAGE, $page);
 }
 while (PAGE) {
 chomp;
 ...


-- 
   Lars Aronsson (l...@aronsson.se)
   Aronsson Datateknik - http://aronsson.se



___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette


Re: [Toolserver-l] Dumps handling / storage / updating etc...

2011-12-10 Thread Stefan Kühn
Am 09.12.2011 17:52, schrieb Platonides:
 While /mnt/user-store/dump is a mess, we have a bit of organization at
 /mnt/user-store/dumps where they are inside folders by dbname. Although
 they should be additionally categorised inside into folders by date.

 I'm surprised by the number of uncompressed files there (ie .xml or
 .sql). Many times it wouldn't even be needed to decompress them.

When I create this directory dump there where no directory dumps. 
Today we can easy merge this two directories. In the future I will 
download the dumps in the directory dumps under the right project 
directory like dewiki or so. I work with perl and need the 
uncompressed file in XML to read the dump. I have no idea how to read 
with perl a compressed file. I need only the newest dump, so at the 
moment my script delete all other dumps of an project and only let the 
newest and the second newest in the directory dump.

Stefan (sk)


___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette


Re: [Toolserver-l] Dumps handling / storage / updating etc...

2011-12-09 Thread Peter Körner
Am 09.12.2011 12:41, schrieb Danny B.:
 Questions, comments, suggestions?

When you have data to share, the main problem is usually finding someone 
who is able and willing to store multi-gigabyte files on their server 
and provide the necessary bandwith for downloaders.

Collecting all dumps in one place begins with building a 
hosting-location with some terabytes of storage and a fast connection.

Peter

___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette


Re: [Toolserver-l] Dumps handling / storage / updating etc...

2011-12-09 Thread Danny B .
  Původní zpráva 
 Od: Peter Körner osm-li...@mazdermind.de
 
 When you have data to share, the main problem is usually finding someone 
 who is able and willing to store multi-gigabyte files on their server 
 and provide the necessary bandwith for downloaders.
 
 Collecting all dumps in one place begins with building a 
 hosting-location with some terabytes of storage and a fast connection.

We already have these dumps stored in /mnt/user-storage/various places as 
well as lot of people have them in their ~.

The purpose is to have them only on one place, since now they are very often 
duplicated and on many places.

Also, only those dumps, which are being used by TS users are supposed to be 
stored, the proposal is not about mirroring the dumps.wikimedia.org...


Danny B.

___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette


Re: [Toolserver-l] Dumps handling / storage / updating etc...

2011-12-09 Thread Peter Körner
Am 09.12.2011 12:52, schrieb Danny B.:
 We already have these dumps stored in /mnt/user-storage/various places  as 
 well as lot of people have them in their ~.
I'm sorry I've missed that this mail was on the Toolserver-Mailinglist.

Never mind.

Peter

___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette



Re: [Toolserver-l] Dumps handling / storage / updating etc...

2011-12-09 Thread Lars Aronsson
On 12/09/2011 12:46 PM, Peter Körner wrote:
 Collecting all dumps in one place begins with building a
 hosting-location with some terabytes of storage and a fast connection.

To me, that sounds like -- the toolserver!
I'm sorry if this suggestion is naive.
Why is the toolserver short on disk space?
When I downloaded some dumps, why did I sometimes
get only 200 kbytes/seconds? Are we on an ADSL line?


-- 
   Lars Aronsson (l...@aronsson.se)
   Aronsson Datateknik - http://aronsson.se



___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette


Re: [Toolserver-l] Dumps handling / storage / updating etc...

2011-12-09 Thread Lars Aronsson
On 12/09/2011 12:52 PM, Danny B. wrote:
 Also, only those dumps, which are being used by TS users are supposed to be 
 stored, the proposal is not about mirroring the dumps.wikimedia.org...

This is stupid. I suggest we change the ambition and start
to actually mirror all of dumps.wikimedia.org.


-- 
   Lars Aronsson (l...@aronsson.se)
   Aronsson Datateknik - http://aronsson.se



___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette


Re: [Toolserver-l] Dumps handling / storage / updating etc...

2011-12-09 Thread Platonides
On 09/12/11 12:52, Danny B. wrote:
 We already have these dumps stored in /mnt/user-storage/various places as 
 well as lot of people have them in their ~.
 
 The purpose is to have them only on one place, since now they are very often 
 duplicated and on many places.
 
 Also, only those dumps, which are being used by TS users are supposed to be 
 stored, the proposal is not about mirroring the dumps.wikimedia.org...
 
 
 Danny B.

While /mnt/user-store/dump is a mess, we have a bit of organization at
/mnt/user-store/dumps where they are inside folders by dbname. Although
they should be additionally categorised inside into folders by date.

I'm surprised by the number of uncompressed files there (ie .xml or
.sql). Many times it wouldn't even be needed to decompress them.

___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette


Re: [Toolserver-l] Dumps handling / storage / updating etc...

2011-12-09 Thread Lars Aronsson
On 12/09/2011 05:52 PM, Platonides wrote:
 I'm surprised by the number of uncompressed files there (ie .xml or
 .sql). Many times it wouldn't even be needed to decompress them.

The popular pywikipediabot framework has an -xml: option, and
I used to believe that it required the filename of an uncompressed
XML file. But I was wrong. The following works just fine:

python replace.py -lang:da \
   -xml:../dumps/dawiki/dawiki-20110404-pages-articles.xml.bz2 \
   dansk svensk

If the following would also work (but it does not), we wouldn't
have to worry about disk space at all:

python replace.py -lang:da \
   
-xml:http://dumps.wikimedia.org/dawiki/20111202/dawiki-20111202-pages-articles.xml.bz2
 
\
   dansk svensk



-- 
   Lars Aronsson (l...@aronsson.se)
   Aronsson Datateknik - http://aronsson.se



___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette


Re: [Toolserver-l] Dumps handling / storage / updating etc...

2011-12-09 Thread Darkdadaah
 On 12/09/2011 05:52 PM, Platonides wrote:
 If the following would also work (but it does not), we wouldn't
 have to worry about disk space at all:
 python replace.py -lang:da \
 -xml:http://dumps.wikimedia.org/dawiki/20111202/dawiki-20111202-pages-articles.xml.bz2
  \
 dansk svensk

Would that not put a burden on the bandwidth, especially with repeated
use of the same file? Unless the files were automatically cached... in
the user-store ?

Darkdadaah

___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette