Re: [Toolserver-l] Dumps handling / storage / updating etc...
On Sun, Dec 11, 2011 at 10:47 AM, Platonides platoni...@gmail.com wrote: You seem to think that piping the output from bzip2 will hold the xml dump uncompressed in memory until your script processes it. That's wrong. bzip2 will begin uncompressing and writing to the pipe, when the pipe fills, it will get blocked. As your perl script reads from there, there's space freed and the unbzipping can progress. This is correct, but the overall memory usage depends on the XML library and programming technique being used. For XML that is too large to comfortably fit in memory, there are techniques to allow for the script to process the data before the entire XML file is parsed (google SAX or stream-oriented parsing). But this requires more advanced programming techniques, such as callbacks, compared to the more naive method of parsing all the XML into a data structure and then returning the data structure. That naive technique can result in large memory use if, say, the program tries to create a memory array of every page revision on enwiki. Of course if the perl script is doing the parsing itself, by just matching regular expressions, this is not hard to do in a stream-oriented way. - Carl ___ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Re: [Toolserver-l] Dumps handling / storage / updating etc...
On 12/12/11 13:59, Carl (CBM) wrote: This is correct, but the overall memory usage depends on the XML library and programming technique being used. For XML that is too large to comfortably fit in memory, there are techniques to allow for the script to process the data before the entire XML file is parsed (google SAX or stream-oriented parsing). But this requires more advanced programming techniques, such as callbacks, compared to the more naive method of parsing all the XML into a data structure and then returning the data structure. That naive technique can result in large memory use if, say, the program tries to create a memory array of every page revision on enwiki. Of course if the perl script is doing the parsing itself, by just matching regular expressions, this is not hard to do in a stream-oriented way. - Carl Obviously. No matter if it's read from a .xml or a .xml.bz2, if it tried to build a xml tree in memory the memory usage would be incredibly huge. I would expect such app to get killed for such. ___ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Re: [Toolserver-l] Dumps handling / storage / updating etc...
Am 10.12.2011 20:52, schrieb Jeremy Baron: Is it sufficient to receive the XML on stdin or do you need to be able to seek? It is trivial to give you XML on stdin e.g. $ path/to/bz2 bzip2 -d | perl script.pl Hmm, the stdin is possible, but I think this will need many memory of RAM on the server. I think this is no option for the future. Every language grows every day and the dumps will also grow. The next problem is the parallel use of a compressed file. If more user use this compressed file like your idea, then bzip2 will crash the server IMHO. I think it is no problem to store the uncompressed XML files for an easy usage. We should make rules, where they have to stay and how long or we need a list, where every user can say I need only the two newest dumps of enwiki, dewiki, If a dump is not needed, then we can delete this file. Stefan (sk) ___ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Re: [Toolserver-l] Dumps handling / storage / updating etc...
Hi there, I have experience with this topic. here is a simple read function : use Compress::Bzip2 qw(:all ); use IO::Uncompress::Bunzip2 qw ($Bunzip2Error); use IO::File; sub ReadFile { my $filename=shift; my $html=; my $fh; if ($filename =~/.bz2/) { $fh=IO::Uncompress::Bunzip2-new( $filename) or die Couldn't open bzipped input file: $Bunzip2Error\n; } else { $fh= IO::File-new( $filename) or die Couldn't open input file $@\n; } while($fh) { $html .= $_; } $html; } I have examples of how to process the huge bz file in parts here, without downloading the whole thing http://bazaar.launchpad.net/~jamesmikedupont/+junk/openstreetmap-wikipedia/view/head:/GetPart.pl basically you can download with http a partialfile http://bazaar.launchpad.net/~jamesmikedupont/+junk/openstreetmap-wikipedia/view/head:/GetPart.pl#L122 $req-init_header('Range' = sprintf(bytes=%s-%s, $startpos , $endpos - 1 )); then use bz2 recover to extract data from that block. let me know if you have any questions On Sat, Dec 10, 2011 at 8:52 PM, Jeremy Baron jer...@tuxmachine.com wrote: On Sat, Dec 10, 2011 at 14:18, Stefan Kühn kueh...@gmx.net wrote: I work with perl and need the uncompressed file in XML to read the dump. I have no idea how to read with perl a compressed file. Is it sufficient to receive the XML on stdin or do you need to be able to seek? It is trivial to give you XML on stdin e.g. $ path/to/bz2 bzip2 -d | perl script.pl -Jeremy ___ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette -- James Michael DuPont Member of Free Libre Open Source Software Kosova http://flossk.org ___ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Re: [Toolserver-l] Dumps handling / storage / updating etc...
Am 11.12.2011 11:02, schrieb Mike Dupont: let me know if you have any questions Thanks for this script. I will try this. Stefan (sk) ___ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Re: [Toolserver-l] Dumps handling / storage / updating etc...
On 12/11/2011 10:45 AM, Stefan Kühn wrote: Hmm, the stdin is possible, but I think this will need many memory of RAM on the server. I think this is no option for the future. Every language grows every day and the dumps will also grow. No, Stefan, it's not a matter of RAM, but of CPU. When your program reads from a pipe, the decompression program (bunzip2 or gunzip) consumes a few extra processor cycles every time your program reads the next kilobyte or megabyte of input. Most often, these CPU cycles are cheaper than storing the uncompressed XML file on disk. Sometimes, reading compressed data and decompressing it, is also faster than reading the larger uncompressed data from disk. If you read the entire compressed file into RAM and decompress it in RAM before starting to use it, then a lot of RAM will be needed. But there is no reason to do this for an XML file, which is always processed like a stream or sequence. (Remember that UNIX pipes were invented in a time when streaming data from one tape station to another was common, and a PDP-11 had 32 Kbyte of RAM.) Here's how I read the *.sql.gz files in Perl: my $page = enwiki-2028-page.sql.gz; if ($page =~ /\.gz$/) { open(PAGE, gunzip $page |); } else { open(PAGE, $page); } while (PAGE) { chomp; ... -- Lars Aronsson (l...@aronsson.se) Aronsson Datateknik - http://aronsson.se ___ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Re: [Toolserver-l] Dumps handling / storage / updating etc...
Am 09.12.2011 17:52, schrieb Platonides: While /mnt/user-store/dump is a mess, we have a bit of organization at /mnt/user-store/dumps where they are inside folders by dbname. Although they should be additionally categorised inside into folders by date. I'm surprised by the number of uncompressed files there (ie .xml or .sql). Many times it wouldn't even be needed to decompress them. When I create this directory dump there where no directory dumps. Today we can easy merge this two directories. In the future I will download the dumps in the directory dumps under the right project directory like dewiki or so. I work with perl and need the uncompressed file in XML to read the dump. I have no idea how to read with perl a compressed file. I need only the newest dump, so at the moment my script delete all other dumps of an project and only let the newest and the second newest in the directory dump. Stefan (sk) ___ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Re: [Toolserver-l] Dumps handling / storage / updating etc...
Am 09.12.2011 12:41, schrieb Danny B.: Questions, comments, suggestions? When you have data to share, the main problem is usually finding someone who is able and willing to store multi-gigabyte files on their server and provide the necessary bandwith for downloaders. Collecting all dumps in one place begins with building a hosting-location with some terabytes of storage and a fast connection. Peter ___ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Re: [Toolserver-l] Dumps handling / storage / updating etc...
Původní zpráva Od: Peter Körner osm-li...@mazdermind.de When you have data to share, the main problem is usually finding someone who is able and willing to store multi-gigabyte files on their server and provide the necessary bandwith for downloaders. Collecting all dumps in one place begins with building a hosting-location with some terabytes of storage and a fast connection. We already have these dumps stored in /mnt/user-storage/various places as well as lot of people have them in their ~. The purpose is to have them only on one place, since now they are very often duplicated and on many places. Also, only those dumps, which are being used by TS users are supposed to be stored, the proposal is not about mirroring the dumps.wikimedia.org... Danny B. ___ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Re: [Toolserver-l] Dumps handling / storage / updating etc...
Am 09.12.2011 12:52, schrieb Danny B.: We already have these dumps stored in /mnt/user-storage/various places as well as lot of people have them in their ~. I'm sorry I've missed that this mail was on the Toolserver-Mailinglist. Never mind. Peter ___ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Re: [Toolserver-l] Dumps handling / storage / updating etc...
On 12/09/2011 12:46 PM, Peter Körner wrote: Collecting all dumps in one place begins with building a hosting-location with some terabytes of storage and a fast connection. To me, that sounds like -- the toolserver! I'm sorry if this suggestion is naive. Why is the toolserver short on disk space? When I downloaded some dumps, why did I sometimes get only 200 kbytes/seconds? Are we on an ADSL line? -- Lars Aronsson (l...@aronsson.se) Aronsson Datateknik - http://aronsson.se ___ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Re: [Toolserver-l] Dumps handling / storage / updating etc...
On 12/09/2011 12:52 PM, Danny B. wrote: Also, only those dumps, which are being used by TS users are supposed to be stored, the proposal is not about mirroring the dumps.wikimedia.org... This is stupid. I suggest we change the ambition and start to actually mirror all of dumps.wikimedia.org. -- Lars Aronsson (l...@aronsson.se) Aronsson Datateknik - http://aronsson.se ___ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Re: [Toolserver-l] Dumps handling / storage / updating etc...
On 09/12/11 12:52, Danny B. wrote: We already have these dumps stored in /mnt/user-storage/various places as well as lot of people have them in their ~. The purpose is to have them only on one place, since now they are very often duplicated and on many places. Also, only those dumps, which are being used by TS users are supposed to be stored, the proposal is not about mirroring the dumps.wikimedia.org... Danny B. While /mnt/user-store/dump is a mess, we have a bit of organization at /mnt/user-store/dumps where they are inside folders by dbname. Although they should be additionally categorised inside into folders by date. I'm surprised by the number of uncompressed files there (ie .xml or .sql). Many times it wouldn't even be needed to decompress them. ___ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Re: [Toolserver-l] Dumps handling / storage / updating etc...
On 12/09/2011 05:52 PM, Platonides wrote: I'm surprised by the number of uncompressed files there (ie .xml or .sql). Many times it wouldn't even be needed to decompress them. The popular pywikipediabot framework has an -xml: option, and I used to believe that it required the filename of an uncompressed XML file. But I was wrong. The following works just fine: python replace.py -lang:da \ -xml:../dumps/dawiki/dawiki-20110404-pages-articles.xml.bz2 \ dansk svensk If the following would also work (but it does not), we wouldn't have to worry about disk space at all: python replace.py -lang:da \ -xml:http://dumps.wikimedia.org/dawiki/20111202/dawiki-20111202-pages-articles.xml.bz2 \ dansk svensk -- Lars Aronsson (l...@aronsson.se) Aronsson Datateknik - http://aronsson.se ___ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette
Re: [Toolserver-l] Dumps handling / storage / updating etc...
On 12/09/2011 05:52 PM, Platonides wrote: If the following would also work (but it does not), we wouldn't have to worry about disk space at all: python replace.py -lang:da \ -xml:http://dumps.wikimedia.org/dawiki/20111202/dawiki-20111202-pages-articles.xml.bz2 \ dansk svensk Would that not put a burden on the bandwidth, especially with repeated use of the same file? Unless the files were automatically cached... in the user-store ? Darkdadaah ___ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org) https://lists.wikimedia.org/mailman/listinfo/toolserver-l Posting guidelines for this list: https://wiki.toolserver.org/view/Mailing_list_etiquette