Re: [Toolserver-l] Dumps handling / storage / updating etc...

2011-12-12 Thread Carl (CBM)
On Sun, Dec 11, 2011 at 10:47 AM, Platonides platoni...@gmail.com wrote: You seem to think that piping the output from bzip2 will hold the xml dump uncompressed in memory until your script processes it. That's wrong. bzip2 will begin uncompressing and writing to the pipe, when the pipe fills,

Re: [Toolserver-l] Dumps handling / storage / updating etc...

2011-12-12 Thread Platonides
On 12/12/11 13:59, Carl (CBM) wrote: This is correct, but the overall memory usage depends on the XML library and programming technique being used. For XML that is too large to comfortably fit in memory, there are techniques to allow for the script to process the data before the entire XML

Re: [Toolserver-l] Dumps handling / storage / updating etc...

2011-12-11 Thread Stefan Kühn
Am 10.12.2011 20:52, schrieb Jeremy Baron: Is it sufficient to receive the XML on stdin or do you need to be able to seek? It is trivial to give you XML on stdin e.g. $ path/to/bz2 bzip2 -d | perl script.pl Hmm, the stdin is possible, but I think this will need many memory of RAM on the

Re: [Toolserver-l] Dumps handling / storage / updating etc...

2011-12-11 Thread Mike Dupont
Hi there, I have experience with this topic. here is a simple read function : use Compress::Bzip2 qw(:all ); use IO::Uncompress::Bunzip2 qw ($Bunzip2Error); use IO::File; sub ReadFile { my $filename=shift; my $html=; my $fh; if ($filename =~/.bz2/) {

Re: [Toolserver-l] Dumps handling / storage / updating etc...

2011-12-11 Thread Stefan Kühn
Am 11.12.2011 11:02, schrieb Mike Dupont: let me know if you have any questions Thanks for this script. I will try this. Stefan (sk) ___ Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)

Re: [Toolserver-l] Dumps handling / storage / updating etc...

2011-12-11 Thread Lars Aronsson
On 12/11/2011 10:45 AM, Stefan Kühn wrote: Hmm, the stdin is possible, but I think this will need many memory of RAM on the server. I think this is no option for the future. Every language grows every day and the dumps will also grow. No, Stefan, it's not a matter of RAM, but of CPU. When your

Re: [Toolserver-l] Dumps handling / storage / updating etc...

2011-12-10 Thread Stefan Kühn
Am 09.12.2011 17:52, schrieb Platonides: While /mnt/user-store/dump is a mess, we have a bit of organization at /mnt/user-store/dumps where they are inside folders by dbname. Although they should be additionally categorised inside into folders by date. I'm surprised by the number of

[Toolserver-l] Dumps handling / storage / updating etc...

2011-12-09 Thread Danny B .
Hi, currently there is quite a big mess in how dump files are handled. They are located in several locations without any system in it, some locations are public, some private, thus there are obviously duplicates which eat the space etc. Also their naming differs. Hence I've got this proposal:

Re: [Toolserver-l] Dumps handling / storage / updating etc...

2011-12-09 Thread Peter Körner
Am 09.12.2011 12:41, schrieb Danny B.: Questions, comments, suggestions? When you have data to share, the main problem is usually finding someone who is able and willing to store multi-gigabyte files on their server and provide the necessary bandwith for downloaders. Collecting all dumps in

Re: [Toolserver-l] Dumps handling / storage / updating etc...

2011-12-09 Thread Danny B .
Původní zpráva Od: Peter Körner osm-li...@mazdermind.de When you have data to share, the main problem is usually finding someone who is able and willing to store multi-gigabyte files on their server and provide the

Re: [Toolserver-l] Dumps handling / storage / updating etc...

2011-12-09 Thread Peter Körner
Am 09.12.2011 12:52, schrieb Danny B.: We already have these dumps stored in /mnt/user-storage/various places as well as lot of people have them in their ~. I'm sorry I've missed that this mail was on the Toolserver-Mailinglist. Never mind. Peter

Re: [Toolserver-l] Dumps handling / storage / updating etc...

2011-12-09 Thread Lars Aronsson
On 12/09/2011 12:46 PM, Peter Körner wrote: Collecting all dumps in one place begins with building a hosting-location with some terabytes of storage and a fast connection. To me, that sounds like -- the toolserver! I'm sorry if this suggestion is naive. Why is the toolserver short on disk

Re: [Toolserver-l] Dumps handling / storage / updating etc...

2011-12-09 Thread Lars Aronsson
On 12/09/2011 12:52 PM, Danny B. wrote: Also, only those dumps, which are being used by TS users are supposed to be stored, the proposal is not about mirroring the dumps.wikimedia.org... This is stupid. I suggest we change the ambition and start to actually mirror all of dumps.wikimedia.org.

Re: [Toolserver-l] Dumps handling / storage / updating etc...

2011-12-09 Thread Platonides
On 09/12/11 12:52, Danny B. wrote: We already have these dumps stored in /mnt/user-storage/various places as well as lot of people have them in their ~. The purpose is to have them only on one place, since now they are very often duplicated and on many places. Also, only those dumps,

Re: [Toolserver-l] Dumps handling / storage / updating etc...

2011-12-09 Thread Lars Aronsson
On 12/09/2011 05:52 PM, Platonides wrote: I'm surprised by the number of uncompressed files there (ie .xml or .sql). Many times it wouldn't even be needed to decompress them. The popular pywikipediabot framework has an -xml: option, and I used to believe that it required the filename of an

Re: [Toolserver-l] Dumps handling / storage / updating etc...

2011-12-09 Thread Darkdadaah
On 12/09/2011 05:52 PM, Platonides wrote: If the following would also work (but it does not), we wouldn't have to worry about disk space at all: python replace.py -lang:da \ -xml:http://dumps.wikimedia.org/dawiki/20111202/dawiki-20111202-pages-articles.xml.bz2 \ dansk svensk Would that