On Sun, Dec 11, 2011 at 10:47 AM, Platonides platoni...@gmail.com wrote:
You seem to think that piping the output from bzip2 will hold the xml
dump uncompressed in memory until your script processes it. That's wrong.
bzip2 will begin uncompressing and writing to the pipe, when the pipe
fills,
On 12/12/11 13:59, Carl (CBM) wrote:
This is correct, but the overall memory usage depends on the XML
library and programming technique being used. For XML that is too
large to comfortably fit in memory, there are techniques to allow for
the script to process the data before the entire XML
Am 10.12.2011 20:52, schrieb Jeremy Baron:
Is it sufficient to receive the XML on stdin or do you need to be able to
seek?
It is trivial to give you XML on stdin e.g.
$ path/to/bz2 bzip2 -d | perl script.pl
Hmm, the stdin is possible, but I think this will need many memory of
RAM on the
Hi there,
I have experience with this topic.
here is a simple read function :
use Compress::Bzip2 qw(:all );
use IO::Uncompress::Bunzip2 qw ($Bunzip2Error);
use IO::File;
sub ReadFile
{
my $filename=shift;
my $html=;
my $fh;
if ($filename =~/.bz2/)
{
Am 11.12.2011 11:02, schrieb Mike Dupont:
let me know if you have any questions
Thanks for this script. I will try this.
Stefan (sk)
___
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
On 12/11/2011 10:45 AM, Stefan Kühn wrote:
Hmm, the stdin is possible, but I think this will need many memory of
RAM on the server. I think this is no option for the future. Every
language grows every day and the dumps will also grow.
No, Stefan, it's not a matter of RAM, but of CPU. When your
Am 09.12.2011 17:52, schrieb Platonides:
While /mnt/user-store/dump is a mess, we have a bit of organization at
/mnt/user-store/dumps where they are inside folders by dbname. Although
they should be additionally categorised inside into folders by date.
I'm surprised by the number of
Hi,
currently there is quite a big mess in how dump files are handled. They are
located in several locations without any system in it, some locations are
public, some private, thus there are obviously duplicates which eat the space
etc. Also their naming differs.
Hence I've got this proposal:
Am 09.12.2011 12:41, schrieb Danny B.:
Questions, comments, suggestions?
When you have data to share, the main problem is usually finding someone
who is able and willing to store multi-gigabyte files on their server
and provide the necessary bandwith for downloaders.
Collecting all dumps in
Původní zpráva
Od: Peter Körner osm-li...@mazdermind.de
When you have data to share, the main problem is usually finding someone
who is able and willing to store multi-gigabyte files on their server
and provide the
Am 09.12.2011 12:52, schrieb Danny B.:
We already have these dumps stored in /mnt/user-storage/various places as
well as lot of people have them in their ~.
I'm sorry I've missed that this mail was on the Toolserver-Mailinglist.
Never mind.
Peter
On 12/09/2011 12:46 PM, Peter Körner wrote:
Collecting all dumps in one place begins with building a
hosting-location with some terabytes of storage and a fast connection.
To me, that sounds like -- the toolserver!
I'm sorry if this suggestion is naive.
Why is the toolserver short on disk
On 12/09/2011 12:52 PM, Danny B. wrote:
Also, only those dumps, which are being used by TS users are supposed to be
stored, the proposal is not about mirroring the dumps.wikimedia.org...
This is stupid. I suggest we change the ambition and start
to actually mirror all of dumps.wikimedia.org.
On 09/12/11 12:52, Danny B. wrote:
We already have these dumps stored in /mnt/user-storage/various places as
well as lot of people have them in their ~.
The purpose is to have them only on one place, since now they are very often
duplicated and on many places.
Also, only those dumps,
On 12/09/2011 05:52 PM, Platonides wrote:
I'm surprised by the number of uncompressed files there (ie .xml or
.sql). Many times it wouldn't even be needed to decompress them.
The popular pywikipediabot framework has an -xml: option, and
I used to believe that it required the filename of an
On 12/09/2011 05:52 PM, Platonides wrote:
If the following would also work (but it does not), we wouldn't
have to worry about disk space at all:
python replace.py -lang:da \
-xml:http://dumps.wikimedia.org/dawiki/20111202/dawiki-20111202-pages-articles.xml.bz2
\
dansk svensk
Would that
16 matches
Mail list logo