On 11/12/11 10:45, Stefan Kühn wrote:
> Am 10.12.2011 20:52, schrieb Jeremy Baron:
>> Is it sufficient to receive the XML on stdin or do you need to be able to 
>> seek?
>>
>> It is trivial to give you XML on stdin e.g.
>> $<  path/to/bz2 bzip2 -d | perl script.pl
> 
> Hmm, the stdin is possible, but I think this will need many memory of 
> RAM on the server. I think this is no option for the future. Every 
> language grows every day and the dumps will also grow. The next problem 
> is the parallel use of a compressed file. If more user use this 
> compressed file like your idea, then bzip2 will crash the server IMHO.
> 
> I think it is no problem to store the uncompressed XML files for an easy 
> usage. We should make rules, where they have to stay and how long or we 
> need a list, where every user can say "I need only the two newest dumps 
> of enwiki, dewiki,...". If a dump is not needed, then we can delete this 
> file.
> 
> Stefan (sk)

You seem to think that piping the output from bzip2 will hold the xml
dump uncompressed in memory until your script processes it. That's wrong.
bzip2 will begin uncompressing and writing to the pipe, when the pipe
fills, it will get blocked. As your perl script reads from there,
there's space freed and the unbzipping can progress.

_______________________________________________
Toolserver-l mailing list (Toolserver-l@lists.wikimedia.org)
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette

Reply via email to