On Sun, Dec 11, 2011 at 10:47 AM, Platonides <[email protected]> wrote:
> You seem to think that piping the output from bzip2 will hold the xml
> dump uncompressed in memory until your script processes it. That's wrong.
> bzip2 will begin uncompressing and writing to the pipe, when the pipe
> fills, it will get blocked. As your perl script reads from there,
> there's space freed and the unbzipping can progress.

This is correct, but the overall memory usage depends on the XML
library and programming technique being used. For XML that is too
large to comfortably fit in memory, there are techniques to allow for
the script to process the data before the entire XML file is parsed
(google "SAX" or "stream-oriented parsing"). But this requires more
advanced programming techniques, such as callbacks, compared to the
more naive method of parsing all the XML into a data structure and
then returning the data structure.  That naive technique can result in
large memory use if, say, the program tries to create a memory array
of every page revision on enwiki.

Of course if the perl script is doing the parsing itself, by just
matching regular expressions, this is not hard to do in a
stream-oriented way.

- Carl

_______________________________________________
Toolserver-l mailing list ([email protected])
https://lists.wikimedia.org/mailman/listinfo/toolserver-l
Posting guidelines for this list: 
https://wiki.toolserver.org/view/Mailing_list_etiquette

Reply via email to