[note that this has also been posted to comp.lang.python and discussed separately over there]

Steven D'Aprano, 20.12.2010 22:19:
ashish makani wrote:

Goal : I am trying to parse a ginormous ( ~ 1gb) xml file.

I sympathize with you. I wonder who thought that building a 1GB XML file
was a good thing.

Forget about using any XML parser that reads the entire file into memory.
By the time that 1GB of text is read and parsed, you will probably have
something about 6-8GB (estimated) in size.

The in-memory size is highly dependent on the data, specifically the text-to-structure ratio. If it's a lot of text content, the difference to the serialised tree will be small. If it's a lot of structure with tiny bits of text content, the in-memory size of the tree will be a lot larger.


I am guessing, as this happens (over the course of 20-30 mins), the tree
representing is being slowly built in memory, but even after 30-40 mins,
nothing happens.

It's probably not finished. Leave it another hour or so and you'll get an
out of memory error.

Right, if it gets into wild swapping, it can slow down almost to a halt, even though the XML parsing itself tends to have pretty good memory locality (but the ever growing in-memory tree obviously doesn't).


4. I then investigated some streaming libraries, but am confused - there is
SAX[http://en.wikipedia.org/wiki/Simple_API_for_XML] , the iterparse
interface[http://effbot.org/zone/element-iterparse.htm], & several otehr
options ( minidom)

Which one is the best for my situation ?

You absolutely need to use a streaming library. element-iterparse still
builds the tree, so that's no use to you.

Wrong. iterparse() allows you to cut branches in the tree while it's growing, that's exactly what it's there for.


I believe you should use SAX or
minidom, but that's about my limit of knowledge of streaming XML parsers.

With "minidom" being an advice that's even worse than SAX - SAX would at least solve the problem, whereas minidom wouldn't because of its intolerable memory requirements.

Stefan

_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to