ashish makani wrote:
Goal : I am trying to parse a ginormous ( ~ 1gb) xml file.
I sympathize with you. I wonder who thought that building a 1GB XML file was a good thing.
Forget about using any XML parser that reads the entire file into memory. By the time that 1GB of text is read and parsed, you will probably have something about 6-8GB (estimated) in size.
[...]
My hardware setup : I have a win7 pro box with 8gb of RAM & intel core2 quad cpuq9400.
In order to access 8GB of RAM, you'll be running a 64-bit OS, correct? In this case, you should expect double the memory usage of the XML object to (estimated) 12-16GB.
I am guessing, as this happens (over the course of 20-30 mins), the tree representing is being slowly built in memory, but even after 30-40 mins, nothing happens.
It's probably not finished. Leave it another hour or so and you'll get an out of memory error.
4. I then investigated some streaming libraries, but am confused - there is SAX[http://en.wikipedia.org/wiki/Simple_API_for_XML] , the iterparse interface[http://effbot.org/zone/element-iterparse.htm], & several otehr options ( minidom) Which one is the best for my situation ?
You absolutely need to use a streaming library. element-iterparse still builds the tree, so that's no use to you. I believe you should use SAX or minidom, but that's about my limit of knowledge of streaming XML parsers.
Should i instead just open the file, & use reg ex to look for the element i need ?
That's likely to need less memory than building a parse tree, but still a huge amount of memory. And you don't know how complex the XML is, in general you *can't* correctly parse arbitrary XML with regular expressions (although you can for simple examples). Stick with the right tool for the job, the streaming XML library.
-- Steven _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor