Re: [Tutor] Trying to parse a HUGE(1gb) xml file in python

Steven D'Aprano Mon, 20 Dec 2010 13:22:03 -0800

ashish makani wrote:

Goal : I am trying to parse a ginormous ( ~ 1gb) xml file.

I sympathize with you. I wonder who thought that building a 1GB XML filewas a good thing.

Forget about using any XML parser that reads the entire file intomemory. By the time that 1GB of text is read and parsed, you willprobably have something about 6-8GB (estimated) in size.



[...]

My hardware setup : I have a win7 pro box with 8gb of RAM & intel core2 quad
cpuq9400.

In order to access 8GB of RAM, you'll be running a 64-bit OS, correct?In this case, you should expect double the memory usage of the XMLobject to (estimated) 12-16GB.

I am guessing, as this happens (over the course of 20-30 mins), the tree
representing is being slowly built in memory, but even after 30-40 mins,
nothing happens.

It's probably not finished. Leave it another hour or so and you'll getan out of memory error.

4. I then investigated some streaming libraries, but am confused - there is
SAX[http://en.wikipedia.org/wiki/Simple_API_for_XML] , the iterparse
interface[http://effbot.org/zone/element-iterparse.htm], & several otehr
options ( minidom)

Which one is the best for my situation ?

You absolutely need to use a streaming library. element-iterparse stillbuilds the tree, so that's no use to you. I believe you should use SAXor minidom, but that's about my limit of knowledge of streaming XML parsers.

Should i instead just open the file, & use reg ex to look for the element i
need ?

That's likely to need less memory than building a parse tree, but stilla huge amount of memory. And you don't know how complex the XML is, ingeneral you *can't* correctly parse arbitrary XML with regularexpressions (although you can for simple examples). Stick with the righttool for the job, the streaming XML library.



--
Steven
_______________________________________________
Tutor maillist  -  [email protected]
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Trying to parse a HUGE(1gb) xml file in python

Reply via email to