On Tue, Dec 21, 2010 at 3:44 AM, Stefan Behnel <stefan...@behnel.de> wrote: > [note that this has also been posted to comp.lang.python and discussed > separately over there] > > Steven D'Aprano, 20.12.2010 22:19: >> >> ashish makani wrote: >> >>> Goal : I am trying to parse a ginormous ( ~ 1gb) xml file. >> >> I sympathize with you. I wonder who thought that building a 1GB XML file >> was a good thing.
David Mertz, Ph.D. Comparator, Gnosis Software, Inc. June 2003 http://gnosis.cx/publish/programming/xml_matters_29.html that was just the first listing: http://www.google.com/search?client=ubuntu&channel=fs&q=parsing+gigabyte+xml+python&ie=utf-8&oe=utf-8 >> >> Forget about using any XML parser that reads the entire file into memory. >> By the time that 1GB of text is read and parsed, you will probably have >> something about 6-8GB (estimated) in size. > > The in-memory size is highly dependent on the data, specifically the > text-to-structure ratio. If it's a lot of text content, the difference to > the serialised tree will be small. If it's a lot of structure with tiny bits > of text content, the in-memory size of the tree will be a lot larger. > > >>> I am guessing, as this happens (over the course of 20-30 mins), the tree >>> representing is being slowly built in memory, but even after 30-40 mins, >>> nothing happens. >> >> It's probably not finished. Leave it another hour or so and you'll get an >> out of memory error. > > Right, if it gets into wild swapping, it can slow down almost to a halt, > even though the XML parsing itself tends to have pretty good memory locality > (but the ever growing in-memory tree obviously doesn't). > > >>> 4. I then investigated some streaming libraries, but am confused - there >>> is >>> SAX[http://en.wikipedia.org/wiki/Simple_API_for_XML] , the iterparse >>> interface[http://effbot.org/zone/element-iterparse.htm], & several otehr >>> options ( minidom) >>> >>> Which one is the best for my situation ? >> >> You absolutely need to use a streaming library. element-iterparse still >> builds the tree, so that's no use to you. > > Wrong. iterparse() allows you to cut branches in the tree while it's > growing, that's exactly what it's there for. > > >> I believe you should use SAX or >> minidom, but that's about my limit of knowledge of streaming XML parsers. > > With "minidom" being an advice that's even worse than SAX - SAX would at > least solve the problem, whereas minidom wouldn't because of its intolerable > memory requirements. > > Stefan > > _______________________________________________ > Tutor maillist - tu...@python.org > To unsubscribe or change subscription options: > http://mail.python.org/mailman/listinfo/tutor > -- They're installing the breathalyzer on my email account next week. _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor