Hi, Bill Kinnersley wrote: > Can anyone add any substance to this remark? With today's typical > system RAM of 2GB to 3GB, is it even worth consideration any more that a > document might not fit in memory?
At least, it allows you to parse pretty large documents. But think of parallel handling of more than one document. In that case, you'd still want to make sure things don't hit the swap disk. > Offhand I'd guess the size of the XML file and the size of the DOM tree > would be in the same ballpark. So unless I've got more than 500MB of > XML to read, I'm clear. Right or wrong? Wrong. Especially the stdlib's minidom is terribly memory hungry. Fredrik has some benchmarks and memory size hints on his cElementTree page. http://effbot.org/zone/celementtree.htm#benchmarks Here are some other benchmarks from Ian Bicking on HTML parsers: http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/ Sadly, I do not know of any direct comparison of lxml.etree and cElementTree regarding memory usage, but my guess is that cET is still a bit better than lxml.etree (which is impressively memory friendly already). A quick comparison for a 3.4MB XML file with a lot of text and very short tag names (the old testament in English) gave me almost exactly the same time for parsing. When done, I had a 17MB Python interpreter for lxml.etree and a 10MB interpreter for cET. Depending on your XML, this may change in any kind of way, as both optimise their time and memory usage very differently. For minidom, I get about 60MB, where Fredrik got 80MB. That's still about a factor of 17-23 compared to the serialised XML file, whereas lxml and cET end up with a factor of 3-5. Your assumption that you can use a system with 3GB of RAM to parse a 500MB XML file into an in-memory tree can easily turn wrong for XML files with more tags and shorter text content (say, numbers), or for documents with non-european languages. Stefan _______________________________________________ XML-SIG maillist - XML-SIG@python.org http://mail.python.org/mailman/listinfo/xml-sig