Re: python xml DOM? pulldom? SAX?
Thanks a lot for all your replies that was really great and helpfull. Now I have some problems with the indexing, it takes to much memory and akes to long. I have to look into it. -- http://mail.python.org/mailman/listinfo/python-list
python xml DOM? pulldom? SAX?
Hi, I want to get text out of some nodes of a huge xml file (1,5 GB). The architecture of the xml file is something like this parent page titlebla/title id/id revision id/id textblablabla/text revision /page page /page /parent I want to combine the text out of page:title and page:revision:text for every single page element. One by one I want to index these combined texts (so for each page one index) What is the most efficient API for that?: SAX ( I don´t thonk so) DOM or pulldom? Or should I just use Xpath somehow. I don`t want to do anything else with his xml file afterwards. I hope someone will understand me. Thank you very much Jog -- http://mail.python.org/mailman/listinfo/python-list
Re: python xml DOM? pulldom? SAX?
Hi, I'd advocate for using SAX, as DOM related methods implies loading the complete XML content in memory whereas SAX grab things on the fly. SAX method should therefore be faster and less memory consuming... By the way, if your goal is to just combine the text out of page:title and page:revision:text for every single page element, maybe you should also consider an XSLT filter. Regards, Thierry -- http://mail.python.org/mailman/listinfo/python-list
Re: python xml DOM? pulldom? SAX?
On 29 Aug 2005 08:17:04 -0700 jog [EMAIL PROTECTED] wrote: I want to get text out of some nodes of a huge xml file (1,5 GB). The architecture of the xml file is something like this [structure snipped] I want to combine the text out of page:title and page:revision:text for every single page element. One by one I want to index these combined texts (so for each page one index) What is the most efficient API for that?: SAX ( I don´t thonk so) DOM or pulldom? Definitely SAX IMHO, or xml.parsers.expat. For what you're doing, an event-driven interface is ideal. DOM parses the *entire* XML tree into memory at once, before you can do anything - highly inefficient for a large data set like this. I've never used pulldom, it might have potential, but from my (limited and flawed) understanding of it, I think it may also wind up loading most of the file into memory by the time you're done. SAX will not build any memory structures other than the ones you explicitly create (SAX is commonly used to build DOM trees). With SAX, you can just watch for any tags of interest (and perhaps some surrounding tags to provide context), extract the desired data, and all that very efficiently. It took me a bit to get the hang of SAX, but once I did, I haven't looked back. Event-driven parsing is a brilliant solution to this problem domain. Or should I just use Xpath somehow. XPath usually requires a DOM tree on which it can operate. The Python XPath implementation (in PyXML) requires DOM objects. I see this as being a highly inefficient solution. Another potential solution, if the data file has extraneous information: run the source file through an XSLT transform that strips it down to only the data you need, and then apply SAX to parse it. - Michael -- http://mail.python.org/mailman/listinfo/python-list
Re: python xml DOM? pulldom? SAX?
jog wrote: I want to get text out of some nodes of a huge xml file (1,5 GB). The architecture of the xml file is something like this I want to combine the text out of page:title and page:revision:text for every single page element. One by one I want to index these combined texts (so for each page one index) here's one way to do it: try: import cElementTree as ET except ImportError: from elementtree import ElementTree as ET for event, elem in ET.iterparse(file): if elem.tag == page: title = elem.findtext(title) revision = elem.findtext(revision/text) print title, revision elem.clear() # won't need this any more references: http://effbot.org/zone/element-index.htm http://effbot.org/zone/celementtree.htm (for best performance) http://effbot.org/zone/element-iterparse.htm /F -- http://mail.python.org/mailman/listinfo/python-list
Re: python xml DOM? pulldom? SAX?
[jog] I want to get text out of some nodes of a huge xml file (1,5 GB). The architecture of the xml file is something like this [snip] I want to combine the text out of page:title and page:revision:text for every single page element. One by one I want to index these combined texts (so for each page one index) What is the most efficient API for that?: SAX ( I don´t thonk so) SAX is perfect for the job. See code below. DOM If your XML file is 1.5G, you'll need *lots* of RAM and virtual memory to load it into a DOM. or pulldom? Not sure how pulldom does it's pull optimizations, but I think it still builds an in-memory object structure for your document, which will still take buckets of memory for such a big document. I could be wrong though. Or should I just use Xpath somehow. Using xpath normally requires building a (D)OM, which will consume *lots* of memory for your document, regardless of how efficient the OM is. Best to use SAX and XPATH-style expressions. You can get a limited subset of xpath using a SAX handler and a stack. Your problem is particularly well suited to that kind of solution. Code that does a basic job of this for your specific problem is given below. Note that there are a number of caveats with this code 1. characterdata handlers may get called multiple times for a single xml text() node. This is permitted in the SAX spec, and is basically a consequence of using buffered IO to read the contents of the xml file, e.g. the start of a text node is at the end of the last buffer read, and the rest of the text node is at the beginning of the next buffer. 2. This code assumes that your revision/text nodes do not contain mixed content, i.e. a mixture of elements and text, e.g. revisiontextThis is a piece of brevision/b text/text/revision. The below code will fail to extract all character data in that case. #-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= import xml.sax class Page: def append(self, field_name, new_value): old_value = if hasattr(self, field_name): old_value = getattr(self, field_name) setattr(self, field_name, %s%s % (old_value, new_value)) class page_matcher(xml.sax.handler.ContentHandler): def __init__(self, page_handler=None): xml.sax.handler.ContentHandler.__init__(self) self.page_handler = page_handler self.stack = [] def check_stack(self): stack_expr = / + /.join(self.stack) if '/parent/page' == stack_expr: self.page = Page() elif '/parent/page/title/text()' == stack_expr: self.page.append('title', self.chardata) elif '/parent/page/revision/id/text()' == stack_expr: self.page.append('revision_id', self.chardata) elif '/parent/page/revision/text/text()' == stack_expr: self.page.append('revision_text', self.chardata) else: pass def startElement(self, elemname, attrs): self.stack.append(elemname) self.check_stack() def endElement(self, elemname): if elemname == 'page' and self.page_handler: self.page_handler(self.page) self.page = None self.stack.pop() def characters(self, data): self.chardata = data self.stack.append('text()') self.check_stack() self.stack.pop() testdoc = parent page titlePage number 1/title idp1/id revision idr1/id textrevision one/text /revision /page page titlePage number 2/title idp2/id revision idr2/id textrevision two/text /revision /page /parent def page_handler(new_page): print New page print title\t\t%s % new_page.title print revision_id\t%s % new_page.revision_id print revision_text\t%s % new_page.revision_text print if __name__ == __main__: parser = xml.sax.make_parser() parser.setContentHandler(page_matcher(page_handler)) parser.setFeature(xml.sax.handler.feature_namespaces, 0) parser.feed(testdoc) #-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= HTH, -- alan kennedy -- email alan: http://xhaus.com/contact/alan -- http://mail.python.org/mailman/listinfo/python-list
Re: python xml DOM? pulldom? SAX?
Alan Kennedy wrote: SAX is perfect for the job. See code below. depends on your definition of perfect... using a 20 MB version of jog's sample, and having replaced the print statements with local variable assignments, I get the following timings: 5 lines of cElementTree code: 7.2 seconds 60+ lines of xml.sax code: 63 seconds (Python 2.4.1, Windows XP, Pentium 3 GHz) /F -- http://mail.python.org/mailman/listinfo/python-list
Re: python xml DOM? pulldom? SAX?
[Alan Kennedy] SAX is perfect for the job. See code below. [Fredrik Lundh] depends on your definition of perfect... Obviously, perfect is the eye of the beholder ;-) [Fredrik Lundh] using a 20 MB version of jog's sample, and having replaced the print statements with local variable assignments, I get the following timings: 5 lines of cElementTree code: 7.2 seconds 60+ lines of xml.sax code: 63 seconds (Python 2.4.1, Windows XP, Pentium 3 GHz) Impressive! At first, I thought your code sample was building a tree for the entire document, so I checked the API docs. It appeared to me that an event processing model *couldn't* obtain the text for the node when notified of the node: the text node is still in the future. That's when I understood the nature of iterparse, which must generate an event *after* the node is complete, and it's subdocument reified. That's also when I understood the meaning of the elem.clear() call at the end. Only the required section of the tree is modelled in memory at any given time. Nice. There are some minor inefficiencies in my pure python sax code, e.g. building the stack expression for every evaluation, but I left them in for didactic reasons. But even if every possible speed optimisation was added to my python code, I doubt it would be able to match your code. I'm guessing that a lot of the reason why cElementTree performs best is because the model-building is primarily implemented in C: Both of our solutions run python code for every node in the tree, i.e. are O(N). But yours also avoids the overhead of having function-calls/stack-frames for every single node event, by processing all events inside a single function. If the SAX algorithm were implemented in C (or Java) for that matter, I wonder if it might give comparable performance to the cElementTree code, primarily because the data structures it is building are simpler, compared to the tree-subsections being reified and discarded by cElementTree. But that's not of relevance, because we're looking for python solutions. (Aside: I can't wait to run my solution on a fully-optimising PyPy :-) That's another nice thing I didn't know (c)ElementTree could do. enlightened-ly'yrs, -- alan kennedy -- email alan: http://xhaus.com/contact/alan -- http://mail.python.org/mailman/listinfo/python-list
Re: python xml DOM? pulldom? SAX?
jog [EMAIL PROTECTED] wrote: Hi, I want to get text out of some nodes of a huge xml file (1,5 GB). The architecture of the xml file is something like this parent page titlebla/title id/id revision id/id textblablabla/text revision /page page /page /parent I want to combine the text out of page:title and page:revision:text for every single page element. One by one I want to index these combined texts (so for each page one index) What is the most efficient API for that?: SAX ( I don?t thonk so) DOM or pulldom? Or should I just use Xpath somehow. I don`t want to do anything else with his xml file afterwards. I hope someone will understand me. Thank you very much Jog I would use Expat interface from Python, Awk, or even Bash shell. I'm most familiar with shell interface to Expat, which would go something like start() # Usage: start tag att=value ... { case $1 in page) unset title text ;; esac } data() # Usage: data text { case ${XML_TAG_STACK[0]}.${XML_TAG_STACK[1]}.${XML_TAG_STACK[2]} in title.page.*) title=$1 ;; text.revision.page) text=$1 ;; esac } end() # Usage: end tag { case $1 in page) echo title=$title text=$text ;; esac } expat -s start -d data -e end file.xml -- William Park [EMAIL PROTECTED], Toronto, Canada ThinFlash: Linux thin-client on USB key (flash) drive http://home.eol.ca/~parkw/thinflash.html BashDiff: Super Bash shell http://freshmeat.net/projects/bashdiff/ -- http://mail.python.org/mailman/listinfo/python-list