I'm using cElementTree.iterparse to iterate over an XML file. I think iterparse is a wonderful idea - I've found it to be much more convenient than SAX for iterative processing. I have come across a problem though...
For the majority of my elements, both the start and end events contain the text of the element (i.e., element.text). For a handful of the elements, the text is only in the end event (i.e., element.text is None in the start event but it is not None in the end event). The text is found without any problem when using cElementTree.parse on the file instead. A small test to reproduce this behavior is at the end of this note and an 80KB sample xml file is at http://www.averdevelopment.com/python/test.xml. The test file is whittled down from a much larger file which had the problem with several more elements (but only a very small percentage of the total). I couldn't seem to delete any elements before the element in question without changing the behavior. Am I misunderstanding something or is this perhaps a bug? I'm using: http://effbot.org/downloads/cElementTree-0.9.8-20050123.win32-py2.3.exe http://effbot.org/downloads/elementtree-1.2.4-20041228.win32.exe http://python.org/ftp/python/2.3.4/Python-2.3.4.exe Windows XP SP2 Thanks, Jimmy #################################################### import sets from cElementTree import dump, iterparse, parse values = dict(start=sets.Set(), end=sets.Set()) i = 0 for event, element in iterparse('test.xml', ('start', 'end')): if element.tag.endswith('}ele') and element.text: values[event].add(element.text) if element.tag.endswith('}ele') and element.text is None: print i, event + ' ' dump(element) if element.text == '297.257582': print i, event + ' ' dump(element) i += 1 print 'In start but not end:', values['start'] - values['end'] print 'In end but not start:', values['end'] - values['start'] print # Finding the same text with ElementTree is no problem gpx = parse('test.xml').getroot() trk = element.findall('{http://www.topografix.com/GPX/1/1}trk')[-1] trkseg = trk.findall('{http://www.topografix.com/GPX/1/1}trkseg')[-1] trkpt = trkseg.findall('{http://www.topografix.com/GPX/1/1}trkpt')[-2] ele = trkpt.findall('{http://www.topografix.com/GPX/1/1}ele')[0] print ele.text #################################################### Output: 3622 start <ns0:ele xmlns:ns0="http://www.topografix.com/GPX/1/1" /> 3623 end <ns0:ele xmlns:ns0="http://www.topografix.com/GPX/1/1">297.257582</ns0:ele> In start but not end: Set([]) In end but not start: Set(['297.257582']) 297.257582 _______________________________________________ XML-SIG maillist - XML-SIG@python.org http://mail.python.org/mailman/listinfo/xml-sig