Hello all, I'm trying to merge and filter some xml. This is working well, but I'm getting one node that's not in my list to include. Python version is 3.4.0.
The goal is to merge multiple xml files and then write a new one based on whether or not <pid> is in an include list. In the mock data below, the 3 xml files have a total of 8 <rec> nodes, and I have 4 <pid> values in my list. The output is correctly formed xml, but it includes 5 <rec> nodes; the 4 in the list, plus 89012 from input1.xml. It runs without error. I've used used type() to compare rec.find('part').find('pid').text and the items in the list, they're strings. When the first for loop is done, xmlet has 8 rec nodes. Is there a problem in the iteration in the second for? Any other recommendations also welcome. Thanks! The code itself was cobbled together from two sources, http://stackoverflow.com/questions/9004135/merge-multiple-xml-files-from-command-line/11315257#11315257 and http://bryson3gps.wordpress.com/tag/elementtree/ Here's the code and data: #!/usr/bin/env python3 import os, glob from xml.etree import ElementTree as ET xmls = glob.glob('input*.xml') ilf = os.path.join(os.path.expanduser('~'),'include_list.txt') xo = os.path.join(os.path.expanduser('~'),'mergedSortedOutput.xml') il = [x.strip() for x in open(ilf)] xmlet = None for xml in xmls: d = ET.parse(xml).getroot() for rec in d.iter('inv'): if xmlet is None: xmlet = d else: xmlet.extend(rec) for rec in xmlet: if rec.find('part').find('pid').text not in il: xmlet.remove(rec) ET.ElementTree(xmlet).write(xo) quit() include_list.txt 12345 34567 56789 67890 input1.xml <inv> <rec> <part> <pid>67890</pid> <tid>67890t</tid> </part> <detail> <did>67890d</did> </detail> </rec> <rec> <part> <pid>78901</pid> <tid>78901t</tid> </part> <detail> <did>78901d</did> </detail> </rec> <rec> <part> <pid>89012</pid> <tid>89012t</tid> </part> <detail> <did>89012d</did> </detail> </rec> </inv> input2.xml <inv> <rec> <part> <pid>45678</pid> <tid>45678t</tid> </part> <detail> <did>45678d</did> </detail> </rec> <rec> <part> <pid>56789</pid> <tid>56789t</tid> </part> <detail> <did>56789d</did> </detail> </rec> </inv> input3.xml <inv> <rec> <part> <pid>12345</pid> <tid>12345t</tid> </part> <detail> <did>12345d</did> </detail> </rec> <rec> <part> <pid>23456</pid> <tid>23456t</tid> </part> <detail> <did>23456d</did> </detail> </rec> <rec> <part> <pid>34567</pid> <tid>34567t</tid> </part> <detail> <did>34567d</did> </detail> </rec> </inv> mergedSortedOutput.xml: <inv> <rec> <part> <pid>67890</pid> <tid>67890t</tid> </part> <detail> <did>67890d</did> </detail> </rec> <rec> <part> <pid>89012</pid> <tid>89012t</tid> </part> <detail> <did>89012d</did> </detail> </rec> <rec> <part> <pid>12345</pid> <tid>12345t</tid> </part> <detail> <did>12345d</did> </detail> </rec> <rec> <part> <pid>34567</pid> <tid>34567t</tid> </part> <detail> <did>34567d</did> </detail> </rec> <rec> <part> <pid>56789</pid> <tid>56789t</tid> </part> <detail> <did>56789d</did> </detail> </rec> </inv> _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor