Re: XML Parsing
Sambit Samal wrote: > Hi , > > Need help in Python Script using xml.etree.ElementTree to update the > value of any element in below XML ( e.g SETNPI to be 5 ) based on some > constraint ( e.g ) . Something along the lines from xml.etree import ElementTree as ET tree = ET.parse("original.xml") for e in tree.findall(".//ruleset[@id='2']//SETNPI"): e.text = "5" tree.write("modified.xml") might work for you. > > > DRATRN > > > > 1 > > > 1 > > 0 > ORIG > CFORIG > TERM > > > > >1 > 1 > > CONTINUE > > > > > 2 > > 1 > > 0 > TERM > > > > >1 >1 > >CONTINUE > > -- https://mail.python.org/mailman/listinfo/python-list
Re: xml parsing with lxml
On Friday, October 7, 2016 at 3:21:43 PM UTC-5, John Gordon wrote: > root = doc.getroot() > for child in root: > print(child.tag) > Excellent! thank, you sir! that'll get me started. Appreciate the reply. Doug O'Leary -- https://mail.python.org/mailman/listinfo/python-list
Re: xml parsing with lxml
In <622ea3b0-88b4-420b-89e0-9e7c6e866...@googlegroups.com> Doug OLearywrites: > >>> from lxml import etree > >>> doc =3D etree.parse('config.xml') > Now what? For instance, how do I list the top level children of > ? root = doc.getroot() for child in root: print(child.tag) -- John Gordon A is for Amy, who fell down the stairs gor...@panix.com B is for Basil, assaulted by bears -- Edward Gorey, "The Gashlycrumb Tinies" -- https://mail.python.org/mailman/listinfo/python-list
Re: XML Parsing
Sepideh Ghanavati sepideh...@gmail.com writes: I know basic of python and I have an xml file created from csv What XML schema defines the document's format? Without knowing the schema, parsing will be unreliable. What created the document? Why is it relevant that the document was “created from CSV”? which has three attributes category, definition and definition description. What do you mean by “attributes”? In Python, an attribute has a specific meaning. In XML, an attribute has a rather different meaning. Neither of those meanings seems to apply to “the XML document has three attributes”. XML documents don't have attributes; differnt XML elements in a document have different attributes. I want to parse through xml file and identify actors, constraints, principal from the text. How are those defined in the document's schema? However, I am not sure what is the best way to go. Any suggestion? You should: * Learn some more about XML URL:http://www.xmlobjective.com/the-basic-principles-of-xml/. * Learn exactly what formal document schema defines the document URL:https://en.wikipedia.org/wiki/XML_schema. If the document isn't accompanied by a specification of exactly what its schema is, you're going to have a difficult time. -- \“If I melt dry ice, can I swim without getting wet?” —Steven | `\Wright | _o__) | Ben Finney -- https://mail.python.org/mailman/listinfo/python-list
Re: XML Parsing
Sepideh Ghanavati schrieb am 06.04.2015 um 04:26: I know basic of python and I have an xml file created from csv which has three attributes category, definition and definition description. I want to parse through xml file and identify actors, constraints, principal from the text. However, I am not sure what is the best way to go. Any suggestion? If it's really generated from a CSV file, you could also parse that instead: https://docs.python.org/3/library/csv.html Admittedly, CSV files are simple, but they also have major problems, especially when it comes to detecting their character encoding and their specific format (tab/comma/semicolon/space/whatever separated, with or without escaping, quoted values, ...). Meaning, you can easily end up reading nonsense from the file instead of the content that was originally put into it. So, if you want to parse from XML instead, use ElementTree: https://docs.python.org/3/library/xml.etree.elementtree.html Stefan -- https://mail.python.org/mailman/listinfo/python-list
Re: XML parsing ExpatError with xml.dom.minidom at line 1, column 0
ming wrote: Hi, i've a Python script which stopped working about a month ago. But until then, it worked flawlessly for months (if not years). A tiny self-contained 7-line script is provided below. i ran into an XML parsing problem with xml.dom.minidom and the error message is included below. The weird thing is if i used an XML validator on the web to validate against this particular URL directly, it is all good. Moreover, i saved the page source in Firefox or Chrome then validated against the saved XML file, it's also all good. Since the error happened at the very beginning of the input (line 1, column 0) as indicated below, i was wondering if this is an encoding mismatch. However, according to the saved page source in FireFox or Chrome, there is the following at the beginning: ?xml version=1.0 encoding=UTF-8? program = #!/usr/bin/env python import urllib2 from xml.dom.minidom import parseString fd = urllib2.urlopen('http://api.worldbank.org/countries') data = fd.read() fd.close() dom = parseString(data) = error msg = Traceback (most recent call last): File ./bugReport.py, line 9, in module dom = parseString(data) File /usr/lib/python2.7/xml/dom/minidom.py, line 1931, in parseString return expatbuilder.parseString(string) File /usr/lib/python2.7/xml/dom/expatbuilder.py, line 940, in parseString return builder.parseString(string) File /usr/lib/python2.7/xml/dom/expatbuilder.py, line 223, in parseString parser.Parse(string, True) xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 0 = i'm running Python 2.7.5+ on Ubuntu 13.10. Thanks. Looking into the data returned from the server: data = urllib2.urlopen(http://api.worldbank.org/countries;).read() with open(tmp.dat, w) as f: f.write(data) ... [1]+ Angehalten python $ file tmp.dat tmp.dat: gzip compressed data, from FAT filesystem (MS-DOS, OS/2, NT) OK, let's expand: $ fg python import gzip, StringIO expanded_data = gzip.GzipFile(fileobj=StringIO.StringIO(data)).read() import xml.dom.minidom xml.dom.minidom.parseString(expanded_data) xml.dom.minidom.Document instance at 0x19a1320 There may be a way to uncompress the gzipped data transparently, but I'm too lazy to look it up... -- https://mail.python.org/mailman/listinfo/python-list
Re: XML parsing ExpatError with xml.dom.minidom at line 1, column 0
On 2014-02-13 20:10, Peter Otten wrote: ming wrote: Hi, i've a Python script which stopped working about a month ago. But until then, it worked flawlessly for months (if not years). A tiny self-contained 7-line script is provided below. i ran into an XML parsing problem with xml.dom.minidom and the error message is included below. The weird thing is if i used an XML validator on the web to validate against this particular URL directly, it is all good. Moreover, i saved the page source in Firefox or Chrome then validated against the saved XML file, it's also all good. Since the error happened at the very beginning of the input (line 1, column 0) as indicated below, i was wondering if this is an encoding mismatch. However, according to the saved page source in FireFox or Chrome, there is the following at the beginning: ?xml version=1.0 encoding=UTF-8? program = #!/usr/bin/env python import urllib2 from xml.dom.minidom import parseString fd = urllib2.urlopen('http://api.worldbank.org/countries') data = fd.read() fd.close() dom = parseString(data) = error msg = Traceback (most recent call last): File ./bugReport.py, line 9, in module dom = parseString(data) File /usr/lib/python2.7/xml/dom/minidom.py, line 1931, in parseString return expatbuilder.parseString(string) File /usr/lib/python2.7/xml/dom/expatbuilder.py, line 940, in parseString return builder.parseString(string) File /usr/lib/python2.7/xml/dom/expatbuilder.py, line 223, in parseString parser.Parse(string, True) xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 0 = i'm running Python 2.7.5+ on Ubuntu 13.10. Thanks. Looking into the data returned from the server: data = urllib2.urlopen(http://api.worldbank.org/countries;).read() with open(tmp.dat, w) as f: f.write(data) ... [1]+ Angehalten python $ file tmp.dat tmp.dat: gzip compressed data, from FAT filesystem (MS-DOS, OS/2, NT) OK, let's expand: $ fg python import gzip, StringIO expanded_data = gzip.GzipFile(fileobj=StringIO.StringIO(data)).read() import xml.dom.minidom xml.dom.minidom.parseString(expanded_data) xml.dom.minidom.Document instance at 0x19a1320 There may be a way to uncompress the gzipped data transparently, but I'm too lazy to look it up... From a brief look at the docs, it looks like you can specify the format. For example, for JSON: fd = urlopen('http://api.worldbank.org/countries?format=json') -- https://mail.python.org/mailman/listinfo/python-list
Re: XML parsing: SAX/expat yield
kj wrote: I want to write code that parses a file that is far bigger than the amount of memory I can count on. Therefore, I want to stay as far away as possible from anything that produces a memory-resident DOM tree. The top-level structure of this xml is very simple: it's just a very long list of records. All the complexity of the data is at the level of the individual records, but these records are tiny in size (relative to the size of the entire file). So the ideal would be a parser-iterator, which parses just enough of the file to yield (in the generator sense) the next record, thereby returning control to the caller; the caller can process the record, delete it from memory, and return control to the parser-iterator; once parser-iterator regains control, it repeats this sequence starting where it left off. How about http://effbot.org/zone/element-iterparse.htm#incremental-parsing Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: XML parsing: SAX/expat yield
In i3c7lc$e6v$0...@news.t-online.com Peter Otten __pete...@web.de writes: How about http://effbot.org/zone/element-iterparse.htm#incremental-parsing Exactly! Thanks! ~K -- http://mail.python.org/mailman/listinfo/python-list
Re: XML parsing with python
inder wrote: On Aug 17, 8:31 pm, John Posner jjpos...@optimum.net wrote: Use the iterparse() function of the xml.etree.ElementTree package. http://effbot.org/zone/element-iterparse.htm http://codespeak.net/lxml/parsing.html#iterparse-and-iterwalk Stefan iterparse() is too big a hammer for this purpose, IMO. How about this: from xml.etree.ElementTree import ElementTree tree = ElementTree(None, myfile.xml) for elem in tree.findall('//book/title'): print elem.text -John Thanks for the prompt reply . I feel let me try using iterparse. Will it be slower compared to SAX parsing ... ultimately I will have a huge xml file to parse ? If you use the cElementTree module, it may even be faster. Another question , I will also need to validate my xml against xsd . I would like to do this validation through the parsing tool itself . In that case, you can use lxml instead of ElementTree. http://codespeak.net/lxml/ Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: XML parsing with python
John Posner wrote: Use the iterparse() function of the xml.etree.ElementTree package. iterparse() is too big a hammer for this purpose, IMO. How about this: from xml.etree.ElementTree import ElementTree tree = ElementTree(None, myfile.xml) for elem in tree.findall('//book/title'): print elem.text Is that really so much better than an iterparse() version? from xml.etree.ElementTree import ElementTree for _, elem in ElementTree.iterparse(myfile.xml): if elem.tag == 'book': print elem.findtext('title') elem.clear() Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: XML parsing with python
On Aug 18, 11:24 am, Stefan Behnel stefan...@behnel.de wrote: inder wrote: On Aug 17, 8:31 pm, John Posner jjpos...@optimum.net wrote: Use the iterparse() function of the xml.etree.ElementTree package. http://effbot.org/zone/element-iterparse.htm http://codespeak.net/lxml/parsing.html#iterparse-and-iterwalk Stefan iterparse() is too big a hammer for this purpose, IMO. How about this: from xml.etree.ElementTree import ElementTree tree = ElementTree(None, myfile.xml) for elem in tree.findall('//book/title'): print elem.text -John Thanks for the prompt reply . I feel let me try using iterparse. Will it be slower compared to SAX parsing ... ultimately I will have a huge xml file to parse ? If you use the cElementTree module, it may even be faster. Another question , I will also need to validate my xml against xsd . I would like to do this validation through the parsing tool itself . In that case, you can use lxml instead of ElementTree. http://codespeak.net/lxml/ Stefan Hi , Is lxml part of standard python package ? I am having python 2.5 . I might not be able to use any additional package other than the standard python . Could you please suggest something part of standard python package ? Thanks -- http://mail.python.org/mailman/listinfo/python-list
Re: XML parsing with python
inder wrote: Is lxml part of standard python package ? I am having python 2.5 . No, that's why I suggested ElementTree first. I might not be able to use any additional package other than the standard python . Could you please suggest something part of standard python package ? No, there isn't any XMLSchema support in the stdlib. However, you may still be able to use lxml locally for development and with validation enabled, and switch to non-validating ElementTree on distribution/pre-prod-testing/whatever. Just use a conditional import and write a bit of setup code. Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: XML parsing with python
inder wrote: I am new to xml . I need to parse the xml file . After reading and browsing on the web , I could get much help . I guess SAX would be better suited for my requirement . That's a common misconception. Could some juct provide me a sample python code so that I can execute it and see how the parsing actually happens . Lets say my xml file - ?xml version=1.0? library category code=SciFi room=1 !--if you want to test invalid document against schema you can just cut the mandatory id attribute -- book id=ISBN001 titleI,Robot/title pages100/pages authorIsaac Asimov/author /book book id=ISBN001 damaged=true titleBlade Runner/title pages400/pages authorPhilip K. Dick/author /book /category category code=Boring room=2 book id=ISBN003 titleLord Of The Rings/title pages2/pages authorTolkien/author /book book id=ISBN004 damaged=true titleXML-Schema Specification/title pages5000/pages authorW3C/author /book /category category code=Fantasy book id=ISBN005 damaged=true titleAladin/title pages150/pages authorDon't know/author /book /category /library -- I need the output to be - (elements containing 'title' ) I,Robot Blade Runner Lord Of The Rings XML-Schema Specification Aladin Use the iterparse() function of the xml.etree.ElementTree package. http://effbot.org/zone/element-iterparse.htm http://codespeak.net/lxml/parsing.html#iterparse-and-iterwalk Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: XML parsing with python
Use the iterparse() function of the xml.etree.ElementTree package. http://effbot.org/zone/element-iterparse.htm http://codespeak.net/lxml/parsing.html#iterparse-and-iterwalk Stefan iterparse() is too big a hammer for this purpose, IMO. How about this: from xml.etree.ElementTree import ElementTree tree = ElementTree(None, myfile.xml) for elem in tree.findall('//book/title'): print elem.text -John -- http://mail.python.org/mailman/listinfo/python-list
Re: XML parsing with python
On Aug 17, 8:31 pm, John Posner jjpos...@optimum.net wrote: Use the iterparse() function of the xml.etree.ElementTree package. http://effbot.org/zone/element-iterparse.htm http://codespeak.net/lxml/parsing.html#iterparse-and-iterwalk Stefan iterparse() is too big a hammer for this purpose, IMO. How about this: from xml.etree.ElementTree import ElementTree tree = ElementTree(None, myfile.xml) for elem in tree.findall('//book/title'): print elem.text -John Thanks for the prompt reply . I feel let me try using iterparse. Will it be slower compared to SAX parsing ... ultimately I will have a huge xml file to parse ? Another question , I will also need to validate my xml against xsd . I would like to do this validation through the parsing tool itself . Does there exist an Unix utility which validates or even a python library call would be fine ? Thanks in advance -- http://mail.python.org/mailman/listinfo/python-list
RE: XML Parsing
You flatter me sir (or madam? can't tell from your name...), but I wouldn't presume to so lofty a title among this crowd. I'd save that for the likes of Alan Gauld and Kent Johnson, who are much more prolific and informative contributors to this list than I. -- Paul -Original Message- From: hrishy [mailto:hris...@yahoo.co.uk] Sent: Wednesday, February 25, 2009 11:36 PM To: python-list@python.org; Paul McGuire Subject: Re: XML Parsing Ha the guru himself responding :-) --- On Wed, 25/2/09, Paul McGuire pt...@austin.rr.com wrote: From: Paul McGuire pt...@austin.rr.com Subject: Re: XML Parsing To: python-list@python.org Date: Wednesday, 25 February, 2009, 2:04 PM On Feb 25, 1:17 am, hrishy hris...@yahoo.co.uk wrote: Hi Something like this snip solution using ElementTree Note i am not a python programmer just a enthusiast and i was curious why people on the list didnt suggest a code like above You just beat the rest of us to it - good example of ElementTree for parsing XML (and I Iearned the '//' shortcut for one or more intervening tag levels). To the OP: if you are parsing XML, I would look hard at the modules (esp. ElementTree) that are written explicitly for XML, before considering using regular expressions. There are just too many potential surprises when trying to match XML tags - presence/absence/ order of attributes, namespaces, whitespace inside tags, to name a few. -- Paul -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: XML Parsing
Hi Lie I am not a python guy but very interested in the langauge and i consider the people on this list to be intelligent and was wundering why you people did not suggest xpath for this kind of a problem just curious and willing to learn. I am searching for a answer but the question is why not use xpath to extract xml text from a xml doc ? regards Hrishy --- On Wed, 25/2/09, Lie Ryan lie.1...@gmail.com wrote: From: Lie Ryan lie.1...@gmail.com Subject: Re: XML Parsing To: python-list@python.org Date: Wednesday, 25 February, 2009, 7:33 AM Are you searching for answer or searching for another people that have the same answer as you? :) Many roads lead to Rome is a very famous quotation... -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: XML Parsing
Probably because you responded an hour after the question was posted, and in the dead of night. Newsgroups often move slower than that. But now we have posted a solution like that, so all's well in the world. :) Cheers, Cliff On Wed, 2009-02-25 at 08:20 +, hrishy wrote: Hi Lie I am not a python guy but very interested in the langauge and i consider the people on this list to be intelligent and was wundering why you people did not suggest xpath for this kind of a problem just curious and willing to learn. I am searching for a answer but the question is why not use xpath to extract xml text from a xml doc ? regards Hrishy --- On Wed, 25/2/09, Lie Ryan lie.1...@gmail.com wrote: From: Lie Ryan lie.1...@gmail.com Subject: Re: XML Parsing To: python-list@python.org Date: Wednesday, 25 February, 2009, 7:33 AM Are you searching for answer or searching for another people that have the same answer as you? :) Many roads lead to Rome is a very famous quotation... -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: XML Parsing
On Feb 25, 1:17 am, hrishy hris...@yahoo.co.uk wrote: Hi Something like this snip solution using ElementTree Note i am not a python programmer just a enthusiast and i was curious why people on the list didnt suggest a code like above You just beat the rest of us to it - good example of ElementTree for parsing XML (and I Iearned the '//' shortcut for one or more intervening tag levels). To the OP: if you are parsing XML, I would look hard at the modules (esp. ElementTree) that are written explicitly for XML, before considering using regular expressions. There are just too many potential surprises when trying to match XML tags - presence/absence/ order of attributes, namespaces, whitespace inside tags, to name a few. -- Paul -- http://mail.python.org/mailman/listinfo/python-list
Re: XML Parsing
Ha the guru himself responding :-) --- On Wed, 25/2/09, Paul McGuire pt...@austin.rr.com wrote: From: Paul McGuire pt...@austin.rr.com Subject: Re: XML Parsing To: python-list@python.org Date: Wednesday, 25 February, 2009, 2:04 PM On Feb 25, 1:17 am, hrishy hris...@yahoo.co.uk wrote: Hi Something like this snip solution using ElementTree Note i am not a python programmer just a enthusiast and i was curious why people on the list didnt suggest a code like above You just beat the rest of us to it - good example of ElementTree for parsing XML (and I Iearned the '//' shortcut for one or more intervening tag levels). To the OP: if you are parsing XML, I would look hard at the modules (esp. ElementTree) that are written explicitly for XML, before considering using regular expressions. There are just too many potential surprises when trying to match XML tags - presence/absence/ order of attributes, namespaces, whitespace inside tags, to name a few. -- Paul -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: XML Parsing
Hi Cliff Thanks so using elementree is the right way to handle this problem regards Hrishy --- On Wed, 25/2/09, J. Clifford Dyer j...@sdf.lonestar.org wrote: From: J. Clifford Dyer j...@sdf.lonestar.org Subject: Re: XML Parsing To: hris...@yahoo.co.uk Cc: python-list@python.org, Lie Ryan lie.1...@gmail.com Date: Wednesday, 25 February, 2009, 12:37 PM Probably because you responded an hour after the question was posted, and in the dead of night. Newsgroups often move slower than that. But now we have posted a solution like that, so all's well in the world. :) Cheers, Cliff On Wed, 2009-02-25 at 08:20 +, hrishy wrote: Hi Lie I am not a python guy but very interested in the langauge and i consider the people on this list to be intelligent and was wundering why you people did not suggest xpath for this kind of a problem just curious and willing to learn. I am searching for a answer but the question is why not use xpath to extract xml text from a xml doc ? regards Hrishy --- On Wed, 25/2/09, Lie Ryan lie.1...@gmail.com wrote: From: Lie Ryan lie.1...@gmail.com Subject: Re: XML Parsing To: python-list@python.org Date: Wednesday, 25 February, 2009, 7:33 AM Are you searching for answer or searching for another people that have the same answer as you? :) Many roads lead to Rome is a very famous quotation... -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: XML Parsing
On Feb 25, 2:50 pm, Girish girish@gmail.com wrote: Can anyone please tell me how to get content of Signal tag.. that is, how to extract the data ![CDATA[Parameter Identifiers Supported - $01 to $20]] Was there something in particular about Jean-Paul Calderone's solution that didn't satisfy you? http://tinyurl.com/azgo5j -- http://mail.python.org/mailman/listinfo/python-list
Re: XML Parsing
On Tue, 24 Feb 2009 20:50:20 -0800, Girish wrote: Hello, I have a xml file which is as follows: pids Parameter_Class Parameter Id=pid_031605_093137_283 Identifier$/Identifier TypePID/Type Signal![CDATA[Parameter Identifiers Supported - $01 to $20]]/Signal Description![CDATA[This PID indicates which legislated PIDs]]/Description .. ... Can anyone please tell me how to get content of Signal tag.. that is, how to extract the data ![CDATA[Parameter Identifiers Supported - $01 to $20]] Thanks, Girish... The easy one is to use re module (Regular expression). # untested import re signal_pattern = re.compile('Signal(.*)/Signal') signals = signal_pattern.findall(xmlstring) also, you may also use the xml module, which will be more reliable if you have data like this: foo attr=Signalblooo/Signalblah/foo, import xml.dom.minidom xmldata = xml.dom.minidom.parse(open('myfile.xml')) for node in xmldata.getElementsByTagName('Signal'): print node.toxml() ... -- http://mail.python.org/mailman/listinfo/python-list
Re: XML Parsing
Hi I am just a python enthusiast and not a python user but was just wundering why didnt the list members come up with or recommen XPATH based solution which i think is very elegant for this type of a problem isnt it ? regards Hrishy --- On Wed, 25/2/09, Lie Ryan lie.1...@gmail.com wrote: From: Lie Ryan lie.1...@gmail.com Subject: Re: XML Parsing To: python-list@python.org Date: Wednesday, 25 February, 2009, 5:43 AM On Tue, 24 Feb 2009 20:50:20 -0800, Girish wrote: Hello, I have a xml file which is as follows: pids Parameter_Class Parameter Id=pid_031605_093137_283 Identifier$/Identifier TypePID/Type Signal![CDATA[Parameter Identifiers Supported - $01 to $20]]/Signal Description![CDATA[This PID indicates which legislated PIDs]]/Description .. ... Can anyone please tell me how to get content of Signal tag.. that is, how to extract the data ![CDATA[Parameter Identifiers Supported - $01 to $20]] Thanks, Girish... The easy one is to use re module (Regular expression). # untested import re signal_pattern = re.compile('Signal(.*)/Signal') signals = signal_pattern.findall(xmlstring) also, you may also use the xml module, which will be more reliable if you have data like this: foo attr=Signalblooo/Signalblah/foo, import xml.dom.minidom xmldata = xml.dom.minidom.parse(open('myfile.xml')) for node in xmldata.getElementsByTagName('Signal'): print node.toxml() ... -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: XML Parsing
On Wed, 2009-02-25 at 06:09 +, hrishy wrote: Hi I am just a python enthusiast and not a python user but was just wundering why didnt the list members come up with or recommen XPATH based solution which i think is very elegant for this type of a problem isnt it ? Did you mean XQuery? Depending on the need, XQuery might be an overkill. And don't forget that XQuery is still an obscure, unknown language for most people (the de facto standard for querying is still SQL). -- http://mail.python.org/mailman/listinfo/python-list
Re: XML Parsing
Hi Something like this pids Parameter_Class ParameterId=pid_031605_093137_283 Identifier$/Identifier TypePID/Type Signal![CDATA[Parameter Identifiers Supported - $01 to $20]]/Signal Description![CDATA[This PID indicates which legislated PIDs]] /Description from elementtree.ElementTree import ElementTree doc = ElementTree(file='tst.xml') for e in mydata.findall('/pids//signal'): print e.get('title').text Note i am not a python programmer just a enthusiast and i was curious why people on the list didnt suggest a code like above willing to hear and learn from experienced python gurus regards Hrishy -- http://mail.python.org/mailman/listinfo/python-list
Re: XML Parsing
Are you searching for answer or searching for another people that have the same answer as you? :) Many roads lead to Rome is a very famous quotation... -- http://mail.python.org/mailman/listinfo/python-list
Re: XML Parsing
On Tue, Apr 1, 2008 at 10:42 PM, Alok Kothari [EMAIL PROTECTED] wrote: Hello, I am new to XML parsing.Could you kindly tell me whats the problem with the following code: import xml.dom.minidom import xml.parsers.expat document = token pos=nnLetterman/tokentoken pos=bezis/ tokentoken pos=jjrbetter/tokentoken pos=csthan/ tokentoken pos=npJay/tokentoken pos=npLeno/token This document is not well-formed. It doesn't have root element. ... Traceback (most recent call last): File C:/Python25/Programs/eg.py, line 20, in module p.Parse(document, 1) ExpatError: junk after document element: line 1, column 33 Told ya :) Try wrapping your document in root element, like tokenstoken.../tokentoken.../token/tokens -- kv -- http://mail.python.org/mailman/listinfo/python-list
Re: XML Parsing
Thanks ! it worked ! On Wed, Apr 2, 2008 at 1:31 AM, Konstantin Veretennicov [EMAIL PROTECTED] wrote: On Tue, Apr 1, 2008 at 10:42 PM, Alok Kothari [EMAIL PROTECTED] wrote: Hello, I am new to XML parsing.Could you kindly tell me whats the problem with the following code: import xml.dom.minidom import xml.parsers.expat document = token pos=nnLetterman/tokentoken pos=bezis/ tokentoken pos=jjrbetter/tokentoken pos=csthan/ tokentoken pos=npJay/tokentoken pos=npLeno/token This document is not well-formed. It doesn't have root element. ... Traceback (most recent call last): File C:/Python25/Programs/eg.py, line 20, in module p.Parse(document, 1) ExpatError: junk after document element: line 1, column 33 Told ya :) Try wrapping your document in root element, like tokenstoken.../tokentoken.../token/tokens -- kv -- http://mail.python.org/mailman/listinfo/python-list
Re: XML Parsing
On Apr 1, 12:42 pm, Alok Kothari [EMAIL PROTECTED] wrote: Hello, I am new to XML parsing.Could you kindly tell me whats the problem with the following code: import xml.dom.minidom import xml.parsers.expat document = token pos=nnLetterman/tokentoken pos=bezis/ tokentoken pos=jjrbetter/tokentoken pos=csthan/ tokentoken pos=npJay/tokentoken pos=npLeno/token # 3 handler functions def start_element(name, attrs): print 'Start element:', name, attrs def end_element(name): print 'End element:', name def char_data(data): print 'Character data:', repr(data) p = xml.parsers.expat.ParserCreate() p.StartElementHandler = start_element p.EndElementHandler = end_element p.CharacterDataHandler = char_data p.Parse(document, 1) OUTPUT: Start element: token {u'pos': u'nn'} Character data: u'Letterman' End element: token Traceback (most recent call last): File C:/Python25/Programs/eg.py, line 20, in module p.Parse(document, 1) ExpatError: junk after document element: line 1, column 33 Your XML is wrong. Don't put line breaks between / and token. -- http://mail.python.org/mailman/listinfo/python-list
Re: XML Parsing
On Apr 1, 1:42 pm, Alok Kothari [EMAIL PROTECTED] wrote: Hello, I am new to XML parsing.Could you kindly tell me whats the problem with the following code: import xml.dom.minidom import xml.parsers.expat document = token pos=nnLetterman/tokentoken pos=bezis/ tokentoken pos=jjrbetter/tokentoken pos=csthan/ tokentoken pos=npJay/tokentoken pos=npLeno/token # 3 handler functions def start_element(name, attrs): print 'Start element:', name, attrs def end_element(name): print 'End element:', name def char_data(data): print 'Character data:', repr(data) p = xml.parsers.expat.ParserCreate() p.StartElementHandler = start_element p.EndElementHandler = end_element p.CharacterDataHandler = char_data p.Parse(document, 1) OUTPUT: Start element: token {u'pos': u'nn'} Character data: u'Letterman' End element: token Traceback (most recent call last): File C:/Python25/Programs/eg.py, line 20, in module p.Parse(document, 1) ExpatError: junk after document element: line 1, column 33 I don't know if you are aware of the BeautifulSoup module: import BeautifulSoup as bs xml = token pos=nnLetterman/tokentoken pos=bezis/ tokentoken pos=jjrbetter/tokentoken pos=csthan/ tokentoken pos=npJay/tokentoken pos=npLeno/token doc = bs.BeautifulStoneSoup(xml) tokens = doc.findAll(token) for token in tokens: for attr in token.attrs: print %s : %s % attr print token.string --output:-- pos : nn Letterman pos : bez is pos : jjr better pos : cs than pos : np Jay pos : np Leno -- http://mail.python.org/mailman/listinfo/python-list
Re: XML Parsing
En Tue, 01 Apr 2008 20:44:41 -0300, 7stud [EMAIL PROTECTED] escribió: I am new to XML parsing.Could you kindly tell me whats the problem with the following code: import xml.dom.minidom import xml.parsers.expat I don't know if you are aware of the BeautifulSoup module: Or ElementTree: import xml.etree.ElementTree as ET doctext = tokenstoken pos=nnLetterman/tokentoken pos=bezis/tokentoken pos=jjrbetter/tokentoken pos=csthan/tokentoken pos=npJay/tokentoken pos=npLeno/token/tokens doc = ET.fromstring(doctext) for token in doc.findall(token): print 'pos:', token.get('pos') print 'text:', token.text -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list
Re: XML Parsing
On 28 Mar 2007 00:38:38 -0700, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: I want to parse this XML file: ?xml version=1.0 ? text text:one filefilename/file contents Hello /contents /text:one text:two filefilename2/file contents Hello2 /contents /text:two /text This XML will be in a file called filecreate.xml As you might have guessed, I want to create files from this XML file contents, so how can I do this? What modules should I use? What options do I have? Where can I find tutorials? Will I be able to put this on the internet (on a googlepages server)? http://effbot.org/zone/celementtree.htm HTH, -- Amit Khemka -- onyomo.com Home Page: www.cse.iitd.ernet.in/~csd00377 Endless the world's turn, endless the sun's Spinning, Endless the quest; I turn again, back to my own beginning, And here, find rest. -- http://mail.python.org/mailman/listinfo/python-list
Re: XML Parsing
[EMAIL PROTECTED] wrote: I want to parse this XML file: ?xml version=1.0 ? text text:one filefilename/file contents Hello /contents /text:one text:two filefilename2/file contents Hello2 /contents /text:two /text This XML will be in a file called filecreate.xml As you might have guessed, I want to create files from this XML file contents, so how can I do this? What modules should I use? What options do I have? Where can I find tutorials? Will I be able to put this on the internet (on a googlepages server)? Thanks in advance to everyone who helps me. And yes I have used Google but I am unsure what to use. The above file is not valid XML. It misses a xmlns:text namespace declaration. So you won't be able to parse it regardless of what parser you use. Diez -- http://mail.python.org/mailman/listinfo/python-list
Re: XML Parsing
On Mar 28, 10:51 am, Diez B. Roggisch [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote: I want to parse this XML file: ?xml version=1.0 ? text text:one filefilename/file contents Hello /contents /text:one text:two filefilename2/file contents Hello2 /contents /text:two /text This XML will be in a file called filecreate.xml As you might have guessed, I want to create files from this XML file contents, so how can I do this? What modules should I use? What options do I have? Where can I find tutorials? Will I be able to put this on the internet (on a googlepages server)? Thanks in advance to everyone who helps me. And yes I have used Google but I am unsure what to use. The above file is not valid XML. It misses a xmlns:text namespace declaration. So you won't be able to parse it regardless of what parser you use. Diez- Hide quoted text - - Show quoted text - The example is valid well-formed XML. It is permitted to use the : character in element names. Whether one should in a non namespace context is a different matter. Harvey -- http://mail.python.org/mailman/listinfo/python-list
Re: XML Parsing
[EMAIL PROTECTED] a écrit : I want to parse this XML file: zip As you might have guessed, I want to create files from this XML file contents, so how can I do this? What modules should I use? What options do I have? Where can I find tutorials? Will I be able to put this on the internet (on a googlepages server)? See urllib2 module and its missing guide. Thanks in advance to everyone who helps me. And yes I have used Google but I am unsure what to use. About XML, to complete Amit link to ElementsTree, you may take a look at: http://www.diveintopython.org/xml_processing/index.html (learn by example) And look at: http://pyxml.sourceforge.net/ http://www.rexx.com/~dkuhlman/pyxmlfaq.html http://shellsage.com/?q=node/12 http://www.python.org/community/sigs/current/xml-sig/ http://docs.python.org/lib/markup.html -- http://mail.python.org/mailman/listinfo/python-list
Re: XML Parsing
[EMAIL PROTECTED] a écrit : I want to parse this XML file: ?xml version=1.0 ? text text:one filefilename/file contents Hello /contents /text:one text:two filefilename2/file contents Hello2 /contents /text:two /text This XML will be in a file called filecreate.xml As you might have guessed, I want to create files from this XML file contents, so how can I do this? Using a sax parser might be the best solution here. -- http://mail.python.org/mailman/listinfo/python-list
Re: XML Parsing
The example is valid well-formed XML. It is permitted to use the : character in element names. Whether one should in a non namespace context is a different matter. It is? I was always under the impression one has to declare a namespace. But this might be shaped from the usage of XSLT and W3C schema that require these. Diez -- http://mail.python.org/mailman/listinfo/python-list
Re: XML Parsing
[EMAIL PROTECTED] wrote: I want to parse this XML file: ?xml version=1.0 ? text text:one filefilename/file contents Hello /contents /text:one text:two filefilename2/file contents Hello2 /contents /text:two /text This XML will be in a file called filecreate.xml As you might have guessed, I want to create files from this XML file contents, so how can I do this? What modules should I use? What options do I have? Where can I find tutorials? Will I be able to put this on the internet (on a googlepages server)? Thanks in advance to everyone who helps me. And yes I have used Google but I am unsure what to use. Try this: http://www.python.org/doc/2.4.1/lib/expat-example.html Christian -- http://mail.python.org/mailman/listinfo/python-list
Re: XML Parsing
HI, I could suggest you to use the minidom xml parser from xml module. Your XML schema does not seem to complocated. You will find detailed descriptions, and working code in the book: Dive ino Python. Google for it :-)) Gabor Urban NMC - ART -- http://mail.python.org/mailman/listinfo/python-list
Re: XML parsing and writing
c00i90wn wrote: Stefan Behnel wrote: c00i90wn wrote: Hey, I'm having a problem with the xml.dom.minidom package, I want to generate a simple xml for storing configuration variables, for that purpose I've written the following code, but before pasting it I'll tell you what my problem is. On first write of the xml everything goes as it should but on subsequent writes it starts to add more and more unneeded newlines to it making it hard to read and ugly. Maybe you should try to get your code a little cleaner first, that usually helps in finding these kinds of bugs. Try rewriting it with ElementTree or lxml, that usually helps you in getting your work done. http://effbot.org/zone/element-index.htm http://codespeak.net/lxml/ Nice package ElementTree is but sadly it doesn't have a pretty print, well, guess I'll have to do it myself, if you have one already can you please give it to me? thanks :) lxml's output functions all accept a pretty_print keyword argument. Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: XML parsing and writing
someone wrote: Nice package ElementTree is but sadly it doesn't have a pretty print, well, guess I'll have to do it myself, if you have one already can you please give it to me? thanks :) http://effbot.python-hosting.com/file/stuff/sandbox/elementlib/indent.py /F -- http://mail.python.org/mailman/listinfo/python-list
Re: XML parsing and writing
c00i90wn wrote: Nice package ElementTree is but sadly it doesn't have a pretty print, well, guess I'll have to do it myself, if you have one already can you please give it to me? thanks :) FWIW Amara and plain old 4Suite both support pretty-print, canonical XML print and more such options. http://uche.ogbuji.net/tech/4suite/amara/ http://4Suite.org -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.nethttp://fourthought.com http://copia.ogbuji.net http://4Suite.org Articles: http://uche.ogbuji.net/tech/publications/ -- http://mail.python.org/mailman/listinfo/python-list
Re: XML parsing and writing
c00i90wn wrote: On first write of the xml everything goes as it should but on subsequent writes it starts to add more and more unneeded newlines to it making it hard to read and ugly. Pretty make it pretty by putting in newlines (and spaces) that are not in the original data. That is, if you have text John Smith associated with the element name then pretty gives you something like name John Smith /name here with an extra two newlines and some whitespace indentation. (I don't recall 100% when it puts in stuff, but the point of pretty is to put in extra stuff.) You need to strip out the extra stuff (or print it out not pretty; can you get a viewer that buffs-up a notbuff file so you are seeing pretty but the data isn't actually pretty?). Jim -- http://mail.python.org/mailman/listinfo/python-list
Re: XML parsing and writing
c00i90wn wrote: Hey, I'm having a problem with the xml.dom.minidom package, I want to generate a simple xml for storing configuration variables, for that purpose I've written the following code, but before pasting it I'll tell you what my problem is. On first write of the xml everything goes as it should but on subsequent writes it starts to add more and more unneeded newlines to it making it hard to read and ugly. Maybe you should try to get your code a little cleaner first, that usually helps in finding these kinds of bugs. Try rewriting it with ElementTree or lxml, that usually helps you in getting your work done. http://effbot.org/zone/element-index.htm http://codespeak.net/lxml/ Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: XML parsing and writing
Nice package ElementTree is but sadly it doesn't have a pretty print, well, guess I'll have to do it myself, if you have one already can you please give it to me? thanks :) Stefan Behnel wrote: c00i90wn wrote: Hey, I'm having a problem with the xml.dom.minidom package, I want to generate a simple xml for storing configuration variables, for that purpose I've written the following code, but before pasting it I'll tell you what my problem is. On first write of the xml everything goes as it should but on subsequent writes it starts to add more and more unneeded newlines to it making it hard to read and ugly. Maybe you should try to get your code a little cleaner first, that usually helps in finding these kinds of bugs. Try rewriting it with ElementTree or lxml, that usually helps you in getting your work done. http://effbot.org/zone/element-index.htm http://codespeak.net/lxml/ Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: XML parsing per record
Willem Ligtenberg wrote: Is there an easy way, to couple data together. Because I have discoverd an irritating feature in the xml file. Sometimes this is a database reference: Dbtag Dbtag_dbUCSC/Dbtag_db Dbtag_tag Object-id Object-id_str1234/Object-id_str /Object-id /Dbtag_tag /Dbtag And sometimes: Dbtag Dbtag_dbUCSC/Dbtag_db Dbtag_tag Object-id Object-id_id1234/Object-id_id /Object-id /Dbtag_tag /Dbtag So I get a list database names and two! lists of ID's And those two are in no way related. Is there an easy way to create a dictionary like this DBname -- ID If not, I still might need to revert to SAX... :( None of your requirements sound particularly difficult to implement. If you would post a complete example of the data you want to parse and the data you would like to end up it would be easier to help you. The sample data you posted originally does not have many of the fields you want to extract and your example of what you want to end up with is not too clear either. If you are having trouble with ElementTree I expect you will be completely lost with SAX, ElementTree is much easier to work with and cElementTree is very fast. Kent -- http://mail.python.org/mailman/listinfo/python-list
Re: XML parsing per record
Willem Ligtenberg [EMAIL PROTECTED] wrote: On Sun, 17 Apr 2005 02:16:04 +, William Park wrote: Care to post more details? The XML file I need to parse contains information about genes. So the first element is a gene and then there are a lot sub-elements with sub-elements. I only need some of the informtion and want to store it in my an object called gene. Lateron this information will be printed into a file, which in it's turn will be fed into some other program. You have to help us a little more here. Which info do you want to extract from below example? Entrezgene-Set ... /Entrezgene-Set -- William Park [EMAIL PROTECTED], Toronto, Canada Slackware Linux -- because it works. -- http://mail.python.org/mailman/listinfo/python-list
Re: XML parsing per record
This is all the info I need from the xml file: ID -- Gene-track_geneid320632/Gene-track_geneid Name --Gene-ref Gene-ref_locusPzp/Gene-ref_locus Startbase -- Gene-commentary_seqs Seq-loc Seq-loc_int Seq-interval Seq-interval_from126957426/Seq-interval_from Seq-interval_to126989473/Seq-interval_to Seq-interval_strand Na-strand value=plus/ /Seq-interval_strand Seq-interval_id Seq-id Seq-id_gi51860766/Seq-id_gi /Seq-id /Seq-interval_id /Seq-interval /Seq-loc_int /Seq-loc /Gene-commentary_seqs Endbase Function -- Prot-ref_name Prot-ref_name_EU5 snRNP-specific protein, 200 kDa/Prot-ref_name_E Prot-ref_name_EU5 snRNP-specific protein, 200 kDa (DEXH RNA helicase family)/Prot-ref_name_E /Prot-ref_name DBLink -- Gene-ref_locus-tagMGI:201/Gene-ref_locus-tag Gene-commentary_source Other-source Other-source_src Dbtag Dbtag_dbGO/Dbtag_db Dbtag_tag Object-id Object-id_id5524/Object-id_id /Object-id /Dbtag_tag /Dbtag /Other-source_src Other-source_anchorATP binding/Other-source_anchor Other-source_post-textevidence: ISS/Other-source_post-text /Other-source /Gene-commentary_source Product-type -- Entrezgene_type value=protein-coding6/Entrezgene_type gene-comment -- Gene-ref_descactivating signal cointegrator 1 complex subunit 3-like 1/Gene-ref_desc synonym -- Gene-ref_syn Gene-ref_syn_EHELIC2/Gene-ref_syn_E Gene-ref_syn_EKIAA0788/Gene-ref_syn_E Gene-ref_syn_EU5-200KD/Gene-ref_syn_E Gene-ref_syn_EU5-200-KD/Gene-ref_syn_E Gene-ref_syn_EA330064G03Rik/Gene-ref_syn_E /Gene-ref_syn EC -- Prot-ref_ec Prot-ref_ec_E1.5.1.5/Prot-ref_ec_E Prot-ref_ec_E3.5.4.9/Prot-ref_ec_E /Prot-ref_ec Chromosome: SubSource SubSource_subtype value=chromosome1/SubSource_subtype SubSource_name6/SubSource_name /SubSource Some can happen more than once in a record. On Fri, 22 Apr 2005 02:41:46 -0400, William Park wrote: Willem Ligtenberg [EMAIL PROTECTED] wrote: On Sun, 17 Apr 2005 02:16:04 +, William Park wrote: Care to post more details? The XML file I need to parse contains information about genes. So the first element is a gene and then there are a lot sub-elements with sub-elements. I only need some of the informtion and want to store it in my an object called gene. Lateron this information will be printed into a file, which in it's turn will be fed into some other program. You have to help us a little more here. Which info do you want to extract from below example? Entrezgene-Set ... /Entrezgene-Set -- http://mail.python.org/mailman/listinfo/python-list
Re: XML parsing per record
As I'm trying to write the code using cElementTree. I stumble across one problem. Sometimes there are multiple values to retrieve from one record for the same element. Like this: Prot-ref_name_EATP-binding cassette, subfamily G, member 1/Prot-ref_name_E Prot-ref_name_EATP-binding cassette 8/Prot-ref_name_E How do you get not only the first, but the rest as well, so that I can store it in a list. Thanks in advance, Willem Ligtenberg On Fri, 22 Apr 2005 13:48:15 +0200, Willem Ligtenberg wrote: This is all the info I need from the xml file: ID --Gene-track_geneid320632/Gene-track_geneid Name -- Gene-ref Gene-ref_locusPzp/Gene-ref_locus Startbase -- Gene-commentary_seqs Seq-loc Seq-loc_int Seq-interval Seq-interval_from126957426/Seq-interval_from Seq-interval_to126989473/Seq-interval_to Seq-interval_strand Na-strand value=plus/ /Seq-interval_strand Seq-interval_id Seq-id Seq-id_gi51860766/Seq-id_gi /Seq-id /Seq-interval_id /Seq-interval /Seq-loc_int /Seq-loc /Gene-commentary_seqs Endbase Function -- Prot-ref_name Prot-ref_name_EU5 snRNP-specific protein, 200 kDa/Prot-ref_name_E Prot-ref_name_EU5 snRNP-specific protein, 200 kDa (DEXH RNA helicase family)/Prot-ref_name_E /Prot-ref_name DBLink -- Gene-ref_locus-tagMGI:201/Gene-ref_locus-tag Gene-commentary_source Other-source Other-source_src Dbtag Dbtag_dbGO/Dbtag_db Dbtag_tag Object-id Object-id_id5524/Object-id_id /Object-id /Dbtag_tag /Dbtag /Other-source_src Other-source_anchorATP binding/Other-source_anchor Other-source_post-textevidence: ISS/Other-source_post-text /Other-source /Gene-commentary_source Product-type -- Entrezgene_type value=protein-coding6/Entrezgene_type gene-comment -- Gene-ref_descactivating signal cointegrator 1 complex subunit 3-like 1/Gene-ref_desc synonym -- Gene-ref_syn Gene-ref_syn_EHELIC2/Gene-ref_syn_E Gene-ref_syn_EKIAA0788/Gene-ref_syn_E Gene-ref_syn_EU5-200KD/Gene-ref_syn_E Gene-ref_syn_EU5-200-KD/Gene-ref_syn_E Gene-ref_syn_EA330064G03Rik/Gene-ref_syn_E /Gene-ref_syn EC -- Prot-ref_ec Prot-ref_ec_E1.5.1.5/Prot-ref_ec_E Prot-ref_ec_E3.5.4.9/Prot-ref_ec_E /Prot-ref_ec Chromosome: SubSource SubSource_subtype value=chromosome1/SubSource_subtype SubSource_name6/SubSource_name /SubSource Some can happen more than once in a record. On Fri, 22 Apr 2005 02:41:46 -0400, William Park wrote: Willem Ligtenberg [EMAIL PROTECTED] wrote: On Sun, 17 Apr 2005 02:16:04 +, William Park wrote: Care to post more details? The XML file I need to parse contains information about genes. So the first element is a gene and then there are a lot sub-elements with sub-elements. I only need some of the informtion and want to store it in my an object called gene. Lateron this information will be printed into a file, which in it's turn will be fed into some other program. You have to help us a little more here. Which info do you want to extract from below example? Entrezgene-Set ... /Entrezgene-Set -- http://mail.python.org/mailman/listinfo/python-list
Re: XML parsing per record
By the way, I know about findall, but when I iterate thruogh it like: for x in function: print 'function', x I get: function Element 'Prot-ref_name_E' at 0xb7d10cf8 function Element 'Prot-ref_name_E' at 0xb7d10d10 But ofcourse I want the information in there... On Fri, 22 Apr 2005 15:22:17 +0200, Willem Ligtenberg wrote: As I'm trying to write the code using cElementTree. I stumble across one problem. Sometimes there are multiple values to retrieve from one record for the same element. Like this: Prot-ref_name_EATP-binding cassette, subfamily G, member 1/Prot-ref_name_E Prot-ref_name_EATP-binding cassette 8/Prot-ref_name_E How do you get not only the first, but the rest as well, so that I can store it in a list. Thanks in advance, Willem Ligtenberg -- http://mail.python.org/mailman/listinfo/python-list
Re: XML parsing per record
Willem Ligtenberg wrote: As I'm trying to write the code using cElementTree. I stumble across one problem. Sometimes there are multiple values to retrieve from one record for the same element. Like this: Prot-ref_name_EATP-binding cassette, subfamily G, member 1/Prot-ref_name_E Prot-ref_name_EATP-binding cassette 8/Prot-ref_name_E How do you get not only the first, but the rest as well, so that I can store it in a list. findall returns a list of matching elements. if elem is the paretnt element, this gives you a list of the text inside all Prot-ref_name_E child elements: [e.text for e in elem.findall(Prot-ref_name_E)] (you have read the elementtree documentation, I hope?) /F -- http://mail.python.org/mailman/listinfo/python-list
Re: XML parsing per record
As you can read in the other post of mine, my problem was with the iterating through the list. didn't know that you should do. e.text. I did only print e, not print e.text Did read documentation, but must admit not everything. Anyway, thank you very much! On Fri, 22 Apr 2005 15:47:08 +0200, Fredrik Lundh wrote: Willem Ligtenberg wrote: As I'm trying to write the code using cElementTree. I stumble across one problem. Sometimes there are multiple values to retrieve from one record for the same element. Like this: Prot-ref_name_EATP-binding cassette, subfamily G, member 1/Prot-ref_name_E Prot-ref_name_EATP-binding cassette 8/Prot-ref_name_E How do you get not only the first, but the rest as well, so that I can store it in a list. findall returns a list of matching elements. if elem is the paretnt element, this gives you a list of the text inside all Prot-ref_name_E child elements: [e.text for e in elem.findall(Prot-ref_name_E)] (you have read the elementtree documentation, I hope?) /F -- http://mail.python.org/mailman/listinfo/python-list
Re: XML parsing per record
Willem Ligtenberg wrote: By the way, I know about findall, but when I iterate thruogh it like: for x in function: print 'function', x I get: function Element 'Prot-ref_name_E' at 0xb7d10cf8 function Element 'Prot-ref_name_E' at 0xb7d10d10 But ofcourse I want the information in there... for x in function: print 'function', x.text /F -- http://mail.python.org/mailman/listinfo/python-list
Re: XML parsing per record
I'll first try it using SAX, because I want to have as little dependancies as possible. I already have BioPython as a dependancy. And I personally don't like to install lot's of packages for a program to work. So I don't want to impose that on other people. But thanks anyway and I might go for the cElementTree later on, if the ordinary SAX proves to slow... On Wed, 20 Apr 2005 08:03:00 -0400, Kent Johnson wrote: Willem Ligtenberg wrote: Willem Ligtenberg [EMAIL PROTECTED] wrote: I want to parse a very large (2.4 gig) XML file (bioinformatics ofcourse :)) But I have no clue how to do that. Most things I see read the entire xml file at once. That isn't going to work here ofcourse. So I would like to parse a XML file one record at a time and then be able to store the information in another object. How should I do that? The XML file I need to parse contains information about genes. So the first element is a gene and then there are a lot sub-elements with sub-elements. I only need some of the informtion and want to store it in my an object called gene. Lateron this information will be printed into a file, which in it's turn will be fed into some other program. This is an example of the XML ?xml version=1.0? !DOCTYPE Entrezgene-Set PUBLIC -//NCBI//NCBI Entrezgene/EN NCBI_Entrezgene.dtd Entrezgene-Set Entrezgene snip /Entrezgene /Entrezgene-Set This should get you started with cElementTree: import cElementTree as ElementTree source = 'Entrezgene.xml' for event, elem in ElementTree.iterparse(source): if elem.tag == 'Entrezgene': # Process the Entrezgene element geneid = elem.findtext('Entrezgene_track-info/Gene-track/Gene-track_geneid') print 'Gene id', geneid # Throw away the element, we're done with it elem.clear() Kent -- http://mail.python.org/mailman/listinfo/python-list
Re: XML parsing per record
Sorry I just decided that I want to use your solution, but I am wondering is cElemenTree in expat or is that something different? On Wed, 20 Apr 2005 08:03:00 -0400, Kent Johnson wrote: Willem Ligtenberg wrote: Willem Ligtenberg [EMAIL PROTECTED] wrote: I want to parse a very large (2.4 gig) XML file (bioinformatics ofcourse :)) But I have no clue how to do that. Most things I see read the entire xml file at once. That isn't going to work here ofcourse. So I would like to parse a XML file one record at a time and then be able to store the information in another object. How should I do that? The XML file I need to parse contains information about genes. So the first element is a gene and then there are a lot sub-elements with sub-elements. I only need some of the informtion and want to store it in my an object called gene. Lateron this information will be printed into a file, which in it's turn will be fed into some other program. This is an example of the XML ?xml version=1.0? !DOCTYPE Entrezgene-Set PUBLIC -//NCBI//NCBI Entrezgene/EN NCBI_Entrezgene.dtd Entrezgene-Set Entrezgene snip /Entrezgene /Entrezgene-Set This should get you started with cElementTree: import cElementTree as ElementTree source = 'Entrezgene.xml' for event, elem in ElementTree.iterparse(source): if elem.tag == 'Entrezgene': # Process the Entrezgene element geneid = elem.findtext('Entrezgene_track-info/Gene-track/Gene-track_geneid') print 'Gene id', geneid # Throw away the element, we're done with it elem.clear() Kent -- http://mail.python.org/mailman/listinfo/python-list
Re: XML parsing per record
On 4/21/05, Willem Ligtenberg [EMAIL PROTECTED] wrote: Sorry I just decided that I want to use your solution, but I am wondering is cElemenTree in expat or is that something different? Nope, cElemenTree is very much its own man. See http://effbot.org/zone/celementtree.htm. -- Cheers, Simon B, [EMAIL PROTECTED], http://www.brunningonline.net/simon/blog/ -- http://mail.python.org/mailman/listinfo/python-list
Re: XML parsing per record
Don't assume that just because you have a 2.4G XML file that you have 2.4G of data. Looking at these verbose tags, plus the fact that the XML is pretty-printed (all those leading spaces - not even tabs! - add up), I'm guessing you only have about 5-10% actual data, and the rest is just XML tagging/untagging and spaces. (For example, 373 characters used to represent a date/time - this is a sin!) As XML goes, this looks pretty dead easy to parse with non-XML parser means. It looks like all of your leaf nodes open and close on the same line, which would be easy to extract with regexp's or pyparsing. Especially since you mention I only need some of the informtion, you don't even have to build a full document tree representation. SAX parsers would also be good, since you could only trigger on the matching subset of tags that you are really interested in. Lastly, you could even try a pyparsing approach. I usually don't recommend pyparsing for XML since there are already many good XML-targeted tools out there, but it is very easy to throw together something in pyparsing that extracts, say, all of the object-id_id entries, or all of the gene-source structures. What is the subset of information you are looking to extract? -- Paul -- http://mail.python.org/mailman/listinfo/python-list
Re: XML parsing per record
On Sun, 17 Apr 2005 02:16:04 +, William Park wrote: Willem Ligtenberg [EMAIL PROTECTED] wrote: I want to parse a very large (2.4 gig) XML file (bioinformatics ofcourse :)) But I have no clue how to do that. Most things I see read the entire xml file at once. That isn't going to work here ofcourse. So I would like to parse a XML file one record at a time and then be able to store the information in another object. How should I do that? Thanks in advance, Willem Ligtenberg A total newbie to python by the way. You may want to try Expat (www.libexpat.org) or Python wrapper to it. You can feed small piece at a time, say by lines or whatever. Of course, it all depends on what kind of parsing you have in mind. :-) Care to post more details? The XML file I need to parse contains information about genes. So the first element is a gene and then there are a lot sub-elements with sub-elements. I only need some of the informtion and want to store it in my an object called gene. Lateron this information will be printed into a file, which in it's turn will be fed into some other program. This is an example of the XML ?xml version=1.0? !DOCTYPE Entrezgene-Set PUBLIC -//NCBI//NCBI Entrezgene/EN NCBI_Entrezgene.dtd Entrezgene-Set Entrezgene Entrezgene_track-info Gene-track Gene-track_geneid9996/Gene-track_geneid Gene-track_status value=secondary1/Gene-track_status Gene-track_current-id Dbtag Dbtag_dbLocusID/Dbtag_db Dbtag_tag Object-id Object-id_id320632/Object-id_id /Object-id /Dbtag_tag /Dbtag Dbtag Dbtag_dbGeneID/Dbtag_db Dbtag_tag Object-id Object-id_id320632/Object-id_id /Object-id /Dbtag_tag /Dbtag /Gene-track_current-id Gene-track_create-date Date Date_std Date-std Date-std_year2003/Date-std_year Date-std_month8/Date-std_month Date-std_day28/Date-std_day Date-std_hour21/Date-std_hour Date-std_minute39/Date-std_minute Date-std_second0/Date-std_second /Date-std /Date_std /Date /Gene-track_create-date Gene-track_update-date Date Date_std Date-std Date-std_year2005/Date-std_year Date-std_month2/Date-std_month Date-std_day17/Date-std_day Date-std_hour12/Date-std_hour Date-std_minute54/Date-std_minute Date-std_second0/Date-std_second /Date-std /Date_std /Date /Gene-track_update-date /Gene-track /Entrezgene_track-info Entrezgene_type value=protein-coding6/Entrezgene_type Entrezgene_source BioSource BioSource_genome value=genomic1/BioSource_genome BioSource_origin value=natural1/BioSource_origin BioSource_org Org-ref Org-ref_taxnameMus musculus/Org-ref_taxname Org-ref_commonhouse mouse/Org-ref_common Org-ref_db Dbtag Dbtag_dbtaxon/Dbtag_db Dbtag_tag Object-id Object-id_id10090/Object-id_id /Object-id /Dbtag_tag /Dbtag /Org-ref_db Org-ref_syn Org-ref_syn_Emouse/Org-ref_syn_E /Org-ref_syn Org-ref_orgname OrgName OrgName_name OrgName_name_binomial BinomialOrgName BinomialOrgName_genusMus/BinomialOrgName_genus BinomialOrgName_speciesmusculus/BinomialOrgName_species /BinomialOrgName /OrgName_name_binomial /OrgName_name OrgName_lineageEukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia; Sciurognathi; Muridae; Murinae; Mus/OrgName_lineage OrgName_gcode1/OrgName_gcode OrgName_mgcode2/OrgName_mgcode OrgName_divROD/OrgName_div /OrgName /Org-ref_orgname /Org-ref /BioSource_org /BioSource /Entrezgene_source Entrezgene_gene Gene-ref /Gene-ref /Entrezgene_gene Entrezgene_gene-source Gene-source Gene-source_srcLocusLink/Gene-source_src Gene-source_src-int9996/Gene-source_src-int Gene-source_src-str29996/Gene-source_src-str2 Gene-source_gene-display value=false/ Gene-source_locus-display value=false/ Gene-source_extra-terms value=false/ /Gene-source /Entrezgene_gene-source
Re: XML parsing per record
Willem Ligtenberg wrote: Willem Ligtenberg [EMAIL PROTECTED] wrote: I want to parse a very large (2.4 gig) XML file (bioinformatics ofcourse :)) But I have no clue how to do that. Most things I see read the entire xml file at once. That isn't going to work here ofcourse. So I would like to parse a XML file one record at a time and then be able to store the information in another object. How should I do that? The XML file I need to parse contains information about genes. So the first element is a gene and then there are a lot sub-elements with sub-elements. I only need some of the informtion and want to store it in my an object called gene. Lateron this information will be printed into a file, which in it's turn will be fed into some other program. This is an example of the XML ?xml version=1.0? !DOCTYPE Entrezgene-Set PUBLIC -//NCBI//NCBI Entrezgene/EN NCBI_Entrezgene.dtd Entrezgene-Set Entrezgene snip /Entrezgene /Entrezgene-Set This should get you started with cElementTree: import cElementTree as ElementTree source = 'Entrezgene.xml' for event, elem in ElementTree.iterparse(source): if elem.tag == 'Entrezgene': # Process the Entrezgene element geneid = elem.findtext('Entrezgene_track-info/Gene-track/Gene-track_geneid') print 'Gene id', geneid # Throw away the element, we're done with it elem.clear() Kent -- http://mail.python.org/mailman/listinfo/python-list
Re: XML parsing per record
William Park wrote: You may want to try Expat (www.libexpat.org) or Python wrapper to it. Python comes with a low-level expat wrapper (pyexpat). however, if you want performance, cElementTree (which also uses expat) is a lot faster than pyexpat. (see my other post for links to benchmarks and code). /F -- http://mail.python.org/mailman/listinfo/python-list
Re: XML parsing per record
Willem Ligtenberg wrote: I want to parse a very large (2.4 gig) XML file (bioinformatics ofcourse :)) But I have no clue how to do that. Most things I see read the entire xml file at once. That isn't going to work here ofcourse. So I would like to parse a XML file one record at a time and then be able to store the information in another object. How should I do that? Thanks in advance, Willem Ligtenberg A total newbie to python by the way. Read about SAX parsers. This may be of help: http://www.devarticles.com/c/a/XML/Parsing-XML-with-SAX-and-Python/ Out of curiousity, why is the data stored in a XML file? XML is not known for its efficiency --Irmen -- http://mail.python.org/mailman/listinfo/python-list
Re: XML parsing per record
Irmen de Jong wrote: XML is not known for its efficiency sarcasm Surely you are blaspheming, sir! XML's the greatest thing since peanut butter! /sarcasm I'm just *waiting* for the day someone finds its use on the rolls of toilet paper... oh the glorious day... -- http://mail.python.org/mailman/listinfo/python-list
Re: XML parsing per record
Willem Ligtenberg wrote: I want to parse a very large (2.4 gig) XML file (bioinformatics ofcourse :)) But I have no clue how to do that. Most things I see read the entire xml file at once. That isn't going to work here ofcourse. So I would like to parse a XML file one record at a time and then be able to store the information in another object. How should I do that? You might be interested in this recipe using ElementTree: http://online.effbot.org/2004_12_01_archive.htm#element-generator Kent -- http://mail.python.org/mailman/listinfo/python-list
Re: XML parsing per record
Kent Johnson wrote: So I would like to parse a XML file one record at a time and then be able to store the information in another object. You might be interested in this recipe using ElementTree: http://online.effbot.org/2004_12_01_archive.htm#element-generator if you have ElementTree 1.2.5 or later, the iterparse function provides a more efficient implementation of that pattern: http://effbot.org/zone/element-iterparse.htm the cElementTree implemention of iterparse is a lot faster than SAX; see the second table under http://effbot.org/zone/celementtree.htm#benchmarks for some figures. /F -- http://mail.python.org/mailman/listinfo/python-list
Re: XML parsing per record
Willem Ligtenberg [EMAIL PROTECTED] wrote: I want to parse a very large (2.4 gig) XML file (bioinformatics ofcourse :)) But I have no clue how to do that. Most things I see read the entire xml file at once. That isn't going to work here ofcourse. So I would like to parse a XML file one record at a time and then be able to store the information in another object. How should I do that? Thanks in advance, Willem Ligtenberg A total newbie to python by the way. You may want to try Expat (www.libexpat.org) or Python wrapper to it. You can feed small piece at a time, say by lines or whatever. Of course, it all depends on what kind of parsing you have in mind. :-) Care to post more details? -- William Park [EMAIL PROTECTED], Toronto, Canada Slackware Linux -- because it works. -- http://mail.python.org/mailman/listinfo/python-list
Re: xml parsing escape characters
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 ~From your experience, do you think that if this wrong XML code could be meant to be read only by somekind of Microsoft parser, the error will not occur? I'll try to explain: xml producer writes the code in Windows platform and 'thinks' that every client will read/parse the code with a specific Windows parser. Could that (wrong) XML code parse correctly in that kind of specific Windows client? Or in other words: Do you know any windows parser that could turn that erroneous encoding to a xml tree, with four or five inner levels of tags? I'd like to thank everyone for taking the time to answer me. Luis -BEGIN PGP SIGNATURE- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFB8UIOHn4UHCY8rB8RAgK4AKCiHjPdkCKnirX4gEIawT9hBp3HmQCdGoFK 3IEMLLXwMZKvNoqA4tISVnI= =jvOU -END PGP SIGNATURE- -- http://mail.python.org/mailman/listinfo/python-list
Re: xml parsing escape characters
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 ~From your experience, do you think that if this wrong XML code could be meant to be read only by somekind of Microsoft parser, the error will not occur? I'll try to explain: xml producer writes the code in Windows platform and 'thinks' that every client will read/parse the code with a specific Windows parser. Could that (wrong) XML code parse correctly in that kind of specific Windows client? Or in other words: Do you know any windows parser that could turn that erroneous encoding to a xml tree, with four or five inner levels of tags? I'd like to thank everyone for taking the time to answer me. Luis -BEGIN PGP SIGNATURE- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFB8UIOHn4UHCY8rB8RAgK4AKCiHjPdkCKnirX4gEIawT9hBp3HmQCdGoFK 3IEMLLXwMZKvNoqA4tISVnI= =jvOU -END PGP SIGNATURE- -- http://mail.python.org/mailman/listinfo/python-list
Re: xml parsing escape characters
Luis P. Mendes wrote: xml producer writes the code in Windows platform and 'thinks' that every client will read/parse the code with a specific Windows parser. Could that (wrong) XML code parse correctly in that kind of specific Windows client? not if it's an XML parser. Do you know any windows parser that could turn that erroneous encoding to a xml tree, with four or five inner levels of tags? any parser *can* do that, but I doubt many parsers will do it unless you ask it to (by extracting the string and parsing it again). here's the elementtree version: from elementtree.ElementTree import parse, XML wrapper = parse(urllib.urlopen(url)) dataset = XML(wrapper.findtext({http://www..}string;)) /F -- http://mail.python.org/mailman/listinfo/python-list
Re: xml parsing escape characters
Luis P. Mendes wrote: From your experience, do you think that if this wrong XML code could be meant to be read only by somekind of Microsoft parser, the error will not occur? This is very unlikely. MSXML would never do this incorrectly. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list
Re: xml parsing escape characters
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 this is the xml document: ?xml version=1.0 encoding=utf-8? string xmlns=http://www..;lt;DataSetgt; ~ lt;Ordergt; ~ lt;Customergt;439lt;/Customergt; (... others ...) ~ lt;/Ordergt; lt;/DataSetgt;/string When I do: print xmldoc.toxml() it prints: ?xml version=1.0 ? string xmlns=http://www...;lt;DataSetgt; ~ lt;Ordergt; ~lt;Customergt;439lt;/Customergt; ~ lt;/Ordergt; lt;/DataSetgt;/string __ with: stringNode = xmldoc.childNodes[0] print stringNode.toxml() I get: string xmlns=http://www...;lt;DataSetgt; ~ lt;Ordergt; ~lt;Customergt;439lt;/Customergt; ~ lt;/Ordergt; lt;/DataSetgt;/string __ with: DataSetNode = stringNode.childNodes[0] print DataSetNode.toxml() I get: lt;DataSetgt; ~ lt;Ordergt; ~lt;Customergt;439lt;/Customergt; ~ lt;/Ordergt; lt;/DataSetgt; ___- so far so good, but when I issue the command: print DataSetNode.childNodes[0] I get: IndexError: tuple index out of range Why the error, and why does it return a tuple? Why doesn't it return: lt;Ordergt; lt;Customergt;439lt;/Customergt; lt;/Ordergt; ?? -BEGIN PGP SIGNATURE- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFB76y3Hn4UHCY8rB8RAvQsAKCFD/hps8ybQli8HAs3iSCvRjwqjACfS/12 5gctpB91S5cy299e/TVLGQk= =XR2a -END PGP SIGNATURE- -- http://mail.python.org/mailman/listinfo/python-list
Re: xml parsing escape characters
Luis P. Mendes wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 this is the xml document: ?xml version=1.0 encoding=utf-8? string xmlns=http://www..;lt;DataSetgt; ~ lt;Ordergt; ~ lt;Customergt;439lt;/Customergt; (... others ...) ~ lt;/Ordergt; lt;/DataSetgt;/string This is an XML document containing a single tag, string, whose content is text containing entity-escaped XML. This is *not* an XML document containing tags DataSet, Order, Customer, etc. All the behaviour you are seeing is a consequence of this. You need to unescape the contents of the string tag to be able to treat it as structured XML. Kent -- http://mail.python.org/mailman/listinfo/python-list
Re: xml parsing escape characters
Kent Johnson wrote: [...] This is an XML document containing a single tag, string, whose content is text containing entity-escaped XML. This is *not* an XML document containing tags DataSet, Order, Customer, etc. All the behaviour you are seeing is a consequence of this. You need to unescape the contents of the string tag to be able to treat it as structured XML. The unescaping is usually done for you by the xml parser that you use. --Irmen -- http://mail.python.org/mailman/listinfo/python-list
Re: xml parsing escape characters
Irmen de Jong wrote: Kent Johnson wrote: [...] This is an XML document containing a single tag, string, whose content is text containing entity-escaped XML. This is *not* an XML document containing tags DataSet, Order, Customer, etc. All the behaviour you are seeing is a consequence of this. You need to unescape the contents of the string tag to be able to treat it as structured XML. The unescaping is usually done for you by the xml parser that you use. Yes, so if your XML contains for example stufflt;not a taggt;/stuff and you parse this and ask for the *text* content of the stuff tag, you will get the string not a tag but it's still *not* a tag. If you try to get child elements of the stuff element there will be none. This is exactly the confusion the OP has. --Irmen -- http://mail.python.org/mailman/listinfo/python-list
Re: xml parsing escape characters
Luis P. Mendes wrote: with:DataSetNode = stringNode.childNodes[0] print DataSetNode.toxml() I get: lt;DataSetgt; ~ lt;Ordergt; ~lt;Customergt;439lt;/Customergt; ~ lt;/Ordergt; lt;/DataSetgt; ___- so far so good, but when I issue the command: print DataSetNode.childNodes[0] I get: IndexError: tuple index out of range Why the error, and why does it return a tuple? The DataSetNode has no children, because it is not an Element node, but a Text node. In XML, an element is denoted by DataSet.../DataSet and *not* by lt;DataSetgt;...lt;/DataSetgt; The latter is just a single string, represented in XML as a Text node. It does not give you any hierarchy whatsoever. As a text node does not have any children, its childNode members is a empty tuple; accessing that tuple gives you an IndexError. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list
Re: xml parsing escape characters
Irmen de Jong wrote: The unescaping is usually done for you by the xml parser that you use. Usually, but not in this case. If you have a text that looks like XML, and you want to put it into an XML element, the XML file uses lt; and gt;. The XML parser unescapes that as and . However, it does not then consider the and as markup, and it shouldn't. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list
Re: xml parsing escape characters
Martin v. Löwis wrote: Irmen de Jong wrote: The unescaping is usually done for you by the xml parser that you use. Usually, but not in this case. If you have a text that looks like XML, and you want to put it into an XML element, the XML file uses lt; and gt;. The XML parser unescapes that as and . However, it does not then consider the and as markup, and it shouldn't. That's also what I said? The unescaping of the XML entities in the contents of the OP's string element is done for you by the parser, so you will get a text node with the ,,,whatever in there. The OP probably wants to feed that to a new xml parser instance to process it as markup. Or perhaps the way the original XML document is constructed is flawed. --Irmen -- http://mail.python.org/mailman/listinfo/python-list
Re: xml parsing escape characters
Irmen de Jong wrote: Usually, but not in this case. If you have a text that looks like XML, and you want to put it into an XML element, the XML file uses lt; and gt;. The XML parser unescapes that as and . However, it does not then consider the and as markup, and it shouldn't. That's also what I said? You said it in response to All the behaviour you are seeing is a consequence of this. You need to unescape the contents of the string tag to be able to treat it as structured XML. In that context, I interpreted The unescaping is usually done for you by the xml parser that you use. as The parser should have done what you want; if the parser didn't, that is is bug in the parser. The OP probably wants to feed that to a new xml parser instance to process it as markup. Or perhaps the way the original XML document is constructed is flawed. Either of these, indeed - probably the latter. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list
Re: xml parsing escape characters
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 I would like to thank everyone for your answers, but I'm not seeing the light yet! When I access the url via the Firefox browser and look into the source code, I also get: ?xml version=1.0 encoding=utf-8? string xmlns=httplt;DataSetgt; ~ lt;Ordergt; ~lt;Customergt;439lt;/Customergt; ~ lt;/Ordergt; lt;/DataSetgt;/string should I take the contents of the string tag that is text and replace all 'lt' with '' and 'gt' with '' and then read it with xml.minidom? how to do it? or should I use another parser that accomplishes the task with no need to replace the escaped characters? -BEGIN PGP SIGNATURE- Version: GnuPG v1.2.4 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFB8AIQHn4UHCY8rB8RAuw8AJ9ZMQ8P3c7wXD1zVLd2fe7MktMQwwCfXAND EPpY1w2a3ix2s2vWRlzZ43U= =bJQV -END PGP SIGNATURE- -- http://mail.python.org/mailman/listinfo/python-list
Re: xml parsing escape characters
Luis P. Mendes wrote: When I access the url via the Firefox browser and look into the source code, I also get: ?xml version=1.0 encoding=utf-8? string xmlns=httplt;DataSetgt; ~ lt;Ordergt; ~lt;Customergt;439lt;/Customergt; ~ lt;/Ordergt; lt;/DataSetgt;/string Please do try to understand what you are seeing. This is crucial for understanding what happens. You may have the understanding that XML can be represented as a tree. This would be good - if not, please read a book that explains why XML can be considered as a tree. In the tree, you have inner nodes, and leaf nodes. For example, the document a bHello/b cWorld/c /a has 5 nodes (ignoring whitespace content): Element:a Element:b Text:Hello | \-- Element:c Text:World So the leaf nodes are typically Text nodes (unless you have an empty element). Your document has this structure: Element:string Text:DataSet Order Customer439/Customer /Order /DataSet So the ***TEXT*** contains the letter , just like it contains the letters O and r. There IS no element Order in your document, no matter how hard you look. If you want a DataSet *element* in your document, it should read string xmlns=... DataSet Order Customer439/Customer /Order /DataSet /string As this is the document you apparently want to process, complain to whoever gave you that other document. should I take the contents of the string tag that is text and replace all 'lt' with '' and 'gt' with '' and then read it with xml.minidom? No. We still don't know what you want to achieve, so it is difficult to advise you what to do. My best advise is that whoever generates the XML document should fix it. or should I use another parser that accomplishes the task with no need to replace the escaped characters? No. The parser is working correctly. The document you got can also be interpreted as containing another XML document as a text. This is evil, but apparently people are doing it, anyway. If you really want that embedded document, you need first to extract it. To see what I mean, do print DataSetNode.data The .data attribute gives you the string contents of a text node. You could use this as an XML document, and parse it again to an XML parser. This would be ugly, but might be your only choice if the producer of the document is unwilling to adjust. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list
Re: xml parsing escape characters
On Thu, 20 Jan 2005 21:54:30 +0100, Martin v. Lwis wrote: Luis P. Mendes wrote: When I access the url via the Firefox browser and look into the source code, I also get: ?xml version=1.0 encoding=utf-8? string xmlns=httplt;DataSetgt; ~ lt;Ordergt; ~lt;Customergt;439lt;/Customergt; ~ lt;/Ordergt; lt;/DataSetgt;/string Please do try to understand what you are seeing. This is crucial for understanding what happens. From extremely painful and lengthy personal experience, Luis, I ***extremely*** strongly recommend taking the time to nail this down until you really, really, really understand what is going on. Until you can explain it to somebody else coherently, ideally. Mixing escaping levels like this absolutely, positively *must* be done correctly, or extremely-painful-to-debug problems will result. (My painful experience was layering an RPC implementation in plain text on top of IM messages, where I was dealing with everything from the socket level up except the XML parser. Ultimately it turned out there was a problem in the XML parser, it rendered amp;amp; as , which is wrong wrong wrong. But that took a *long* time to find, especially as I had other bugs in the way.) Since you're layering XML in XML, test amp;amp; and amp;amp;amp; to make sure they work correctly; those usually show encoding errors. And, given your current understanding of the issue, do not write your own decoding function unless you absolutely can't avoid it. -- http://mail.python.org/mailman/listinfo/python-list
Re: xml parsing escape characters
Luis P. Mendes wrote: I get the following result: ?xml version=1.0 encoding=utf-8? string xmlns=http://www..;lt;DataSetgt; ~ lt;Ordergt; Most likely, this result is correct, and your document really does contain lt;Ordergt; I don't get any elements. But, if I access the same url via a browser, the result in the browser window is something like: string xmlns=http://www..; ~ DataSet Most likely, your browser is incorrect (or atleast confusing), and renders lt; as , even though this is not markup. I already browsed the web, I know it's about the escape characters, but I didn't find a simple solution for this. Not sure what this is. AFAICT, everything works correctly. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list