Re: XML Parsing

2017-09-04 Thread Peter Otten
Sambit Samal wrote:

> Hi ,
> 
> Need help in Python Script using xml.etree.ElementTree  to update the
> value of any element in below XML ( e.g SETNPI to be 5 ) based on some
> constraint ( e.g  ) .

Something along the lines

from xml.etree import ElementTree as ET

tree = ET.parse("original.xml")
for e in tree.findall(".//ruleset[@id='2']//SETNPI"):
e.text = "5"
tree.write("modified.xml")

might work for you.

> 
>   
> DRATRN
> 
> 
> 
>  1
>  
>
>  1
>  
>  0
>   ORIG
>  CFORIG
> TERM
>
> 
> 
>   
>1
>   1
>   
>  CONTINUE
> 
>
> 
>   
>   2
> 
> 1
>  
>   0
>   TERM
>   
> 
> 
>   
>1
>1
>
>CONTINUE
>
>


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: xml parsing with lxml

2016-10-07 Thread Doug OLeary
On Friday, October 7, 2016 at 3:21:43 PM UTC-5, John Gordon wrote:
> root = doc.getroot()
> for child in root:
> print(child.tag)
> 

Excellent!  thank, you sir!  that'll get me started.  

Appreciate the reply.

Doug O'Leary
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: xml parsing with lxml

2016-10-07 Thread John Gordon
In <622ea3b0-88b4-420b-89e0-9e7c6e866...@googlegroups.com> Doug OLeary 
 writes:

> >>> from lxml import etree
> >>> doc =3D etree.parse('config.xml')

> Now what?  For instance, how do I list the top level children of
> ?

root = doc.getroot()
for child in root:
print(child.tag)

-- 
John Gordon   A is for Amy, who fell down the stairs
gor...@panix.com  B is for Basil, assaulted by bears
-- Edward Gorey, "The Gashlycrumb Tinies"

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: XML Parsing

2015-04-05 Thread Ben Finney
Sepideh Ghanavati sepideh...@gmail.com writes:

 I know basic of python and I have an xml file created from csv

What XML schema defines the document's format?

Without knowing the schema, parsing will be unreliable.

What created the document? Why is it relevant that the document was
“created from CSV”?

 which has three attributes category, definition and definition
 description.

What do you mean by “attributes”?

In Python, an attribute has a specific meaning.

In XML, an attribute has a rather different meaning.

Neither of those meanings seems to apply to “the XML document has three
attributes”. XML documents don't have attributes; differnt XML elements
in a document have different attributes.

 I want to parse through xml file and identify actors, constraints,
 principal from the text.

How are those defined in the document's schema?

 However, I am not sure what is the best way to go. Any suggestion?

You should:

* Learn some more about XML
  URL:http://www.xmlobjective.com/the-basic-principles-of-xml/.

* Learn exactly what formal document schema defines the document
  URL:https://en.wikipedia.org/wiki/XML_schema. If the document isn't
  accompanied by a specification of exactly what its schema is, you're
  going to have a difficult time.

-- 
 \“If I melt dry ice, can I swim without getting wet?” —Steven |
  `\Wright |
_o__)  |
Ben Finney

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: XML Parsing

2015-04-05 Thread Stefan Behnel
Sepideh Ghanavati schrieb am 06.04.2015 um 04:26:
 I know basic of python and I have an xml file created from csv which has
 three attributes category, definition and definition description.
 I want to parse through xml file and identify actors, constraints,
 principal from the text. However, I am not sure what is the best way to
 go. Any suggestion?

If it's really generated from a CSV file, you could also parse that instead:

https://docs.python.org/3/library/csv.html

Admittedly, CSV files are simple, but they also have major problems,
especially when it comes to detecting their character encoding and their
specific format (tab/comma/semicolon/space/whatever separated, with or
without escaping, quoted values, ...). Meaning, you can easily end up
reading nonsense from the file instead of the content that was originally
put into it.

So, if you want to parse from XML instead, use ElementTree:

https://docs.python.org/3/library/xml.etree.elementtree.html

Stefan

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: XML parsing ExpatError with xml.dom.minidom at line 1, column 0

2014-02-13 Thread Peter Otten
ming wrote:

 Hi,
 i've a Python script which stopped working about a month ago.   But until
 then, it worked flawlessly for months (if not years).   A tiny
 self-contained 7-line script is provided below.
 
 i ran into an XML parsing problem with xml.dom.minidom and the error
 message is included below.  The weird thing is if i used an XML validator
 on the web to validate against this particular URL directly, it is all
 good.   Moreover, i saved the page source in Firefox or Chrome then
 validated against the saved XML file, it's also all good.
 
 Since the error happened at the very beginning of the input (line 1,
 column 0) as indicated below, i was wondering if this is an encoding
 mismatch.  However, according to the saved page source in FireFox or
 Chrome, there is the following at the beginning:
?xml version=1.0 encoding=UTF-8?
 
 
 program
 =
 #!/usr/bin/env python
 
 import urllib2
 from xml.dom.minidom import parseString
 
 fd = urllib2.urlopen('http://api.worldbank.org/countries')
 data = fd.read()
 fd.close()
 dom = parseString(data)
 =
 
 
 error msg
 =
 Traceback (most recent call last):
   File ./bugReport.py, line 9, in module
 dom = parseString(data)
   File /usr/lib/python2.7/xml/dom/minidom.py, line 1931, in parseString
 return expatbuilder.parseString(string)
   File /usr/lib/python2.7/xml/dom/expatbuilder.py, line 940, in
   parseString
 return builder.parseString(string)
   File /usr/lib/python2.7/xml/dom/expatbuilder.py, line 223, in
   parseString
 parser.Parse(string, True)
 xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1,
 column 0 =
 
 
 i'm running Python 2.7.5+ on Ubuntu 13.10.
 
 Thanks.

Looking into the data returned from the server:

 data = urllib2.urlopen(http://api.worldbank.org/countries;).read()
 with open(tmp.dat, w) as f: f.write(data)
... 
 
[1]+  Angehalten  python
$ file tmp.dat
tmp.dat: gzip compressed data, from FAT filesystem (MS-DOS, OS/2, NT)

OK, let's expand:

$ fg
python


 import gzip, StringIO
 expanded_data = gzip.GzipFile(fileobj=StringIO.StringIO(data)).read()
 import xml.dom.minidom
 xml.dom.minidom.parseString(expanded_data)
xml.dom.minidom.Document instance at 0x19a1320

There may be a way to uncompress the gzipped data transparently, but I'm too 
lazy to look it up...

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: XML parsing ExpatError with xml.dom.minidom at line 1, column 0

2014-02-13 Thread MRAB

On 2014-02-13 20:10, Peter Otten wrote:

ming wrote:


Hi,
i've a Python script which stopped working about a month ago.   But until
then, it worked flawlessly for months (if not years).   A tiny
self-contained 7-line script is provided below.

i ran into an XML parsing problem with xml.dom.minidom and the error
message is included below.  The weird thing is if i used an XML validator
on the web to validate against this particular URL directly, it is all
good.   Moreover, i saved the page source in Firefox or Chrome then
validated against the saved XML file, it's also all good.

Since the error happened at the very beginning of the input (line 1,
column 0) as indicated below, i was wondering if this is an encoding
mismatch.  However, according to the saved page source in FireFox or
Chrome, there is the following at the beginning:
   ?xml version=1.0 encoding=UTF-8?


program
=
#!/usr/bin/env python

import urllib2
from xml.dom.minidom import parseString

fd = urllib2.urlopen('http://api.worldbank.org/countries')
data = fd.read()
fd.close()
dom = parseString(data)
=


error msg
=
Traceback (most recent call last):
  File ./bugReport.py, line 9, in module
dom = parseString(data)
  File /usr/lib/python2.7/xml/dom/minidom.py, line 1931, in parseString
return expatbuilder.parseString(string)
  File /usr/lib/python2.7/xml/dom/expatbuilder.py, line 940, in
  parseString
return builder.parseString(string)
  File /usr/lib/python2.7/xml/dom/expatbuilder.py, line 223, in
  parseString
parser.Parse(string, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1,
column 0 =


i'm running Python 2.7.5+ on Ubuntu 13.10.

Thanks.


Looking into the data returned from the server:


data = urllib2.urlopen(http://api.worldbank.org/countries;).read()
with open(tmp.dat, w) as f: f.write(data)

...



[1]+  Angehalten  python
$ file tmp.dat
tmp.dat: gzip compressed data, from FAT filesystem (MS-DOS, OS/2, NT)

OK, let's expand:

$ fg
python



import gzip, StringIO
expanded_data = gzip.GzipFile(fileobj=StringIO.StringIO(data)).read()
import xml.dom.minidom
xml.dom.minidom.parseString(expanded_data)

xml.dom.minidom.Document instance at 0x19a1320

There may be a way to uncompress the gzipped data transparently, but I'm too
lazy to look it up...


From a brief look at the docs, it looks like you can specify the
format. For example, for JSON:

fd = urlopen('http://api.worldbank.org/countries?format=json')

--
https://mail.python.org/mailman/listinfo/python-list


Re: XML parsing: SAX/expat yield

2010-08-04 Thread Peter Otten
kj wrote:

 I want to write code that parses a file that is far bigger than
 the amount of memory I can count on.  Therefore, I want to stay as
 far away as possible from anything that produces a memory-resident
 DOM tree.
 
 The top-level structure of this xml is very simple: it's just a
 very long list of records.  All the complexity of the data is at
 the level of the individual records, but these records are tiny in
 size (relative to the size of the entire file).
 
 So the ideal would be a parser-iterator, which parses just enough
 of the file to yield (in the generator sense) the next record,
 thereby returning control to the caller; the caller can process
 the record, delete it from memory, and return control to the
 parser-iterator; once parser-iterator regains control, it repeats
 this sequence starting where it left off.

How about

http://effbot.org/zone/element-iterparse.htm#incremental-parsing

Peter
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: XML parsing: SAX/expat yield

2010-08-04 Thread kj
In i3c7lc$e6v$0...@news.t-online.com Peter Otten __pete...@web.de writes:

How about

http://effbot.org/zone/element-iterparse.htm#incremental-parsing

Exactly!

Thanks!

~K

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: XML parsing with python

2009-08-18 Thread Stefan Behnel
inder wrote:
 On Aug 17, 8:31 pm, John Posner jjpos...@optimum.net wrote:
 Use the iterparse() function of the xml.etree.ElementTree package.
 http://effbot.org/zone/element-iterparse.htm
 http://codespeak.net/lxml/parsing.html#iterparse-and-iterwalk
 Stefan
 iterparse() is too big a hammer for this purpose, IMO. How about this:

   from xml.etree.ElementTree import ElementTree
   tree = ElementTree(None, myfile.xml)
   for elem in tree.findall('//book/title'):
   print elem.text

 -John
 
 Thanks for the prompt reply .
 
 I feel let me try using iterparse. Will it be slower compared to SAX
 parsing ... ultimately I will have a huge xml file to parse ?

If you use the cElementTree module, it may even be faster.


 Another question , I will also need to validate my xml against xsd . I
 would like to do this validation through the parsing tool  itself .

In that case, you can use lxml instead of ElementTree.

http://codespeak.net/lxml/

Stefan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: XML parsing with python

2009-08-18 Thread Stefan Behnel
John Posner wrote:
 Use the iterparse() function of the xml.etree.ElementTree package.
 
 iterparse() is too big a hammer for this purpose, IMO. How about this:
 
  from xml.etree.ElementTree import ElementTree
  tree = ElementTree(None, myfile.xml)
  for elem in tree.findall('//book/title'):
  print elem.text

Is that really so much better than an iterparse() version?

  from xml.etree.ElementTree import ElementTree

  for _, elem in ElementTree.iterparse(myfile.xml):
  if elem.tag == 'book':
  print elem.findtext('title')
  elem.clear()

Stefan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: XML parsing with python

2009-08-18 Thread inder
On Aug 18, 11:24 am, Stefan Behnel stefan...@behnel.de wrote:
 inder wrote:
  On Aug 17, 8:31 pm, John Posner jjpos...@optimum.net wrote:
  Use the iterparse() function of the xml.etree.ElementTree package.
 http://effbot.org/zone/element-iterparse.htm
 http://codespeak.net/lxml/parsing.html#iterparse-and-iterwalk
  Stefan
  iterparse() is too big a hammer for this purpose, IMO. How about this:

    from xml.etree.ElementTree import ElementTree
    tree = ElementTree(None, myfile.xml)
    for elem in tree.findall('//book/title'):
        print elem.text

  -John

  Thanks for the prompt reply .

  I feel let me try using iterparse. Will it be slower compared to SAX
  parsing ... ultimately I will have a huge xml file to parse ?

 If you use the cElementTree module, it may even be faster.

  Another question , I will also need to validate my xml against xsd . I
  would like to do this validation through the parsing tool  itself .

 In that case, you can use lxml instead of ElementTree.

 http://codespeak.net/lxml/

 Stefan

Hi ,

Is lxml part of standard python package ? I am having python 2.5 .

I might not be able to use any additional package other than the
standard python . Could you please suggest something part of standard
python package ?

Thanks
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: XML parsing with python

2009-08-18 Thread Stefan Behnel
inder wrote:
 Is lxml part of standard python package ? I am having python 2.5 .

No, that's why I suggested ElementTree first.


 I might not be able to use any additional package other than the
 standard python . Could you please suggest something part of standard
 python package ?

No, there isn't any XMLSchema support in the stdlib.

However, you may still be able to use lxml locally for development and with
validation enabled, and switch to non-validating ElementTree on
distribution/pre-prod-testing/whatever. Just use a conditional import and
write a bit of setup code.

Stefan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: XML parsing with python

2009-08-17 Thread Stefan Behnel
inder wrote:
 I am new to xml . I need to parse the xml file . After reading and
 browsing on the web , I could get much help .
 
 I guess SAX would be better suited for my requirement .

That's a common misconception.


 Could some juct provide me a sample python code so that I can execute
 it and see how the parsing actually happens .
 
 Lets say my xml file -
 ?xml version=1.0?
 library
 category code=SciFi room=1 !--if you want to test invalid
 document against schema you can just cut the mandatory id attribute --
 book id=ISBN001
 titleI,Robot/title
 pages100/pages
 authorIsaac Asimov/author
 /book
 book id=ISBN001 damaged=true
 titleBlade Runner/title
 pages400/pages
 authorPhilip K. Dick/author
 /book
 /category
 category code=Boring room=2
 book id=ISBN003
 titleLord Of The Rings/title
 pages2/pages
 authorTolkien/author
 /book
 book id=ISBN004 damaged=true
 titleXML-Schema Specification/title
 pages5000/pages
 authorW3C/author
 /book
 /category
 category code=Fantasy
 book id=ISBN005 damaged=true
 titleAladin/title
 pages150/pages
 authorDon't know/author
 /book
 /category
 /library
 
 
 --
 
 I need the output to be - (elements containing 'title' )
 
 I,Robot
 Blade Runner
 Lord Of The Rings
 XML-Schema Specification
 Aladin

Use the iterparse() function of the xml.etree.ElementTree package.

http://effbot.org/zone/element-iterparse.htm
http://codespeak.net/lxml/parsing.html#iterparse-and-iterwalk

Stefan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: XML parsing with python

2009-08-17 Thread John Posner



Use the iterparse() function of the xml.etree.ElementTree package.

http://effbot.org/zone/element-iterparse.htm
http://codespeak.net/lxml/parsing.html#iterparse-and-iterwalk

Stefan
  


iterparse() is too big a hammer for this purpose, IMO. How about this:

 from xml.etree.ElementTree import ElementTree
 tree = ElementTree(None, myfile.xml)
 for elem in tree.findall('//book/title'):
 print elem.text


-John

--
http://mail.python.org/mailman/listinfo/python-list


Re: XML parsing with python

2009-08-17 Thread inder
On Aug 17, 8:31 pm, John Posner jjpos...@optimum.net wrote:
  Use the iterparse() function of the xml.etree.ElementTree package.

 http://effbot.org/zone/element-iterparse.htm
 http://codespeak.net/lxml/parsing.html#iterparse-and-iterwalk

  Stefan

 iterparse() is too big a hammer for this purpose, IMO. How about this:

   from xml.etree.ElementTree import ElementTree
   tree = ElementTree(None, myfile.xml)
   for elem in tree.findall('//book/title'):
       print elem.text

 -John

Thanks for the prompt reply .

I feel let me try using iterparse. Will it be slower compared to SAX
parsing ... ultimately I will have a huge xml file to parse ?


Another question , I will also need to validate my xml against xsd . I
would like to do this validation through the parsing tool  itself .

Does there exist an Unix utility which validates or even a python
library call would be fine  ?

Thanks in advance
-- 
http://mail.python.org/mailman/listinfo/python-list


RE: XML Parsing

2009-02-26 Thread Paul McGuire
You flatter me sir (or madam? can't tell from your name...), but I wouldn't
presume to so lofty a title among this crowd.  I'd save that for the likes
of Alan Gauld and Kent Johnson, who are much more prolific  and informative
contributors to this list than I.

-- Paul


-Original Message-
From: hrishy [mailto:hris...@yahoo.co.uk] 
Sent: Wednesday, February 25, 2009 11:36 PM
To: python-list@python.org; Paul McGuire
Subject: Re: XML Parsing

Ha the guru himself responding :-)


--- On Wed, 25/2/09, Paul McGuire pt...@austin.rr.com wrote:

 From: Paul McGuire pt...@austin.rr.com
 Subject: Re: XML Parsing
 To: python-list@python.org
 Date: Wednesday, 25 February, 2009, 2:04 PM On Feb 25, 1:17 am, hrishy 
 hris...@yahoo.co.uk
 wrote:
  Hi
 
  Something like this
 
 snip solution using ElementTree
 
  Note i am not a python programmer just a enthusiast
 and i was curious why people on the list didnt suggest a code like 
 above
 
 
 You just beat the rest of us to it - good example of ElementTree for 
 parsing XML (and I Iearned the '//' shortcut for one or more 
 intervening tag levels).
 
 To the OP: if you are parsing XML, I would look hard at the modules 
 (esp. ElementTree) that are written explicitly for XML, before 
 considering using regular expressions.  There are just too many 
 potential surprises when trying to match XML tags - presence/absence/ 
 order of attributes, namespaces, whitespace inside tags, to name a 
 few.
 
 -- Paul
 --
 http://mail.python.org/mailman/listinfo/python-list


  

--
http://mail.python.org/mailman/listinfo/python-list


Re: XML Parsing

2009-02-25 Thread hrishy
Hi Lie

I am not a python guy but very interested in the langauge and i consider the 
people on this list to be intelligent and was wundering why you people did not 
suggest xpath for this kind of a problem just curious and willing to learn.

I am searching for a answer but the question is 
why not use xpath to extract xml text from a xml doc ?

regards
Hrishy


--- On Wed, 25/2/09, Lie Ryan lie.1...@gmail.com wrote:

 From: Lie Ryan lie.1...@gmail.com
 Subject: Re: XML Parsing
 To: python-list@python.org
 Date: Wednesday, 25 February, 2009, 7:33 AM
 Are you searching for answer or searching for another people
 that have 
 the same answer as you? :)
 
 Many roads lead to Rome is a very famous
 quotation...
 
 --
 http://mail.python.org/mailman/listinfo/python-list


  
--
http://mail.python.org/mailman/listinfo/python-list


Re: XML Parsing

2009-02-25 Thread J. Clifford Dyer
Probably because you responded an hour after the question was posted,
and in the dead of night.  Newsgroups often move slower than that.  But
now we have posted a solution like that, so all's well in the world.  :)

Cheers,
Cliff


On Wed, 2009-02-25 at 08:20 +, hrishy wrote:
 Hi Lie
 
 I am not a python guy but very interested in the langauge and i consider the 
 people on this list to be intelligent and was wundering why you people did 
 not suggest xpath for this kind of a problem just curious and willing to 
 learn.
 
 I am searching for a answer but the question is 
 why not use xpath to extract xml text from a xml doc ?
 
 regards
 Hrishy
 
 
 --- On Wed, 25/2/09, Lie Ryan lie.1...@gmail.com wrote:
 
  From: Lie Ryan lie.1...@gmail.com
  Subject: Re: XML Parsing
  To: python-list@python.org
  Date: Wednesday, 25 February, 2009, 7:33 AM
  Are you searching for answer or searching for another people
  that have 
  the same answer as you? :)
  
  Many roads lead to Rome is a very famous
  quotation...
  
  --
  http://mail.python.org/mailman/listinfo/python-list
 
 
   
 --
 http://mail.python.org/mailman/listinfo/python-list
 

--
http://mail.python.org/mailman/listinfo/python-list


Re: XML Parsing

2009-02-25 Thread Paul McGuire
On Feb 25, 1:17 am, hrishy hris...@yahoo.co.uk wrote:
 Hi

 Something like this

snip solution using ElementTree

 Note i am not a python programmer just a enthusiast and i was curious why 
 people on the list didnt suggest a code like above


You just beat the rest of us to it - good example of ElementTree for
parsing XML (and I Iearned the '//' shortcut for one or more
intervening tag levels).

To the OP: if you are parsing XML, I would look hard at the modules
(esp. ElementTree) that are written explicitly for XML, before
considering using regular expressions.  There are just too many
potential surprises when trying to match XML tags - presence/absence/
order of attributes, namespaces, whitespace inside tags, to name a
few.

-- Paul
--
http://mail.python.org/mailman/listinfo/python-list


Re: XML Parsing

2009-02-25 Thread hrishy
Ha the guru himself responding :-)


--- On Wed, 25/2/09, Paul McGuire pt...@austin.rr.com wrote:

 From: Paul McGuire pt...@austin.rr.com
 Subject: Re: XML Parsing
 To: python-list@python.org
 Date: Wednesday, 25 February, 2009, 2:04 PM
 On Feb 25, 1:17 am, hrishy hris...@yahoo.co.uk
 wrote:
  Hi
 
  Something like this
 
 snip solution using ElementTree
 
  Note i am not a python programmer just a enthusiast
 and i was curious why people on the list didnt suggest a
 code like above
 
 
 You just beat the rest of us to it - good example of
 ElementTree for
 parsing XML (and I Iearned the '//' shortcut for
 one or more
 intervening tag levels).
 
 To the OP: if you are parsing XML, I would look hard at the
 modules
 (esp. ElementTree) that are written explicitly for XML,
 before
 considering using regular expressions.  There are just too
 many
 potential surprises when trying to match XML tags -
 presence/absence/
 order of attributes, namespaces, whitespace inside tags, to
 name a
 few.
 
 -- Paul
 --
 http://mail.python.org/mailman/listinfo/python-list


  
--
http://mail.python.org/mailman/listinfo/python-list


Re: XML Parsing

2009-02-25 Thread hrishy
Hi Cliff

Thanks so using elementree is the right way to handle this problem

regards
Hrishy


--- On Wed, 25/2/09, J. Clifford Dyer j...@sdf.lonestar.org wrote:

 From: J. Clifford Dyer j...@sdf.lonestar.org
 Subject: Re: XML Parsing
 To: hris...@yahoo.co.uk
 Cc: python-list@python.org, Lie Ryan lie.1...@gmail.com
 Date: Wednesday, 25 February, 2009, 12:37 PM
 Probably because you responded an hour after the question
 was posted,
 and in the dead of night.  Newsgroups often move slower
 than that.  But
 now we have posted a solution like that, so all's well
 in the world.  :)
 
 Cheers,
 Cliff
 
 
 On Wed, 2009-02-25 at 08:20 +, hrishy wrote:
  Hi Lie
  
  I am not a python guy but very interested in the
 langauge and i consider the people on this list to be
 intelligent and was wundering why you people did not suggest
 xpath for this kind of a problem just curious and willing to
 learn.
  
  I am searching for a answer but the question is 
  why not use xpath to extract xml text from a xml doc ?
  
  regards
  Hrishy
  
  
  --- On Wed, 25/2/09, Lie Ryan
 lie.1...@gmail.com wrote:
  
   From: Lie Ryan lie.1...@gmail.com
   Subject: Re: XML Parsing
   To: python-list@python.org
   Date: Wednesday, 25 February, 2009, 7:33 AM
   Are you searching for answer or searching for
 another people
   that have 
   the same answer as you? :)
   
   Many roads lead to Rome is a very
 famous
   quotation...
   
   --
  
 http://mail.python.org/mailman/listinfo/python-list
  
  

  --
  http://mail.python.org/mailman/listinfo/python-list
 


  
--
http://mail.python.org/mailman/listinfo/python-list


Re: XML Parsing

2009-02-24 Thread alex23
On Feb 25, 2:50 pm, Girish girish@gmail.com wrote:
 Can anyone please tell me how to get content of Signal tag.. that
 is, how to extract the data ![CDATA[Parameter Identifiers Supported -
 $01 to $20]]

Was there something in particular about Jean-Paul Calderone's solution
that didn't satisfy you? http://tinyurl.com/azgo5j


--
http://mail.python.org/mailman/listinfo/python-list


Re: XML Parsing

2009-02-24 Thread Lie Ryan
On Tue, 24 Feb 2009 20:50:20 -0800, Girish wrote:

 Hello,
 
 I have a xml file which is as follows:
 
 pids
 Parameter_Class
 Parameter Id=pid_031605_093137_283
 Identifier$/Identifier
 TypePID/Type
 Signal![CDATA[Parameter Identifiers Supported - $01
 to $20]]/Signal
 Description![CDATA[This PID indicates which
 legislated PIDs]]/Description
  ..
  ...
 
 Can anyone please tell me how to get content of Signal tag.. that is,
 how to extract the data ![CDATA[Parameter Identifiers Supported - $01
 to $20]]
 
 Thanks,
 Girish...

The easy one is to use re module (Regular expression). 

# untested
import re
signal_pattern = re.compile('Signal(.*)/Signal')
signals = signal_pattern.findall(xmlstring)

also, you may also use the xml module, which will be more reliable if you 
have data like this: foo attr=Signalblooo/Signalblah/foo,

 import xml.dom.minidom
 xmldata = xml.dom.minidom.parse(open('myfile.xml'))
 for node in xmldata.getElementsByTagName('Signal'): print node.toxml()
... 

--
http://mail.python.org/mailman/listinfo/python-list


Re: XML Parsing

2009-02-24 Thread hrishy
Hi 

I am just a python enthusiast and not a python user but was just wundering why 
didnt the list members come up with or recommen XPATH based solution
which i think is very elegant for this type of a problem isnt it ?

regards
Hrishy



--- On Wed, 25/2/09, Lie Ryan lie.1...@gmail.com wrote:

 From: Lie Ryan lie.1...@gmail.com
 Subject: Re: XML Parsing
 To: python-list@python.org
 Date: Wednesday, 25 February, 2009, 5:43 AM
 On Tue, 24 Feb 2009 20:50:20 -0800, Girish wrote:
 
  Hello,
  
  I have a xml file which is as follows:
  
  pids
  Parameter_Class
  Parameter
 Id=pid_031605_093137_283
 
 Identifier$/Identifier
  TypePID/Type
  Signal![CDATA[Parameter
 Identifiers Supported - $01
  to $20]]/Signal
  Description![CDATA[This
 PID indicates which
  legislated PIDs]]/Description
   ..
   ...
  
  Can anyone please tell me how to get content of
 Signal tag.. that is,
  how to extract the data ![CDATA[Parameter
 Identifiers Supported - $01
  to $20]]
  
  Thanks,
  Girish...
 
 The easy one is to use re module (Regular expression). 
 
 # untested
 import re
 signal_pattern =
 re.compile('Signal(.*)/Signal')
 signals = signal_pattern.findall(xmlstring)
 
 also, you may also use the xml module, which will be more
 reliable if you 
 have data like this: foo
 attr=Signalblooo/Signalblah/foo,
 
  import xml.dom.minidom
  xmldata =
 xml.dom.minidom.parse(open('myfile.xml'))
  for node in
 xmldata.getElementsByTagName('Signal'): print
 node.toxml()
 ... 
 
 --
 http://mail.python.org/mailman/listinfo/python-list


  
--
http://mail.python.org/mailman/listinfo/python-list


Re: XML Parsing

2009-02-24 Thread Lie Ryan
On Wed, 2009-02-25 at 06:09 +, hrishy wrote:
 Hi 
 
 I am just a python enthusiast and not a python user but was just wundering 
 why didnt the list members come up with or recommen XPATH based solution
 which i think is very elegant for this type of a problem isnt it ?

Did you mean XQuery?

Depending on the need, XQuery might be an overkill. And don't forget
that XQuery is still an obscure, unknown language for most people (the
de facto standard for querying is still SQL).

--
http://mail.python.org/mailman/listinfo/python-list


Re: XML Parsing

2009-02-24 Thread hrishy
Hi 

Something like this

pids
  Parameter_Class
ParameterId=pid_031605_093137_283 
Identifier$/Identifier
TypePID/Type
Signal![CDATA[Parameter
Identifiers Supported - $01
to $20]]/Signal
   Description![CDATA[This PID indicates which
  legislated PIDs]]
   /Description

from elementtree.ElementTree import ElementTree
doc = ElementTree(file='tst.xml')
for e in mydata.findall('/pids//signal'):
print e.get('title').text

Note i am not a python programmer just a enthusiast and i was curious why 
people on the list didnt suggest a code like above

willing to hear and learn from experienced python gurus

regards
Hrishy


  
--
http://mail.python.org/mailman/listinfo/python-list


Re: XML Parsing

2009-02-24 Thread Lie Ryan
Are you searching for answer or searching for another people that have 
the same answer as you? :)

Many roads lead to Rome is a very famous quotation...

--
http://mail.python.org/mailman/listinfo/python-list


Re: XML Parsing

2008-04-01 Thread Konstantin Veretennicov
On Tue, Apr 1, 2008 at 10:42 PM, Alok Kothari [EMAIL PROTECTED]
wrote:

 Hello,
  I am new to XML parsing.Could you kindly tell me whats the
 problem with the following code:

 import xml.dom.minidom
 import xml.parsers.expat
 document = token pos=nnLetterman/tokentoken pos=bezis/
 tokentoken pos=jjrbetter/tokentoken pos=csthan/
 tokentoken pos=npJay/tokentoken pos=npLeno/token


This document is not well-formed. It doesn't have root element.

...



 Traceback (most recent call last):
  File C:/Python25/Programs/eg.py, line 20, in module
p.Parse(document, 1)
 ExpatError: junk after document element: line 1, column 33


Told ya :)


Try wrapping your document in root element, like
tokenstoken.../tokentoken.../token/tokens

--
kv
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: XML Parsing

2008-04-01 Thread Alok Kothari
Thanks ! it worked !

On Wed, Apr 2, 2008 at 1:31 AM, Konstantin Veretennicov 
[EMAIL PROTECTED] wrote:

 On Tue, Apr 1, 2008 at 10:42 PM, Alok Kothari [EMAIL PROTECTED]
 wrote:

  Hello,
   I am new to XML parsing.Could you kindly tell me whats the
  problem with the following code:
 
  import xml.dom.minidom
  import xml.parsers.expat
  document = token pos=nnLetterman/tokentoken pos=bezis/
  tokentoken pos=jjrbetter/tokentoken pos=csthan/
  tokentoken pos=npJay/tokentoken pos=npLeno/token
 

 This document is not well-formed. It doesn't have root element.

 ...


 
  Traceback (most recent call last):
   File C:/Python25/Programs/eg.py, line 20, in module
 p.Parse(document, 1)
  ExpatError: junk after document element: line 1, column 33
 

 Told ya :)


 Try wrapping your document in root element, like
 tokenstoken.../tokentoken.../token/tokens

 --
 kv

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: XML Parsing

2008-04-01 Thread Jason Scheirer
On Apr 1, 12:42 pm, Alok Kothari [EMAIL PROTECTED] wrote:
 Hello,
   I am new to XML parsing.Could you kindly tell me whats the
 problem with the following code:

 import xml.dom.minidom
 import xml.parsers.expat
 document = token pos=nnLetterman/tokentoken pos=bezis/
 tokentoken pos=jjrbetter/tokentoken pos=csthan/
 tokentoken pos=npJay/tokentoken pos=npLeno/token

 # 3 handler functions
 def start_element(name, attrs):
 print 'Start element:', name, attrs
 def end_element(name):
 print 'End element:', name
 def char_data(data):
 print 'Character data:', repr(data)

 p = xml.parsers.expat.ParserCreate()

 p.StartElementHandler = start_element
 p.EndElementHandler = end_element
 p.CharacterDataHandler = char_data
 p.Parse(document, 1)

 OUTPUT:

 Start element: token {u'pos': u'nn'}
 Character data: u'Letterman'
 End element: token

 Traceback (most recent call last):
   File C:/Python25/Programs/eg.py, line 20, in module
 p.Parse(document, 1)
 ExpatError: junk after document element: line 1, column 33

Your XML is wrong. Don't put line breaks between / and token.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: XML Parsing

2008-04-01 Thread 7stud
On Apr 1, 1:42 pm, Alok Kothari [EMAIL PROTECTED] wrote:
 Hello,
           I am new to XML parsing.Could you kindly tell me whats the
 problem with the following code:

 import xml.dom.minidom
 import xml.parsers.expat
 document = token pos=nnLetterman/tokentoken pos=bezis/
 tokentoken pos=jjrbetter/tokentoken pos=csthan/
 tokentoken pos=npJay/tokentoken pos=npLeno/token

 # 3 handler functions
 def start_element(name, attrs):
     print 'Start element:', name, attrs
 def end_element(name):
     print 'End element:', name
 def char_data(data):
     print 'Character data:', repr(data)

 p = xml.parsers.expat.ParserCreate()

 p.StartElementHandler = start_element
 p.EndElementHandler = end_element
 p.CharacterDataHandler = char_data
 p.Parse(document, 1)

 OUTPUT:

 Start element: token {u'pos': u'nn'}
 Character data: u'Letterman'
 End element: token

 Traceback (most recent call last):
   File C:/Python25/Programs/eg.py, line 20, in module
     p.Parse(document, 1)
 ExpatError: junk after document element: line 1, column 33


I don't know if you are aware of the BeautifulSoup module:


import BeautifulSoup as bs

xml = token pos=nnLetterman/tokentoken pos=bezis/
tokentoken pos=jjrbetter/tokentoken pos=csthan/
tokentoken pos=npJay/tokentoken pos=npLeno/token

doc = bs.BeautifulStoneSoup(xml)

tokens = doc.findAll(token)
for token in tokens:
for attr in token.attrs:
print %s : %s % attr


print token.string

--output:--
pos : nn
Letterman
pos : bez
is
pos : jjr
better
pos : cs
than
pos : np
Jay
pos : np
Leno
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: XML Parsing

2008-04-01 Thread Gabriel Genellina
En Tue, 01 Apr 2008 20:44:41 -0300, 7stud [EMAIL PROTECTED]  
escribió:

           I am new to XML parsing.Could you kindly tell me whats the
 problem with the following code:

 import xml.dom.minidom
 import xml.parsers.expat

 I don't know if you are aware of the BeautifulSoup module:

Or ElementTree:

import xml.etree.ElementTree as ET

doctext = tokenstoken pos=nnLetterman/tokentoken  
pos=bezis/tokentoken pos=jjrbetter/tokentoken  
pos=csthan/tokentoken pos=npJay/tokentoken  
pos=npLeno/token/tokens

doc = ET.fromstring(doctext)
for token in doc.findall(token):
print 'pos:', token.get('pos')
print 'text:', token.text

-- 
Gabriel Genellina

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: XML Parsing

2007-03-28 Thread Amit Khemka
On 28 Mar 2007 00:38:38 -0700, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote:
 I want to parse this XML file:

 ?xml version=1.0 ?

 text

 text:one
 filefilename/file
 contents
 Hello
 /contents
 /text:one

 text:two
 filefilename2/file
 contents
 Hello2
 /contents
 /text:two

 /text

 This XML will be in a file called filecreate.xml

 As you might have guessed, I want to create files from this XML file
 contents, so how can I do this?
 What modules should I use? What options do I have? Where can I find
 tutorials? Will I be able to put
 this on the internet (on a googlepages server)?

http://effbot.org/zone/celementtree.htm


HTH,
-- 

Amit Khemka -- onyomo.com
Home Page: www.cse.iitd.ernet.in/~csd00377
Endless the world's turn, endless the sun's Spinning, Endless the quest;
I turn again, back to my own beginning, And here, find rest.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: XML Parsing

2007-03-28 Thread Diez B. Roggisch
[EMAIL PROTECTED] wrote:

 I want to parse this XML file:
 
 ?xml version=1.0 ?
 
 text
 
 text:one
 filefilename/file
 contents
 Hello
 /contents
 /text:one
 
 text:two
 filefilename2/file
 contents
 Hello2
 /contents
 /text:two
 
 /text
 
 This XML will be in a file called filecreate.xml
 
 As you might have guessed, I want to create files from this XML file
 contents, so how can I do this?
 What modules should I use? What options do I have? Where can I find
 tutorials? Will I be able to put
 this on the internet (on a googlepages server)?
 
 Thanks in advance to everyone who helps me.
 And yes I have used Google but I am unsure what to use.

The above file is not valid XML. It misses a xmlns:text namespace
declaration. So you won't be able to parse it regardless of what parser you
use.

Diez
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: XML Parsing

2007-03-28 Thread harvey . thomas
On Mar 28, 10:51 am, Diez B. Roggisch [EMAIL PROTECTED] wrote:
 [EMAIL PROTECTED] wrote:
  I want to parse this XML file:

  ?xml version=1.0 ?

  text

  text:one
  filefilename/file
  contents
  Hello
  /contents
  /text:one

  text:two
  filefilename2/file
  contents
  Hello2
  /contents
  /text:two

  /text

  This XML will be in a file called filecreate.xml

  As you might have guessed, I want to create files from this XML file
  contents, so how can I do this?
  What modules should I use? What options do I have? Where can I find
  tutorials? Will I be able to put
  this on the internet (on a googlepages server)?

  Thanks in advance to everyone who helps me.
  And yes I have used Google but I am unsure what to use.

 The above file is not valid XML. It misses a xmlns:text namespace
 declaration. So you won't be able to parse it regardless of what parser you
 use.

 Diez- Hide quoted text -

 - Show quoted text -

The example is valid well-formed XML. It is permitted to use the :
character in element names. Whether one should in a non namespace
context is a different matter.

Harvey

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: XML Parsing

2007-03-28 Thread Laurent Pointal
[EMAIL PROTECTED] a écrit :
 I want to parse this XML file:
zip
 As you might have guessed, I want to create files from this XML file
 contents, so how can I do this?
 What modules should I use? What options do I have? Where can I find
 tutorials? Will I be able to put
 this on the internet (on a googlepages server)?

See urllib2 module and its missing guide.

 Thanks in advance to everyone who helps me.
 And yes I have used Google but I am unsure what to use.
 

About XML, to complete Amit link to ElementsTree, you may take a look at:

http://www.diveintopython.org/xml_processing/index.html
(learn by example)

And look at:
http://pyxml.sourceforge.net/
http://www.rexx.com/~dkuhlman/pyxmlfaq.html
http://shellsage.com/?q=node/12
http://www.python.org/community/sigs/current/xml-sig/
http://docs.python.org/lib/markup.html
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: XML Parsing

2007-03-28 Thread Bruno Desthuilliers
[EMAIL PROTECTED] a écrit :
 I want to parse this XML file:
 
 ?xml version=1.0 ?
 
 text
 
 text:one
 filefilename/file
 contents
 Hello
 /contents
 /text:one
 
 text:two
 filefilename2/file
 contents
 Hello2
 /contents
 /text:two
 
 /text
 
 This XML will be in a file called filecreate.xml
 
 As you might have guessed, I want to create files from this XML file
 contents, so how can I do this?

Using a sax parser might be the best solution here.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: XML Parsing

2007-03-28 Thread Diez B. Roggisch
 
 The example is valid well-formed XML. It is permitted to use the :
 character in element names. Whether one should in a non namespace
 context is a different matter.

It is? I was always under the impression one has to declare a namespace. But
this might be shaped from the usage of XSLT and W3C schema that require
these.

Diez
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: XML Parsing

2007-03-28 Thread Christian
[EMAIL PROTECTED] wrote:
 I want to parse this XML file:
 
 ?xml version=1.0 ?
 
 text
 
 text:one
 filefilename/file
 contents
 Hello
 /contents
 /text:one
 
 text:two
 filefilename2/file
 contents
 Hello2
 /contents
 /text:two
 
 /text
 
 This XML will be in a file called filecreate.xml
 
 As you might have guessed, I want to create files from this XML file
 contents, so how can I do this?
 What modules should I use? What options do I have? Where can I find
 tutorials? Will I be able to put
 this on the internet (on a googlepages server)?
 
 Thanks in advance to everyone who helps me.
 And yes I have used Google but I am unsure what to use.
 

Try this:

http://www.python.org/doc/2.4.1/lib/expat-example.html


Christian

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: XML Parsing

2007-03-28 Thread Urban, Gabor
HI,
 
I could suggest you to use the minidom xml parser from xml module. Your
XML schema does not seem to complocated. You will find detailed
descriptions, and working code in the book: Dive ino Python. Google for
it :-))
 

Gabor Urban
NMC - ART 

 
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: XML parsing and writing

2006-08-29 Thread Stefan Behnel
c00i90wn wrote:
 Stefan Behnel wrote:
 c00i90wn wrote:
 Hey, I'm having a problem with the xml.dom.minidom package, I want to
 generate a simple xml for storing configuration variables, for that
 purpose I've written the following code, but before pasting it I'll
 tell you what my problem is. On first write of the xml everything goes
 as it should but on subsequent writes it starts to add more and more
 unneeded newlines to it making it hard to read and ugly.
 Maybe you should try to get your code a little cleaner first, that usually
 helps in finding these kinds of bugs. Try rewriting it with ElementTree or
 lxml, that usually helps you in getting your work done.

 http://effbot.org/zone/element-index.htm
 http://codespeak.net/lxml/

 Nice package ElementTree is but sadly it doesn't have a pretty print,
 well, guess I'll have to do it myself, if you have one already can you
 please give it to me? thanks :)

lxml's output functions all accept a pretty_print keyword argument.

Stefan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: XML parsing and writing

2006-08-29 Thread Fredrik Lundh
someone wrote:

 Nice package ElementTree is but sadly it doesn't have a pretty print,
 well, guess I'll have to do it myself, if you have one already can you
 please give it to me? thanks :)

http://effbot.python-hosting.com/file/stuff/sandbox/elementlib/indent.py

/F 



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: XML parsing and writing

2006-08-28 Thread uche . ogbuji
c00i90wn wrote:
 Nice package ElementTree is but sadly it doesn't have a pretty print,
 well, guess I'll have to do it myself, if you have one already can you
 please give it to me? thanks :)

FWIW Amara and plain old 4Suite both support pretty-print, canonical
XML print and more such options.

http://uche.ogbuji.net/tech/4suite/amara/
http://4Suite.org

--
Uche Ogbuji   Fourthought, Inc.
http://uche.ogbuji.nethttp://fourthought.com
http://copia.ogbuji.net   http://4Suite.org
Articles: http://uche.ogbuji.net/tech/publications/

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: XML parsing and writing

2006-08-01 Thread Jim

c00i90wn wrote:
On first write of the xml everything goes
 as it should but on subsequent writes it starts to add more and more
 unneeded newlines to it making it hard to read and ugly.
Pretty make it pretty by putting in newlines (and spaces) that are not
in the original data.  That is, if you have text John Smith
associated with the element name then pretty gives you something like

  name
John Smith
  /name
here with an extra two newlines and some whitespace indentation.  (I
don't recall 100% when it puts in stuff, but the point of pretty is to
put in extra stuff.)  You need to strip out the extra stuff (or print
it out not pretty; can you get a viewer that buffs-up a notbuff file so
you are seeing pretty but the data isn't actually pretty?).

Jim

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: XML parsing and writing

2006-07-31 Thread Stefan Behnel
c00i90wn wrote:
 Hey, I'm having a problem with the xml.dom.minidom package, I want to
 generate a simple xml for storing configuration variables, for that
 purpose I've written the following code, but before pasting it I'll
 tell you what my problem is. On first write of the xml everything goes
 as it should but on subsequent writes it starts to add more and more
 unneeded newlines to it making it hard to read and ugly.

Maybe you should try to get your code a little cleaner first, that usually
helps in finding these kinds of bugs. Try rewriting it with ElementTree or
lxml, that usually helps you in getting your work done.

http://effbot.org/zone/element-index.htm
http://codespeak.net/lxml/

Stefan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: XML parsing and writing

2006-07-31 Thread c00i90wn
Nice package ElementTree is but sadly it doesn't have a pretty print,
well, guess I'll have to do it myself, if you have one already can you
please give it to me? thanks :)

Stefan Behnel wrote:
 c00i90wn wrote:
  Hey, I'm having a problem with the xml.dom.minidom package, I want to
  generate a simple xml for storing configuration variables, for that
  purpose I've written the following code, but before pasting it I'll
  tell you what my problem is. On first write of the xml everything goes
  as it should but on subsequent writes it starts to add more and more
  unneeded newlines to it making it hard to read and ugly.

 Maybe you should try to get your code a little cleaner first, that usually
 helps in finding these kinds of bugs. Try rewriting it with ElementTree or
 lxml, that usually helps you in getting your work done.

 http://effbot.org/zone/element-index.htm
 http://codespeak.net/lxml/
 
 Stefan

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: XML parsing per record

2005-04-23 Thread Kent Johnson
Willem Ligtenberg wrote:
Is there an easy way, to couple data together. Because I have discoverd an
irritating feature in the xml file.
Sometimes this is a database reference:
Dbtag
Dbtag_dbUCSC/Dbtag_db
Dbtag_tag
Object-id
Object-id_str1234/Object-id_str
/Object-id
/Dbtag_tag
/Dbtag
And sometimes:
Dbtag
Dbtag_dbUCSC/Dbtag_db
Dbtag_tag
Object-id
Object-id_id1234/Object-id_id
/Object-id
/Dbtag_tag
/Dbtag
So I get a list database names and two! lists of ID's
And those two are in no way related. Is there an easy way to create a
dictionary like this DBname -- ID
If not, I still might need to revert to SAX... :(
None of your requirements sound particularly difficult to implement. If you would post a complete 
example of the data you want to parse and the data you would like to end up it would be easier to 
help you. The sample data you posted originally does not have many of the fields you want to extract 
and your example of what you want to end up with is not too clear either.

If you are having trouble with ElementTree I expect you will be completely lost with SAX, 
ElementTree is much easier to work with and cElementTree is very fast.

Kent
--
http://mail.python.org/mailman/listinfo/python-list


Re: XML parsing per record

2005-04-22 Thread William Park
Willem Ligtenberg [EMAIL PROTECTED] wrote:
 On Sun, 17 Apr 2005 02:16:04 +, William Park wrote:
  Care to post more details?
 
 The XML file I need to parse contains information about genes.
 So the first element is a gene and then there are a lot sub-elements with
 sub-elements. I only need some of the informtion and want to store it in
 my an object called gene. Lateron this information will be printed into a
 file, which in it's turn will be fed into some other program.

You have to help us a little more here.  Which info do you want to
extract from below example?

 Entrezgene-Set
 ...
 /Entrezgene-Set

-- 
William Park [EMAIL PROTECTED], Toronto, Canada
Slackware Linux -- because it works.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: XML parsing per record

2005-04-22 Thread Willem Ligtenberg
This is all the info I need from the xml file:
ID --  Gene-track_geneid320632/Gene-track_geneid

Name --Gene-ref
Gene-ref_locusPzp/Gene-ref_locus

Startbase -- Gene-commentary_seqs
Seq-loc
  Seq-loc_int
Seq-interval
  Seq-interval_from126957426/Seq-interval_from
  Seq-interval_to126989473/Seq-interval_to
  Seq-interval_strand
Na-strand value=plus/
  /Seq-interval_strand
  Seq-interval_id
Seq-id
  Seq-id_gi51860766/Seq-id_gi
/Seq-id
  /Seq-interval_id
/Seq-interval
  /Seq-loc_int
/Seq-loc
  /Gene-commentary_seqs
Endbase

Function -- Prot-ref_name
Prot-ref_name_EU5 snRNP-specific protein, 200 kDa/Prot-ref_name_E
Prot-ref_name_EU5 snRNP-specific protein, 200 kDa (DEXH RNA helicase
family)/Prot-ref_name_E
  /Prot-ref_name

DBLink -- Gene-ref_locus-tagMGI:201/Gene-ref_locus-tag
Gene-commentary_source
Other-source
  Other-source_src
Dbtag
  Dbtag_dbGO/Dbtag_db
  Dbtag_tag
Object-id
  Object-id_id5524/Object-id_id
/Object-id
  /Dbtag_tag
/Dbtag
  /Other-source_src
  Other-source_anchorATP binding/Other-source_anchor
  Other-source_post-textevidence: ISS/Other-source_post-text
/Other-source
  /Gene-commentary_source

Product-type -- Entrezgene_type value=protein-coding6/Entrezgene_type

gene-comment -- Gene-ref_descactivating signal cointegrator 1 complex 
subunit 3-like
1/Gene-ref_desc

synonym -- Gene-ref_syn
Gene-ref_syn_EHELIC2/Gene-ref_syn_E
Gene-ref_syn_EKIAA0788/Gene-ref_syn_E
Gene-ref_syn_EU5-200KD/Gene-ref_syn_E
Gene-ref_syn_EU5-200-KD/Gene-ref_syn_E
Gene-ref_syn_EA330064G03Rik/Gene-ref_syn_E
  /Gene-ref_syn
  
EC -- Prot-ref_ec
Prot-ref_ec_E1.5.1.5/Prot-ref_ec_E
Prot-ref_ec_E3.5.4.9/Prot-ref_ec_E
  /Prot-ref_ec

Chromosome: SubSource
SubSource_subtype value=chromosome1/SubSource_subtype
SubSource_name6/SubSource_name
  /SubSource

Some can happen more than once in a record.


On Fri, 22 Apr 2005 02:41:46 -0400, William Park wrote:

 Willem Ligtenberg [EMAIL PROTECTED] wrote:
 On Sun, 17 Apr 2005 02:16:04 +, William Park wrote:
  Care to post more details?
 
 The XML file I need to parse contains information about genes.
 So the first element is a gene and then there are a lot sub-elements with
 sub-elements. I only need some of the informtion and want to store it in
 my an object called gene. Lateron this information will be printed into a
 file, which in it's turn will be fed into some other program.
 
 You have to help us a little more here.  Which info do you want to
 extract from below example?
 
 Entrezgene-Set
 ...
 /Entrezgene-Set

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: XML parsing per record

2005-04-22 Thread Willem Ligtenberg
As I'm trying to write the code using cElementTree.
I stumble across one problem. Sometimes there are multiple values to
retrieve from one record for the same element. Like this:
Prot-ref_name_EATP-binding cassette, subfamily G, member 1/Prot-ref_name_E
Prot-ref_name_EATP-binding cassette 8/Prot-ref_name_E

How do you get not only the first, but the rest as well, so that I can
store it in a list.

Thanks in advance,

Willem Ligtenberg

On Fri, 22 Apr 2005 13:48:15 +0200, Willem Ligtenberg wrote:

 This is all the info I need from the xml file:
 ID --Gene-track_geneid320632/Gene-track_geneid
 
 Name --  Gene-ref
 Gene-ref_locusPzp/Gene-ref_locus
 
 Startbase -- Gene-commentary_seqs
 Seq-loc
   Seq-loc_int
 Seq-interval
   Seq-interval_from126957426/Seq-interval_from
   Seq-interval_to126989473/Seq-interval_to
   Seq-interval_strand
 Na-strand value=plus/
   /Seq-interval_strand
   Seq-interval_id
 Seq-id
   Seq-id_gi51860766/Seq-id_gi
 /Seq-id
   /Seq-interval_id
 /Seq-interval
   /Seq-loc_int
 /Seq-loc
   /Gene-commentary_seqs
 Endbase
 
 Function -- Prot-ref_name
 Prot-ref_name_EU5 snRNP-specific protein, 200 kDa/Prot-ref_name_E
 Prot-ref_name_EU5 snRNP-specific protein, 200 kDa (DEXH RNA helicase
 family)/Prot-ref_name_E
   /Prot-ref_name
 
 DBLink -- Gene-ref_locus-tagMGI:201/Gene-ref_locus-tag
 Gene-commentary_source
 Other-source
   Other-source_src
 Dbtag
   Dbtag_dbGO/Dbtag_db
   Dbtag_tag
 Object-id
   Object-id_id5524/Object-id_id
 /Object-id
   /Dbtag_tag
 /Dbtag
   /Other-source_src
   Other-source_anchorATP binding/Other-source_anchor
   Other-source_post-textevidence: 
 ISS/Other-source_post-text
 /Other-source
   /Gene-commentary_source
 
 Product-type -- Entrezgene_type value=protein-coding6/Entrezgene_type
 
 gene-comment -- Gene-ref_descactivating signal cointegrator 1 complex 
 subunit 3-like
 1/Gene-ref_desc
 
 synonym -- Gene-ref_syn
 Gene-ref_syn_EHELIC2/Gene-ref_syn_E
 Gene-ref_syn_EKIAA0788/Gene-ref_syn_E
 Gene-ref_syn_EU5-200KD/Gene-ref_syn_E
 Gene-ref_syn_EU5-200-KD/Gene-ref_syn_E
 Gene-ref_syn_EA330064G03Rik/Gene-ref_syn_E
   /Gene-ref_syn
   
 EC -- Prot-ref_ec
 Prot-ref_ec_E1.5.1.5/Prot-ref_ec_E
 Prot-ref_ec_E3.5.4.9/Prot-ref_ec_E
   /Prot-ref_ec
 
 Chromosome: SubSource
 SubSource_subtype value=chromosome1/SubSource_subtype
 SubSource_name6/SubSource_name
   /SubSource
 
 Some can happen more than once in a record.
 
 
 On Fri, 22 Apr 2005 02:41:46 -0400, William Park wrote:
 
 Willem Ligtenberg [EMAIL PROTECTED] wrote:
 On Sun, 17 Apr 2005 02:16:04 +, William Park wrote:
  Care to post more details?
 
 The XML file I need to parse contains information about genes.
 So the first element is a gene and then there are a lot sub-elements with
 sub-elements. I only need some of the informtion and want to store it in
 my an object called gene. Lateron this information will be printed into a
 file, which in it's turn will be fed into some other program.
 
 You have to help us a little more here.  Which info do you want to
 extract from below example?
 
 Entrezgene-Set
 ...
 /Entrezgene-Set

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: XML parsing per record

2005-04-22 Thread Willem Ligtenberg
By the way, I know about findall, but when I iterate thruogh it like:
for x in function:
print 'function', x

I get:
function Element 'Prot-ref_name_E' at 0xb7d10cf8
function Element 'Prot-ref_name_E' at 0xb7d10d10

But ofcourse I want the information in there...

On Fri, 22 Apr 2005 15:22:17 +0200, Willem Ligtenberg wrote:

 As I'm trying to write the code using cElementTree.
 I stumble across one problem. Sometimes there are multiple values to
 retrieve from one record for the same element. Like this:
 Prot-ref_name_EATP-binding cassette, subfamily G, member 1/Prot-ref_name_E
 Prot-ref_name_EATP-binding cassette 8/Prot-ref_name_E
 
 How do you get not only the first, but the rest as well, so that I can
 store it in a list.
 
 Thanks in advance,
 
 Willem Ligtenberg
 
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: XML parsing per record

2005-04-22 Thread Fredrik Lundh
Willem Ligtenberg wrote:
As I'm trying to write the code using cElementTree.
I stumble across one problem. Sometimes there are multiple values to
retrieve from one record for the same element. Like this:
Prot-ref_name_EATP-binding cassette, subfamily G, member 1/Prot-ref_name_E
Prot-ref_name_EATP-binding cassette 8/Prot-ref_name_E
How do you get not only the first, but the rest as well, so that I can
store it in a list.
findall returns a list of matching elements.  if elem is the paretnt 
element,
this gives you a list of the text inside all Prot-ref_name_E child elements:
   [e.text for e in elem.findall(Prot-ref_name_E)]
(you have read the elementtree documentation, I hope?)
/F
--
http://mail.python.org/mailman/listinfo/python-list


Re: XML parsing per record

2005-04-22 Thread Willem Ligtenberg
As you can read in the other post of mine, my problem was with the
iterating through the list. didn't know that you should do. e.text. I did
only print e, not print e.text
Did read documentation, but must admit not everything.

Anyway, thank you very much!

On Fri, 22 Apr 2005 15:47:08 +0200, Fredrik Lundh wrote:

 Willem Ligtenberg wrote:
 
 As I'm trying to write the code using cElementTree.
 I stumble across one problem. Sometimes there are multiple values to
 retrieve from one record for the same element. Like this:
 Prot-ref_name_EATP-binding cassette, subfamily G, member 
 1/Prot-ref_name_E
 Prot-ref_name_EATP-binding cassette 8/Prot-ref_name_E
 
 How do you get not only the first, but the rest as well, so that I can
 store it in a list.
 
 findall returns a list of matching elements.  if elem is the paretnt 
 element,
 this gives you a list of the text inside all Prot-ref_name_E child elements:
 
 [e.text for e in elem.findall(Prot-ref_name_E)]
 
 (you have read the elementtree documentation, I hope?)
 
 /F

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: XML parsing per record

2005-04-22 Thread Fredrik Lundh
Willem Ligtenberg wrote:
By the way, I know about findall, but when I iterate thruogh it like:
for x in function:
print 'function', x
I get:
function Element 'Prot-ref_name_E' at 0xb7d10cf8
function Element 'Prot-ref_name_E' at 0xb7d10d10
But ofcourse I want the information in there...
   for x in function:
   print 'function', x.text
/F
--
http://mail.python.org/mailman/listinfo/python-list


Re: XML parsing per record

2005-04-21 Thread Willem Ligtenberg
I'll first try it using SAX, because I want to have as little dependancies
as possible. I already have BioPython as a dependancy. And I personally
don't like to install lot's of packages for a program to work. So I don't
want to impose that on other people.
But thanks anyway and I might go for the cElementTree later on, if the
ordinary SAX proves to slow...

On Wed, 20 Apr 2005 08:03:00 -0400,
Kent Johnson wrote:

 Willem Ligtenberg wrote:
Willem Ligtenberg [EMAIL PROTECTED] wrote:

I want to parse a very large (2.4 gig) XML file (bioinformatics
ofcourse :)) But I have no clue how to do that. Most things I see read
the entire xml file at once. That isn't going to work here ofcourse.

So I would like to parse a XML file one record at a time and then be
able to store the information in another object.  How should I do
that?
 
 The XML file I need to parse contains information about genes.
 So the first element is a gene and then there are a lot sub-elements with
 sub-elements. I only need some of the informtion and want to store it in
 my an object called gene. Lateron this information will be printed into a
 file, which in it's turn will be fed into some other program.
 This is an example of the XML
 ?xml version=1.0?
 !DOCTYPE Entrezgene-Set PUBLIC -//NCBI//NCBI Entrezgene/EN 
 NCBI_Entrezgene.dtd
 Entrezgene-Set
   Entrezgene
 snip
   /Entrezgene
 /Entrezgene-Set
 
 This should get you started with cElementTree:
 
 import cElementTree as ElementTree
 
 source = 'Entrezgene.xml'
 
 for event, elem in ElementTree.iterparse(source):
  if elem.tag == 'Entrezgene':
  # Process the Entrezgene element
  geneid = 
 elem.findtext('Entrezgene_track-info/Gene-track/Gene-track_geneid')
  print 'Gene id', geneid
 
  # Throw away the element, we're done with it
  elem.clear()
 
 Kent

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: XML parsing per record

2005-04-21 Thread Willem Ligtenberg
Sorry I just decided that I want to use your solution, but I am wondering
is cElemenTree in expat or is that something different?

On Wed, 20 Apr 2005
08:03:00 -0400, Kent Johnson wrote:

 Willem Ligtenberg wrote:
Willem Ligtenberg [EMAIL PROTECTED] wrote:

I want to parse a very large (2.4 gig) XML file (bioinformatics
ofcourse :)) But I have no clue how to do that. Most things I see read
the entire xml file at once. That isn't going to work here ofcourse.

So I would like to parse a XML file one record at a time and then be
able to store the information in another object.  How should I do
that?
 
 The XML file I need to parse contains information about genes.
 So the first element is a gene and then there are a lot sub-elements with
 sub-elements. I only need some of the informtion and want to store it in
 my an object called gene. Lateron this information will be printed into a
 file, which in it's turn will be fed into some other program.
 This is an example of the XML
 ?xml version=1.0?
 !DOCTYPE Entrezgene-Set PUBLIC -//NCBI//NCBI Entrezgene/EN 
 NCBI_Entrezgene.dtd
 Entrezgene-Set
   Entrezgene
 snip
   /Entrezgene
 /Entrezgene-Set
 
 This should get you started with cElementTree:
 
 import cElementTree as ElementTree
 
 source = 'Entrezgene.xml'
 
 for event, elem in ElementTree.iterparse(source):
  if elem.tag == 'Entrezgene':
  # Process the Entrezgene element
  geneid = 
 elem.findtext('Entrezgene_track-info/Gene-track/Gene-track_geneid')
  print 'Gene id', geneid
 
  # Throw away the element, we're done with it
  elem.clear()
 
 Kent

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: XML parsing per record

2005-04-21 Thread Simon Brunning
On 4/21/05, Willem Ligtenberg [EMAIL PROTECTED] wrote:
 Sorry I just decided that I want to use your solution, but I am wondering
 is cElemenTree in expat or is that something different?

Nope, cElemenTree is  very much its own man. See
http://effbot.org/zone/celementtree.htm.

-- 
Cheers,
Simon B,
[EMAIL PROTECTED],
http://www.brunningonline.net/simon/blog/
--
http://mail.python.org/mailman/listinfo/python-list


Re: XML parsing per record

2005-04-21 Thread Paul McGuire
Don't assume that just because you have a 2.4G XML file that you have
2.4G of data.  Looking at these verbose tags, plus the fact that the
XML is pretty-printed (all those leading spaces - not even tabs! - add
up), I'm guessing you only have about 5-10% actual data, and the rest
is just XML tagging/untagging and spaces.  (For example, 373 characters
used to represent a date/time - this is a sin!)

As XML goes, this looks pretty dead easy to parse with non-XML parser
means.  It looks like all of your leaf nodes open and close on the same
line, which would be easy to extract with regexp's or pyparsing.
Especially since you mention I only need some of the informtion, you
don't even have to build a full document tree representation.  SAX
parsers would also be good, since you could only trigger on the
matching subset of tags that you are really interested in.  Lastly, you
could even try a pyparsing approach.  I usually don't recommend
pyparsing for XML since there are already many good XML-targeted tools
out there, but it is very easy to throw together something in pyparsing
that extracts, say, all of the object-id_id entries, or all of the
gene-source structures.  What is the subset of information you are
looking to extract?

-- Paul

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: XML parsing per record

2005-04-20 Thread Willem Ligtenberg
On Sun, 17 Apr 2005 02:16:04 +, William Park wrote:

 Willem Ligtenberg [EMAIL PROTECTED] wrote:
 I want to parse a very large (2.4 gig) XML file (bioinformatics
 ofcourse :)) But I have no clue how to do that. Most things I see read
 the entire xml file at once. That isn't going to work here ofcourse.
 
 So I would like to parse a XML file one record at a time and then be
 able to store the information in another object.  How should I do
 that?
 
 Thanks in advance,
 
 Willem Ligtenberg A total newbie to python by the way.
 
 You may want to try Expat (www.libexpat.org) or Python wrapper to it.
 You can feed small piece at a time, say by lines or whatever.  Of
 course, it all depends on what kind of parsing you have in mind. :-)
 
 Care to post more details?

The XML file I need to parse contains information about genes.
So the first element is a gene and then there are a lot sub-elements with
sub-elements. I only need some of the informtion and want to store it in
my an object called gene. Lateron this information will be printed into a
file, which in it's turn will be fed into some other program.
This is an example of the XML
?xml version=1.0?
!DOCTYPE Entrezgene-Set PUBLIC -//NCBI//NCBI Entrezgene/EN 
NCBI_Entrezgene.dtd
Entrezgene-Set
  Entrezgene
Entrezgene_track-info
  Gene-track
Gene-track_geneid9996/Gene-track_geneid
Gene-track_status value=secondary1/Gene-track_status
Gene-track_current-id
  Dbtag
Dbtag_dbLocusID/Dbtag_db
Dbtag_tag
  Object-id
Object-id_id320632/Object-id_id
  /Object-id
/Dbtag_tag
  /Dbtag
  Dbtag
Dbtag_dbGeneID/Dbtag_db
Dbtag_tag
  Object-id
Object-id_id320632/Object-id_id
  /Object-id
/Dbtag_tag
  /Dbtag
/Gene-track_current-id
Gene-track_create-date
  Date
Date_std
  Date-std
Date-std_year2003/Date-std_year
Date-std_month8/Date-std_month
Date-std_day28/Date-std_day
Date-std_hour21/Date-std_hour
Date-std_minute39/Date-std_minute
Date-std_second0/Date-std_second
  /Date-std
/Date_std
  /Date
/Gene-track_create-date
Gene-track_update-date
  Date
Date_std
  Date-std
Date-std_year2005/Date-std_year
Date-std_month2/Date-std_month
Date-std_day17/Date-std_day
Date-std_hour12/Date-std_hour
Date-std_minute54/Date-std_minute
Date-std_second0/Date-std_second
  /Date-std
/Date_std
  /Date
/Gene-track_update-date
  /Gene-track
/Entrezgene_track-info
Entrezgene_type value=protein-coding6/Entrezgene_type
Entrezgene_source
  BioSource
BioSource_genome value=genomic1/BioSource_genome
BioSource_origin value=natural1/BioSource_origin
BioSource_org
  Org-ref
Org-ref_taxnameMus musculus/Org-ref_taxname
Org-ref_commonhouse mouse/Org-ref_common
Org-ref_db
  Dbtag
Dbtag_dbtaxon/Dbtag_db
Dbtag_tag
  Object-id
Object-id_id10090/Object-id_id
  /Object-id
/Dbtag_tag
  /Dbtag
/Org-ref_db
Org-ref_syn
  Org-ref_syn_Emouse/Org-ref_syn_E
/Org-ref_syn
Org-ref_orgname
  OrgName
OrgName_name
  OrgName_name_binomial
BinomialOrgName
  BinomialOrgName_genusMus/BinomialOrgName_genus
  
BinomialOrgName_speciesmusculus/BinomialOrgName_species
/BinomialOrgName
  /OrgName_name_binomial
/OrgName_name
OrgName_lineageEukaryota; Metazoa; Chordata; Craniata; 
Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Glires; 
Rodentia; Sciurognathi; Muridae; Murinae; Mus/OrgName_lineage
OrgName_gcode1/OrgName_gcode
OrgName_mgcode2/OrgName_mgcode
OrgName_divROD/OrgName_div
  /OrgName
/Org-ref_orgname
  /Org-ref
/BioSource_org
  /BioSource
/Entrezgene_source
Entrezgene_gene
  Gene-ref
  /Gene-ref
/Entrezgene_gene
Entrezgene_gene-source
  Gene-source
Gene-source_srcLocusLink/Gene-source_src
Gene-source_src-int9996/Gene-source_src-int
Gene-source_src-str29996/Gene-source_src-str2
Gene-source_gene-display value=false/
Gene-source_locus-display value=false/
Gene-source_extra-terms value=false/
  /Gene-source
/Entrezgene_gene-source

Re: XML parsing per record

2005-04-20 Thread Kent Johnson
Willem Ligtenberg wrote:
Willem Ligtenberg [EMAIL PROTECTED] wrote:
I want to parse a very large (2.4 gig) XML file (bioinformatics
ofcourse :)) But I have no clue how to do that. Most things I see read
the entire xml file at once. That isn't going to work here ofcourse.
So I would like to parse a XML file one record at a time and then be
able to store the information in another object.  How should I do
that?
The XML file I need to parse contains information about genes.
So the first element is a gene and then there are a lot sub-elements with
sub-elements. I only need some of the informtion and want to store it in
my an object called gene. Lateron this information will be printed into a
file, which in it's turn will be fed into some other program.
This is an example of the XML
?xml version=1.0?
!DOCTYPE Entrezgene-Set PUBLIC -//NCBI//NCBI Entrezgene/EN 
NCBI_Entrezgene.dtd
Entrezgene-Set
  Entrezgene
snip
  /Entrezgene
/Entrezgene-Set
This should get you started with cElementTree:
import cElementTree as ElementTree
source = 'Entrezgene.xml'
for event, elem in ElementTree.iterparse(source):
if elem.tag == 'Entrezgene':
# Process the Entrezgene element
geneid = 
elem.findtext('Entrezgene_track-info/Gene-track/Gene-track_geneid')
print 'Gene id', geneid
# Throw away the element, we're done with it
elem.clear()
Kent
--
http://mail.python.org/mailman/listinfo/python-list


Re: XML parsing per record

2005-04-17 Thread Fredrik Lundh
William Park wrote:
You may want to try Expat (www.libexpat.org) or Python wrapper to it.
Python comes with a low-level expat wrapper (pyexpat).
however, if you want performance, cElementTree (which also uses expat) is a
lot faster than pyexpat.  (see my other post for links to benchmarks and code).
/F
--
http://mail.python.org/mailman/listinfo/python-list


Re: XML parsing per record

2005-04-16 Thread Irmen de Jong
Willem Ligtenberg wrote:
I want to parse a very large (2.4 gig) XML file (bioinformatics ofcourse :))
But I have no clue how to do that. Most things I see read the entire xml
file at once. That isn't going to work here ofcourse.
So I would like to parse a XML file one record at a time and then be able
to store the information in another object.
How should I do that?
Thanks in advance,
Willem Ligtenberg
A total newbie to python by the way.

Read about SAX parsers.
This may be of help:
http://www.devarticles.com/c/a/XML/Parsing-XML-with-SAX-and-Python/
Out of curiousity, why is the data stored in a XML file?
XML is not known for its efficiency
--Irmen
--
http://mail.python.org/mailman/listinfo/python-list


Re: XML parsing per record

2005-04-16 Thread Ivan Voras
Irmen de Jong wrote:
XML is not known for its efficiency
sarcasm Surely you are blaspheming, sir! XML's the greatest thing 
since peanut butter! /sarcasm

I'm just *waiting* for the day someone finds its use on the rolls of 
toilet paper... oh the glorious day...

--
http://mail.python.org/mailman/listinfo/python-list


Re: XML parsing per record

2005-04-16 Thread Kent Johnson
Willem Ligtenberg wrote:
I want to parse a very large (2.4 gig) XML file (bioinformatics ofcourse :))
But I have no clue how to do that. Most things I see read the entire xml
file at once. That isn't going to work here ofcourse.
So I would like to parse a XML file one record at a time and then be able
to store the information in another object.
How should I do that?
You might be interested in this recipe using ElementTree:
http://online.effbot.org/2004_12_01_archive.htm#element-generator
Kent
--
http://mail.python.org/mailman/listinfo/python-list


Re: XML parsing per record

2005-04-16 Thread Fredrik Lundh
Kent Johnson wrote:
So I would like to parse a XML file one record at a time and then be able
to store the information in another object.
You might be interested in this recipe using ElementTree:
http://online.effbot.org/2004_12_01_archive.htm#element-generator
if you have ElementTree 1.2.5 or later, the iterparse function provides a
more efficient implementation of that pattern:
   http://effbot.org/zone/element-iterparse.htm
the cElementTree implemention of iterparse is a lot faster than SAX; see
the second table under
   http://effbot.org/zone/celementtree.htm#benchmarks
for some figures.
/F
--
http://mail.python.org/mailman/listinfo/python-list


Re: XML parsing per record

2005-04-16 Thread William Park
Willem Ligtenberg [EMAIL PROTECTED] wrote:
 I want to parse a very large (2.4 gig) XML file (bioinformatics
 ofcourse :)) But I have no clue how to do that. Most things I see read
 the entire xml file at once. That isn't going to work here ofcourse.
 
 So I would like to parse a XML file one record at a time and then be
 able to store the information in another object.  How should I do
 that?
 
 Thanks in advance,
 
 Willem Ligtenberg A total newbie to python by the way.

You may want to try Expat (www.libexpat.org) or Python wrapper to it.
You can feed small piece at a time, say by lines or whatever.  Of
course, it all depends on what kind of parsing you have in mind. :-)

Care to post more details?

-- 
William Park [EMAIL PROTECTED], Toronto, Canada
Slackware Linux -- because it works.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: xml parsing escape characters

2005-01-21 Thread Luis P. Mendes
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
~From your experience, do you think that if this wrong XML code  could be
meant to be read only by somekind of Microsoft parser, the error will
not occur?
I'll try to explain:
xml producer writes the code in Windows platform and 'thinks' that every
client will read/parse the code with a specific Windows parser.  Could
that (wrong) XML code parse correctly in that kind of specific Windows
client?
Or in other words:
Do you know any windows parser that could turn that erroneous encoding
to a xml tree, with four or five inner levels of tags?
I'd like to thank everyone for taking the time to answer me.
Luis
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
iD8DBQFB8UIOHn4UHCY8rB8RAgK4AKCiHjPdkCKnirX4gEIawT9hBp3HmQCdGoFK
3IEMLLXwMZKvNoqA4tISVnI=
=jvOU
-END PGP SIGNATURE-
--
http://mail.python.org/mailman/listinfo/python-list


Re: xml parsing escape characters

2005-01-21 Thread Luis P. Mendes
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
~From your experience, do you think that if this wrong XML code  could be
meant to be read only by somekind of Microsoft parser, the error will
not occur?
I'll try to explain:
xml producer writes the code in Windows platform and 'thinks' that every
client will read/parse the code with a specific Windows parser.  Could
that (wrong) XML code parse correctly in that kind of specific Windows
client?
Or in other words:
Do you know any windows parser that could turn that erroneous encoding
to a xml tree, with four or five inner levels of tags?
I'd like to thank everyone for taking the time to answer me.
Luis
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
iD8DBQFB8UIOHn4UHCY8rB8RAgK4AKCiHjPdkCKnirX4gEIawT9hBp3HmQCdGoFK
3IEMLLXwMZKvNoqA4tISVnI=
=jvOU
-END PGP SIGNATURE-
--
http://mail.python.org/mailman/listinfo/python-list


Re: xml parsing escape characters

2005-01-21 Thread Fredrik Lundh
Luis P. Mendes wrote:

 xml producer writes the code in Windows platform and 'thinks' that every
 client will read/parse the code with a specific Windows parser.  Could
 that (wrong) XML code parse correctly in that kind of specific Windows
 client?

not if it's an XML parser.

 Do you know any windows parser that could turn that erroneous encoding
 to a xml tree, with four or five inner levels of tags?

any parser *can* do that, but I doubt many parsers will do it unless
you ask it to (by extracting the string and parsing it again).  here's the
elementtree version:

from elementtree.ElementTree import parse, XML

wrapper = parse(urllib.urlopen(url))
dataset = XML(wrapper.findtext({http://www..}string;))

/F 



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: xml parsing escape characters

2005-01-21 Thread Martin v. Lwis
Luis P. Mendes wrote:
From your experience, do you think that if this wrong XML code  could be
meant to be read only by somekind of Microsoft parser, the error will
not occur?
This is very unlikely. MSXML would never do this incorrectly.
Regards,
Martin
--
http://mail.python.org/mailman/listinfo/python-list


Re: xml parsing escape characters

2005-01-20 Thread Luis P. Mendes
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
this is the xml document:
?xml version=1.0 encoding=utf-8?
string xmlns=http://www..;lt;DataSetgt;
~   lt;Ordergt;
~ lt;Customergt;439lt;/Customergt;
(... others ...)
~   lt;/Ordergt;
lt;/DataSetgt;/string
When I do:
print xmldoc.toxml()
it prints:
?xml version=1.0 ?
string xmlns=http://www...;lt;DataSetgt;
~  lt;Ordergt;
~lt;Customergt;439lt;/Customergt;
~  lt;/Ordergt;
lt;/DataSetgt;/string
__
with:   stringNode = xmldoc.childNodes[0]
print stringNode.toxml()
I get:
string xmlns=http://www...;lt;DataSetgt;
~  lt;Ordergt;
~lt;Customergt;439lt;/Customergt;
~  lt;/Ordergt;
lt;/DataSetgt;/string
__
with:   DataSetNode = stringNode.childNodes[0]
print DataSetNode.toxml()
I get:
lt;DataSetgt;
~  lt;Ordergt;
~lt;Customergt;439lt;/Customergt;
~  lt;/Ordergt;
lt;/DataSetgt;
___-
so far so good, but when I issue the command:
print DataSetNode.childNodes[0]
I get:
IndexError: tuple index out of range
Why the error, and why does it return a tuple?
Why doesn't it return:
lt;Ordergt;
lt;Customergt;439lt;/Customergt;
lt;/Ordergt;
??
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
iD8DBQFB76y3Hn4UHCY8rB8RAvQsAKCFD/hps8ybQli8HAs3iSCvRjwqjACfS/12
5gctpB91S5cy299e/TVLGQk=
=XR2a
-END PGP SIGNATURE-
--
http://mail.python.org/mailman/listinfo/python-list


Re: xml parsing escape characters

2005-01-20 Thread Kent Johnson
Luis P. Mendes wrote:
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
this is the xml document:
?xml version=1.0 encoding=utf-8?
string xmlns=http://www..;lt;DataSetgt;
~   lt;Ordergt;
~ lt;Customergt;439lt;/Customergt;
(... others ...)
~   lt;/Ordergt;
lt;/DataSetgt;/string
This is an XML document containing a single tag, string, whose content is text containing 
entity-escaped XML.

This is *not* an XML document containing tags DataSet, Order, Customer, 
etc.
All the behaviour you are seeing is a consequence of this. You need to unescape the contents of the 
string tag to be able to treat it as structured XML.

Kent
--
http://mail.python.org/mailman/listinfo/python-list


Re: xml parsing escape characters

2005-01-20 Thread Irmen de Jong
Kent Johnson wrote:
[...]
This is an XML document containing a single tag, string, whose content 
is text containing entity-escaped XML.

This is *not* an XML document containing tags DataSet, Order, 
Customer, etc.

All the behaviour you are seeing is a consequence of this. You need to 
unescape the contents of the string tag to be able to treat it as 
structured XML.
The unescaping is usually done for you by the xml parser that you use.
--Irmen
--
http://mail.python.org/mailman/listinfo/python-list


Re: xml parsing escape characters

2005-01-20 Thread Kent Johnson
Irmen de Jong wrote:
Kent Johnson wrote:
[...]
This is an XML document containing a single tag, string, whose 
content is text containing entity-escaped XML.

This is *not* an XML document containing tags DataSet, Order, 
Customer, etc.

All the behaviour you are seeing is a consequence of this. You need to 
unescape the contents of the string tag to be able to treat it as 
structured XML.

The unescaping is usually done for you by the xml parser that you use.
Yes, so if your XML contains for example
stufflt;not a taggt;/stuff
and you parse this and ask for the *text* content of the stuff tag, you will 
get the string
not a tag
but it's still *not* a tag. If you try to get child elements of the stuff 
element there will be none.
This is exactly the confusion the OP has.
--Irmen
--
http://mail.python.org/mailman/listinfo/python-list


Re: xml parsing escape characters

2005-01-20 Thread Martin v. Lwis
Luis P. Mendes wrote:
with:DataSetNode = stringNode.childNodes[0]
print DataSetNode.toxml()
I get:
lt;DataSetgt;
~  lt;Ordergt;
~lt;Customergt;439lt;/Customergt;
~  lt;/Ordergt;
lt;/DataSetgt;
___-
so far so good, but when I issue the command:
print DataSetNode.childNodes[0]
I get:
IndexError: tuple index out of range
Why the error, and why does it return a tuple?
The DataSetNode has no children, because it is not
an Element node, but a Text node. In XML, an element
is denoted by
  DataSet.../DataSet
and *not* by
  lt;DataSetgt;...lt;/DataSetgt;
The latter is just a single string, represented
in XML as a Text node. It does not give you any
hierarchy whatsoever.
As a text node does not have any children, its
childNode members is a empty tuple; accessing
that tuple gives you an IndexError.
Regards,
Martin
--
http://mail.python.org/mailman/listinfo/python-list


Re: xml parsing escape characters

2005-01-20 Thread Martin v. Lwis
Irmen de Jong wrote:
The unescaping is usually done for you by the xml parser that you use.
Usually, but not in this case. If you have a text that looks like
XML, and you want to put it into an XML element, the XML file uses
lt; and gt;. The XML parser unescapes that as  and . However, it
does not then consider the  and  as markup, and it shouldn't.
Regards,
Martin
--
http://mail.python.org/mailman/listinfo/python-list


Re: xml parsing escape characters

2005-01-20 Thread Irmen de Jong
Martin v. Löwis wrote:
Irmen de Jong wrote:
The unescaping is usually done for you by the xml parser that you use.

Usually, but not in this case. If you have a text that looks like
XML, and you want to put it into an XML element, the XML file uses
lt; and gt;. The XML parser unescapes that as  and . However, it
does not then consider the  and  as markup, and it shouldn't.
That's also what I said?
The unescaping of the XML entities in the contents of the OP's
string element is done for you by the parser,
so you will get a text node with the ,,,whatever in there.
The OP probably wants to feed that to a new xml parser instance
to process it as markup.
Or perhaps the way the original XML document is constructed is
flawed.
--Irmen
--
http://mail.python.org/mailman/listinfo/python-list


Re: xml parsing escape characters

2005-01-20 Thread Martin v. Lwis
Irmen de Jong wrote:
Usually, but not in this case. If you have a text that looks like
XML, and you want to put it into an XML element, the XML file uses
lt; and gt;. The XML parser unescapes that as  and . However, it
does not then consider the  and  as markup, and it shouldn't.

That's also what I said?
You said it in response to
 All the behaviour you are seeing is a consequence of this. You need
 to unescape the contents of the string tag to be able to treat it 
 as structured XML.

In that context, I interpreted
 The unescaping is usually done for you by the xml parser that you
 use.
as The parser should have done what you want; if the parser didn't,
that is is bug in the parser.
The OP probably wants to feed that to a new xml parser instance
to process it as markup.
Or perhaps the way the original XML document is constructed is
flawed.
Either of these, indeed - probably the latter.
Regards,
Martin
--
http://mail.python.org/mailman/listinfo/python-list


Re: xml parsing escape characters

2005-01-20 Thread Luis P. Mendes
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
I would like to thank everyone for your answers, but I'm not seeing the
light yet!
When I access the url via the Firefox browser and look into the source
code, I also get:
?xml version=1.0 encoding=utf-8?
string xmlns=httplt;DataSetgt;
~  lt;Ordergt;
~lt;Customergt;439lt;/Customergt;
~  lt;/Ordergt;
lt;/DataSetgt;/string
should I take the contents of the string tag that is text and replace
all 'lt' with '' and 'gt' with '' and then read it with xml.minidom?
how to do it?
or should I use another parser that accomplishes the task with no need
to replace the escaped characters?
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
iD8DBQFB8AIQHn4UHCY8rB8RAuw8AJ9ZMQ8P3c7wXD1zVLd2fe7MktMQwwCfXAND
EPpY1w2a3ix2s2vWRlzZ43U=
=bJQV
-END PGP SIGNATURE-
--
http://mail.python.org/mailman/listinfo/python-list


Re: xml parsing escape characters

2005-01-20 Thread Martin v. Lwis
Luis P. Mendes wrote:
When I access the url via the Firefox browser and look into the source
code, I also get:
?xml version=1.0 encoding=utf-8?
string xmlns=httplt;DataSetgt;
~  lt;Ordergt;
~lt;Customergt;439lt;/Customergt;
~  lt;/Ordergt;
lt;/DataSetgt;/string
Please do try to understand what you are seeing. This is crucial for
understanding what happens.
You may have the understanding that XML can be represented as a tree.
This would be good - if not, please read a book that explains why
XML can be considered as a tree.
In the tree, you have inner nodes, and leaf nodes. For example,
the document
a
  bHello/b
  cWorld/c
/a
has 5 nodes (ignoring whitespace content):
Element:a  Element:b  Text:Hello
   |
   \-- Element:c  Text:World
So the leaf nodes are typically Text nodes (unless you
have an empty element). Your document has this structure:
Element:string  Text:DataSet
   Order
  Customer439/Customer
  /Order
/DataSet
So the ***TEXT*** contains the letter , just like it contains
the letters O and r. There IS no element Order in your document,
no matter how hard you look.
If you want a DataSet *element* in your document, it should
read
string xmlns=...
 DataSet
  Order
   Customer439/Customer
  /Order
 /DataSet
/string
As this is the document you apparently want to process, complain
to whoever gave you that other document.
should I take the contents of the string tag that is text and replace
all 'lt' with '' and 'gt' with '' and then read it with xml.minidom?
No. We still don't know what you want to achieve, so it is difficult to
advise you what to do. My best advise is that whoever generates the XML
document should fix it.
or should I use another parser that accomplishes the task with no need
to replace the escaped characters?
No. The parser is working correctly.
The document you got can also be interpreted as containing another
XML document as a text. This is evil, but apparently people are doing
it, anyway. If you really want that embedded document, you need
first to extract it.
To see what I mean, do
print DataSetNode.data
The .data attribute gives you the string contents of
a text node. You could use this as an XML document, and
parse it again to an XML parser. This would be ugly,
but might be your only choice if the producer of the
document is unwilling to adjust.
Regards,
Martin
--
http://mail.python.org/mailman/listinfo/python-list


Re: xml parsing escape characters

2005-01-20 Thread Jeremy Bowers
On Thu, 20 Jan 2005 21:54:30 +0100, Martin v. Lwis wrote:

 Luis P. Mendes wrote:
 When I access the url via the Firefox browser and look into the source
 code, I also get:
 
 ?xml version=1.0 encoding=utf-8? string
 xmlns=httplt;DataSetgt; ~  lt;Ordergt;
 ~lt;Customergt;439lt;/Customergt; ~  lt;/Ordergt;
 lt;/DataSetgt;/string
 
 Please do try to understand what you are seeing. This is crucial for
 understanding what happens.

From extremely painful and lengthy personal experience, Luis, I
***extremely*** strongly recommend taking the time to nail this down until
you really, really, really understand what is going on. Until you can
explain it to somebody else coherently, ideally.

Mixing escaping levels like this absolutely, positively *must* be done
correctly, or extremely-painful-to-debug problems will result.

(My painful experience was layering an RPC implementation in plain text on
top of IM messages, where I was dealing with everything from the socket
level up except the XML parser. Ultimately it turned out there was a
problem in the XML parser, it rendered amp;amp; as , which is wrong
wrong wrong. But that took a *long* time to find, especially as I had
other bugs in the way.)

Since you're layering XML in XML, test amp;amp; and amp;amp;amp; to make
sure they work correctly; those usually show encoding errors. And, given
your current understanding of the issue, do not write your own decoding
function unless you absolutely can't avoid it.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: xml parsing escape characters

2005-01-19 Thread Martin v. Lwis
Luis P. Mendes wrote:
I get the following result:
?xml version=1.0 encoding=utf-8?
string xmlns=http://www..;lt;DataSetgt;
~  lt;Ordergt;
Most likely, this result is correct, and your document
really does contain
  lt;Ordergt;

I don't get any elements.  But, if I access the same url via a browser,
the result in the browser window is something like:
string xmlns=http://www..;
~  DataSet
Most likely, your browser is incorrect (or atleast confusing), and
renders lt; as , even though this is not markup.
I already browsed the web, I know it's about the escape characters, but
I didn't find a simple solution for this.
Not sure what this is. AFAICT, everything works correctly.
Regards,
Martin
--
http://mail.python.org/mailman/listinfo/python-list