Re: python xml DOM? pulldom? SAX?

2005-09-06 Thread jog
Thanks a lot for all your replies that was really great and helpfull.
Now I have some problems with the indexing, it takes to much memory and
akes to long. I have to look into it.

-- 
http://mail.python.org/mailman/listinfo/python-list


python xml DOM? pulldom? SAX?

2005-08-29 Thread jog
Hi,
I want to get text out of some nodes of a huge xml file (1,5 GB). The
architecture of the xml file is something like this
parent
   page
titlebla/title
id/id
revision
  id/id
  textblablabla/text
revision
   /page
   page
   /page

/parent
I want to combine the text out of page:title and page:revision:text for
every single page element. One by one I want to index these combined
texts (so for each page one index)
What is the most efficient API for that?: SAX ( I don´t thonk so) DOM
or pulldom?
Or should I just use Xpath somehow.
I don`t want to do anything else with his xml file afterwards.
I hope someone will understand me.
Thank you very much
Jog

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: python xml DOM? pulldom? SAX?

2005-08-29 Thread tooper
Hi,

I'd advocate for using SAX, as DOM related methods implies loading the
complete XML content in memory whereas SAX grab things on the fly.
SAX method should therefore be faster and less memory consuming...

By the way, if your goal is to just combine the text out of page:title
and page:revision:text for every single page element, maybe you should
also consider an XSLT filter.

Regards,
Thierry

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: python xml DOM? pulldom? SAX?

2005-08-29 Thread Michael Ekstrand
On 29 Aug 2005 08:17:04 -0700
jog [EMAIL PROTECTED] wrote:
 I want to get text out of some nodes of a huge xml file (1,5 GB). The
 architecture of the xml file is something like this
 [structure snipped]
 I want to combine the text out of page:title and page:revision:text
 for every single page element. One by one I want to index these
 combined texts (so for each page one index)
 What is the most efficient API for that?: SAX ( I don´t thonk so) DOM
 or pulldom?

Definitely SAX IMHO, or xml.parsers.expat. For what you're doing, an
event-driven interface is ideal. DOM parses the *entire* XML tree into
memory at once, before you can do anything - highly inefficient for a
large data set like this. I've never used pulldom, it might have
potential, but from my (limited and flawed) understanding of it, I
think it may also wind up loading most of the file into memory by the
time you're done.

SAX will not build any memory structures other than the ones you
explicitly create (SAX is commonly used to build DOM trees). With SAX,
you can just watch for any tags of interest (and perhaps some
surrounding tags to provide context), extract the desired data, and all
that very efficiently.

It took me a bit to get the hang of SAX, but once I did, I haven't
looked back. Event-driven parsing is a brilliant solution to this
problem domain.

 Or should I just use Xpath somehow.

XPath usually requires a DOM tree on which it can operate. The Python
XPath implementation (in PyXML) requires DOM objects. I see this as
being a highly inefficient solution.

Another potential solution, if the data file has extraneous
information: run the source file through an XSLT transform that strips
it down to only the data you need, and then apply SAX to parse it.

- Michael
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: python xml DOM? pulldom? SAX?

2005-08-29 Thread Fredrik Lundh
jog wrote:

 I want to get text out of some nodes of a huge xml file (1,5 GB). The
 architecture of the xml file is something like this

 I want to combine the text out of page:title and page:revision:text for
 every single page element. One by one I want to index these combined
 texts (so for each page one index)

here's one way to do it:

try:
import cElementTree as ET
except ImportError:
from elementtree import ElementTree as ET

for event, elem in ET.iterparse(file):
if elem.tag == page:
title = elem.findtext(title)
revision = elem.findtext(revision/text)
print title, revision
elem.clear() # won't need this any more

references:

http://effbot.org/zone/element-index.htm
http://effbot.org/zone/celementtree.htm (for best performance)
http://effbot.org/zone/element-iterparse.htm

/F 



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: python xml DOM? pulldom? SAX?

2005-08-29 Thread Alan Kennedy
[jog]
  I want to get text out of some nodes of a huge xml file (1,5 GB). The
  architecture of the xml file is something like this

[snip]

  I want to combine the text out of page:title and page:revision:text
  for every single page element. One by one I want to index these
  combined texts (so for each page one index)
  What is the most efficient API for that?:
  SAX ( I don´t thonk so)

SAX is perfect for the job. See code below.

  DOM

If your XML file is 1.5G, you'll need *lots* of RAM and virtual memory 
to load it into a DOM.

  or pulldom?

Not sure how pulldom does it's pull optimizations, but I think it 
still builds an in-memory object structure for your document, which will 
still take buckets of memory for such a big document. I could be wrong 
though.

  Or should I just use Xpath somehow.

Using xpath normally requires building a (D)OM, which will consume 
*lots* of memory for your document, regardless of how efficient the OM is.

Best to use SAX and XPATH-style expressions.

You can get a limited subset of xpath using a SAX handler and a stack. 
Your problem is particularly well suited to that kind of solution. Code 
that does a basic job of this for your specific problem is given below.

Note that there are a number of caveats with this code

1. characterdata handlers may get called multiple times for a single xml 
text() node. This is permitted in the SAX spec, and is basically a 
consequence of using buffered IO to read the contents of the xml file, 
e.g. the start of a text node is at the end of the last buffer read, and 
the rest of the text node is at the beginning of the next buffer.

2. This code assumes that your revision/text nodes do not contain 
mixed content, i.e. a mixture of elements and text, e.g. 
revisiontextThis is a piece of brevision/b 
text/text/revision. The below code will fail to extract all 
character data in that case.

#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
import xml.sax

class Page:

   def append(self, field_name, new_value):
 old_value = 
 if hasattr(self, field_name):
   old_value = getattr(self, field_name)
 setattr(self, field_name, %s%s % (old_value, new_value))

class page_matcher(xml.sax.handler.ContentHandler):

   def __init__(self, page_handler=None):
 xml.sax.handler.ContentHandler.__init__(self)
 self.page_handler = page_handler
 self.stack = []

   def check_stack(self):
 stack_expr = / + /.join(self.stack)
 if '/parent/page' == stack_expr:
   self.page = Page()
 elif '/parent/page/title/text()' == stack_expr:
   self.page.append('title', self.chardata)
 elif '/parent/page/revision/id/text()' == stack_expr:
   self.page.append('revision_id', self.chardata)
 elif '/parent/page/revision/text/text()' == stack_expr:
   self.page.append('revision_text', self.chardata)
 else:
   pass

   def startElement(self, elemname, attrs):
 self.stack.append(elemname)
 self.check_stack()

   def endElement(self, elemname):
 if elemname == 'page' and self.page_handler:
   self.page_handler(self.page)
   self.page = None
 self.stack.pop()

   def characters(self, data):
 self.chardata = data
 self.stack.append('text()')
 self.check_stack()
 self.stack.pop()

testdoc = 
parent
 page
  titlePage number 1/title
  idp1/id
  revision
idr1/id
textrevision one/text
  /revision
 /page
 page
  titlePage number 2/title
  idp2/id
  revision
idr2/id
textrevision two/text
  /revision
 /page
/parent


def page_handler(new_page):
   print New page
   print title\t\t%s % new_page.title
   print revision_id\t%s % new_page.revision_id
   print revision_text\t%s % new_page.revision_text
   print

if __name__ == __main__:
   parser = xml.sax.make_parser()
   parser.setContentHandler(page_matcher(page_handler))
   parser.setFeature(xml.sax.handler.feature_namespaces, 0)
   parser.feed(testdoc)
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

HTH,

-- 
alan kennedy
--
email alan:  http://xhaus.com/contact/alan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: python xml DOM? pulldom? SAX?

2005-08-29 Thread Fredrik Lundh
Alan Kennedy wrote:

 SAX is perfect for the job. See code below.

depends on your definition of perfect...

using a 20 MB version of jog's sample, and having replaced
the print statements with local variable assignments, I get the
following timings:

5 lines of cElementTree code: 7.2 seconds
60+ lines of xml.sax code: 63 seconds

(Python 2.4.1, Windows XP, Pentium 3 GHz)

/F 



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: python xml DOM? pulldom? SAX?

2005-08-29 Thread Alan Kennedy
[Alan Kennedy]
 SAX is perfect for the job. See code below.

[Fredrik Lundh]
  depends on your definition of perfect...

Obviously, perfect is the eye of the beholder ;-)

[Fredrik Lundh]
 using a 20 MB version of jog's sample, and having replaced
 the print statements with local variable assignments, I get the
 following timings:
 
 5 lines of cElementTree code: 7.2 seconds
 60+ lines of xml.sax code: 63 seconds
 
 (Python 2.4.1, Windows XP, Pentium 3 GHz)

Impressive!

At first, I thought your code sample was building a tree for the entire 
document, so I checked the API docs. It appeared to me that an event 
processing model *couldn't* obtain the text for the node when notified 
of the node: the text node is still in the future.

That's when I understood the nature of iterparse, which must generate an 
event *after* the node is complete, and it's subdocument reified. That's 
also when I understood the meaning of the elem.clear() call at the 
end. Only the required section of the tree is modelled in memory at any 
given time. Nice.

There are some minor inefficiencies in my pure python sax code, e.g. 
building the stack expression for every evaluation, but I left them in 
for didactic reasons. But even if every possible speed optimisation was 
added to my python code, I doubt it would be able to match your code.

I'm guessing that a lot of the reason why cElementTree performs best is 
because the model-building is primarily implemented in C: Both of our 
solutions run python code for every node in the tree, i.e. are O(N). But 
yours also avoids the overhead of having function-calls/stack-frames for 
every single node event, by processing all events inside a single function.

If the SAX algorithm were implemented in C (or Java) for that matter, I 
wonder if it might give comparable performance to the cElementTree code, 
primarily because the data structures it is building are simpler, 
compared to the tree-subsections being reified and discarded by 
cElementTree. But that's not of relevance, because we're looking for 
python solutions. (Aside: I can't wait to run my solution on a 
fully-optimising PyPy :-)

That's another nice thing I didn't know (c)ElementTree could do.

enlightened-ly'yrs,

-- 
alan kennedy
--
email alan:  http://xhaus.com/contact/alan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: python xml DOM? pulldom? SAX?

2005-08-29 Thread William Park
jog [EMAIL PROTECTED] wrote:
 Hi,
 I want to get text out of some nodes of a huge xml file (1,5 GB). The
 architecture of the xml file is something like this
 parent
page
 titlebla/title
 id/id
 revision
   id/id
   textblablabla/text
 revision
/page
page
/page
 
 /parent
 I want to combine the text out of page:title and page:revision:text for
 every single page element. One by one I want to index these combined
 texts (so for each page one index)
 What is the most efficient API for that?: SAX ( I don?t thonk so) DOM
 or pulldom?
 Or should I just use Xpath somehow.
 I don`t want to do anything else with his xml file afterwards.
 I hope someone will understand me.
 Thank you very much
 Jog

I would use Expat interface from Python, Awk, or even Bash shell.  I'm
most familiar with shell interface to Expat, which would go something
like

start() # Usage: start tag att=value ...
{
case $1 in
page) unset title text ;;
esac
}
data()  # Usage: data text
{
case ${XML_TAG_STACK[0]}.${XML_TAG_STACK[1]}.${XML_TAG_STACK[2]} in
title.page.*) title=$1 ;;
text.revision.page) text=$1 ;;
esac
}
end()   # Usage: end tag
{
case $1 in
page) echo title=$title text=$text ;;
esac
}
expat -s start -d data -e end  file.xml

-- 
William Park [EMAIL PROTECTED], Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
   http://home.eol.ca/~parkw/thinflash.html
BashDiff: Super Bash shell
  http://freshmeat.net/projects/bashdiff/
-- 
http://mail.python.org/mailman/listinfo/python-list