Thomas,

I honestly think that you should try to solve this problem without using XPath, or at the very least the XPath API in JAXP. Xalan does not run XPath queries directly on a W3C DOM instance, instead it creates its own internal tree called DTM. Because of how the XPath API is designed, this will happen every time you call evaluate(). The memory footprint of your application must be enormous and increasing the heap size helps for a while until the VM needs to manage/ housekeep it.

From your description, it seems that your queries are quite simple and do not involve negative axes. Why can't you just stream through the document using StAX or SAX and pick up the values you need? No matter how fast the the XPath implementation, streaming will be several X's faster on large documents like yours.

-- Santiago

On Sep 4, 2007, at 5:22 AM, Thomas Maschutznig wrote:

I am using xalan-j 2.7.0 on java 1.5 together with JAXP 1.3 similar to the xalan-j ApplyXPathJAXP sample. I have to read data from a 20MB XML file with approx. 3000 nodes directly below the document root; each one of these nodes contains some sub-nodes with attributes. I want to partially extract data from this file and create Java beans so I choose XPath expressions to extract exactly the tag and attribute data I need. First, I search for all of those 3000 nodes directly below root like this:
  XPath xPath = XPathFactory.newInstance().newXPath();
org.w3c.dom.NodeList nodes = (NodeList) xPath.evaluate("/Waveset/ User", inputSource, XPathConstants.NODESET);

Then I go through all matching nodes in a for-loop and extract data from each node's content using around 5 to 10 relative XPath expressions.
  for(int i=0; i < nodes.getLength(); i++) {
    System.out.println("Identity Count is : " + i);
    node = (org.w3c.dom.Element) nodes.item(i);
firstName = xPath.evaluate("[EMAIL PROTECTED]'firstname']/ @value", node); lastName = xPath.evaluate("[EMAIL PROTECTED]'lastname']/@value", node);
    // some more similar lines here...
  }

I can read "Identity Count is: x" for the first 60 to 90 lines very fast, within 2 or 3 seconds, but then it seems to start slowing down and finally at a count of around 1500 it takes up to 10 seconds and later maybe even more for one node to be processed (even so after JVM and gc options were tuned; before that it was significantly worse). I tuned JVM options, maximizing heap-space and resizing eden-space; I can see garbage collections happen every 20 to 30 seconds. My JVM options (on Windows 2003 x64, jdk 1.5.0_11 64bit) right now are: -Xms4g -Xmx4g -XX:NewSize=2g -XX:ThreadStackSize=16384 -XX: +UseParallelGC -server -XX:+AggressiveOpts

Classpath is: .:IMR_Import_Lib.jar:antlr-2.7.6.jar:asm.jar:asm- attrs.jar:c3p0-0.9.1.jar:cglib-2.1.3.jar:commons- collections-2.1.1.jar:commons- logging-1.0.4.jar:dom4j-1.6.1.jar:ejb3- persistence.jar:hibernate3.jar:jdbc2_0- stdext.jar:jta.jar:log4j-1.2.14.jar:ojdbc14.jar:serializer.jar:xalan.j ar:xercesImpl.jar:xml-apis.jar:hibernate-annotations.jar:hibernate- commons-annotations.jar

(there is a .properties file on .)

I also tried a modified version of the first xPath.evaluate(), explicitly creating a Document object of the XML, to no avail:
    DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
    DocumentBuilder db = dbf.newDocumentBuilder();
    Document d = db.parse(new File(this.xmlFilePathName));

    XPath xPath = XPathFactory.newInstance().newXPath();

NodeList nodes = (NodeList) xPath.evaluate("/Waveset/User", d, XPathConstants.NODESET);

I am a little stuck here with the drastically decreasing performance around half way through the XML file. Did I miss anything in my code? I know using a lot of XPath expressions like I do is very expensive but why would the second half of the file take 5 times as long as the first one while the first 100 /Wave/User nodes are parsed within seconds?

 Thomas

Reply via email to