Re: Decreasing XPath Performance on large files?

Santiago Pericas-Geertsen Tue, 04 Sep 2007 07:46:01 -0700

Thomas,

I honestly think that you should try to solve this problem withoutusing XPath, or at the very least the XPath API in JAXP. Xalan doesnot run XPath queries directly on a W3C DOM instance, instead itcreates its own internal tree called DTM. Because of how the XPathAPI is designed, this will happen every time you call evaluate(). Thememory footprint of your application must be enormous and increasingthe heap size helps for a while until the VM needs to manage/housekeep it.

From your description, it seems that your queries are quite simpleand do not involve negative axes. Why can't you just stream throughthe document using StAX or SAX and pick up the values you need? Nomatter how fast the the XPath implementation, streaming will beseveral X's faster on large documents like yours.


-- Santiago

On Sep 4, 2007, at 5:22 AM, Thomas Maschutznig wrote:

I am using xalan-j 2.7.0 on java 1.5 together with JAXP 1.3 similarto the xalan-j ApplyXPathJAXP sample. I have to read data from a20MB XML file with approx. 3000 nodes directly below the documentroot; each one of these nodes contains some sub-nodes withattributes. I want to partially extract data from this file andcreate Java beans so I choose XPath expressions to extract exactlythe tag and attribute data I need.First, I search for all of those 3000 nodes directly below rootlike this:
  XPath xPath = XPathFactory.newInstance().newXPath();
org.w3c.dom.NodeList nodes = (NodeList) xPath.evaluate("/Waveset/User", inputSource, XPathConstants.NODESET);
Then I go through all matching nodes in a for-loop and extract datafrom each node's content using around 5 to 10 relative XPathexpressions.
  for(int i=0; i < nodes.getLength(); i++) {
    System.out.println("Identity Count is : " + i);
    node = (org.w3c.dom.Element) nodes.item(i);
firstName = xPath.evaluate("[EMAIL PROTECTED]'firstname']/@value", node);lastName = xPath.evaluate("[EMAIL PROTECTED]'lastname']/@value",node);
    // some more similar lines here...
  }
I can read "Identity Count is: x" for the first 60 to 90 lines veryfast, within 2 or 3 seconds, but then it seems to start slowingdown and finally at a count of around 1500 it takes up to 10seconds and later maybe even more for one node to be processed(even so after JVM and gc options were tuned; before that it wassignificantly worse).I tuned JVM options, maximizing heap-space and resizing eden-space;I can see garbage collections happen every 20 to 30 seconds. My JVMoptions (on Windows 2003 x64, jdk 1.5.0_11 64bit) right now are:-Xms4g -Xmx4g -XX:NewSize=2g -XX:ThreadStackSize=16384 -XX:+UseParallelGC -server -XX:+AggressiveOpts
Classpath is: .:IMR_Import_Lib.jar:antlr-2.7.6.jar:asm.jar:asm-attrs.jar:c3p0-0.9.1.jar:cglib-2.1.3.jar:commons-collections-2.1.1.jar:commons-logging-1.0.4.jar:dom4j-1.6.1.jar:ejb3-persistence.jar:hibernate3.jar:jdbc2_0-stdext.jar:jta.jar:log4j-1.2.14.jar:ojdbc14.jar:serializer.jar:xalan.jar:xercesImpl.jar:xml-apis.jar:hibernate-annotations.jar:hibernate-commons-annotations.jar
(there is a .properties file on .)
I also tried a modified version of the first xPath.evaluate(),explicitly creating a Document object of the XML, to no avail:
    DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
    DocumentBuilder db = dbf.newDocumentBuilder();
    Document d = db.parse(new File(this.xmlFilePathName));

    XPath xPath = XPathFactory.newInstance().newXPath();
NodeList nodes = (NodeList) xPath.evaluate("/Waveset/User", d,XPathConstants.NODESET);
I am a little stuck here with the drastically decreasingperformance around half way through the XML file. Did I missanything in my code? I know using a lot of XPath expressions like Ido is very expensive but why would the second half of the file take5 times as long as the first one while the first 100 /Wave/Usernodes are parsed within seconds?
 Thomas

Re: Decreasing XPath Performance on large files?

Reply via email to