I am using xalan-j 2.7.0 on java 1.5 together with JAXP 1.3 similar to the xalan-j ApplyXPathJAXP sample. I have to read data from a 20MB XML file with approx. 3000 nodes directly below the document root; each one of these nodes contains some sub-nodes with attributes. I want to partially extract data from this file and create Java beans so I choose XPath expressions to extract exactly the tag and attribute data I need. First, I search for all of those 3000 nodes directly below root like this:
  XPath xPath = XPathFactory.newInstance().newXPath();
org.w3c.dom.NodeList nodes = (NodeList) xPath.evaluate("/Waveset/ User", inputSource, XPathConstants.NODESET);

Then I go through all matching nodes in a for-loop and extract data from each node's content using around 5 to 10 relative XPath expressions.
  for(int i=0; i < nodes.getLength(); i++) {
    System.out.println("Identity Count is : " + i);
    node = (org.w3c.dom.Element) nodes.item(i);
firstName = xPath.evaluate("[EMAIL PROTECTED]'firstname']/ @value", node); lastName = xPath.evaluate("[EMAIL PROTECTED]'lastname']/@value", node);
    // some more similar lines here...
  }

I can read "Identity Count is: x" for the first 60 to 90 lines very fast, within 2 or 3 seconds, but then it seems to start slowing down and finally at a count of around 1500 it takes up to 10 seconds and later maybe even more for one node to be processed (even so after JVM and gc options were tuned; before that it was significantly worse). I tuned JVM options, maximizing heap-space and resizing eden-space; I can see garbage collections happen every 20 to 30 seconds. My JVM options (on Windows 2003 x64, jdk 1.5.0_11 64bit) right now are: -Xms4g -Xmx4g -XX:NewSize=2g -XX:ThreadStackSize=16384 -XX: +UseParallelGC -server -XX:+AggressiveOpts

Classpath is: .:IMR_Import_Lib.jar:antlr-2.7.6.jar:asm.jar:asm- attrs.jar:c3p0-0.9.1.jar:cglib-2.1.3.jar:commons- collections-2.1.1.jar:commons-logging-1.0.4.jar:dom4j-1.6.1.jar:ejb3- persistence.jar:hibernate3.jar:jdbc2_0- stdext.jar:jta.jar:log4j-1.2.14.jar:ojdbc14.jar:serializer.jar:xalan.jar :xercesImpl.jar:xml-apis.jar:hibernate-annotations.jar:hibernate- commons-annotations.jar

(there is a .properties file on .)

I also tried a modified version of the first xPath.evaluate(), explicitly creating a Document object of the XML, to no avail:
    DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
    DocumentBuilder db = dbf.newDocumentBuilder();
    Document d = db.parse(new File(this.xmlFilePathName));

    XPath xPath = XPathFactory.newInstance().newXPath();

NodeList nodes = (NodeList) xPath.evaluate("/Waveset/User", d, XPathConstants.NODESET);

I am a little stuck here with the drastically decreasing performance around half way through the XML file. Did I miss anything in my code? I know using a lot of XPath expressions like I do is very expensive but why would the second half of the file take 5 times as long as the first one while the first 100 /Wave/User nodes are parsed within seconds?

 Thomas

Reply via email to