I am using xalan-j 2.7.0 on java 1.5 together with JAXP 1.3 similar
to the xalan-j ApplyXPathJAXP sample. I have to read data from a 20MB
XML file with approx. 3000 nodes directly below the document root;
each one of these nodes contains some sub-nodes with attributes. I
want to partially extract data from this file and create Java beans
so I choose XPath expressions to extract exactly the tag and
attribute data I need.
First, I search for all of those 3000 nodes directly below root like
this:
XPath xPath = XPathFactory.newInstance().newXPath();
org.w3c.dom.NodeList nodes = (NodeList) xPath.evaluate("/Waveset/
User", inputSource, XPathConstants.NODESET);
Then I go through all matching nodes in a for-loop and extract data
from each node's content using around 5 to 10 relative XPath
expressions.
for(int i=0; i < nodes.getLength(); i++) {
System.out.println("Identity Count is : " + i);
node = (org.w3c.dom.Element) nodes.item(i);
firstName = xPath.evaluate("[EMAIL PROTECTED]'firstname']/
@value", node);
lastName = xPath.evaluate("[EMAIL PROTECTED]'lastname']/@value",
node);
// some more similar lines here...
}
I can read "Identity Count is: x" for the first 60 to 90 lines very
fast, within 2 or 3 seconds, but then it seems to start slowing down
and finally at a count of around 1500 it takes up to 10 seconds and
later maybe even more for one node to be processed (even so after JVM
and gc options were tuned; before that it was significantly worse).
I tuned JVM options, maximizing heap-space and resizing eden-space; I
can see garbage collections happen every 20 to 30 seconds. My JVM
options (on Windows 2003 x64, jdk 1.5.0_11 64bit) right now are:
-Xms4g -Xmx4g -XX:NewSize=2g -XX:ThreadStackSize=16384 -XX:
+UseParallelGC -server -XX:+AggressiveOpts
Classpath is: .:IMR_Import_Lib.jar:antlr-2.7.6.jar:asm.jar:asm-
attrs.jar:c3p0-0.9.1.jar:cglib-2.1.3.jar:commons-
collections-2.1.1.jar:commons-logging-1.0.4.jar:dom4j-1.6.1.jar:ejb3-
persistence.jar:hibernate3.jar:jdbc2_0-
stdext.jar:jta.jar:log4j-1.2.14.jar:ojdbc14.jar:serializer.jar:xalan.jar
:xercesImpl.jar:xml-apis.jar:hibernate-annotations.jar:hibernate-
commons-annotations.jar
(there is a .properties file on .)
I also tried a modified version of the first xPath.evaluate(),
explicitly creating a Document object of the XML, to no avail:
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document d = db.parse(new File(this.xmlFilePathName));
XPath xPath = XPathFactory.newInstance().newXPath();
NodeList nodes = (NodeList) xPath.evaluate("/Waveset/User", d,
XPathConstants.NODESET);
I am a little stuck here with the drastically decreasing performance
around half way through the XML file. Did I miss anything in my code?
I know using a lot of XPath expressions like I do is very expensive
but why would the second half of the file take 5 times as long as the
first one while the first 100 /Wave/User nodes are parsed within
seconds?
Thomas