Thomas,
I honestly think that you should try to solve this problem without
using XPath, or at the very least the XPath API in JAXP. Xalan does
not run XPath queries directly on a W3C DOM instance, instead it
creates its own internal tree called DTM. Because of how the XPath
API is designed, this will happen every time you call evaluate(). The
memory footprint of your application must be enormous and increasing
the heap size helps for a while until the VM needs to manage/
housekeep it.
From your description, it seems that your queries are quite simple
and do not involve negative axes. Why can't you just stream through
the document using StAX or SAX and pick up the values you need? No
matter how fast the the XPath implementation, streaming will be
several X's faster on large documents like yours.
-- Santiago
On Sep 4, 2007, at 5:22 AM, Thomas Maschutznig wrote:
I am using xalan-j 2.7.0 on java 1.5 together with JAXP 1.3 similar
to the xalan-j ApplyXPathJAXP sample. I have to read data from a
20MB XML file with approx. 3000 nodes directly below the document
root; each one of these nodes contains some sub-nodes with
attributes. I want to partially extract data from this file and
create Java beans so I choose XPath expressions to extract exactly
the tag and attribute data I need.
First, I search for all of those 3000 nodes directly below root
like this:
XPath xPath = XPathFactory.newInstance().newXPath();
org.w3c.dom.NodeList nodes = (NodeList) xPath.evaluate("/Waveset/
User", inputSource, XPathConstants.NODESET);
Then I go through all matching nodes in a for-loop and extract data
from each node's content using around 5 to 10 relative XPath
expressions.
for(int i=0; i < nodes.getLength(); i++) {
System.out.println("Identity Count is : " + i);
node = (org.w3c.dom.Element) nodes.item(i);
firstName = xPath.evaluate("[EMAIL PROTECTED]'firstname']/
@value", node);
lastName = xPath.evaluate("[EMAIL PROTECTED]'lastname']/@value",
node);
// some more similar lines here...
}
I can read "Identity Count is: x" for the first 60 to 90 lines very
fast, within 2 or 3 seconds, but then it seems to start slowing
down and finally at a count of around 1500 it takes up to 10
seconds and later maybe even more for one node to be processed
(even so after JVM and gc options were tuned; before that it was
significantly worse).
I tuned JVM options, maximizing heap-space and resizing eden-space;
I can see garbage collections happen every 20 to 30 seconds. My JVM
options (on Windows 2003 x64, jdk 1.5.0_11 64bit) right now are:
-Xms4g -Xmx4g -XX:NewSize=2g -XX:ThreadStackSize=16384 -XX:
+UseParallelGC -server -XX:+AggressiveOpts
Classpath is: .:IMR_Import_Lib.jar:antlr-2.7.6.jar:asm.jar:asm-
attrs.jar:c3p0-0.9.1.jar:cglib-2.1.3.jar:commons-
collections-2.1.1.jar:commons-
logging-1.0.4.jar:dom4j-1.6.1.jar:ejb3-
persistence.jar:hibernate3.jar:jdbc2_0-
stdext.jar:jta.jar:log4j-1.2.14.jar:ojdbc14.jar:serializer.jar:xalan.j
ar:xercesImpl.jar:xml-apis.jar:hibernate-annotations.jar:hibernate-
commons-annotations.jar
(there is a .properties file on .)
I also tried a modified version of the first xPath.evaluate(),
explicitly creating a Document object of the XML, to no avail:
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document d = db.parse(new File(this.xmlFilePathName));
XPath xPath = XPathFactory.newInstance().newXPath();
NodeList nodes = (NodeList) xPath.evaluate("/Waveset/User", d,
XPathConstants.NODESET);
I am a little stuck here with the drastically decreasing
performance around half way through the XML file. Did I miss
anything in my code? I know using a lot of XPath expressions like I
do is very expensive but why would the second half of the file take
5 times as long as the first one while the first 100 /Wave/User
nodes are parsed within seconds?
Thomas