DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT <http://issues.apache.org/bugzilla/show_bug.cgi?id=28146>. ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND INSERTED IN THE BUG DATABASE.
http://issues.apache.org/bugzilla/show_bug.cgi?id=28146 'out of memory' when transforming large result sets. ------- Additional Comments From [EMAIL PROTECTED] 2004-04-16 19:22 ------- ---------------------------- Problem: This particular problem has a stylesheet stylesheet.xsl and an input data file data.xml. The data.xml file is essentially like this: <DOC> <RECORD num="1"> <FIELD name="FIELDNAME1">val-a</FIELD> <FIELD name="FIELDNAME2">val-b</FIELD> ... </RECORD> <RECORD num="2"> <FIELD name="FIELDNAME1">val-c</FIELD> <FIELD name="FIELDNAME2">val-d</FIELD> ... </RECORD> ... </DOC> The original problem has about 20,000 <RECORD> elements. On average there are about 175 <FIELD> child elements of each <RECORD>. The data.xml file is about 130 Megabytes large. The problem is that for large enough input data.xml the transformation of the input xml to output html runs out of memory in the JVM. ---------------------------------------------- Analysis: >From this original data.xml file I created data100.xml with the first 100 <RECORD> records, data1000.xml with the first 1000, data5000.xml and data10000.xml Here are my findings for Xalan-J interpretive. I produce these results using the JAXP API. Number of RECORDS Wall clock time to run(seconds) Peak Memory Usage(Megabytes) 100 2.0 23 1000 9.8 53 5000 52.0 286 10000 109.0 456 20356 306.0 569 My PC: Computer: Intel Pentium M processor 1600 MHz. 1,047,472 KBytes RAM System: Microsoft Windows 2000 Serice Pack 4 The stylesheet and largest data.xml document takes over 300 seconds on my more or less dedicated PC and most of that time it is pegged at 100% CPU. The output HTML file is over 300 Megabytes. If this were to run inside of a servlet, I think that there would be time out issues with the browser. There would also be incredible network demands to send such large generated HTML files from a web server to the browsers. Perhaps servlets and browsers are not involved and it is the desire to generate HTML files that are even larger than 300 Megabytes. Another XSLT processor that I tried (not Xalan) showed similar increasing processing time and increasing memory memory usage as the input xml file got larger, but ran out of memory on the largest data.xml input. However I have no doubt that with a larger input file, or smaller JVM heap size or more things running on my computer that I would have run out of memory too. I did a few runs with Xalan's XSLTC processor which ran faster than Xalan-J but used more memory. I didn't run it extensively because memory seems to be the issue. ---------------------------------------- Conclusion: The input data.xml file is always read into memory in an internal format so that the XPath expressions can be evaluated and a sequence of matching nodes can be returned. With a large enough input data XML file Xalan will run out of memory. This is true for Xalan-J interpretive Xalan-J XSLTC and seems to be true for the other XSLT processor that I tried. It might be possible with effort to do some analysis and reduce the memory usage by Xalan so that yet larger input XML can be processed, but without re- architecting there will be a limit on the size of input XML files. It hard to guess, but from my experience using 10% less CPU or 10% less memory could be quite an effort without re-design. My initial look at what sort of object creation was happening was that many org.apache.xpath.objects.XNodeSet object were consuming the memory. These objects already looked "thin" to me so it didn't look like any low hanging fruit in this area. A re-work of the stylesheet was more likely to reduce the number of these objects created. The stylesheet has this (more or less): <xsl:for-each select="DOC/RECORD[.]"> <TR bgcolor="#C0C0C0"> <xsl:if test="$recnumber='y'"> <TD bgcolor="#C0C0C0"> <font color="#000000" size="-1" face="sans-serif"> <xsl:value-of select="position()"/> </font> </TD> </xsl:if> <TD bgcolor="#C0C0C0"> <font color="#000000" size="-1" face="sans-serif"> <xsl:value-of select="[EMAIL PROTECTED]'FIELDNAME1']"/> </font> </TD> <TD bgcolor="#C0C0C0"> <font color="#000000" size="-1" face="sans-serif"> <xsl:value-of select="[EMAIL PROTECTED]'FIELDNAME2']"/> </font> </TD> ... With one <TD>...</TD> for each attribute name ( 'FIELDNAME1' 'FIELDNAME2' ...) Henry Z. suggested that I tried this alternate: <TR bgcolor="#C0C0C0"> <xsl:if test="$recnumber='y'"> <TD bgcolor="#C0C0C0"> <font color="#000000" size="-1" face="sans-serif"> <xsl:value-of select="position()"/> </font> </TD> </xsl:if> <!-- ============================================================ --> <xsl:for-each select="FIELD"> <xsl:variable name="nameattr" select="string(@name)" /> <xsl:choose> <xsl:when test="$nameattr='FIELDNAME1'"> <TD bgcolor="#C0C0C0"> <font color="#000000" size="-1" face="sans-serif"> <xsl:value-of select="."/> </font> </TD> </xsl:when> <xsl:when test="$nameattr='FIELDNAME2'"> <TD bgcolor="#C0C0C0"> <font color="#000000" size="-1" face="sans-serif"> <xsl:value-of select="."/> <TD bgco</font>0C0C0"> </TD> <font color="#000000" size="-1" face="sans-serif"> </xsl:when> Which reduced the number of org.apache.xpath.objects.XNodeSet object that were created but the output was not always the same as before. Henry and I both knew that this was a possibility. I had hoped that each <RECORD> had exactly the same <FIELD> elements in the same order, but this was not the case, so the output differed. With more knowledge about the data perhaps some changes could be made in the stylesheet that would reduce the memory pressure. For this particular stylesheet it seems that the RECORDs are just processed sequentially from the input, and one doesn't really need the whole input XML document in memory at the same time. But the XML parser can not anticipate what will be done in a particular stylesheet with XPath expressions. This end users knowledge that the records are processed in sequence makes cutting the input XML data file is into pieces possible. If each each piece is run in a separate XSLT transformation and those results are then glued back together I think that larger input XML could be processed. Regards, Brian Minchau --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
