DO NOT REPLY [Bug 28146] - 'out of memory' when transforming large result sets.

bugzilla Fri, 16 Apr 2004 12:21:56 -0700

DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=28146>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.


http://issues.apache.org/bugzilla/show_bug.cgi?id=28146

'out of memory' when transforming large result sets.





------- Additional Comments From [EMAIL PROTECTED]  2004-04-16 19:22 -------
----------------------------
Problem:

This particular problem has a stylesheet stylesheet.xsl and an input data file 
data.xml.

The data.xml file is essentially like this:
<DOC>
  <RECORD num="1">
    <FIELD name="FIELDNAME1">val-a</FIELD>
    <FIELD name="FIELDNAME2">val-b</FIELD>
    ...
  </RECORD>
  <RECORD num="2">
    <FIELD name="FIELDNAME1">val-c</FIELD>
    <FIELD name="FIELDNAME2">val-d</FIELD>
    ...
  </RECORD>
  ...
</DOC>

The original problem has about 20,000 <RECORD> elements. On average there are 
about 175 <FIELD> child elements of each <RECORD>.  The data.xml file is about 
130 Megabytes large.

The problem is that for large enough input data.xml the transformation of the 
input xml to output html runs out of memory in the JVM.


----------------------------------------------
Analysis:

>From this original data.xml file I created data100.xml with the first 100 
<RECORD> records, data1000.xml with the first 1000, data5000.xml and 
data10000.xml

Here are my findings for Xalan-J interpretive. I produce these results using 
the JAXP API.

Number of RECORDS  Wall clock time to run(seconds) Peak Memory Usage(Megabytes)
        100             2.0                             23
        1000            9.8                             53
        5000            52.0                            286
        10000           109.0                           456
        20356           306.0                           569

My PC:
Computer: Intel Pentium M processor 1600 MHz.  1,047,472 KBytes RAM
System: Microsoft Windows 2000 Serice Pack 4

The stylesheet and largest data.xml document takes over 300 seconds on my more 
or less dedicated PC and most of that time it is pegged at 100% CPU.  The 
output HTML file is over 300 Megabytes.  If this were to run inside of a 
servlet, I think that there would be time out issues with the browser. There 
would also be incredible network demands to send such large generated HTML 
files from a web server to the browsers. Perhaps servlets and browsers are not 
involved and it is the desire to generate HTML files that are even larger than 
300 Megabytes. 

Another XSLT processor that I tried (not Xalan) showed similar increasing 
processing time and increasing memory memory usage as the input xml file got 
larger, but ran out of memory on the largest data.xml input. However I have no 
doubt that with a larger input file, or smaller JVM heap size or more things 
running on my computer that I would have run out of memory too.

I did a few runs with Xalan's XSLTC processor which ran faster than Xalan-J but 
used more memory.  I didn't run it extensively because memory seems to be the 
issue.

----------------------------------------
Conclusion:

The input data.xml file is always read into memory in an internal format so 
that the XPath expressions can be evaluated and a sequence of matching nodes 
can be returned. With a large enough input data XML file Xalan will run out of 
memory.  This is true for Xalan-J interpretive Xalan-J XSLTC and seems to be 
true for the other XSLT processor that I tried.

It might be possible with effort to do some analysis and reduce the memory 
usage by Xalan so that yet larger input XML can be processed, but without re-
architecting there will be a limit on the size of input XML files. It hard to 
guess, but from my experience using 10% less CPU or 10% less memory could be 
quite an effort without re-design.

My initial look at what sort of object creation was happening was that many 
org.apache.xpath.objects.XNodeSet object were consuming the memory. These 
objects already looked "thin" to me so it didn't look like any low hanging 
fruit in this area. A re-work of the stylesheet was more likely to reduce the 
number of these objects created.



The stylesheet has this (more or less):

<xsl:for-each select="DOC/RECORD[.]">
<TR bgcolor="#C0C0C0">
        <xsl:if test="$recnumber='y'">
        <TD bgcolor="#C0C0C0">
                <font color="#000000" size="-1" face="sans-serif">
                <xsl:value-of select="position()"/>
                </font>
        </TD>
        </xsl:if>
        <TD bgcolor="#C0C0C0">
                <font color="#000000" size="-1" face="sans-serif">
                <xsl:value-of select="[EMAIL PROTECTED]'FIELDNAME1']"/>
                </font>
        </TD>
        <TD bgcolor="#C0C0C0">
                <font color="#000000" size="-1" face="sans-serif">
                <xsl:value-of select="[EMAIL PROTECTED]'FIELDNAME2']"/>
                </font>
        </TD>
...

With one <TD>...</TD> for each attribute name ( 'FIELDNAME1' 'FIELDNAME2' ...) 
Henry Z. suggested that I tried this alternate:
<TR bgcolor="#C0C0C0">
        <xsl:if test="$recnumber='y'">
        <TD bgcolor="#C0C0C0">
                <font color="#000000" size="-1" face="sans-serif">
                <xsl:value-of select="position()"/>
                </font>
        </TD>
        </xsl:if>
<!-- ============================================================ -->
<xsl:for-each select="FIELD">
          <xsl:variable name="nameattr" select="string(@name)" />
<xsl:choose>
<xsl:when test="$nameattr='FIELDNAME1'">
        <TD bgcolor="#C0C0C0">
                <font color="#000000" size="-1" face="sans-serif">
                <xsl:value-of select="."/>
                </font>
        </TD>
</xsl:when>
<xsl:when test="$nameattr='FIELDNAME2'">
        <TD bgcolor="#C0C0C0">
                <font color="#000000" size="-1" face="sans-serif">
                <xsl:value-of select="."/>
        <TD bgco</font>0C0C0">
        </TD>   <font color="#000000" size="-1" face="sans-serif">
</xsl:when>

Which reduced the number of org.apache.xpath.objects.XNodeSet object that were 
created but the output was not always the same as before. Henry and I both knew 
that this was a possibility. I had hoped that each <RECORD> had exactly the 
same <FIELD> elements in the same order, but this was not the case, so the 
output differed. With more knowledge about the data perhaps some changes could 
be made in the stylesheet that would reduce the memory pressure.

For this particular stylesheet it seems that the RECORDs are just processed 
sequentially from the input, and one doesn't really need the whole input XML 
document in memory at the same time. But the XML parser can not anticipate what 
will be done in a particular stylesheet with XPath expressions. This end users 
knowledge that the records are processed in sequence makes cutting the input  
XML data file is into pieces possible. If each each piece is run in a separate 
XSLT transformation and those results are then glued back together I think that 
larger input XML could be processed.

Regards,
Brian Minchau

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

DO NOT REPLY [Bug 28146] - 'out of memory' when transforming large result sets.

Reply via email to