Dario Laera schrieb:
Hi all,
I have xml files with thousands of sibling nodes, I want to transform
the xml so that all this nodes gets grouped in subset of n elements. I
have wrote an xsl file for this but it's very slow and I have big files:
20 minutes of cpu time for 20MB file.

It's true that this transformation takes extremely long using Xalan-J
2.7.1, or any other processor I tried. The reason is that this
transformation is not coded as efficiently as it could be done.

<xsl:template match="/report">
   <report>
       <xsl:for-each select="./object">
           <xsl:if test="position() mod 5 = 1">
               <object-collection>
                   <xsl:apply-templates select="."/>
                   <xsl:variable name="count" select="position()"/>
                   <xsl:apply-templates select="../object[$count + 1]"/>
                   <xsl:apply-templates select="../object[$count + 2]"/>
                   <xsl:apply-templates select="../object[$count + 3]"/>
                   <xsl:apply-templates select="../object[$count + 4]"/>

Obviously, it may be inefficient to apply-templates four times
to search a very large node-set. You'd want to use the following-sibling
axis instead and apply-templates only once:

  <xsl:template match="/report">
    <xsl:copy>
      <xsl:for-each select="object[ position() mod $group-size = 1 ]">
        <object-collection>
          <xsl:apply-templates select="."/>
          <xsl:apply-templates
select="following-sibling::object[position() &lt; $group-size]"/>
        </object-collection>
      </xsl:for-each>
    </xsl:copy>
  </xsl:template>

This reduces execution time in *Saxon 6.5.1* from 385 to 1.4 seconds,
given a document containing 100,000 <object> nodes. Saxon is obviously
able to generate efficient instructions for this XPath expression.

But it doesn't help (not much, at least) for Xalan-J 2.7.1, neither for
Xalan-C nor LibXSLT, all of which take extremely long for the job.

For a context size of 100,000, the above *may* mean to a processor: scan
a set of 100,000 nodes 20,000 times. That is, scan 2 billion nodes,
which is quite a lot.

Maybe an XPath involving the following-sibling axis and a predicate like
the above is not optimized as it could be.

Imagine how a processor may work: There is a pointer to the context
node, and now you ask that pointer to move up to the parent node and
then examine every single one of 100,000 children to check if its
position() happens to match what you supply in the predicate. And do
this 20,000 times.

Wouldn't it be much more efficient to ask the pointer to move just one
node further on the same level? Depending on the data structure, it
might be only a couple of dereferencing operations, and no scanning at
all. It'll be much faster.

My solution is the above, but it's not time efficient. Is there a way to
do it better?

Yes, at least for Xalan-J, Saxon and LibXSLT:

<xsl:transform version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform";>
  <xsl:output indent="yes"/>
  <xsl:variable name="group-size" select="5"/>
  <xsl:template match="report">
    <xsl:copy>
<xsl:apply-templates select="object[position() mod $group-size = 1]"/>
    </xsl:copy>
  </xsl:template>
  <xsl:template match="object">
    <object-collection>
      <xsl:apply-templates select="." mode="sibrec">
        <xsl:with-param name="todo" select="$group-size - 1"/>
      </xsl:apply-templates>
    </object-collection>
  </xsl:template>
  <!-- Läuft mit Xalan-J 2.7.1 und LibXSLT bei Kontextgröße von 100.000
viel schneller. Wahrscheinlich weil following-sibling::object[1] optimiert
  werden kann. -->
  <xsl:template match="object" mode="sibrec">
    <xsl:param name="todo"/>
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
    <xsl:if test="$todo > 0">
      <xsl:apply-templates
        select="following-sibling::object[1]" mode="sibrec">
        <xsl:with-param name="todo" select="$todo - 1"/>
      </xsl:apply-templates>
    </xsl:if>
  </xsl:template>
</xsl:transform>

About 3 seconds using Xalan-J 2.7.1 on the above-mentioned 100,000
<object> node document, 1.6 seconds using Saxon 6.5.1, 1.3 seconds using
xsltproc/LibXSLT, but far too long using Xalan-C, 390 seconds. (Xalan-C,
to say something positive about it, tends to use by far the least memory
of all processors I've ever tested.)

Try googling for "sibling recursion" to learn more.

Michael Ludwig

Reply via email to