We did some fairly extensive research into understanding the slow performance for the 15 minute file you are referencing. The investigations and results are documented on the wiki here:

https://cwiki.apache.org/confluence/display/DAFFODIL/Proposal%3A+Runtime+Performance+Improvement+Plan

While we found and merged few improvements for 3.9, we found the biggest factor in poor performance was the schemas heavy use on speculative parsing. For example, the schema looks something like this:

  <element name="record" maxOccurs="unbounded">
    <complexType>
      <choice>
        <element ref="record1" />
        <element ref="record2" />
        <element ref="record3" />
        ...
        <element ref="recordN" />
      </choice>
    </complexType>
  </element>

Where each record has one or more discriminators (usually multiple fields into the record) to determine if we have found the right record.

This means that in order to successfully parse record X we have to first attempt to parse and fail X - 1 records first. If the record happens to be far down in the choice, that could by many failures before you actually find the element. And it must be done for every single one of the 5 million records. All that speculative parsing and backtracking incurs overhead that adds up.

To make this efficient, you really need something like choice dispatch to tell Daffodil exactly which record of the choice to parse, effectively skipping those X-1 records. I believe you've written a variant that does that, which we also tested, and we found it was significantly faster, on the order of a few minutes vs 10-15 minutes. We also found the speed was within the same order of magnitude of code specifically written to parse ARINC. It was slower, but that isn't entirely expected with the genericness Daffodil provides.

Greanted, a big drawback with the choice dispatch schema is it is much more complex. Which is a fair complaint, but I'm not sure there's a whole lot we can easily do. One idea we've floated is to add the capability to inspect speculative parsing schemas and do some optimizations during compilation to convert them to direct dispatch. But detecting where that is possible is likely not an easy task. And it's not even possible with many speculative parsing schemas--sometimes a schema really does need to just try something and see if it fails.

Right now, the best advice we have is to use direct dispatch to get the best performance.

Note that you might be able to more cleanly convert your existing speculative parsing schema to direct dispatch using the dfdl:lookAhead function. For example, you could potentially do something like this:

  <element name="record" maxOccurs="unbounded">
    <complexType>
      <choice dfdl:choiceDispatchKey="{ dfdlx:lookAhead(32, 8) }">
        <element dfdl:choiceBranchKey="1" ref="record1" />
        <element dfdl:choiceBranchKey="2" ref="record2" />
        <element dfdl:choiceBranchKey="3" ref="record3" />
        ...
        <element dfdl:choiceBranchKey="N" ref="recordN" />
      </choice>
    </complexType>
  </element>

This relies on knowing the bit offsets and sizes of the discriminating fields, and I think arinc would probably need multiple if-statements and lookAhead calls (since different records are determined by multiple fields in different bit positions), but I *think* in theory it could work. This would allow your schema to appear fairly flat but still get efficient gains from direct dispatch, which all the complexity in the single choice dispatch expression.


On 2024-09-13 08:49 AM, Roger L Costello wrote:
Time for Daffodil to parse a file containing 5-million records, each record has 
132 characters:

Daffodil version 3.6 took 15 minutes

Daffodil version 3.7 took 12.5 minutes

Daffodil version 3.8 took 12.6 minutes

The performance of Daffodil has declined between version 3.7 and 3.8 by 0.8%

Reply via email to