We did some fairly extensive research into understanding the slow performance
for the 15 minute file you are referencing. The investigations and results are
documented on the wiki here:
https://cwiki.apache.org/confluence/display/DAFFODIL/Proposal%3A+Runtime+Performance+Improvement+Plan
While we found and merged few improvements for 3.9, we found the biggest factor
in poor performance was the schemas heavy use on speculative parsing. For
example, the schema looks something like this:
<element name="record" maxOccurs="unbounded">
<complexType>
<choice>
<element ref="record1" />
<element ref="record2" />
<element ref="record3" />
...
<element ref="recordN" />
</choice>
</complexType>
</element>
Where each record has one or more discriminators (usually multiple fields into
the record) to determine if we have found the right record.
This means that in order to successfully parse record X we have to first attempt
to parse and fail X - 1 records first. If the record happens to be far down in
the choice, that could by many failures before you actually find the element.
And it must be done for every single one of the 5 million records. All that
speculative parsing and backtracking incurs overhead that adds up.
To make this efficient, you really need something like choice dispatch to tell
Daffodil exactly which record of the choice to parse, effectively skipping those
X-1 records. I believe you've written a variant that does that, which we also
tested, and we found it was significantly faster, on the order of a few minutes
vs 10-15 minutes. We also found the speed was within the same order of magnitude
of code specifically written to parse ARINC. It was slower, but that isn't
entirely expected with the genericness Daffodil provides.
Greanted, a big drawback with the choice dispatch schema is it is much more
complex. Which is a fair complaint, but I'm not sure there's a whole lot we can
easily do. One idea we've floated is to add the capability to inspect
speculative parsing schemas and do some optimizations during compilation to
convert them to direct dispatch. But detecting where that is possible is likely
not an easy task. And it's not even possible with many speculative parsing
schemas--sometimes a schema really does need to just try something and see if it
fails.
Right now, the best advice we have is to use direct dispatch to get the best
performance.
Note that you might be able to more cleanly convert your existing speculative
parsing schema to direct dispatch using the dfdl:lookAhead function. For
example, you could potentially do something like this:
<element name="record" maxOccurs="unbounded">
<complexType>
<choice dfdl:choiceDispatchKey="{ dfdlx:lookAhead(32, 8) }">
<element dfdl:choiceBranchKey="1" ref="record1" />
<element dfdl:choiceBranchKey="2" ref="record2" />
<element dfdl:choiceBranchKey="3" ref="record3" />
...
<element dfdl:choiceBranchKey="N" ref="recordN" />
</choice>
</complexType>
</element>
This relies on knowing the bit offsets and sizes of the discriminating fields,
and I think arinc would probably need multiple if-statements and lookAhead calls
(since different records are determined by multiple fields in different bit
positions), but I *think* in theory it could work. This would allow your schema
to appear fairly flat but still get efficient gains from direct dispatch, which
all the complexity in the single choice dispatch expression.
On 2024-09-13 08:49 AM, Roger L Costello wrote:
Time for Daffodil to parse a file containing 5-million records, each record has
132 characters:
Daffodil version 3.6 took 15 minutes
Daffodil version 3.7 took 12.5 minutes
Daffodil version 3.8 took 12.6 minutes
The performance of Daffodil has declined between version 3.7 and 3.8 by 0.8%