Re: The performance of Daffodil at the command line is horrible

Mike Beckerle Wed, 17 May 2023 11:02:59 -0700

How do you know if it is character 6 or character 13 that has the
subsection code? I assume that depends on the character 5 section code?


What is in characters 1-4 and 6-12 ? Different for every record type?

There is a pure-DFDL answer to this which I don't have enough info yet to
explain, and there is a Daffodil extension, the dfdlx:lookAhead() function.
The latter is obvious how to use. You look ahead at characters 5, 6, and
13, then convert your choice into a 'choice-by-dispatch' which is constant
time, not O(m) time.

https://cwiki.apache.org/confluence/display/DAFFODIL/Proposal%3A+DFDLX+lookAhead

This stuff comes up often enough that I'm thinking about a layer to let you
easily examine a part of the data stream twice - once to learn from it, a
second time to actually parse it. In your case you want to examine bytes 1
to 13 twice. Once to learn the section code and subsection code, a second
time when actually parsing the message.






On Wed, May 17, 2023 at 9:42 AM Roger L Costello <coste...@mitre.org> wrote:

> Hi Mike,
>
>
>
>    - how does the format determine which record type, A, B, C, .... is
>    the one in the data?
>
>
>
> The input consists of lines. Each line is exactly 132 characters.
>
>
>
> The type of a line is determined by a 1-character section code plus a
> 1-character subsection code. The section code is always located at
> character 5. The subsection code is always located either at character 6 or
> at character 13. Given that, how would I modify my DFDL schema to improve
> its performance?
>
>
>
> *From:* Mike Beckerle <mbecke...@apache.org>
> *Sent:* Wednesday, May 17, 2023 9:13 AM
> *To:* users@daffodil.apache.org
> *Subject:* [EXT] Re: The performance of Daffodil at the command line is
> horrible
>
>
>
>
>
> The choice is certainly the likely suspect.
>
>
>
> What you have here is an O(n * m) algorithm where n is how many records
> and m is the number of record types.
>
>
>
> So, how does the format determine which record type, A, B, C, .... is the
> one in the data?
>
>
>
> Most formats will have one or a small handful of different criteria used,
> based on common initial parts of the data stream.
>
>
>
> The secret is to capture those in exactly one place in the schema and
> expose it before the choice, so that the choice can exploit that common
> structure.
>
>
>
>
>
>
>
> On Wed, May 17, 2023 at 8:35 AM Roger L Costello <coste...@mitre.org>
> wrote:
>
> The input file is 375 MB
> The XML file that DFDL parsing generates is 4.67 GB
>
> Time required for Daffodil to parse the input and generate the XML file is
> 16 minutes, 24 seconds.
>
> Ugh!
>
> That is too long. My customers will laugh at me if I suggest they use a
> tool that takes 16 minutes to parse their data.
>
> Below is the skeletal structure of my DFDL schema. I am pretty sure the
> "choice" is the cause of the slowness. I don't see an alternative to the
> choice; each record of the input could be one of the choices (i.e., the
> input records aren't in any order). Any suggestions for improving the
> performance?
>
> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema";
>     xmlns:fn="http://www.w3.org/2005/xpath-functions";
>     xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/";>
>
>     <xs:annotation>
>         <xs:appinfo source="http://www.ogf.org/dfdl/";>
>             <dfdl:format
>                 alignment="1"
>                 alignmentUnits="bytes"
>                 choiceLengthKind="implicit"
>                 emptyValueDelimiterPolicy="none"
>                 encoding="ASCII"
>                 encodingErrorPolicy="replace"
>                 escapeSchemeRef=""
>                 fillByte="%SP;"
>                 floating="no"
>                 ignoreCase="yes"
>                 initiatedContent="no"
>                 initiator=""
>                 leadingSkip="0"
>                 lengthKind="delimited"
>                 lengthUnits="characters"
>                 nilValueDelimiterPolicy="none"
>                 occursCountKind="implicit"
>                 outputNewLine="%CR;%LF;"
>                 representation="text"
>                 separator=""
>                 separatorSuppressionPolicy="anyEmpty"
>                 sequenceKind="ordered"
>                 textBidi="no"
>                 textPadKind="none"
>                 textTrimKind="none"
>                 trailingSkip="0"
>                 truncateSpecifiedLengthString="no"
>                 terminator=""
>                 textNumberRep="standard"
>                 textStandardBase="10"
>                 textStandardZeroRep="0"
>                 textNumberRounding="pattern"
>                 textStandardExponentRep="E"
>                 textNumberCheckPolicy="strict"
>             />
>         </xs:appinfo>
>     </xs:annotation>
>
>     <xs:element name="Test">
>         <xs:complexType>
>             <xs:sequence dfdl:separator="%NL;"
> dfdl:separatorPosition="infix">
>                 <xs:element name="record" maxOccurs="unbounded" >
>                     <xs:complexType>
>                         <xs:choice>
>                             <xs:element ref="A" />
>
>                             <xs:element ref="B" />
>
>                             <xs:element ref="C" />
>
>                             <xs:element ref="D" />
>
>                             <!-- A hundred more of these element ref's -->
>                         </xs:choice>
>                     </xs:complexType>
>                 </xs:element>
>             </xs:sequence>
>         </xs:complexType>
>     </xs:element>
>
>

Re: The performance of Daffodil at the command line is horrible

Reply via email to