Re: The performance of Daffodil at the command line is horrible

Roger L Costello Wed, 17 May 2023 11:13:33 -0700

Hi Mike,


  *   How do you know if it is character 6 or character 13 that has the 
subsection code? I assume that depends on the character 5 section code?

Correct. If the section code = ‘P’ then the subsection code is in position 6. 
If the section code = ‘R’ then the subsection code is in position 13. Like that.


  *   What is in characters 1-4 and 6-12 ? Different for every record type?

Correct. Different for every record type.


  *   There is a pure-DFDL answer

Yes, that’s what I want!

From: Mike Beckerle <mbecke...@apache.org>
Sent: Wednesday, May 17, 2023 2:03 PM
To: users@daffodil.apache.org
Subject: [EXT] Re: The performance of Daffodil at the command line is horrible

How do you know if it is character 6 or character 13 that has the subsection 
code? I assume that depends on the character 5 section code?

What is in characters 1-4 and 6-12 ? Different for every record type?

There is a pure-DFDL answer to this which I don't have enough info yet to 
explain, and there is a Daffodil extension, the dfdlx:lookAhead() function. The 
latter is obvious how to use. You look ahead at characters 5, 6, and 13, then 
convert your choice into a 'choice-by-dispatch' which is constant time, not 
O(m) time.

https://cwiki.apache.org/confluence/display/DAFFODIL/Proposal%3A+DFDLX+lookAhead

This stuff comes up often enough that I'm thinking about a layer to let you 
easily examine a part of the data stream twice - once to learn from it, a 
second time to actually parse it. In your case you want to examine bytes 1 to 
13 twice. Once to learn the section code and subsection code, a second time 
when actually parsing the message.






On Wed, May 17, 2023 at 9:42 AM Roger L Costello 
<coste...@mitre.org<mailto:coste...@mitre.org>> wrote:
Hi Mike,


  *   how does the format determine which record type, A, B, C, .... is the one 
in the data?

The input consists of lines. Each line is exactly 132 characters.

The type of a line is determined by a 1-character section code plus a 
1-character subsection code. The section code is always located at character 5. 
The subsection code is always located either at character 6 or at character 13. 
Given that, how would I modify my DFDL schema to improve its performance?

From: Mike Beckerle <mbecke...@apache.org<mailto:mbecke...@apache.org>>
Sent: Wednesday, May 17, 2023 9:13 AM
To: users@daffodil.apache.org<mailto:users@daffodil.apache.org>
Subject: [EXT] Re: The performance of Daffodil at the command line is horrible


The choice is certainly the likely suspect.

What you have here is an O(n * m) algorithm where n is how many records and m 
is the number of record types.

So, how does the format determine which record type, A, B, C, .... is the one 
in the data?

Most formats will have one or a small handful of different criteria used, based 
on common initial parts of the data stream.

The secret is to capture those in exactly one place in the schema and expose it 
before the choice, so that the choice can exploit that common structure.



On Wed, May 17, 2023 at 8:35 AM Roger L Costello 
<coste...@mitre.org<mailto:coste...@mitre.org>> wrote:
The input file is 375 MB
The XML file that DFDL parsing generates is 4.67 GB

Time required for Daffodil to parse the input and generate the XML file is 16 
minutes, 24 seconds.

Ugh!

That is too long. My customers will laugh at me if I suggest they use a tool 
that takes 16 minutes to parse their data.

Below is the skeletal structure of my DFDL schema. I am pretty sure the 
"choice" is the cause of the slowness. I don't see an alternative to the 
choice; each record of the input could be one of the choices (i.e., the input 
records aren't in any order). Any suggestions for improving the performance?

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema";
    xmlns:fn="http://www.w3.org/2005/xpath-functions";
    xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/";>

    <xs:annotation>
        <xs:appinfo source="http://www.ogf.org/dfdl/";>
            <dfdl:format
                alignment="1"
                alignmentUnits="bytes"
                choiceLengthKind="implicit"
                emptyValueDelimiterPolicy="none"
                encoding="ASCII"
                encodingErrorPolicy="replace"
                escapeSchemeRef=""
                fillByte="%SP;"
                floating="no"
                ignoreCase="yes"
                initiatedContent="no"
                initiator=""
                leadingSkip="0"
                lengthKind="delimited"
                lengthUnits="characters"
                nilValueDelimiterPolicy="none"
                occursCountKind="implicit"
                outputNewLine="%CR;%LF;"
                representation="text"
                separator=""
                separatorSuppressionPolicy="anyEmpty"
                sequenceKind="ordered"
                textBidi="no"
                textPadKind="none"
                textTrimKind="none"
                trailingSkip="0"
                truncateSpecifiedLengthString="no"
                terminator=""
                textNumberRep="standard"
                textStandardBase="10"
                textStandardZeroRep="0"
                textNumberRounding="pattern"
                textStandardExponentRep="E"
                textNumberCheckPolicy="strict"
            />
        </xs:appinfo>
    </xs:annotation>

    <xs:element name="Test">
        <xs:complexType>
            <xs:sequence dfdl:separator="%NL;" dfdl:separatorPosition="infix">
                <xs:element name="record" maxOccurs="unbounded" >
                    <xs:complexType>
                        <xs:choice>
                            <xs:element ref="A" />
                            <xs:element ref="B" />
                            <xs:element ref="C" />
                            <xs:element ref="D" />
                            <!-- A hundred more of these element ref's -->
                        </xs:choice>
                    </xs:complexType>
                </xs:element>
            </xs:sequence>
        </xs:complexType>
    </xs:element>

Re: The performance of Daffodil at the command line is horrible

Reply via email to