How do you know if it is character 6 or character 13 that has the subsection code? I assume that depends on the character 5 section code?
What is in characters 1-4 and 6-12 ? Different for every record type? There is a pure-DFDL answer to this which I don't have enough info yet to explain, and there is a Daffodil extension, the dfdlx:lookAhead() function. The latter is obvious how to use. You look ahead at characters 5, 6, and 13, then convert your choice into a 'choice-by-dispatch' which is constant time, not O(m) time. https://cwiki.apache.org/confluence/display/DAFFODIL/Proposal%3A+DFDLX+lookAhead This stuff comes up often enough that I'm thinking about a layer to let you easily examine a part of the data stream twice - once to learn from it, a second time to actually parse it. In your case you want to examine bytes 1 to 13 twice. Once to learn the section code and subsection code, a second time when actually parsing the message. On Wed, May 17, 2023 at 9:42 AM Roger L Costello <coste...@mitre.org> wrote: > Hi Mike, > > > > - how does the format determine which record type, A, B, C, .... is > the one in the data? > > > > The input consists of lines. Each line is exactly 132 characters. > > > > The type of a line is determined by a 1-character section code plus a > 1-character subsection code. The section code is always located at > character 5. The subsection code is always located either at character 6 or > at character 13. Given that, how would I modify my DFDL schema to improve > its performance? > > > > *From:* Mike Beckerle <mbecke...@apache.org> > *Sent:* Wednesday, May 17, 2023 9:13 AM > *To:* users@daffodil.apache.org > *Subject:* [EXT] Re: The performance of Daffodil at the command line is > horrible > > > > > > The choice is certainly the likely suspect. > > > > What you have here is an O(n * m) algorithm where n is how many records > and m is the number of record types. > > > > So, how does the format determine which record type, A, B, C, .... is the > one in the data? > > > > Most formats will have one or a small handful of different criteria used, > based on common initial parts of the data stream. > > > > The secret is to capture those in exactly one place in the schema and > expose it before the choice, so that the choice can exploit that common > structure. > > > > > > > > On Wed, May 17, 2023 at 8:35 AM Roger L Costello <coste...@mitre.org> > wrote: > > The input file is 375 MB > The XML file that DFDL parsing generates is 4.67 GB > > Time required for Daffodil to parse the input and generate the XML file is > 16 minutes, 24 seconds. > > Ugh! > > That is too long. My customers will laugh at me if I suggest they use a > tool that takes 16 minutes to parse their data. > > Below is the skeletal structure of my DFDL schema. I am pretty sure the > "choice" is the cause of the slowness. I don't see an alternative to the > choice; each record of the input could be one of the choices (i.e., the > input records aren't in any order). Any suggestions for improving the > performance? > > <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" > xmlns:fn="http://www.w3.org/2005/xpath-functions" > xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/"> > > <xs:annotation> > <xs:appinfo source="http://www.ogf.org/dfdl/"> > <dfdl:format > alignment="1" > alignmentUnits="bytes" > choiceLengthKind="implicit" > emptyValueDelimiterPolicy="none" > encoding="ASCII" > encodingErrorPolicy="replace" > escapeSchemeRef="" > fillByte="%SP;" > floating="no" > ignoreCase="yes" > initiatedContent="no" > initiator="" > leadingSkip="0" > lengthKind="delimited" > lengthUnits="characters" > nilValueDelimiterPolicy="none" > occursCountKind="implicit" > outputNewLine="%CR;%LF;" > representation="text" > separator="" > separatorSuppressionPolicy="anyEmpty" > sequenceKind="ordered" > textBidi="no" > textPadKind="none" > textTrimKind="none" > trailingSkip="0" > truncateSpecifiedLengthString="no" > terminator="" > textNumberRep="standard" > textStandardBase="10" > textStandardZeroRep="0" > textNumberRounding="pattern" > textStandardExponentRep="E" > textNumberCheckPolicy="strict" > /> > </xs:appinfo> > </xs:annotation> > > <xs:element name="Test"> > <xs:complexType> > <xs:sequence dfdl:separator="%NL;" > dfdl:separatorPosition="infix"> > <xs:element name="record" maxOccurs="unbounded" > > <xs:complexType> > <xs:choice> > <xs:element ref="A" /> > > <xs:element ref="B" /> > > <xs:element ref="C" /> > > <xs:element ref="D" /> > > <!-- A hundred more of these element ref's --> > </xs:choice> > </xs:complexType> > </xs:element> > </xs:sequence> > </xs:complexType> > </xs:element> > >