Hello all, To weigh in here... this would be a great use case for a partnership with Apache Drill. Drill can read XML natively but not always accurately. Using DFDL to provide a schema for Drill would be a HUGE win. If anyone is interested in revisiting that thread, I'd be happy to resume the conversation. Best, -- C
> On May 17, 2023, at 2:02 PM, Mike Beckerle <mbecke...@apache.org> wrote: > > How do you know if it is character 6 or character 13 that has the subsection > code? I assume that depends on the character 5 section code? > > What is in characters 1-4 and 6-12 ? Different for every record type? > > There is a pure-DFDL answer to this which I don't have enough info yet to > explain, and there is a Daffodil extension, the dfdlx:lookAhead() function. > The latter is obvious how to use. You look ahead at characters 5, 6, and 13, > then convert your choice into a 'choice-by-dispatch' which is constant time, > not O(m) time. > > https://cwiki.apache.org/confluence/display/DAFFODIL/Proposal%3A+DFDLX+lookAhead > > This stuff comes up often enough that I'm thinking about a layer to let you > easily examine a part of the data stream twice - once to learn from it, a > second time to actually parse it. In your case you want to examine bytes 1 to > 13 twice. Once to learn the section code and subsection code, a second time > when actually parsing the message. > > > > > > > On Wed, May 17, 2023 at 9:42 AM Roger L Costello <coste...@mitre.org > <mailto:coste...@mitre.org>> wrote: >> Hi Mike, >> >> >> >> how does the format determine which record type, A, B, C, .... is the one in >> the data? >> >> >> The input consists of lines. Each line is exactly 132 characters. >> >> >> >> The type of a line is determined by a 1-character section code plus a >> 1-character subsection code. The section code is always located at character >> 5. The subsection code is always located either at character 6 or at >> character 13. Given that, how would I modify my DFDL schema to improve its >> performance? >> >> >> >> From: Mike Beckerle <mbecke...@apache.org <mailto:mbecke...@apache.org>> >> Sent: Wednesday, May 17, 2023 9:13 AM >> To: users@daffodil.apache.org <mailto:users@daffodil.apache.org> >> Subject: [EXT] Re: The performance of Daffodil at the command line is >> horrible >> >> >> >> >> >> The choice is certainly the likely suspect. >> >> >> >> What you have here is an O(n * m) algorithm where n is how many records and >> m is the number of record types. >> >> >> >> So, how does the format determine which record type, A, B, C, .... is the >> one in the data? >> >> >> >> Most formats will have one or a small handful of different criteria used, >> based on common initial parts of the data stream. >> >> >> >> The secret is to capture those in exactly one place in the schema and expose >> it before the choice, so that the choice can exploit that common structure. >> >> >> >> >> >> >> >> On Wed, May 17, 2023 at 8:35 AM Roger L Costello <coste...@mitre.org >> <mailto:coste...@mitre.org>> wrote: >> >> The input file is 375 MB >> The XML file that DFDL parsing generates is 4.67 GB >> >> Time required for Daffodil to parse the input and generate the XML file is >> 16 minutes, 24 seconds. >> >> Ugh! >> >> That is too long. My customers will laugh at me if I suggest they use a tool >> that takes 16 minutes to parse their data. >> >> Below is the skeletal structure of my DFDL schema. I am pretty sure the >> "choice" is the cause of the slowness. I don't see an alternative to the >> choice; each record of the input could be one of the choices (i.e., the >> input records aren't in any order). Any suggestions for improving the >> performance? >> >> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" >> xmlns:fn="http://www.w3.org/2005/xpath-functions" >> xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/"> >> >> <xs:annotation> >> <xs:appinfo source="http://www.ogf.org/dfdl/"> >> <dfdl:format >> alignment="1" >> alignmentUnits="bytes" >> choiceLengthKind="implicit" >> emptyValueDelimiterPolicy="none" >> encoding="ASCII" >> encodingErrorPolicy="replace" >> escapeSchemeRef="" >> fillByte="%SP;" >> floating="no" >> ignoreCase="yes" >> initiatedContent="no" >> initiator="" >> leadingSkip="0" >> lengthKind="delimited" >> lengthUnits="characters" >> nilValueDelimiterPolicy="none" >> occursCountKind="implicit" >> outputNewLine="%CR;%LF;" >> representation="text" >> separator="" >> separatorSuppressionPolicy="anyEmpty" >> sequenceKind="ordered" >> textBidi="no" >> textPadKind="none" >> textTrimKind="none" >> trailingSkip="0" >> truncateSpecifiedLengthString="no" >> terminator="" >> textNumberRep="standard" >> textStandardBase="10" >> textStandardZeroRep="0" >> textNumberRounding="pattern" >> textStandardExponentRep="E" >> textNumberCheckPolicy="strict" >> /> >> </xs:appinfo> >> </xs:annotation> >> >> <xs:element name="Test"> >> <xs:complexType> >> <xs:sequence dfdl:separator="%NL;" >> dfdl:separatorPosition="infix"> >> <xs:element name="record" maxOccurs="unbounded" > >> <xs:complexType> >> <xs:choice> >> <xs:element ref="A" /> >> >> <xs:element ref="B" /> >> >> <xs:element ref="C" /> >> >> <xs:element ref="D" /> >> >> <!-- A hundred more of these element ref's --> >> </xs:choice> >> </xs:complexType> >> </xs:element> >> </xs:sequence> >> </xs:complexType> >> </xs:element> >>