Re: The performance of Daffodil at the command line is horrible

Charles Givre Wed, 17 May 2023 11:12:26 -0700

Hello all, 
To weigh in here... this would be a great use case for a partnership with 
Apache Drill.  Drill can read XML natively but not always accurately.  Using 
DFDL to provide a schema for Drill would be a HUGE win.  If anyone is 
interested in revisiting that thread, I'd be happy to resume the conversation.
Best,
-- C




> On May 17, 2023, at 2:02 PM, Mike Beckerle <mbecke...@apache.org> wrote:
> 
> How do you know if it is character 6 or character 13 that has the subsection 
> code? I assume that depends on the character 5 section code?
> 
> What is in characters 1-4 and 6-12 ? Different for every record type?
> 
> There is a pure-DFDL answer to this which I don't have enough info yet to 
> explain, and there is a Daffodil extension, the dfdlx:lookAhead() function. 
> The latter is obvious how to use. You look ahead at characters 5, 6, and 13, 
> then convert your choice into a 'choice-by-dispatch' which is constant time, 
> not O(m) time. 
> 
> https://cwiki.apache.org/confluence/display/DAFFODIL/Proposal%3A+DFDLX+lookAhead
> 
> This stuff comes up often enough that I'm thinking about a layer to let you 
> easily examine a part of the data stream twice - once to learn from it, a 
> second time to actually parse it. In your case you want to examine bytes 1 to 
> 13 twice. Once to learn the section code and subsection code, a second time 
> when actually parsing the message. 
> 
> 
> 
> 
> 
> 
> On Wed, May 17, 2023 at 9:42 AM Roger L Costello <coste...@mitre.org 
> <mailto:coste...@mitre.org>> wrote:
>> Hi Mike,
>> 
>>  
>> 
>> how does the format determine which record type, A, B, C, .... is the one in 
>> the data?
>>  
>> 
>> The input consists of lines. Each line is exactly 132 characters.
>> 
>>  
>> 
>> The type of a line is determined by a 1-character section code plus a 
>> 1-character subsection code. The section code is always located at character 
>> 5. The subsection code is always located either at character 6 or at 
>> character 13. Given that, how would I modify my DFDL schema to improve its 
>> performance?
>> 
>>  
>> 
>> From: Mike Beckerle <mbecke...@apache.org <mailto:mbecke...@apache.org>> 
>> Sent: Wednesday, May 17, 2023 9:13 AM
>> To: users@daffodil.apache.org <mailto:users@daffodil.apache.org>
>> Subject: [EXT] Re: The performance of Daffodil at the command line is 
>> horrible
>> 
>>  
>> 
>>  
>> 
>> The choice is certainly the likely suspect.
>> 
>>  
>> 
>> What you have here is an O(n * m) algorithm where n is how many records and 
>> m is the number of record types. 
>> 
>>  
>> 
>> So, how does the format determine which record type, A, B, C, .... is the 
>> one in the data? 
>> 
>>  
>> 
>> Most formats will have one or a small handful of different criteria used, 
>> based on common initial parts of the data stream.
>> 
>>  
>> 
>> The secret is to capture those in exactly one place in the schema and expose 
>> it before the choice, so that the choice can exploit that common structure. 
>> 
>>  
>> 
>>  
>> 
>>  
>> 
>> On Wed, May 17, 2023 at 8:35 AM Roger L Costello <coste...@mitre.org 
>> <mailto:coste...@mitre.org>> wrote:
>> 
>> The input file is 375 MB
>> The XML file that DFDL parsing generates is 4.67 GB
>> 
>> Time required for Daffodil to parse the input and generate the XML file is 
>> 16 minutes, 24 seconds.
>> 
>> Ugh!
>> 
>> That is too long. My customers will laugh at me if I suggest they use a tool 
>> that takes 16 minutes to parse their data.
>> 
>> Below is the skeletal structure of my DFDL schema. I am pretty sure the 
>> "choice" is the cause of the slowness. I don't see an alternative to the 
>> choice; each record of the input could be one of the choices (i.e., the 
>> input records aren't in any order). Any suggestions for improving the 
>> performance?
>> 
>> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema";
>>     xmlns:fn="http://www.w3.org/2005/xpath-functions";
>>     xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/";>
>> 
>>     <xs:annotation>
>>         <xs:appinfo source="http://www.ogf.org/dfdl/";>
>>             <dfdl:format
>>                 alignment="1" 
>>                 alignmentUnits="bytes" 
>>                 choiceLengthKind="implicit"
>>                 emptyValueDelimiterPolicy="none" 
>>                 encoding="ASCII" 
>>                 encodingErrorPolicy="replace" 
>>                 escapeSchemeRef="" 
>>                 fillByte="%SP;" 
>>                 floating="no" 
>>                 ignoreCase="yes" 
>>                 initiatedContent="no" 
>>                 initiator="" 
>>                 leadingSkip="0"
>>                 lengthKind="delimited" 
>>                 lengthUnits="characters" 
>>                 nilValueDelimiterPolicy="none" 
>>                 occursCountKind="implicit" 
>>                 outputNewLine="%CR;%LF;" 
>>                 representation="text" 
>>                 separator="" 
>>                 separatorSuppressionPolicy="anyEmpty" 
>>                 sequenceKind="ordered" 
>>                 textBidi="no" 
>>                 textPadKind="none"
>>                 textTrimKind="none" 
>>                 trailingSkip="0" 
>>                 truncateSpecifiedLengthString="no" 
>>                 terminator="" 
>>                 textNumberRep="standard" 
>>                 textStandardBase="10" 
>>                 textStandardZeroRep="0" 
>>                 textNumberRounding="pattern" 
>>                 textStandardExponentRep="E" 
>>                 textNumberCheckPolicy="strict"
>>             />
>>         </xs:appinfo>
>>     </xs:annotation>
>> 
>>     <xs:element name="Test">
>>         <xs:complexType>
>>             <xs:sequence dfdl:separator="%NL;" 
>> dfdl:separatorPosition="infix">
>>                 <xs:element name="record" maxOccurs="unbounded" >
>>                     <xs:complexType>
>>                         <xs:choice>
>>                             <xs:element ref="A" />                           
>>              
>>                             <xs:element ref="B" />                           
>>              
>>                             <xs:element ref="C" />                           
>>                    
>>                             <xs:element ref="D" />                           
>>              
>>                             <!-- A hundred more of these element ref's -->
>>                         </xs:choice>
>>                     </xs:complexType>
>>                 </xs:element>
>>             </xs:sequence>
>>         </xs:complexType>
>>     </xs:element>
>>

Re: The performance of Daffodil at the command line is horrible

Reply via email to