Roger said: So, that means it takes 130 seconds to parse the 5-million-line input file and build an internal infoset but only 96 seconds to create the 4 GB XML file. That makes no sense.
I agree with you that parsing 5M records and it taking 2 minutes ... something is clearly wrong and it is much too slow. 2 minutes to a modern CPU is like the Jurassic Age to a human. I.e., it's an almost unimaginably long time. But parsing taking longer than writing out XML *can* make sense depending on the format complexity. Writing out 4GB of XML is just a bunch of text I/O. Per byte that's pretty fast. The speed of DFDL parsing is proportional to *the number of decisions* the parser must make, which is linearly, but weakly, correlated to the data size. The constant factors vary widely with the format. For an extreme example: there is a mil-std-2045 message header that is 33 bits long. It mostly consists of hidden groups of presence bits that are 0 indicating that some optional component is not present. Each such bit requires a DFDL choice of two possibilities, evaluation of a choice-dispatch-key expression or a occursCount expression, and most of those then create *nothing* in the infoset. So a bunch of overhead to consume 1 bit of input and decide to add nothing to the infoset. Repeat almost 30 times. You have now consumed less than 5 bytes of the input. In terms of parse speed in bytes/second this is going to be super slow because every byte requires a bunch of parser decision making. Writing out the corresponding 956 bytes of XML text is going to be very quick in comparison to this parsing. (FYI: This extreme example is on github here: https://github.com/DFDLSchemas/mil-std-2045/blob/master/src/test/resources/com/owlcyberdefense/mil-std-2045/milstd2045.tdml The test is named test_2045_C_minimum_size_header. ) I realize your data and schema likely don't have such extreme behavior. We need to get your schema so we can figure out where the performance problem is, whether there is a workaround, and what kind of Daffodil features would eliminate all this guesswork about what's slow about it. On Thu, Jan 4, 2024 at 3:59 AM Roger L Costello <coste...@mitre.org> wrote: > Steve wrote: > > > > Ø One way to get a decent approximation for how much time is used for the > former [build an internal infoset] is to use the "null" infoset outputter, > e.g. > > Ø > > Ø daffodil parse -I null ... > > > > Ø This still parses data and builds the internal infoset but turns infoset > > > - serialization into a no-op. > > > > Thanks Steve. I did as you suggested and here’s the result: > > > > - 130 seconds > > > > That is super surprising. I would have expected it to take much, much less > time. > > > > So, that means it takes 130 seconds to parse the 5-million-line input file > and build an internal infoset but only 96 seconds to create the 4 GB XML > file. That makes no sense. > > > > /Roger > > > > *From:* Steve Lawrence <slawre...@apache.org> > *Sent:* Tuesday, January 2, 2024 9:18 AM > *To:* users@daffodil.apache.org > *Subject:* [EXT] Re: Parsing 5 million lines of input is taking 4 minutes > - too slow! > > > > You are correct that daffodil builds an internal infoset and then > serializes that to something else (e. g. XML, EXI, JSON). One way to get a > decent approximation for how much time is used for the former is to use the > "null" infoset outputter, > > ZjQcmQRYFpfptBannerStart > > You are correct that daffodil builds an internal infoset and then > > serializes that to something else (e.g. XML, EXI, JSON). One way to get > > a decent approximation for how much time is used for the former is to > > use the "null" infoset outputter, e.g. > > > > daffodil parse -I null ... > > > > This still parses data and builds the internal infoset but turns infoset > > serialization into a no-op. > > > > On 2023-12-26 02:04 PM, Roger L Costello wrote: > > > Hi Folks, > > > > > > My input file contains 5 million 132-character records. > > > > > > I have done everything that I can think of to make the parsing faster: > > > > > > 1. I precompiled the schema and used it to do the parsing > > > 2. I set Java -Xmx40960m > > > 3. I used a bunch of dfdl:choiceDispatchKey to divide-and-conquer > > > > > > And yet it still takes 4 minutes before the (4 GB) XML file is produced. > > > Waiting 4 minutes is not acceptable for my clients. > > > > > > A couple of questions: > > > > > > 1. Is there anything else that I can do to speed things up? > > > 2. I believe there is time needed to do the parsing and generate an > > > in-memory parse tree, and there is time needed to serialize the > > > in-memory parse tree to an XML file. Is there a way to find those > > > two times? I suspect the former is a lot quicker than the latter. > > > > > > /Roger > > > > > > >