Re: Parsing 5 million lines of input is taking 4 minutes - too slow!

Mike Beckerle Thu, 04 Jan 2024 10:55:18 -0800

Roger said: So, that means it takes 130 seconds to parse the 5-million-line
input file and build an internal infoset but only 96 seconds to create the
4 GB XML file. That makes no sense.

I agree with you that parsing 5M records and it taking 2 minutes ...
something is clearly wrong and it is much too slow.
2 minutes to a modern CPU is like the Jurassic Age to a human. I.e., it's
an almost unimaginably long time.

But parsing taking longer than writing out XML *can* make sense depending
on the format complexity.  Writing out 4GB of XML is just a bunch of text
I/O. Per byte that's pretty fast.

The speed of DFDL parsing is proportional to *the number of decisions* the
parser must make, which is linearly, but weakly, correlated to the data
size. The constant factors vary widely with the format.

For an extreme example: there is a mil-std-2045 message header that is 33
bits long. It mostly consists of hidden groups of presence bits that are 0
indicating that some optional component is not present. Each such bit
requires a DFDL choice of two possibilities, evaluation of a
choice-dispatch-key expression or a occursCount expression, and most of
those then create *nothing* in the infoset. So a bunch of overhead to
consume 1 bit of input and decide to add nothing to the infoset. Repeat
almost 30 times. You have now consumed less than 5 bytes of the input. In
terms of parse speed in bytes/second this is going to be super slow because
every byte requires a bunch of parser decision making.  Writing out the
corresponding 956 bytes of XML text is going to be very quick in comparison
to this parsing.

(FYI: This extreme example is on github here:
https://github.com/DFDLSchemas/mil-std-2045/blob/master/src/test/resources/com/owlcyberdefense/mil-std-2045/milstd2045.tdml
The test is named test_2045_C_minimum_size_header. )

I realize your data and schema likely don't have such extreme behavior.

We need to get your schema so we can figure out where the performance
problem is, whether there is a workaround, and what kind of Daffodil
features would eliminate all this guesswork about what's slow about it.

On Thu, Jan 4, 2024 at 3:59 AM Roger L Costello <coste...@mitre.org> wrote:

> Steve wrote:
>
>
>
> Ø  One way to get a decent approximation for how much time is used for the 
> former [build an internal infoset] is to use the "null" infoset outputter, 
> e.g.
>
> Ø
>
> Ø   daffodil parse -I null ...
>
>
>
> Ø  This still parses data and builds the internal infoset but turns infoset
>
>
>    - serialization into a no-op.
>
>
>
> Thanks Steve. I did as you suggested and here’s the result:
>
>
>
>    - 130 seconds
>
>
>
> That is super surprising. I would have expected it to take much, much less
> time.
>
>
>
> So, that means it takes 130 seconds to parse the 5-million-line input file
> and build an internal infoset but only 96 seconds to create the 4 GB XML
> file. That makes no sense.
>
>
>
> /Roger
>
>
>
> *From:* Steve Lawrence <slawre...@apache.org>
> *Sent:* Tuesday, January 2, 2024 9:18 AM
> *To:* users@daffodil.apache.org
> *Subject:* [EXT] Re: Parsing 5 million lines of input is taking 4 minutes
> - too slow!
>
>
>
> You are correct that daffodil builds an internal infoset and then
> serializes that to something else (e. g. XML, EXI, JSON). One way to get a
> decent approximation for how much time is used for the former is to use the
> "null" infoset outputter,
>
> ZjQcmQRYFpfptBannerStart
>
> You are correct that daffodil builds an internal infoset and then
>
> serializes that to something else (e.g. XML, EXI, JSON). One way to get
>
> a decent approximation for how much time is used for the former is to
>
> use the "null" infoset outputter, e.g.
>
>
>
>    daffodil parse -I null ...
>
>
>
> This still parses data and builds the internal infoset but turns infoset
>
> serialization into a no-op.
>
>
>
> On 2023-12-26 02:04 PM, Roger L Costello wrote:
>
> > Hi Folks,
>
> >
>
> > My input file contains 5 million 132-character records.
>
> >
>
> > I have done everything that I can think of to make the parsing faster:
>
> >
>
> >  1. I precompiled the schema and used it to do the parsing
>
> >  2. I set Java -Xmx40960m
>
> >  3. I used a bunch of dfdl:choiceDispatchKey to divide-and-conquer
>
> >
>
> > And yet it still takes 4 minutes before the (4 GB) XML file is produced.
>
> > Waiting 4 minutes is not acceptable for my clients.
>
> >
>
> > A couple of questions:
>
> >
>
> >  1. Is there anything else that I can do to speed things up?
>
> >  2. I believe there is time needed to do the parsing and generate an
>
> >     in-memory parse tree, and there is time needed to serialize the
>
> >     in-memory parse tree to an XML file. Is there a way to find those
>
> >     two times? I suspect the former is a lot quicker than the latter.
>
> >
>
> > /Roger
>
> >
>
>
>
>

Re: Parsing 5 million lines of input is taking 4 minutes - too slow!

Reply via email to