Re: Parsing 5 million lines of input is taking 4 minutes - too slow!

Mike Beckerle Tue, 02 Jan 2024 06:26:40 -0800

I expect the performance issue is related to some poor behavior inside
daffodil, but it is good to rule out just the XML overheads via the EXI
experiments so thanks for doing that.


My concern is not just finding the issue, it's also enabling better
instrumentation inside Daffodil to make this sort of troubleshooting less
opaque.

I'm thinking of perhaps a way of turning on stats reports that would tell
you how much backtracking was done (as one example) so that one could
determine whether the optimizing out of backtracking choices and such (that
you have already done) is improving things or having the effect you hope it
*should* have. Just a count of the total number of parse steps vs. the
number of such steps that are backtracked would be very helpful. A
histogram of such data - showing counts of where the backtracking happened
(what points of uncertainty) would be another useful level of detail.

There are also some runtime behaviors within Daffodil that are adaptive -
e.g, gradually enlarging buffers used for regex matching. There are various
ways these could be behaving badly and we'd like to be able to rule those
out also.

We also might want to just put in timing that counts how much time is spent
in things like text lexical scanning, input and output value calc, etc.

On Wed, Dec 27, 2023 at 3:38 AM Roger L Costello <coste...@mitre.org> wrote:

> Thanks John and Don.
>
>
>
> Speeding up parsing by outputting EXI is an interesting idea. I guess the
> theory behind it is: If it takes a long time to output a huge text XML
> file, then instead output a much smaller binary EXI file. Interesting idea!
>
>
>
> Okay, I gave it a go. Here are the results:
>
>
>
> Parsing to a huge text XML file took 226 seconds.
>
> Parsing to a binary EXI file took 206 seconds.
>
>
>
> Parsing to EXI was a little faster. Not a big difference.
>
>
>
> The difference in the size of the output was remarkable:
>
>
>
> The size of the text XML file is 4 GB.
>
> The size of the EXI file is 287 MB.
>
>
>
> /Roger
>
>
>
> *From:* Interrante, John A (GE Aerospace, US) <john.interra...@ge.com>
> *Sent:* Tuesday, December 26, 2023 5:08 PM
> *To:* users@daffodil.apache.org
> *Subject:* [EXT] RE: Parsing 5 million lines of input is taking 4 minutes
> - too slow!
>
>
>
> Hi Roger,
>
>
>
> If you are using Daffodil 3.4.0 or later (3.6.0 would be better), then the
> Daffodil CLI has two new infoset types (-I exi and -I exisa) which will
> output non-schema aware EXI infosets and schema aware EXI infosets,
> respectively. EXI (binary XML) infosets are significantly smaller in size
> than normal XML infosets, often even smaller than the original data format
> when made schema aware.  Daffodil uses the Exificient library to support
> these infoset types.
>
>
>
> Try adding -I exi or -I exisa to your “daffodil parse” command to see how
> much it can speed up your parsing.  You will not end up with a 4GB XML
> infoset file; you will end up with hopefully a lot smaller EXI infoset file
> which can be used in place of the XML infoset depending on how the end user
> application reads the infoset (the end user application also needs to use
> an EXI library).
>
>
>
> John
>
>
>
> *From:* Roger L Costello <coste...@mitre.org>
> *Sent:* Tuesday, December 26, 2023 11:04 AM
> *To:* users@daffodil.apache.org
> *Subject:* EXT: Parsing 5 million lines of input is taking 4 minutes -
> too slow!
>
>
>
> Hi Folks,
>
>
>
> My input file contains 5 million 132-character records.
>
>
>
> I have done everything that I can think of to make the parsing faster:
>
>
>
>    1. I precompiled the schema and used it to do the parsing
>    2. I set Java -Xmx40960m
>    3. I used a bunch of dfdl:choiceDispatchKey to divide-and-conquer
>
>
>
> And yet it still takes 4 minutes before the (4 GB) XML file is produced.
> Waiting 4 minutes is not acceptable for my clients.
>
>
>
> A couple of questions:
>
>
>
>    1. Is there anything else that I can do to speed things up?
>    2. I believe there is time needed to do the parsing and generate an
>    in-memory parse tree, and there is time needed to serialize the in-memory
>    parse tree to an XML file. Is there a way to find those two times? I
>    suspect the former is a lot quicker than the latter.
>
>
>
> /Roger
>

Re: Parsing 5 million lines of input is taking 4 minutes - too slow!

Reply via email to