Re: Parsing 5 million lines of input is taking 4 minutes - too slow!

Steve Lawrence Thu, 04 Jan 2024 07:34:19 -0800

It's actually not too surprising to me (not that that is a good thing).

If parsing to EXI takes 200 seconds (from a previous email) and parsingto NULL takes 130 seconds, that means just outputting the EXI infosettakes about 70 seconds. Assuming just building the internal infoset isabout the same order of magnitude as the EXI infoset at around 70seconds (might not be a safe assumption), that means about 60 seconds isspent for just parsing the data. If I'm doing my math right, that'sabout 0.012 milliseconds per line.

That's still very slow in computer times, but text parsing is prettycomplex, especially if you have things like backtracking, patterndiscriminators, delimiter scanning, etc., all of which can causescanning the same data multiple times.

I'm sure there's room for performance improvements, but that issomething we haven't had a lot of time to focus on like we need to. Itis also made more difficult since different formats require differentoptimizations--binary formats parse very differently than text formats,for example. We tend to focus on formats that we commonly come across,which tends not to be big text documents.

And as Mike points out, we don't really have the mechanisms formeasuring which parts of Daffodil take most of the time and where tofocus our efforts. Profilers have helped some in the past, but theresults tend to just be too noisy. Something like what Mike suggestedwould probably help a lot.


On 2024-01-04 03:58 AM, Roger L Costello wrote:

Steve wrote:
ØOne way to get a decent approximation for how much time is used for theformer [build an internal infoset] is to use the "null" infosetoutputter, e.g.
Ø

Ø daffodil parse -I null ...

ØThis still parses data and builds the internal infoset but turns infoset

  * serialization into a no-op.

Thanks Steve. I did as you suggested and here’s the result:

  * 130 seconds
That is super surprising. I would have expected it to take much, muchless time.
So, that means it takes 130 seconds to parse the 5-million-line inputfile and build an internal infoset but only 96 seconds to create the 4GB XML file. That makes no sense.
/Roger

*From:*Steve Lawrence <slawre...@apache.org>
*Sent:* Tuesday, January 2, 2024 9:18 AM
*To:* users@daffodil.apache.org
*Subject:* [EXT] Re: Parsing 5 million lines of input is taking 4minutes - too slow!
You are correct that daffodil builds an internal infoset and thenserializes that to something else (e. g. XML, EXI, JSON). One way to geta decent approximation for how much time is used for the former is touse the "null" infoset outputter,
ZjQcmQRYFpfptBannerStart

You are correct that daffodil builds an internal infoset and then

serializes that to something else (e.g. XML, EXI, JSON). One way to get

a decent approximation for how much time is used for the former is to

use the "null" infoset outputter, e.g.

    daffodil parse -I null ...

This still parses data and builds the internal infoset but turns infoset

serialization into a no-op.

On 2023-12-26 02:04 PM, Roger L Costello wrote:
Hi Folks,
My input file contains 5 million 132-character records.
I have done everything that I can think of to make the parsing faster:
  1. I precompiled the schema and used it to do the parsing
  2. I set Java -Xmx40960m
  3. I used a bunch of dfdl:choiceDispatchKey to divide-and-conquer
And yet it still takes 4 minutes before the (4 GB) XML file is produced.
Waiting 4 minutes is not acceptable for my clients.
A couple of questions:
  1. Is there anything else that I can do to speed things up?
  2. I believe there is time needed to do the parsing and generate an
     in-memory parse tree, and there is time needed to serialize the
     in-memory parse tree to an XML file. Is there a way to find those
     two times? I suspect the former is a lot quicker than the latter.
/Roger

Re: Parsing 5 million lines of input is taking 4 minutes - too slow!

Reply via email to