It's actually not too surprising to me (not that that is a good thing).

If parsing to EXI takes 200 seconds (from a previous email) and parsing to NULL takes 130 seconds, that means just outputting the EXI infoset takes about 70 seconds. Assuming just building the internal infoset is about the same order of magnitude as the EXI infoset at around 70 seconds (might not be a safe assumption), that means about 60 seconds is spent for just parsing the data. If I'm doing my math right, that's about 0.012 milliseconds per line.

That's still very slow in computer times, but text parsing is pretty complex, especially if you have things like backtracking, pattern discriminators, delimiter scanning, etc., all of which can cause scanning the same data multiple times.

I'm sure there's room for performance improvements, but that is something we haven't had a lot of time to focus on like we need to. It is also made more difficult since different formats require different optimizations--binary formats parse very differently than text formats, for example. We tend to focus on formats that we commonly come across, which tends not to be big text documents.

And as Mike points out, we don't really have the mechanisms for measuring which parts of Daffodil take most of the time and where to focus our efforts. Profilers have helped some in the past, but the results tend to just be too noisy. Something like what Mike suggested would probably help a lot.

On 2024-01-04 03:58 AM, Roger L Costello wrote:
Steve wrote:

ØOne way to get a decent approximation for how much time is used for the former [build an internal infoset] is to use the "null" infoset outputter, e.g.

Ø

Ø daffodil parse -I null ...

ØThis still parses data and builds the internal infoset but turns infoset

  * serialization into a no-op.

Thanks Steve. I did as you suggested and here’s the result:

  * 130 seconds

That is super surprising. I would have expected it to take much, much less time.

So, that means it takes 130 seconds to parse the 5-million-line input file and build an internal infoset but only 96 seconds to create the 4 GB XML file. That makes no sense.

/Roger

*From:*Steve Lawrence <slawre...@apache.org>
*Sent:* Tuesday, January 2, 2024 9:18 AM
*To:* users@daffodil.apache.org
*Subject:* [EXT] Re: Parsing 5 million lines of input is taking 4 minutes - too slow!

You are correct that daffodil builds an internal infoset and then serializes that to something else (e. g. XML, EXI, JSON). One way to get a decent approximation for how much time is used for the former is to use the "null" infoset outputter,

ZjQcmQRYFpfptBannerStart

You are correct that daffodil builds an internal infoset and then

serializes that to something else (e.g. XML, EXI, JSON). One way to get

a decent approximation for how much time is used for the former is to

use the "null" infoset outputter, e.g.

    daffodil parse -I null ...

This still parses data and builds the internal infoset but turns infoset

serialization into a no-op.

On 2023-12-26 02:04 PM, Roger L Costello wrote:

Hi Folks,



My input file contains 5 million 132-character records.



I have done everything that I can think of to make the parsing faster:



  1. I precompiled the schema and used it to do the parsing

  2. I set Java -Xmx40960m

  3. I used a bunch of dfdl:choiceDispatchKey to divide-and-conquer



And yet it still takes 4 minutes before the (4 GB) XML file is produced.

Waiting 4 minutes is not acceptable for my clients.



A couple of questions:



  1. Is there anything else that I can do to speed things up?

  2. I believe there is time needed to do the parsing and generate an

     in-memory parse tree, and there is time needed to serialize the

     in-memory parse tree to an XML file. Is there a way to find those

     two times? I suspect the former is a lot quicker than the latter.



/Roger




Reply via email to