It's actually not too surprising to me (not that that is a good thing).
If parsing to EXI takes 200 seconds (from a previous email) and parsing
to NULL takes 130 seconds, that means just outputting the EXI infoset
takes about 70 seconds. Assuming just building the internal infoset is
about the same order of magnitude as the EXI infoset at around 70
seconds (might not be a safe assumption), that means about 60 seconds is
spent for just parsing the data. If I'm doing my math right, that's
about 0.012 milliseconds per line.
That's still very slow in computer times, but text parsing is pretty
complex, especially if you have things like backtracking, pattern
discriminators, delimiter scanning, etc., all of which can cause
scanning the same data multiple times.
I'm sure there's room for performance improvements, but that is
something we haven't had a lot of time to focus on like we need to. It
is also made more difficult since different formats require different
optimizations--binary formats parse very differently than text formats,
for example. We tend to focus on formats that we commonly come across,
which tends not to be big text documents.
And as Mike points out, we don't really have the mechanisms for
measuring which parts of Daffodil take most of the time and where to
focus our efforts. Profilers have helped some in the past, but the
results tend to just be too noisy. Something like what Mike suggested
would probably help a lot.
On 2024-01-04 03:58 AM, Roger L Costello wrote:
Steve wrote:
ØOne way to get a decent approximation for how much time is used for the
former [build an internal infoset] is to use the "null" infoset
outputter, e.g.
Ø
Ø daffodil parse -I null ...
ØThis still parses data and builds the internal infoset but turns infoset
* serialization into a no-op.
Thanks Steve. I did as you suggested and here’s the result:
* 130 seconds
That is super surprising. I would have expected it to take much, much
less time.
So, that means it takes 130 seconds to parse the 5-million-line input
file and build an internal infoset but only 96 seconds to create the 4
GB XML file. That makes no sense.
/Roger
*From:*Steve Lawrence <slawre...@apache.org>
*Sent:* Tuesday, January 2, 2024 9:18 AM
*To:* users@daffodil.apache.org
*Subject:* [EXT] Re: Parsing 5 million lines of input is taking 4
minutes - too slow!
You are correct that daffodil builds an internal infoset and then
serializes that to something else (e. g. XML, EXI, JSON). One way to get
a decent approximation for how much time is used for the former is to
use the "null" infoset outputter,
ZjQcmQRYFpfptBannerStart
You are correct that daffodil builds an internal infoset and then
serializes that to something else (e.g. XML, EXI, JSON). One way to get
a decent approximation for how much time is used for the former is to
use the "null" infoset outputter, e.g.
daffodil parse -I null ...
This still parses data and builds the internal infoset but turns infoset
serialization into a no-op.
On 2023-12-26 02:04 PM, Roger L Costello wrote:
Hi Folks,
My input file contains 5 million 132-character records.
I have done everything that I can think of to make the parsing faster:
1. I precompiled the schema and used it to do the parsing
2. I set Java -Xmx40960m
3. I used a bunch of dfdl:choiceDispatchKey to divide-and-conquer
And yet it still takes 4 minutes before the (4 GB) XML file is produced.
Waiting 4 minutes is not acceptable for my clients.
A couple of questions:
1. Is there anything else that I can do to speed things up?
2. I believe there is time needed to do the parsing and generate an
in-memory parse tree, and there is time needed to serialize the
in-memory parse tree to an XML file. Is there a way to find those
two times? I suspect the former is a lot quicker than the latter.
/Roger