Re: Parsing 5 million lines of input is taking 4 minutes - too slow!

Roger L Costello Thu, 04 Jan 2024 00:59:36 -0800

Steve wrote:

Ø  One way to get a decent approximation for how much time is used for the 
former [build an internal infoset] is to use the "null" infoset outputter, e.g.

Ø

Ø   daffodil parse -I null ...

Ø  This still parses data and builds the internal infoset but turns infoset

  *   serialization into a no-op.

Thanks Steve. I did as you suggested and here’s the result:

  *   130 seconds

That is super surprising. I would have expected it to take much, much less time.

So, that means it takes 130 seconds to parse the 5-million-line input file and 
build an internal infoset but only 96 seconds to create the 4 GB XML file. That 
makes no sense.

/Roger

From: Steve Lawrence <slawre...@apache.org>
Sent: Tuesday, January 2, 2024 9:18 AM
To: users@daffodil.apache.org
Subject: [EXT] Re: Parsing 5 million lines of input is taking 4 minutes - too 
slow!

You are correct that daffodil builds an internal infoset and then serializes 
that to something else (e. g. XML, EXI, JSON). One way to get a decent 
approximation for how much time is used for the former is to use the "null" 
infoset outputter,
ZjQcmQRYFpfptBannerStart

You are correct that daffodil builds an internal infoset and then

serializes that to something else (e.g. XML, EXI, JSON). One way to get

a decent approximation for how much time is used for the former is to

use the "null" infoset outputter, e.g.

   daffodil parse -I null ...

This still parses data and builds the internal infoset but turns infoset

serialization into a no-op.

On 2023-12-26 02:04 PM, Roger L Costello wrote:

> Hi Folks,

>

> My input file contains 5 million 132-character records.

>

> I have done everything that I can think of to make the parsing faster:

>

>  1. I precompiled the schema and used it to do the parsing

>  2. I set Java -Xmx40960m

>  3. I used a bunch of dfdl:choiceDispatchKey to divide-and-conquer

>

> And yet it still takes 4 minutes before the (4 GB) XML file is produced.

> Waiting 4 minutes is not acceptable for my clients.

>

> A couple of questions:

>

>  1. Is there anything else that I can do to speed things up?

>  2. I believe there is time needed to do the parsing and generate an

>     in-memory parse tree, and there is time needed to serialize the

>     in-memory parse tree to an XML file. Is there a way to find those

>     two times? I suspect the former is a lot quicker than the latter.

>

> /Roger

>

Re: Parsing 5 million lines of input is taking 4 minutes - too slow!

Reply via email to