RE: Parsing 5 million lines of input is taking 4 minutes - too slow!

Interrante, John A (GE Aerospace, US) Tue, 26 Dec 2023 14:08:46 -0800

Hi Roger,

If you are using Daffodil 3.4.0 or later (3.6.0 would be better), then the 
Daffodil CLI has two new infoset types (-I exi and -I exisa) which will output 
non-schema aware EXI infosets and schema aware EXI infosets, respectively. EXI 
(binary XML) infosets are significantly smaller in size than normal XML 
infosets, often even smaller than the original data format when made schema 
aware.  Daffodil uses the Exificient library to support these infoset types.


Try adding -I exi or -I exisa to your "daffodil parse" command to see how much 
it can speed up your parsing.  You will not end up with a 4GB XML infoset file; 
you will end up with hopefully a lot smaller EXI infoset file which can be used 
in place of the XML infoset depending on how the end user application reads the 
infoset (the end user application also needs to use an EXI library).

John

From: Roger L Costello <coste...@mitre.org>
Sent: Tuesday, December 26, 2023 11:04 AM
To: users@daffodil.apache.org
Subject: EXT: Parsing 5 million lines of input is taking 4 minutes - too slow!

Hi Folks,

My input file contains 5 million 132-character records.

I have done everything that I can think of to make the parsing faster:


  1.  I precompiled the schema and used it to do the parsing
  2.  I set Java -Xmx40960m
  3.  I used a bunch of dfdl:choiceDispatchKey to divide-and-conquer

And yet it still takes 4 minutes before the (4 GB) XML file is produced. 
Waiting 4 minutes is not acceptable for my clients.

A couple of questions:


  1.  Is there anything else that I can do to speed things up?
  2.  I believe there is time needed to do the parsing and generate an 
in-memory parse tree, and there is time needed to serialize the in-memory parse 
tree to an XML file. Is there a way to find those two times? I suspect the 
former is a lot quicker than the latter.

/Roger

RE: Parsing 5 million lines of input is taking 4 minutes - too slow!

Reply via email to