RE: Parsing 5 million lines of input is taking 4 minutes - too slow!

Brutzman, Donald (Don) (CIV) Wed, 27 Dec 2023 11:45:46 -0800

Please advise if you used schema aware EXI.  We’ve had some examples where 
using numeric datatypes (vice strings) made a big difference in results.


 

The similar times for parsing is not a surprise.  I’d expect that load time for 
parsing the EXI would be much less than parsing XML/text.

 

Experimental results assessing 21 properties of interest showed that EXI 
compactness and decoding speed meets or beats both XML and gzip performance *in 
all cases* with additional tests showing ZIP closely equivalent to gzip.

 

*       Efficient XML Interchange Evaluation
*       Editor Carine Bournez, W3C Working Draft, 7 April 2009
*       This Working Group Note is an evaluation of the Efficient XML 
Interchange (EXI) Format 1.0 with reference to the Properties identified by the 
XML Binary Characterization (XBC) Working Group, relative to XML, gzipped XML 
and ASN.1 PER. It is conducted using the XBC Measurement methodology. For the 
"compactness" and "processing efficiency" Properties, the performance is 
measured with EXI Measurement framework, over the test data collected for the 
EXI measurements, representing XBC Use Cases.
*       https://www.w3.org/TR/exi-evaluation

 

Excellent thesis follows, Bruce Hill’s final parsing experiments prior to 
graduation actually included files up to 30GB with consistent 
compaction/performance ratios.

 

*       Evaluation of efficient XML interchange (EXI) for large datasets and as 
an alternative to binary JSON encodings
*       Hill, Bruce W., Master’s Thesis, Naval Postgraduate School (NPS), 
2015-03
*       Abstract.  Current and emerging Navy information concepts, including 
network-centric warfare and Navy Tactical Cloud, presume high network 
throughput and interoperability. The Extensible Markup Language (XML) addresses 
the latter requirement, but its verbosity is problematic for afloat networks. 
JavaScript Object Notation (JSON) is an alternative to XML common in web 
applications and some non-relational databases. Compact, binary encodings exist 
for both formats. Efficient XML Interchange (EXI) is a standardized, binary 
encoding of XML. Binary JSON (BSON) and Compact Binary Object Representation 
(CBOR) are JSON-compatible encodings. This work evaluates EXI compaction 
against both encodings, and extends evaluations of EXI for datasets up to 4 
gigabytes. Generally, a configuration of EXI exists that produces a more 
compact encoding than BSON or CBOR. Tests show EXI compacts structured, 
non-multimedia data in Microsoft Office files better than the default format. 
The Navy needs to immediately consider EXI for use in web, sensor, and office 
document applications to improve throughput over constrained networks. To 
maximize EXI benefits, future work needs to evaluate EXI’s parameters, as well 
as tune XML schema documents, on a case-by-case basis prior to EXI deployment. 
A suite of test examples and an evaluation framework also need to be developed 
to support this process. 
*       Received NPS Outstanding Thesis Award
*       https://calhoun.nps.edu/handle/10945/45196

 

all the best, Don

-- 

Don Brutzman  Naval Postgraduate School, Code USW/Br        brutz...@nps.edu

Watkins 270,  MOVES Institute, Monterey CA 93943-5000 USA    +1.831.656.2149

X3D graphics, virtual worlds, navy robotics https://faculty.nps.edu/brutzman

 

From: Roger L Costello <coste...@mitre.org> 
Sent: Wednesday, December 27, 2023 12:38 AM
To: users@daffodil.apache.org
Subject: RE: Parsing 5 million lines of input is taking 4 minutes - too slow!

 

Thanks John and Don.

 

Speeding up parsing by outputting EXI is an interesting idea. I guess the 
theory behind it is: If it takes a long time to output a huge text XML file, 
then instead output a much smaller binary EXI file. Interesting idea!

 

Okay, I gave it a go. Here are the results:

 

Parsing to a huge text XML file took 226 seconds.

Parsing to a binary EXI file took 206 seconds.

 

Parsing to EXI was a little faster. Not a big difference.

 

The difference in the size of the output was remarkable:

 

The size of the text XML file is 4 GB.

The size of the EXI file is 287 MB.

 

/Roger

 

From: Interrante, John A (GE Aerospace, US) <john.interra...@ge.com 
<mailto:john.interra...@ge.com> > 
Sent: Tuesday, December 26, 2023 5:08 PM
To: users@daffodil.apache.org <mailto:users@daffodil.apache.org> 
Subject: [EXT] RE: Parsing 5 million lines of input is taking 4 minutes - too 
slow!

 

Hi Roger,

 

If you are using Daffodil 3.4.0 or later (3.6.0 would be better), then the 
Daffodil CLI has two new infoset types (-I exi and -I exisa) which will output 
non-schema aware EXI infosets and schema aware EXI infosets, respectively. EXI 
(binary XML) infosets are significantly smaller in size than normal XML 
infosets, often even smaller than the original data format when made schema 
aware.  Daffodil uses the Exificient library to support these infoset types.

 

Try adding -I exi or -I exisa to your “daffodil parse” command to see how much 
it can speed up your parsing.  You will not end up with a 4GB XML infoset file; 
you will end up with hopefully a lot smaller EXI infoset file which can be used 
in place of the XML infoset depending on how the end user application reads the 
infoset (the end user application also needs to use an EXI library).

 

John

 

From: Roger L Costello <coste...@mitre.org <mailto:coste...@mitre.org> > 
Sent: Tuesday, December 26, 2023 11:04 AM
To: users@daffodil.apache.org <mailto:users@daffodil.apache.org> 
Subject: EXT: Parsing 5 million lines of input is taking 4 minutes - too slow!

 

Hi Folks,

 

My input file contains 5 million 132-character records. 

 

I have done everything that I can think of to make the parsing faster:

 

1.      I precompiled the schema and used it to do the parsing
2.      I set Java -Xmx40960m
3.      I used a bunch of dfdl:choiceDispatchKey to divide-and-conquer

 

And yet it still takes 4 minutes before the (4 GB) XML file is produced. 
Waiting 4 minutes is not acceptable for my clients. 

 

A couple of questions:

 

1.      Is there anything else that I can do to speed things up?
2.      I believe there is time needed to do the parsing and generate an 
in-memory parse tree, and there is time needed to serialize the in-memory parse 
tree to an XML file. Is there a way to find those two times? I suspect the 
former is a lot quicker than the latter.

 

/Roger

smime.p7s
Description: S/MIME cryptographic signature

RE: Parsing 5 million lines of input is taking 4 minutes - too slow!

Reply via email to