Ha! I just ran into this with .docx as well [1][2]. Given that we only need
to extract contents on Tika, the experimental SAX parser [3] is way more
efficient, esp on very large docs. The gc with our current DOM was killing
performance, esp multithreaded.
There are portions of the docx SAX parser that I think will fit well within POI
as a parallel to xssf's eventusermodel. I hope to submit a patch for review
sometime next week (?) (or in open-source time, January?)...
Cheers,
Tim
[1] https://issues.apache.org/jira/browse/TIKA-1321
[2] https://issues.apache.org/jira/browse/TIKA-2180
[3] Admittedly, the experimental SAX parser doesn't include all of the features
that our current DOM parser does! More work remains...
-----Original Message-----
From: Javen O'Neal [mailto:[email protected]]
Sent: Friday, December 2, 2016 2:21 PM
To: POI Users List <[email protected]>
Subject: Re: Too much memory is used when reading a xlsx-file whose size is
just 7.3M
Those numbers sound about right. I'm used to 4 MB balloning to 1 GB.
We could significantly reduce memory consumption if we didn't maintain the XML
DOM in memory, but replacing that requires thousands of hours of work.