K. I watched Tika.parseToString(InputStream) in a debugger.
I'm puzzled about these lines in our OOXMLExtractorFactory:
} finally {
if (tmpRepairedCopy != null) {
if (pkg != null) {
pkg.revert();
}
if a user calls OOXMLExtractorFactory on a TikaInputStream, we load
the pkg into the OpenContainer on the TikaInputStream and that will
get closed as a by-product of closing the TikaInputStream. But, I'm
wondering now why we don't close the pkg on a regular inputstream if
it is not repaired. My guess is that closing the pkg would force a
close on the underlying zip inputstream?
On Tue, Mar 21, 2023 at 11:25 AM Tim Allison <[email protected]> wrote:
>
> Doh. You're right. I should have read our documentation on parseToString():
>
> <strong>NOTE:</strong> Unlike most other Tika methods that take an
> * {@link InputStream}, this method will close the given stream for
> * you as a convenience.
>
> I thought that you had cleared up the slow building oom because of a
> colleague using jni? Are you still having problems or are you just
> curious about 3 openings and 2 closings? Let me break out the
> debugger and take a look.
>
> On Tue, Mar 21, 2023 at 7:49 AM Darren <[email protected]> wrote:
> >
> >
> > public static ZipArchiveThresholdInputStream openZipStream(InputStream
> > stream) throws IOException {
> > // Peek at the first few bytes to sanity check
> > InputStream checkedStream = FileMagic.prepareToCheckMagic(stream);
> > verifyZipHeader(checkedStream);
> >
> > // Open as a proper zip stream
> > return new ZipArchiveThresholdInputStream(new
> > ZipArchiveInputStream(checkedStream));
> > }
> >
> >
> > When calling parseToString , the code will run to upon code one time, after
> > that init ZipArchiveInputStream one time, and will put data to
> > org.apache.commons.compress.archivers.zip.ZipArchiveInputStream#inf, and
> > zero time for
> > org.apache.commons.compress.archivers.zip.ZipArchiveInputStream#close.
> >
> > Shouldn't we close? Please help me. Thank you
> >
> >
> > Tim Allison <[email protected]> 于2023年3月19日周日 22:41写道:
> >>
> >> > it called ZipArchiveInputStream constructor three times(two for
> >> > mediatype, one for parse), but only two times calling
> >> > java.util.zip.Inflater#end()?
> >>
> >> Wait, are you calling close on your BufferedInputStream?
> >>
> >> On Sat, Mar 18, 2023 at 9:36 PM Darren <[email protected]> wrote:
> >> >
> >> > Thank you for your reply on weekend,Tim!
> >> >
> >> > In my program, both methods (detect and parseToString) are used one
> >> > after another to get the medie type and plain text.
> >> > We test millions file samples everyday, and notice the java heap is
> >> > normal but offheap is increasing until java progress was killed by linux
> >> > oom-killer.
> >> > Because in my program, i don't use offheap by native code.
> >> >
> >> > Last night,i only test detect method to get medietype, it seems
> >> > everything is normal. Later i will test parseToString.
> >> >
> >> > And i will try your suggestion and test the program again. Thanks!
> >> >
> >> >
> >> > Tim Allison <[email protected]> 于2023年3月18日周六 19:46写道:
> >> >>
> >> >> Do you get the off heap problem only on parseToString and not on detect?
> >> >>
> >> >> Not part of your question, but I'd recommend using
> >> >> TikaInputStream.get(file, metadata). It is far more efficient for
> >> >> zip-based files as well as PDFs and other parsers that require random
> >> >> access.
> >> >>
> >> >> On Sat, Mar 18, 2023 at 1:24 AM 朱桂锋 <[email protected]> wrote:
> >> >> >
> >> >> > Firstly, thank you for tika, she is great project!
> >> >> >
> >> >> > Recently, i run the tika(version 2.7.0) project and extract text from
> >> >> > document, i find java offheap is increasing until all the memory to
> >> >> > the 100%, and then killed by oom-killer.
> >> >> >
> >> >> > then i use pmap and dump data from memory(exclude the java heap), i
> >> >> > find they are like this:
> >> >> >
> >> >> > [ Content
> >> >> >
> >> >> > Types] . xM1PK
> >> >> >
> >> >> > rels/.relsPK word/ rels/document.xm1.relsPK word /document.xm1PK
> >> >> > word/footer4.xmIPK word/header4. xm1PK word/footer2.xmIPK
> >> >> > word/header2. xm1PK word /header3.xmIPK word/footer3.xmlPK word
> >> >> > /header1.xm1PK
> >> >> >
> >> >> > word/ footer1 . xm1PK
> >> >> >
> >> >> > word / footnotes.xmlPK word/endnotes .xm1PK word/header5. xm1PK
> >> >> > word/media/ image3.pngPK word/media/imagel. jpegPK word/media/image2.
> >> >> > jpegPK word / theme/ theme 1. xm1PK word/settings. xm1PK
> >> >> >
> >> >> > customxml/ itemProps2 .xm1PK
> >> >> >
> >> >> > customXml /item2 . xm1PK docProps /custom. xm1 PK t?92
> >> >> > customXml/rels/item1.xm1.relsPK customXml/ rels/item2.xm1.relsPK
> >> >> > customXm1 /itemProps1.xm1PK
> >> >> >
> >> >> >
> >> >> >
> >> >> > they are office document text,why they are in offheap? so i doubt
> >> >> > when parse office document it will cause memory leak.
> >> >> >
> >> >> > another infomation: when i debug code on my own mac computer, using
> >> >> > xlsx as input file sample ,
> >> >> > when it calling tika.detect, it called ZipArchiveInputStream
> >> >> > constructor twice, and the same times calling
> >> >> > java.util.zip.Inflater#end();
> >> >> > but when it calling tika.parseToString, it called
> >> >> > ZipArchiveInputStream constructor three times(two for mediatype, one
> >> >> > for parse), but only two times calling java.util.zip.Inflater#end()?
> >> >> >
> >> >> > Is that caused the offheap memory leak because of the Inflater use
> >> >> > native code?
> >> >> >
> >> >> > Look forward for your reply! thank you very much!
> >> >> >
> >> >> > my test code:
> >> >> >
> >> >> > public static void extractByFacade(File file) throws Exception {
> >> >> > Tika tika = new Tika();
> >> >> > tika.setMaxStringLength(240);
> >> >> >
> >> >> > org.apache.poi.util.IOUtils.setByteArrayMaxOverride(200000000);
> >> >> >
> >> >> > final BufferedInputStream buffer = new
> >> >> > BufferedInputStream(new FileInputStream(file));
> >> >> > final String mediaType = tika.detect(buffer, file.getName());
> >> >> > // System.out.println("mediaType->" + mediaType);
> >> >> >
> >> >> > final String content = tika.parseToString(buffer);
> >> >> > // System.out.println("extractByFacade>>>>>>>>>>>>>>");
> >> >> > // System.out.println(content + " " + content.length());
> >> >> > }
> >> >> >
> >> >> >