Re: tika-offheap-memory-leak

Darren Tue, 21 Mar 2023 04:49:14 -0700

public static ZipArchiveThresholdInputStream openZipStream(InputStream
stream) throws IOException {
    // Peek at the first few bytes to sanity check
    InputStream checkedStream = FileMagic.prepareToCheckMagic(stream);
    verifyZipHeader(checkedStream);


    // Open as a proper zip stream
    return new ZipArchiveThresholdInputStream(new
ZipArchiveInputStream(checkedStream));
}


When calling parseToString , the code will run to upon code *one time*,
after that  init  ZipArchiveInputStream *one time*,  and will put data to
org.apache.commons.compress.archivers.zip.ZipArchiveInputStream#inf,
and *zero
time*  for
org.apache.commons.compress.archivers.zip.ZipArchiveInputStream#close.

Shouldn't we close?   Please help me.   Thank you


Tim Allison <[email protected]> 于2023年3月19日周日 22:41写道：

> > it called ZipArchiveInputStream constructor three times(two for
> mediatype, one for parse),  but only two times calling
> java.util.zip.Inflater#end()?
>
> Wait, are you calling close on your BufferedInputStream?
>
> On Sat, Mar 18, 2023 at 9:36 PM Darren <[email protected]> wrote:
> >
> > Thank you for your reply on weekend，Tim!
> >
> > In my program,  both methods (detect and parseToString) are used one
> after another to get the medie type and plain text.
> > We test millions file samples everyday, and notice the java heap is
> normal but offheap is increasing until java progress was killed by linux
> oom-killer.
> > Because in my program, i don't use offheap by native code.
> >
> > Last night，i only test detect method to get medietype, it seems
> everything is normal. Later i will test parseToString.
> >
> > And i will try your suggestion and test the program again. Thanks!
> >
> >
> > Tim Allison <[email protected]> 于2023年3月18日周六 19:46写道：
> >>
> >> Do you get the off heap problem only on parseToString and not on detect?
> >>
> >> Not part of your question, but I'd recommend using
> >> TikaInputStream.get(file, metadata).  It is far more efficient for
> >> zip-based files as well as PDFs and other parsers that require random
> >> access.
> >>
> >> On Sat, Mar 18, 2023 at 1:24 AM 朱桂锋 <[email protected]> wrote:
> >> >
> >> > Firstly,  thank you for tika, she is great project!
> >> >
> >> > Recently, i run the tika(version 2.7.0) project and extract text from
> document， i find java offheap is increasing until all the memory to the
> 100%, and then killed by oom-killer.
> >> >
> >> > then i use pmap and dump data from memory(exclude the java heap), i
> find they are like this:
> >> >
> >> > [ Content
> >> >
> >> > Types] . xM1PK
> >> >
> >> > rels/.relsPK word/ rels/document.xm1.relsPK word /document.xm1PK
> word/footer4.xmIPK word/header4. xm1PK word/footer2.xmIPK word/header2.
> xm1PK word /header3.xmIPK word/footer3.xmlPK word /header1.xm1PK
> >> >
> >> > word/ footer1 . xm1PK
> >> >
> >> > word / footnotes.xmlPK word/endnotes .xm1PK word/header5. xm1PK
> word/media/ image3.pngPK word/media/imagel. jpegPK word/media/image2.
> jpegPK word / theme/ theme 1. xm1PK word/settings. xm1PK
> >> >
> >> > customxml/ itemProps2 .xm1PK
> >> >
> >> > customXml /item2 . xm1PK docProps /custom. xm1 PK t?92
> customXml/rels/item1.xm1.relsPK customXml/ rels/item2.xm1.relsPK customXm1
> /itemProps1.xm1PK
> >> >
> >> >
> >> >
> >> > they are office document text，why they are in offheap?  so i doubt
> when parse  office  document  it will cause memory leak.
> >> >
> >> > another infomation:  when i debug code on my own mac computer, using
> xlsx as input file sample ,
> >> > when it calling tika.detect, it called ZipArchiveInputStream
> constructor twice, and the same times calling java.util.zip.Inflater#end();
> >> > but when it calling tika.parseToString,  it called
> ZipArchiveInputStream constructor three times(two for mediatype, one for
> parse),  but only two times calling java.util.zip.Inflater#end()?
> >> >
> >> > Is that caused the offheap memory leak because of the Inflater use
> native code?
> >> >
> >> > Look forward for your reply!  thank you very much!
> >> >
> >> > my test code:
> >> >
> >> >     public static void extractByFacade(File file) throws Exception {
> >> >         Tika tika = new Tika();
> >> >         tika.setMaxStringLength(240);
> >> >
>  org.apache.poi.util.IOUtils.setByteArrayMaxOverride(200000000);
> >> >
> >> >         final BufferedInputStream buffer = new
> BufferedInputStream(new FileInputStream(file));
> >> >         final String mediaType = tika.detect(buffer, file.getName());
> >> > //        System.out.println("mediaType->" + mediaType);
> >> >
> >> >         final String content = tika.parseToString(buffer);
> >> > //        System.out.println("extractByFacade>>>>>>>>>>>>>>");
> >> > //        System.out.println(content + "  " + content.length());
> >> >     }
> >> >
> >> >
>

Re: tika-offheap-memory-leak

Reply via email to