Re: tika-offheap-memory-leak

Tim Allison Sun, 19 Mar 2023 07:41:30 -0700

> it called ZipArchiveInputStream constructor three times(two for mediatype, 
> one for parse),  but only two times calling java.util.zip.Inflater#end()?


Wait, are you calling close on your BufferedInputStream?

On Sat, Mar 18, 2023 at 9:36 PM Darren <[email protected]> wrote:
>
> Thank you for your reply on weekend，Tim!
>
> In my program,  both methods (detect and parseToString) are used one after 
> another to get the medie type and plain text.
> We test millions file samples everyday, and notice the java heap is normal 
> but offheap is increasing until java progress was killed by linux oom-killer.
> Because in my program, i don't use offheap by native code.
>
> Last night，i only test detect method to get medietype, it seems everything is 
> normal. Later i will test parseToString.
>
> And i will try your suggestion and test the program again. Thanks!
>
>
> Tim Allison <[email protected]> 于2023年3月18日周六 19:46写道：
>>
>> Do you get the off heap problem only on parseToString and not on detect?
>>
>> Not part of your question, but I'd recommend using
>> TikaInputStream.get(file, metadata).  It is far more efficient for
>> zip-based files as well as PDFs and other parsers that require random
>> access.
>>
>> On Sat, Mar 18, 2023 at 1:24 AM 朱桂锋 <[email protected]> wrote:
>> >
>> > Firstly,  thank you for tika, she is great project!
>> >
>> > Recently, i run the tika(version 2.7.0) project and extract text from 
>> > document， i find java offheap is increasing until all the memory to the 
>> > 100%, and then killed by oom-killer.
>> >
>> > then i use pmap and dump data from memory(exclude the java heap), i find 
>> > they are like this:
>> >
>> > [ Content
>> >
>> > Types] . xM1PK
>> >
>> > rels/.relsPK word/ rels/document.xm1.relsPK word /document.xm1PK 
>> > word/footer4.xmIPK word/header4. xm1PK word/footer2.xmIPK word/header2. 
>> > xm1PK word /header3.xmIPK word/footer3.xmlPK word /header1.xm1PK
>> >
>> > word/ footer1 . xm1PK
>> >
>> > word / footnotes.xmlPK word/endnotes .xm1PK word/header5. xm1PK 
>> > word/media/ image3.pngPK word/media/imagel. jpegPK word/media/image2. 
>> > jpegPK word / theme/ theme 1. xm1PK word/settings. xm1PK
>> >
>> > customxml/ itemProps2 .xm1PK
>> >
>> > customXml /item2 . xm1PK docProps /custom. xm1 PK t?92 
>> > customXml/rels/item1.xm1.relsPK customXml/ rels/item2.xm1.relsPK customXm1 
>> > /itemProps1.xm1PK
>> >
>> >
>> >
>> > they are office document text，why they are in offheap?  so i doubt when 
>> > parse  office  document  it will cause memory leak.
>> >
>> > another infomation:  when i debug code on my own mac computer, using xlsx 
>> > as input file sample ,
>> > when it calling tika.detect, it called ZipArchiveInputStream constructor 
>> > twice, and the same times calling java.util.zip.Inflater#end();
>> > but when it calling tika.parseToString,  it called ZipArchiveInputStream 
>> > constructor three times(two for mediatype, one for parse),  but only two 
>> > times calling java.util.zip.Inflater#end()?
>> >
>> > Is that caused the offheap memory leak because of the Inflater use native 
>> > code?
>> >
>> > Look forward for your reply!  thank you very much!
>> >
>> > my test code:
>> >
>> >     public static void extractByFacade(File file) throws Exception {
>> >         Tika tika = new Tika();
>> >         tika.setMaxStringLength(240);
>> >         org.apache.poi.util.IOUtils.setByteArrayMaxOverride(200000000);
>> >
>> >         final BufferedInputStream buffer = new BufferedInputStream(new 
>> > FileInputStream(file));
>> >         final String mediaType = tika.detect(buffer, file.getName());
>> > //        System.out.println("mediaType->" + mediaType);
>> >
>> >         final String content = tika.parseToString(buffer);
>> > //        System.out.println("extractByFacade>>>>>>>>>>>>>>");
>> > //        System.out.println(content + "  " + content.length());
>> >     }
>> >
>> >

Re: tika-offheap-memory-leak

Reply via email to