Thank you for your reply on weekend,Tim! In my program, both methods (detect and parseToString) are used one after another to get the medie type and plain text. We test millions file samples everyday, and notice the java heap is normal but offheap is increasing until java progress was killed by linux oom-killer. Because in my program, i don't use offheap by native code.
Last night,i only test detect method to get medietype, it seems everything is normal. Later i will test parseToString. And i will try your suggestion and test the program again. Thanks! Tim Allison <[email protected]> 于2023年3月18日周六 19:46写道: > Do you get the off heap problem only on parseToString and not on detect? > > Not part of your question, but I'd recommend using > TikaInputStream.get(file, metadata). It is far more efficient for > zip-based files as well as PDFs and other parsers that require random > access. > > On Sat, Mar 18, 2023 at 1:24 AM 朱桂锋 <[email protected]> wrote: > > > > Firstly, thank you for tika, she is great project! > > > > Recently, i run the tika(version 2.7.0) project and extract text from > document, i find java offheap is increasing until all the memory to the > 100%, and then killed by oom-killer. > > > > then i use pmap and dump data from memory(exclude the java heap), i find > they are like this: > > > > [ Content > > > > Types] . xM1PK > > > > rels/.relsPK word/ rels/document.xm1.relsPK word /document.xm1PK > word/footer4.xmIPK word/header4. xm1PK word/footer2.xmIPK word/header2. > xm1PK word /header3.xmIPK word/footer3.xmlPK word /header1.xm1PK > > > > word/ footer1 . xm1PK > > > > word / footnotes.xmlPK word/endnotes .xm1PK word/header5. xm1PK > word/media/ image3.pngPK word/media/imagel. jpegPK word/media/image2. > jpegPK word / theme/ theme 1. xm1PK word/settings. xm1PK > > > > customxml/ itemProps2 .xm1PK > > > > customXml /item2 . xm1PK docProps /custom. xm1 PK t?92 > customXml/rels/item1.xm1.relsPK customXml/ rels/item2.xm1.relsPK customXm1 > /itemProps1.xm1PK > > > > > > > > they are office document text,why they are in offheap? so i doubt when > parse office document it will cause memory leak. > > > > another infomation: when i debug code on my own mac computer, using > xlsx as input file sample , > > when it calling tika.detect, it called ZipArchiveInputStream constructor > twice, and the same times calling java.util.zip.Inflater#end(); > > but when it calling tika.parseToString, it called ZipArchiveInputStream > constructor three times(two for mediatype, one for parse), but only two > times calling java.util.zip.Inflater#end()? > > > > Is that caused the offheap memory leak because of the Inflater use > native code? > > > > Look forward for your reply! thank you very much! > > > > my test code: > > > > public static void extractByFacade(File file) throws Exception { > > Tika tika = new Tika(); > > tika.setMaxStringLength(240); > > org.apache.poi.util.IOUtils.setByteArrayMaxOverride(200000000); > > > > final BufferedInputStream buffer = new BufferedInputStream(new > FileInputStream(file)); > > final String mediaType = tika.detect(buffer, file.getName()); > > // System.out.println("mediaType->" + mediaType); > > > > final String content = tika.parseToString(buffer); > > // System.out.println("extractByFacade>>>>>>>>>>>>>>"); > > // System.out.println(content + " " + content.length()); > > } > > > > >
