Do you get the off heap problem only on parseToString and not on detect? Not part of your question, but I'd recommend using TikaInputStream.get(file, metadata). It is far more efficient for zip-based files as well as PDFs and other parsers that require random access.
On Sat, Mar 18, 2023 at 1:24 AM 朱桂锋 <[email protected]> wrote: > > Firstly, thank you for tika, she is great project! > > Recently, i run the tika(version 2.7.0) project and extract text from > document, i find java offheap is increasing until all the memory to the 100%, > and then killed by oom-killer. > > then i use pmap and dump data from memory(exclude the java heap), i find they > are like this: > > [ Content > > Types] . xM1PK > > rels/.relsPK word/ rels/document.xm1.relsPK word /document.xm1PK > word/footer4.xmIPK word/header4. xm1PK word/footer2.xmIPK word/header2. xm1PK > word /header3.xmIPK word/footer3.xmlPK word /header1.xm1PK > > word/ footer1 . xm1PK > > word / footnotes.xmlPK word/endnotes .xm1PK word/header5. xm1PK word/media/ > image3.pngPK word/media/imagel. jpegPK word/media/image2. jpegPK word / > theme/ theme 1. xm1PK word/settings. xm1PK > > customxml/ itemProps2 .xm1PK > > customXml /item2 . xm1PK docProps /custom. xm1 PK t?92 > customXml/rels/item1.xm1.relsPK customXml/ rels/item2.xm1.relsPK customXm1 > /itemProps1.xm1PK > > > > they are office document text,why they are in offheap? so i doubt when parse > office document it will cause memory leak. > > another infomation: when i debug code on my own mac computer, using xlsx as > input file sample , > when it calling tika.detect, it called ZipArchiveInputStream constructor > twice, and the same times calling java.util.zip.Inflater#end(); > but when it calling tika.parseToString, it called ZipArchiveInputStream > constructor three times(two for mediatype, one for parse), but only two > times calling java.util.zip.Inflater#end()? > > Is that caused the offheap memory leak because of the Inflater use native > code? > > Look forward for your reply! thank you very much! > > my test code: > > public static void extractByFacade(File file) throws Exception { > Tika tika = new Tika(); > tika.setMaxStringLength(240); > org.apache.poi.util.IOUtils.setByteArrayMaxOverride(200000000); > > final BufferedInputStream buffer = new BufferedInputStream(new > FileInputStream(file)); > final String mediaType = tika.detect(buffer, file.getName()); > // System.out.println("mediaType->" + mediaType); > > final String content = tika.parseToString(buffer); > // System.out.println("extractByFacade>>>>>>>>>>>>>>"); > // System.out.println(content + " " + content.length()); > } > >
