Thank you, Tim! Tim Allison <[email protected]> 于2023年3月22日周三 02:16写道:
> https://issues.apache.org/jira/browse/TIKA-3990 > > On Tue, Mar 21, 2023 at 11:53 AM Tim Allison <[email protected]> wrote: > > > > K. I watched Tika.parseToString(InputStream) in a debugger. > > > > I'm puzzled about these lines in our OOXMLExtractorFactory: > > > > } finally { > > if (tmpRepairedCopy != null) { > > if (pkg != null) { > > pkg.revert(); > > } > > > > if a user calls OOXMLExtractorFactory on a TikaInputStream, we load > > the pkg into the OpenContainer on the TikaInputStream and that will > > get closed as a by-product of closing the TikaInputStream. But, I'm > > wondering now why we don't close the pkg on a regular inputstream if > > it is not repaired. My guess is that closing the pkg would force a > > close on the underlying zip inputstream? > > > > > > On Tue, Mar 21, 2023 at 11:25 AM Tim Allison <[email protected]> > wrote: > > > > > > Doh. You're right. I should have read our documentation on > parseToString(): > > > > > > <strong>NOTE:</strong> Unlike most other Tika methods that take an > > > * {@link InputStream}, this method will close the given stream for > > > * you as a convenience. > > > > > > I thought that you had cleared up the slow building oom because of a > > > colleague using jni? Are you still having problems or are you just > > > curious about 3 openings and 2 closings? Let me break out the > > > debugger and take a look. > > > > > > On Tue, Mar 21, 2023 at 7:49 AM Darren <[email protected]> wrote: > > > > > > > > > > > > public static ZipArchiveThresholdInputStream > openZipStream(InputStream stream) throws IOException { > > > > // Peek at the first few bytes to sanity check > > > > InputStream checkedStream = > FileMagic.prepareToCheckMagic(stream); > > > > verifyZipHeader(checkedStream); > > > > > > > > // Open as a proper zip stream > > > > return new ZipArchiveThresholdInputStream(new > ZipArchiveInputStream(checkedStream)); > > > > } > > > > > > > > > > > > When calling parseToString , the code will run to upon code one > time, after that init ZipArchiveInputStream one time, and will put data > to org.apache.commons.compress.archivers.zip.ZipArchiveInputStream#inf, > and zero time for > org.apache.commons.compress.archivers.zip.ZipArchiveInputStream#close. > > > > > > > > Shouldn't we close? Please help me. Thank you > > > > > > > > > > > > Tim Allison <[email protected]> 于2023年3月19日周日 22:41写道: > > > >> > > > >> > it called ZipArchiveInputStream constructor three times(two for > mediatype, one for parse), but only two times calling > java.util.zip.Inflater#end()? > > > >> > > > >> Wait, are you calling close on your BufferedInputStream? > > > >> > > > >> On Sat, Mar 18, 2023 at 9:36 PM Darren <[email protected]> wrote: > > > >> > > > > >> > Thank you for your reply on weekend,Tim! > > > >> > > > > >> > In my program, both methods (detect and parseToString) are used > one after another to get the medie type and plain text. > > > >> > We test millions file samples everyday, and notice the java heap > is normal but offheap is increasing until java progress was killed by linux > oom-killer. > > > >> > Because in my program, i don't use offheap by native code. > > > >> > > > > >> > Last night,i only test detect method to get medietype, it seems > everything is normal. Later i will test parseToString. > > > >> > > > > >> > And i will try your suggestion and test the program again. Thanks! > > > >> > > > > >> > > > > >> > Tim Allison <[email protected]> 于2023年3月18日周六 19:46写道: > > > >> >> > > > >> >> Do you get the off heap problem only on parseToString and not on > detect? > > > >> >> > > > >> >> Not part of your question, but I'd recommend using > > > >> >> TikaInputStream.get(file, metadata). It is far more efficient > for > > > >> >> zip-based files as well as PDFs and other parsers that require > random > > > >> >> access. > > > >> >> > > > >> >> On Sat, Mar 18, 2023 at 1:24 AM 朱桂锋 <[email protected]> wrote: > > > >> >> > > > > >> >> > Firstly, thank you for tika, she is great project! > > > >> >> > > > > >> >> > Recently, i run the tika(version 2.7.0) project and extract > text from document, i find java offheap is increasing until all the memory > to the 100%, and then killed by oom-killer. > > > >> >> > > > > >> >> > then i use pmap and dump data from memory(exclude the java > heap), i find they are like this: > > > >> >> > > > > >> >> > [ Content > > > >> >> > > > > >> >> > Types] . xM1PK > > > >> >> > > > > >> >> > rels/.relsPK word/ rels/document.xm1.relsPK word > /document.xm1PK word/footer4.xmIPK word/header4. xm1PK word/footer2.xmIPK > word/header2. xm1PK word /header3.xmIPK word/footer3.xmlPK word > /header1.xm1PK > > > >> >> > > > > >> >> > word/ footer1 . xm1PK > > > >> >> > > > > >> >> > word / footnotes.xmlPK word/endnotes .xm1PK word/header5. > xm1PK word/media/ image3.pngPK word/media/imagel. jpegPK word/media/image2. > jpegPK word / theme/ theme 1. xm1PK word/settings. xm1PK > > > >> >> > > > > >> >> > customxml/ itemProps2 .xm1PK > > > >> >> > > > > >> >> > customXml /item2 . xm1PK docProps /custom. xm1 PK t?92 > customXml/rels/item1.xm1.relsPK customXml/ rels/item2.xm1.relsPK customXm1 > /itemProps1.xm1PK > > > >> >> > > > > >> >> > > > > >> >> > > > > >> >> > they are office document text,why they are in offheap? so i > doubt when parse office document it will cause memory leak. > > > >> >> > > > > >> >> > another infomation: when i debug code on my own mac computer, > using xlsx as input file sample , > > > >> >> > when it calling tika.detect, it called ZipArchiveInputStream > constructor twice, and the same times calling java.util.zip.Inflater#end(); > > > >> >> > but when it calling tika.parseToString, it called > ZipArchiveInputStream constructor three times(two for mediatype, one for > parse), but only two times calling java.util.zip.Inflater#end()? > > > >> >> > > > > >> >> > Is that caused the offheap memory leak because of the Inflater > use native code? > > > >> >> > > > > >> >> > Look forward for your reply! thank you very much! > > > >> >> > > > > >> >> > my test code: > > > >> >> > > > > >> >> > public static void extractByFacade(File file) throws > Exception { > > > >> >> > Tika tika = new Tika(); > > > >> >> > tika.setMaxStringLength(240); > > > >> >> > > org.apache.poi.util.IOUtils.setByteArrayMaxOverride(200000000); > > > >> >> > > > > >> >> > final BufferedInputStream buffer = new > BufferedInputStream(new FileInputStream(file)); > > > >> >> > final String mediaType = tika.detect(buffer, > file.getName()); > > > >> >> > // System.out.println("mediaType->" + mediaType); > > > >> >> > > > > >> >> > final String content = tika.parseToString(buffer); > > > >> >> > // System.out.println("extractByFacade>>>>>>>>>>>>>>"); > > > >> >> > // System.out.println(content + " " + > content.length()); > > > >> >> > } > > > >> >> > > > > >> >> > >
