Thank you,  Tim!

Tim Allison <[email protected]> 于2023年3月22日周三 02:16写道:

> https://issues.apache.org/jira/browse/TIKA-3990
>
> On Tue, Mar 21, 2023 at 11:53 AM Tim Allison <[email protected]> wrote:
> >
> > K. I watched Tika.parseToString(InputStream) in a debugger.
> >
> > I'm puzzled about these lines in our OOXMLExtractorFactory:
> >
> > } finally {
> >     if (tmpRepairedCopy != null) {
> >         if (pkg != null) {
> >             pkg.revert();
> >         }
> >
> > if a user calls OOXMLExtractorFactory on a TikaInputStream, we load
> > the pkg into the OpenContainer on the TikaInputStream and that will
> > get closed as a by-product of closing the TikaInputStream.  But, I'm
> > wondering now why we don't close the pkg on a regular inputstream if
> > it is not repaired.  My guess is that closing the pkg would force a
> > close on the underlying zip inputstream?
> >
> >
> > On Tue, Mar 21, 2023 at 11:25 AM Tim Allison <[email protected]>
> wrote:
> > >
> > > Doh.  You're right.  I should have read our documentation on
> parseToString():
> > >
> > > <strong>NOTE:</strong> Unlike most other Tika methods that take an
> > > * {@link InputStream}, this method will close the given stream for
> > > * you as a convenience.
> > >
> > > I thought that you had cleared up the slow building oom because of a
> > > colleague using jni?  Are you still having problems or are you just
> > > curious about 3 openings and 2 closings?  Let me break out the
> > > debugger and take a look.
> > >
> > > On Tue, Mar 21, 2023 at 7:49 AM Darren <[email protected]> wrote:
> > > >
> > > >
> > > > public static ZipArchiveThresholdInputStream
> openZipStream(InputStream stream) throws IOException {
> > > >     // Peek at the first few bytes to sanity check
> > > >     InputStream checkedStream =
> FileMagic.prepareToCheckMagic(stream);
> > > >     verifyZipHeader(checkedStream);
> > > >
> > > >     // Open as a proper zip stream
> > > >     return new ZipArchiveThresholdInputStream(new
> ZipArchiveInputStream(checkedStream));
> > > > }
> > > >
> > > >
> > > > When calling parseToString , the code will run to upon code one
> time, after that  init  ZipArchiveInputStream one time,  and will put data
> to  org.apache.commons.compress.archivers.zip.ZipArchiveInputStream#inf,
>  and zero time  for
> org.apache.commons.compress.archivers.zip.ZipArchiveInputStream#close.
> > > >
> > > > Shouldn't we close?   Please help me.   Thank you
> > > >
> > > >
> > > > Tim Allison <[email protected]> 于2023年3月19日周日 22:41写道:
> > > >>
> > > >> > it called ZipArchiveInputStream constructor three times(two for
> mediatype, one for parse),  but only two times calling
> java.util.zip.Inflater#end()?
> > > >>
> > > >> Wait, are you calling close on your BufferedInputStream?
> > > >>
> > > >> On Sat, Mar 18, 2023 at 9:36 PM Darren <[email protected]> wrote:
> > > >> >
> > > >> > Thank you for your reply on weekend,Tim!
> > > >> >
> > > >> > In my program,  both methods (detect and parseToString) are used
> one after another to get the medie type and plain text.
> > > >> > We test millions file samples everyday, and notice the java heap
> is normal but offheap is increasing until java progress was killed by linux
> oom-killer.
> > > >> > Because in my program, i don't use offheap by native code.
> > > >> >
> > > >> > Last night,i only test detect method to get medietype, it seems
> everything is normal. Later i will test parseToString.
> > > >> >
> > > >> > And i will try your suggestion and test the program again. Thanks!
> > > >> >
> > > >> >
> > > >> > Tim Allison <[email protected]> 于2023年3月18日周六 19:46写道:
> > > >> >>
> > > >> >> Do you get the off heap problem only on parseToString and not on
> detect?
> > > >> >>
> > > >> >> Not part of your question, but I'd recommend using
> > > >> >> TikaInputStream.get(file, metadata).  It is far more efficient
> for
> > > >> >> zip-based files as well as PDFs and other parsers that require
> random
> > > >> >> access.
> > > >> >>
> > > >> >> On Sat, Mar 18, 2023 at 1:24 AM 朱桂锋 <[email protected]> wrote:
> > > >> >> >
> > > >> >> > Firstly,  thank you for tika, she is great project!
> > > >> >> >
> > > >> >> > Recently, i run the tika(version 2.7.0) project and extract
> text from document, i find java offheap is increasing until all the memory
> to the 100%, and then killed by oom-killer.
> > > >> >> >
> > > >> >> > then i use pmap and dump data from memory(exclude the java
> heap), i find they are like this:
> > > >> >> >
> > > >> >> > [ Content
> > > >> >> >
> > > >> >> > Types] . xM1PK
> > > >> >> >
> > > >> >> > rels/.relsPK word/ rels/document.xm1.relsPK word
> /document.xm1PK word/footer4.xmIPK word/header4. xm1PK word/footer2.xmIPK
> word/header2. xm1PK word /header3.xmIPK word/footer3.xmlPK word
> /header1.xm1PK
> > > >> >> >
> > > >> >> > word/ footer1 . xm1PK
> > > >> >> >
> > > >> >> > word / footnotes.xmlPK word/endnotes .xm1PK word/header5.
> xm1PK word/media/ image3.pngPK word/media/imagel. jpegPK word/media/image2.
> jpegPK word / theme/ theme 1. xm1PK word/settings. xm1PK
> > > >> >> >
> > > >> >> > customxml/ itemProps2 .xm1PK
> > > >> >> >
> > > >> >> > customXml /item2 . xm1PK docProps /custom. xm1 PK t?92
> customXml/rels/item1.xm1.relsPK customXml/ rels/item2.xm1.relsPK customXm1
> /itemProps1.xm1PK
> > > >> >> >
> > > >> >> >
> > > >> >> >
> > > >> >> > they are office document text,why they are in offheap?  so i
> doubt when parse  office  document  it will cause memory leak.
> > > >> >> >
> > > >> >> > another infomation:  when i debug code on my own mac computer,
> using xlsx as input file sample ,
> > > >> >> > when it calling tika.detect, it called ZipArchiveInputStream
> constructor twice, and the same times calling java.util.zip.Inflater#end();
> > > >> >> > but when it calling tika.parseToString,  it called
> ZipArchiveInputStream constructor three times(two for mediatype, one for
> parse),  but only two times calling java.util.zip.Inflater#end()?
> > > >> >> >
> > > >> >> > Is that caused the offheap memory leak because of the Inflater
> use native code?
> > > >> >> >
> > > >> >> > Look forward for your reply!  thank you very much!
> > > >> >> >
> > > >> >> > my test code:
> > > >> >> >
> > > >> >> >     public static void extractByFacade(File file) throws
> Exception {
> > > >> >> >         Tika tika = new Tika();
> > > >> >> >         tika.setMaxStringLength(240);
> > > >> >> >
>  org.apache.poi.util.IOUtils.setByteArrayMaxOverride(200000000);
> > > >> >> >
> > > >> >> >         final BufferedInputStream buffer = new
> BufferedInputStream(new FileInputStream(file));
> > > >> >> >         final String mediaType = tika.detect(buffer,
> file.getName());
> > > >> >> > //        System.out.println("mediaType->" + mediaType);
> > > >> >> >
> > > >> >> >         final String content = tika.parseToString(buffer);
> > > >> >> > //        System.out.println("extractByFacade>>>>>>>>>>>>>>");
> > > >> >> > //        System.out.println(content + "  " +
> content.length());
> > > >> >> >     }
> > > >> >> >
> > > >> >> >
>

Reply via email to