Interesting.  Ashish Basran on TIKA-2180 has identified similar behavior with 
PDF and docx/xlsx.

Would you be able to run hprof on your process and share the output?  Perhaps 
Luis Filipe Nassif would be able to help analyze that as he did on TIKA-2058?

Also, if you could experiment with -Xmx1g or similar to see if there are a few 
outstanding PDFs that are using more memory than expected…From your 
description, so far, though, this feels like a static memory leak kind of 
problem…not necessarily caused by a single file (as in, e.g. TIKA-2045).

Finally, if you could run Tika App in batch mode and confirm that you’re seeing 
the same behavior:

java -jar tika-app.jar -i <input_dir> -o <output-dir> -JXmx8g


Best,

          Tim




From: Will Jones [mailto:[email protected]]
Sent: Tuesday, January 3, 2017 1:41 PM
To: [email protected]
Subject: Re: Memory issues with the Tika Facade

Hello Both,

Thanks for the reply. Using VisualVM it shows me that 8GB is being reserved 
(8GB Xmx), the Used memory quickly climbs up to around 6GB and eventually to 
8GB at which point the program will crash. If I trigger Garbage Collections it 
does not save any memory.

The files themselves are a mixture of PDF, JPG, and Office. The largest PDF 
file is 20MB, the largest DOCX is 600KB. I have done some testing and it is the 
PDF files that cause the issue (only running the JPG and Office files causes no 
memory problems).

I am using Tika 1.14. I had thought by disposing of the Tika facade each loop 
iteration this would have freed up any memory used by the previous parse (sorry 
I am a bit new to Java)?

Thank you



On 3 January 2017 at 17:56, Allison, Timothy B. 
<[email protected]<mailto:[email protected]>> wrote:
Concur with Markus.

Also, what type of files are these?  We know that very large .docx (think "War 
and Peace") and .pptx can use up a crazy amount of memory.  We've added new 
experimental parsers to handle those via SAX in trunk (coming in v 1.15), and 
these parsers decrease memory usage dramatically.


-----Original Message-----
From: Markus Jelsma 
[mailto:[email protected]<mailto:[email protected]>]
Sent: Tuesday, January 3, 2017 12:23 PM
To: [email protected]<mailto:[email protected]>
Subject: RE: Memory issues with the Tika Facade

Hello - what is a large amount of memory, how do you determine it (make sure 
you look at RES, not VIRT) and what are your JVM settings.

It is not uncommon for programs to allocate much memory if the default max heap 
is used, 2 GB in my case. If your JVM eats too much, limit it by setting Xmx to 
a lower level.

Markus

-----Original message-----
> From:Will Jones <[email protected]<mailto:[email protected]>>
> Sent: Tuesday 3rd January 2017 18:14
> To: [email protected]<mailto:[email protected]>
> Subject: Memory issues with the Tika Facade
>
> Hi,
>
> Big fan of what you are doing with Apache Tika. I have been using the Tika 
> facade to fetch metadata on each file in a directory containing a large 
> number of files.
>
> It returns the data I need, but the running process very quickly consumes a 
> large amount of memory as it proceeds through the files.
>
> What am I doing wrong? I have attached the code required to reproduce my 
> problem below.
>
>
> public class TikaTest {
>
>     public void tikaProcess(Path filePath) {
>         Tika t = new Tika();
>         try {
>             Metadata metadata = new Metadata();
>
>             String result = t.parse(filePath, metadata).toString();
>         }catch (Exception e){
>             e.printStackTrace();
>         }
>     }
>
>     public static void main(String[] args) {
>         TikaTest tt = new TikaTest();
>         try {
>             Files.list(Paths.get("g:/somedata/")).forEach(
>                     path -> tt.tikaProcess(path)
>             );
>         }catch (Exception e) {
>             e.printStackTrace();
>         }
>     }
> }

Reply via email to