Interesting. Ashish Basran on TIKA-2180 has identified similar behavior with
PDF and docx/xlsx.
Would you be able to run hprof on your process and share the output? Perhaps
Luis Filipe Nassif would be able to help analyze that as he did on TIKA-2058?
Also, if you could experiment with -Xmx1g or similar to see if there are a few
outstanding PDFs that are using more memory than expected…From your
description, so far, though, this feels like a static memory leak kind of
problem…not necessarily caused by a single file (as in, e.g. TIKA-2045).
Finally, if you could run Tika App in batch mode and confirm that you’re seeing
the same behavior:
java -jar tika-app.jar -i <input_dir> -o <output-dir> -JXmx8g
Best,
Tim
From: Will Jones [mailto:[email protected]]
Sent: Tuesday, January 3, 2017 1:41 PM
To: [email protected]
Subject: Re: Memory issues with the Tika Facade
Hello Both,
Thanks for the reply. Using VisualVM it shows me that 8GB is being reserved
(8GB Xmx), the Used memory quickly climbs up to around 6GB and eventually to
8GB at which point the program will crash. If I trigger Garbage Collections it
does not save any memory.
The files themselves are a mixture of PDF, JPG, and Office. The largest PDF
file is 20MB, the largest DOCX is 600KB. I have done some testing and it is the
PDF files that cause the issue (only running the JPG and Office files causes no
memory problems).
I am using Tika 1.14. I had thought by disposing of the Tika facade each loop
iteration this would have freed up any memory used by the previous parse (sorry
I am a bit new to Java)?
Thank you
On 3 January 2017 at 17:56, Allison, Timothy B.
<[email protected]<mailto:[email protected]>> wrote:
Concur with Markus.
Also, what type of files are these? We know that very large .docx (think "War
and Peace") and .pptx can use up a crazy amount of memory. We've added new
experimental parsers to handle those via SAX in trunk (coming in v 1.15), and
these parsers decrease memory usage dramatically.
-----Original Message-----
From: Markus Jelsma
[mailto:[email protected]<mailto:[email protected]>]
Sent: Tuesday, January 3, 2017 12:23 PM
To: [email protected]<mailto:[email protected]>
Subject: RE: Memory issues with the Tika Facade
Hello - what is a large amount of memory, how do you determine it (make sure
you look at RES, not VIRT) and what are your JVM settings.
It is not uncommon for programs to allocate much memory if the default max heap
is used, 2 GB in my case. If your JVM eats too much, limit it by setting Xmx to
a lower level.
Markus
-----Original message-----
> From:Will Jones <[email protected]<mailto:[email protected]>>
> Sent: Tuesday 3rd January 2017 18:14
> To: [email protected]<mailto:[email protected]>
> Subject: Memory issues with the Tika Facade
>
> Hi,
>
> Big fan of what you are doing with Apache Tika. I have been using the Tika
> facade to fetch metadata on each file in a directory containing a large
> number of files.
>
> It returns the data I need, but the running process very quickly consumes a
> large amount of memory as it proceeds through the files.
>
> What am I doing wrong? I have attached the code required to reproduce my
> problem below.
>
>
> public class TikaTest {
>
> public void tikaProcess(Path filePath) {
> Tika t = new Tika();
> try {
> Metadata metadata = new Metadata();
>
> String result = t.parse(filePath, metadata).toString();
> }catch (Exception e){
> e.printStackTrace();
> }
> }
>
> public static void main(String[] args) {
> TikaTest tt = new TikaTest();
> try {
> Files.list(Paths.get("g:/somedata/")).forEach(
> path -> tt.tikaProcess(path)
> );
> }catch (Exception e) {
> e.printStackTrace();
> }
> }
> }