Hi, So running it with tika_app in batch mode was typically between 1-1.5GB.
I've pretty sure I've found out the issue as a bug in my code. I was calling this t.parse(filePath, metadata); Which returns a Reader object (locally limited to that function). In my code I had not specifically closed it or surrounded it with finally etc. When I do make sure I explcility call .close() on the reader the memory usage is significantly lower. Hopefully useful if anyone else has made the same mistake as me and comes across this on Google! Tweaking the xmx down a bit also helped. On 3 January 2017 at 19:13, Allison, Timothy B. <[email protected]> wrote: > Y, I agree on limiting Xmx... I try to leave 1g per thread when running > against our regression corpus. If 100m works on your docs, great! But it > sounds like you were getting an OOM even with 8g, right? > > Perhaps try batch mode: > > java -jar tika-app.jar -i <input_dir> -o <output-dir> -JXmx1g > > -----Original Message----- > From: Markus Jelsma [mailto:[email protected]] > Sent: Tuesday, January 3, 2017 1:58 PM > To: [email protected] > Subject: RE: Memory issues with the Tika Facade > > Hello - you should set Xmx yourself, 100 MB should be ok depending on the > size of your documents. Finding the optimal Xmx is iterative, as long as no > OutOfMemory occurs, your Xmx is either too high, or just spot on. If you > hit an OutOfMemory regardless of Xmx there's probably a leak, but that > rarely happens. > > Having 8 GB of heap is not a good idea, the JVM can easily eat it all, > whether it needs it or not. > > Markus > > -----Original message----- > > From:Will Jones <[email protected]> > > Sent: Tuesday 3rd January 2017 19:41 > > To: [email protected] > > Subject: Re: Memory issues with the Tika Facade > > > > Hello Both, > > > > Thanks for the reply. Using VisualVM it shows me that 8GB is being > reserved (8GB Xmx), the Used memory quickly climbs up to around 6GB and > eventually to 8GB at which point the program will crash. If I trigger > Garbage Collections it does not save any memory. > > > > The files themselves are a mixture of PDF, JPG, and Office. The largest > PDF file is 20MB, the largest DOCX is 600KB. I have done some testing and > it is the PDF files that cause the issue (only running the JPG and Office > files causes no memory problems). > > > > I am using Tika 1.14. I had thought by disposing of the Tika facade each > loop iteration this would have freed up any memory used by the previous > parse (sorry I am a bit new to Java)? > > > > Thank you > > > > > > > > On 3 January 2017 at 17:56, Allison, Timothy B. <[email protected] > <mailto:[email protected]>> wrote: > > Concur with Markus. > > > > > > Also, what type of files are these? We know that very large .docx > (think "War and Peace") and .pptx can use up a crazy amount of memory. > Weve added new experimental parsers to handle those via SAX in trunk > (coming in v 1.15), and these parsers decrease memory usage dramatically. > > > > > > > > > -----Original Message----- > > > From: Markus Jelsma [mailto:[email protected] <mailto: > [email protected]>] > > > Sent: Tuesday, January 3, 2017 12:23 PM > > > To: [email protected] <mailto:[email protected]> > > > Subject: RE: Memory issues with the Tika Facade > > > > > > Hello - what is a large amount of memory, how do you determine it (make > sure you look at RES, not VIRT) and what are your JVM settings. > > > > > > It is not uncommon for programs to allocate much memory if the default > max heap is used, 2 GB in my case. If your JVM eats too much, limit it by > setting Xmx to a lower level. > > > > > > Markus > > > > > > -----Original message----- > > > > From:Will Jones <[email protected] <mailto:systemdotfork@gmail. > com>> > > > > Sent: Tuesday 3rd January 2017 18:14 > > > > To: [email protected] <mailto:[email protected]> > > > > Subject: Memory issues with the Tika Facade > > > > > > > > Hi, > > > > > > > > Big fan of what you are doing with Apache Tika. I have been using the > Tika facade to fetch metadata on each file in a directory containing a > large number of files. > > > > > > > > It returns the data I need, but the running process very quickly > consumes a large amount of memory as it proceeds through the files. > > > > > > > > What am I doing wrong? I have attached the code required to reproduce > my problem below. > > > > > > > > > > > > public class TikaTest { > > > > > > > > public void tikaProcess(Path filePath) { > > > > Tika t = new Tika(); > > > > try { > > > > Metadata metadata = new Metadata(); > > > > > > > > String result = t.parse(filePath, metadata).toString(); > > > > }catch (Exception e){ > > > > e.printStackTrace(); > > > > } > > > > } > > > > > > > > public static void main(String[] args) { > > > > TikaTest tt = new TikaTest(); > > > > try { > > > > Files.list(Paths.get("g:/somedata/")).forEach( > > > > path -> tt.tikaProcess(path) > > > > ); > > > > }catch (Exception e) { > > > > e.printStackTrace(); > > > > } > > > > } > > > > } > > > > > > > >
