I have a remarkably similar setup to David, I’m running through about 50,000 
PDF files with OCR tools at the moment. I have Tika 1.18 running standalone and 
a shell script sending PDFs to it via curl for each file to extract metadata 
before OCR functions, by POSTing the file to the /meta URL.  

After 10 hours of uptime, Tika is using about 5.6 gigs of memory.  After 
restarting the Tika server, that appears to be about the same amount of memory 
that it gets when it starts fresh.

So whatever the issue is, it’s not in anything that falls under /meta, that 
stuff is working great for me.

> On Aug 7, 2018, at 9:57 AM, Tim Allison <[email protected]> wrote:
> 
> Thank you, David!  It would be helpful to know if downgrading to 1.16
> solves the problems with .txt files, as it does (apparently) with
> pdfs.
> On Tue, Aug 7, 2018 at 9:10 AM David Pilato <[email protected]> wrote:
>> 
>> That's interesting. Someone did some tests on a project I'm working on and 
>> reported as well a lot of memory usage (even for only txt files).
>> I did not dig yet into the issue so I don't know if this is related or not, 
>> but I thought I'd share this here: 
>> https://github.com/dadoonet/fscrawler/issues/566
>> 
>> 
>> Le 7 août 2018 à 14:36 +0200, Tim Allison <[email protected]>, a écrit :
>> 
>> Thomas,
>>   Thank you for raising this on the Solr list. Please let us know if we can 
>> help you help us figure out what’s going on...or if you’ve already figured 
>> it out!
>>    Thank you!
>> 
>>    Best,
>>       Tim
>> 
>> ---------- Forwarded message ---------
>> From: Thomas Scheffler <[email protected]>
>> Date: Thu, Aug 2, 2018 at 6:06 AM
>> Subject: Memory Leak in 7.3 to 7.4
>> To: [email protected] <[email protected]>
>> 
>> 
>> Hi,
>> 
>> we noticed a memory leak in a rather small setup. 40.000 metadata documents 
>> with nearly as much files that have „literal.*“ fields with it. While 7.2.1 
>> has brought some tika issues (due to a beta version) the real problems 
>> started to appear with version 7.3.0 which are currently unresolved in 
>> 7.4.0. Memory consumption is out-of-roof. Where previously 512MB heap was 
>> enough, now 6G aren’t enough to index all files.
>> I am now to a point where I can track this down to the libraries in 
>> solr-7.4.0/contrib/extraction/lib/. If I replace them all by the libraries 
>> shipped with 7.2.1 the problem disappears. As most files are PDF documents I 
>> tried updating pdfbox to 2.0.11 and tika to 1.18 with no solution to the 
>> problem. I will next try to downgrade these single libraries back to 2.0.6 
>> and 1.16 to see if these are the source of the memory leak.
>> 
>> In the mean time I would like to know if anybody else experienced the same 
>> problems?
>> 
>> kind regards,
>> 
>> Thomas

Reply via email to