My bad. The issue I mentioned was reported with a Tika 1.16 version. So not related to the current thread. Most likely a problem in my own code :)
Le 8 août 2018 à 05:25 +0200, Robert Neal Clayton <[email protected]>, a écrit : > I have a remarkably similar setup to David, I’m running through about 50,000 > PDF files with OCR tools at the moment. I have Tika 1.18 running standalone > and a shell script sending PDFs to it via curl for each file to extract > metadata before OCR functions, by POSTing the file to the /meta URL. > > After 10 hours of uptime, Tika is using about 5.6 gigs of memory. After > restarting the Tika server, that appears to be about the same amount of > memory that it gets when it starts fresh. > > So whatever the issue is, it’s not in anything that falls under /meta, that > stuff is working great for me. > > > On Aug 7, 2018, at 9:57 AM, Tim Allison <[email protected]> wrote: > > > > Thank you, David! It would be helpful to know if downgrading to 1.16 > > solves the problems with .txt files, as it does (apparently) with > > pdfs. > > On Tue, Aug 7, 2018 at 9:10 AM David Pilato <[email protected]> wrote: > > > > > > That's interesting. Someone did some tests on a project I'm working on > > > and reported as well a lot of memory usage (even for only txt files). > > > I did not dig yet into the issue so I don't know if this is related or > > > not, but I thought I'd share this here: > > > https://github.com/dadoonet/fscrawler/issues/566 > > > > > > > > > Le 7 août 2018 à 14:36 +0200, Tim Allison <[email protected]>, a écrit : > > > > > > Thomas, > > > Thank you for raising this on the Solr list. Please let us know if we can > > > help you help us figure out what’s going on...or if you’ve already > > > figured it out! > > > Thank you! > > > > > > Best, > > > Tim > > > > > > ---------- Forwarded message --------- > > > From: Thomas Scheffler <[email protected]> > > > Date: Thu, Aug 2, 2018 at 6:06 AM > > > Subject: Memory Leak in 7.3 to 7.4 > > > To: [email protected] <[email protected]> > > > > > > > > > Hi, > > > > > > we noticed a memory leak in a rather small setup. 40.000 metadata > > > documents with nearly as much files that have „literal.*“ fields with it. > > > While 7.2.1 has brought some tika issues (due to a beta version) the real > > > problems started to appear with version 7.3.0 which are currently > > > unresolved in 7.4.0. Memory consumption is out-of-roof. Where previously > > > 512MB heap was enough, now 6G aren’t enough to index all files. > > > I am now to a point where I can track this down to the libraries in > > > solr-7.4.0/contrib/extraction/lib/. If I replace them all by the > > > libraries shipped with 7.2.1 the problem disappears. As most files are > > > PDF documents I tried updating pdfbox to 2.0.11 and tika to 1.18 with no > > > solution to the problem. I will next try to downgrade these single > > > libraries back to 2.0.6 and 1.16 to see if these are the source of the > > > memory leak. > > > > > > In the mean time I would like to know if anybody else experienced the > > > same problems? > > > > > > kind regards, > > > > > > Thomas >
