My bad. The issue I mentioned was reported with a Tika 1.16 version.
So not related to the current thread. Most likely a problem in my own code :)


Le 8 août 2018 à 05:25 +0200, Robert Neal Clayton <[email protected]>, a 
écrit :
> I have a remarkably similar setup to David, I’m running through about 50,000 
> PDF files with OCR tools at the moment. I have Tika 1.18 running standalone 
> and a shell script sending PDFs to it via curl for each file to extract 
> metadata before OCR functions, by POSTing the file to the /meta URL.
>
> After 10 hours of uptime, Tika is using about 5.6 gigs of memory. After 
> restarting the Tika server, that appears to be about the same amount of 
> memory that it gets when it starts fresh.
>
> So whatever the issue is, it’s not in anything that falls under /meta, that 
> stuff is working great for me.
>
> > On Aug 7, 2018, at 9:57 AM, Tim Allison <[email protected]> wrote:
> >
> > Thank you, David! It would be helpful to know if downgrading to 1.16
> > solves the problems with .txt files, as it does (apparently) with
> > pdfs.
> > On Tue, Aug 7, 2018 at 9:10 AM David Pilato <[email protected]> wrote:
> > >
> > > That's interesting. Someone did some tests on a project I'm working on 
> > > and reported as well a lot of memory usage (even for only txt files).
> > > I did not dig yet into the issue so I don't know if this is related or 
> > > not, but I thought I'd share this here: 
> > > https://github.com/dadoonet/fscrawler/issues/566
> > >
> > >
> > > Le 7 août 2018 à 14:36 +0200, Tim Allison <[email protected]>, a écrit :
> > >
> > > Thomas,
> > > Thank you for raising this on the Solr list. Please let us know if we can 
> > > help you help us figure out what’s going on...or if you’ve already 
> > > figured it out!
> > > Thank you!
> > >
> > > Best,
> > > Tim
> > >
> > > ---------- Forwarded message ---------
> > > From: Thomas Scheffler <[email protected]>
> > > Date: Thu, Aug 2, 2018 at 6:06 AM
> > > Subject: Memory Leak in 7.3 to 7.4
> > > To: [email protected] <[email protected]>
> > >
> > >
> > > Hi,
> > >
> > > we noticed a memory leak in a rather small setup. 40.000 metadata 
> > > documents with nearly as much files that have „literal.*“ fields with it. 
> > > While 7.2.1 has brought some tika issues (due to a beta version) the real 
> > > problems started to appear with version 7.3.0 which are currently 
> > > unresolved in 7.4.0. Memory consumption is out-of-roof. Where previously 
> > > 512MB heap was enough, now 6G aren’t enough to index all files.
> > > I am now to a point where I can track this down to the libraries in 
> > > solr-7.4.0/contrib/extraction/lib/. If I replace them all by the 
> > > libraries shipped with 7.2.1 the problem disappears. As most files are 
> > > PDF documents I tried updating pdfbox to 2.0.11 and tika to 1.18 with no 
> > > solution to the problem. I will next try to downgrade these single 
> > > libraries back to 2.0.6 and 1.16 to see if these are the source of the 
> > > memory leak.
> > >
> > > In the mean time I would like to know if anybody else experienced the 
> > > same problems?
> > >
> > > kind regards,
> > >
> > > Thomas
>

Reply via email to