Oh, also the maximum number of Tika connections should be limited to the number of threads to be sure you're not wasting memory on extra Tika instances (which might be expensive).
Karl On Thu, Jan 18, 2018 at 10:52 AM, Karl Wright <[email protected]> wrote: > Hmm, it might be worth asking this question in the Tika user list. We've > not seen this kind of issue before with Tika transformation. > > Also, I think it's worth downloading MCF 2.9.1, which updates the Tika > version to 2.17 from 2.16. There were issues in 2.9 with incompatibilities > between our Tika version and the Apache POI version. This is now publicly > available but the web site has not yet been updated, so modify the download > URL to 2.9.1 from 2.9 to get the point release. > > Thanks, > Karl > > > On Thu, Jan 18, 2018 at 10:41 AM, Shashank Raj <[email protected] > > wrote: > >> Hi Karl, >> I changed the number of worker threads to 6 but still the problem >> persists when I use ManifoldCF's Tika. When going with "null" as output >> connection, there seems no problem. Also tried with Solr without tika >> transformation connection. That also works fine. >> But as soon as I switch to Manifold's transformation connection Tika, I >> get the same error. I have tried increasing heap size as well as decreasing >> workers. >> Also I've not selected "use extract update handler". >> >> Approx size of directory to crawl: 200GB >> In the future this size will be :10TB >> Size of largest file in this directory :2Gb >> >> NOTE: I am using Tomcat 8.0 to run manifold, connected to Postgresql 9.3 >> with Solr 6.6. >> >> On 18-Jan-2018 6:21 PM, "Karl Wright" <[email protected]> wrote: >> >>> Hi Shashank, >>> >>> ManifoldCF's memory consumption is bounded but scales by the number of >>> worker threads you allow. If you have 100 worker threads and each doc can >>> consume 50mb then you need to have at least 5gb right there for Solr >>> output. Tika is also quite expensive memory-wise so I'd allocate at least >>> 10gb for ManifoldCF to support the pipeline you have set up. >>> >>> The best way to control memory, therefore, is probably to reduce the >>> number of worker threads. >>> >>> (I assume you are using the combined war here, otherwise Tomcat would >>> not be involved.) >>> >>> Karl >>> >>> >>> On Thu, Jan 18, 2018 at 6:44 AM, Shashank Raj < >>> [email protected]> wrote: >>> >>>> Hello Karl, >>>> GC Overhead heap error occurs each time and tomcat closes. Heap >>>> allocated is 7Gb(Xmx). Is there any other reason this issue is coming up? I >>>> am using ManifoldCF's tika. >>>> I have Unchecked "Use Update Extract" and max doc size as 50mb. >>>> >>>> >>>> >>> >
