Hi Karl, I changed the number of worker threads to 6 but still the problem persists when I use ManifoldCF's Tika. When going with "null" as output connection, there seems no problem. Also tried with Solr without tika transformation connection. That also works fine. But as soon as I switch to Manifold's transformation connection Tika, I get the same error. I have tried increasing heap size as well as decreasing workers. Also I've not selected "use extract update handler".
Approx size of directory to crawl: 200GB In the future this size will be :10TB Size of largest file in this directory :2Gb NOTE: I am using Tomcat 8.0 to run manifold, connected to Postgresql 9.3 with Solr 6.6. On 18-Jan-2018 6:21 PM, "Karl Wright" <[email protected]> wrote: > Hi Shashank, > > ManifoldCF's memory consumption is bounded but scales by the number of > worker threads you allow. If you have 100 worker threads and each doc can > consume 50mb then you need to have at least 5gb right there for Solr > output. Tika is also quite expensive memory-wise so I'd allocate at least > 10gb for ManifoldCF to support the pipeline you have set up. > > The best way to control memory, therefore, is probably to reduce the > number of worker threads. > > (I assume you are using the combined war here, otherwise Tomcat would not > be involved.) > > Karl > > > On Thu, Jan 18, 2018 at 6:44 AM, Shashank Raj <[email protected]> > wrote: > >> Hello Karl, >> GC Overhead heap error occurs each time and tomcat closes. Heap allocated >> is 7Gb(Xmx). Is there any other reason this issue is coming up? I am using >> ManifoldCF's tika. >> I have Unchecked "Use Update Extract" and max doc size as 50mb. >> >> >> >
