Also check if all the files are successfully parsed by Tika.
*Steph van Schalkwyk* Principal, Remcam Search Engines +1.314.452. <+1+314+452+2896>2896 [email protected] http://remcam.net <http://www.remcam.net/> Skype: svanschalkwyk <https://mail.google.com/mail/u/0/#> <http://linkedin.com/in/vanschalkwyk> On Thu, Jan 18, 2018 at 9:55 AM, Karl Wright <[email protected]> wrote: > Oh, also the maximum number of Tika connections should be limited to the > number of threads to be sure you're not wasting memory on extra Tika > instances (which might be expensive). > > Karl > > > On Thu, Jan 18, 2018 at 10:52 AM, Karl Wright <[email protected]> wrote: > >> Hmm, it might be worth asking this question in the Tika user list. We've >> not seen this kind of issue before with Tika transformation. >> >> Also, I think it's worth downloading MCF 2.9.1, which updates the Tika >> version to 2.17 from 2.16. There were issues in 2.9 with incompatibilities >> between our Tika version and the Apache POI version. This is now publicly >> available but the web site has not yet been updated, so modify the download >> URL to 2.9.1 from 2.9 to get the point release. >> >> Thanks, >> Karl >> >> >> On Thu, Jan 18, 2018 at 10:41 AM, Shashank Raj < >> [email protected]> wrote: >> >>> Hi Karl, >>> I changed the number of worker threads to 6 but still the problem >>> persists when I use ManifoldCF's Tika. When going with "null" as output >>> connection, there seems no problem. Also tried with Solr without tika >>> transformation connection. That also works fine. >>> But as soon as I switch to Manifold's transformation connection Tika, I >>> get the same error. I have tried increasing heap size as well as decreasing >>> workers. >>> Also I've not selected "use extract update handler". >>> >>> Approx size of directory to crawl: 200GB >>> In the future this size will be :10TB >>> Size of largest file in this directory :2Gb >>> >>> NOTE: I am using Tomcat 8.0 to run manifold, connected to Postgresql 9.3 >>> with Solr 6.6. >>> >>> On 18-Jan-2018 6:21 PM, "Karl Wright" <[email protected]> wrote: >>> >>>> Hi Shashank, >>>> >>>> ManifoldCF's memory consumption is bounded but scales by the number of >>>> worker threads you allow. If you have 100 worker threads and each doc can >>>> consume 50mb then you need to have at least 5gb right there for Solr >>>> output. Tika is also quite expensive memory-wise so I'd allocate at least >>>> 10gb for ManifoldCF to support the pipeline you have set up. >>>> >>>> The best way to control memory, therefore, is probably to reduce the >>>> number of worker threads. >>>> >>>> (I assume you are using the combined war here, otherwise Tomcat would >>>> not be involved.) >>>> >>>> Karl >>>> >>>> >>>> On Thu, Jan 18, 2018 at 6:44 AM, Shashank Raj < >>>> [email protected]> wrote: >>>> >>>>> Hello Karl, >>>>> GC Overhead heap error occurs each time and tomcat closes. Heap >>>>> allocated is 7Gb(Xmx). Is there any other reason this issue is coming up? >>>>> I >>>>> am using ManifoldCF's tika. >>>>> I have Unchecked "Use Update Extract" and max doc size as 50mb. >>>>> >>>>> >>>>> >>>> >> >
