Also check if all the files are successfully parsed by Tika.


*Steph van Schalkwyk*
Principal, Remcam Search Engines
+1.314.452. <+1+314+452+2896>2896    [email protected]   http://remcam.net
<http://www.remcam.net/> Skype: svanschalkwyk
<https://mail.google.com/mail/u/0/#>
<http://linkedin.com/in/vanschalkwyk>

On Thu, Jan 18, 2018 at 9:55 AM, Karl Wright <[email protected]> wrote:

> Oh, also the maximum number of Tika connections should be limited to the
> number of threads to be sure you're not wasting memory on extra Tika
> instances (which might be expensive).
>
> Karl
>
>
> On Thu, Jan 18, 2018 at 10:52 AM, Karl Wright <[email protected]> wrote:
>
>> Hmm, it might be worth asking this question in the Tika user list.  We've
>> not seen this kind of issue before with Tika transformation.
>>
>> Also, I think it's worth downloading MCF 2.9.1, which updates the Tika
>> version to 2.17 from 2.16.  There were issues in 2.9 with incompatibilities
>> between our Tika version and the Apache POI version.  This is now publicly
>> available but the web site has not yet been updated, so modify the download
>> URL to 2.9.1 from 2.9 to get the point release.
>>
>> Thanks,
>> Karl
>>
>>
>> On Thu, Jan 18, 2018 at 10:41 AM, Shashank Raj <
>> [email protected]> wrote:
>>
>>> Hi Karl,
>>> I changed the number of worker threads to 6 but still the problem
>>> persists when I use ManifoldCF's Tika. When going with "null" as output
>>> connection, there seems no problem. Also tried with Solr without tika
>>> transformation connection. That also works fine.
>>> But as soon as I switch to Manifold's transformation connection Tika, I
>>> get the same error. I have tried increasing heap size as well as decreasing
>>> workers.
>>> Also I've not selected "use extract update handler".
>>>
>>> Approx size of directory to crawl: 200GB
>>> In the future this size will be :10TB
>>> Size of largest file in this directory :2Gb
>>>
>>> NOTE: I am using Tomcat 8.0 to run manifold, connected to Postgresql 9.3
>>> with Solr 6.6.
>>>
>>> On 18-Jan-2018 6:21 PM, "Karl Wright" <[email protected]> wrote:
>>>
>>>> Hi Shashank,
>>>>
>>>> ManifoldCF's memory consumption is bounded but scales by the number of
>>>> worker threads you allow.  If you have 100 worker threads and each doc can
>>>> consume 50mb then you need to have at least 5gb right there for Solr
>>>> output.  Tika is also quite expensive memory-wise so I'd allocate at least
>>>> 10gb for ManifoldCF to support the pipeline you have set up.
>>>>
>>>> The best way to control memory, therefore, is probably to reduce the
>>>> number of worker threads.
>>>>
>>>> (I assume you are using the combined war here, otherwise Tomcat would
>>>> not be involved.)
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Thu, Jan 18, 2018 at 6:44 AM, Shashank Raj <
>>>> [email protected]> wrote:
>>>>
>>>>> Hello Karl,
>>>>> GC Overhead heap error occurs each time and tomcat closes. Heap
>>>>> allocated is 7Gb(Xmx). Is there any other reason this issue is coming up? 
>>>>> I
>>>>> am using ManifoldCF's tika.
>>>>> I have Unchecked "Use Update Extract" and max doc size as 50mb.
>>>>>
>>>>>
>>>>>
>>>>
>>
>

Reply via email to