Re: stuffamountfactor and getting more work done

Karl Wright Fri, 12 Dec 2014 04:25:26 -0800

Hi Aeham,

Before you assume that stuffing is just not happening fast enough, you will
want to confirm that you have enough documents that are *eligible* for
processing.  In a continuous job, documents may well be scheduled to be
crawled at some time in the future, and are ineligible for crawling until
that future time arrives.  You can get a better sense of this by using the
document and queue status reports.


If you only have 30 worker threads on your machine, it's extremely unlikely
that you would find yourself unable to stuff documents fast enough with the
default parameters.  The only way that would not be true is if your stuffer
queries are performing badly, and that would be important to know too.

Thanks,
Karl




On Fri, Dec 12, 2014 at 7:11 AM, Aeham Abushwashi <
[email protected]> wrote:
>
> Hi,
>
> Are there any gotchas one should be aware of when configuring property
> "org.apache.manifoldcf.crawler.stuffamountfactor"?
>
> At times, I see the manifold nodes in my cluster (and the postgresql box)
> not utilising all the resources they have. I have configured 30 worker
> threads which tend to sit idle waiting for documents (continuous crawl).
> This led me to tweak the batch size of the Stuffer thread indirectly using
> "org.apache.manifoldcf.crawler.stuffamountfactor" and setting it to 20 (I
> believe the default is 2).
>
> I understand that increasing the batch size results in a bigger result set
> coming back from the database. If the size is in the 1000s I doubt it would
> cause problems. My hope is a bigger stuffer batch would allow worker
> threads to operate more efficiently and handle more documents where
> possible.
>
> Please let me know if there are any particular concerns/guidelines over
> tweaking this config property or if there are better ways for increasing
> the width of the processing pipeline for each manifold instance.
>
> Thanks,
> Aeham
>

Re: stuffamountfactor and getting more work done

Reply via email to