Hi Aeham, Before you assume that stuffing is just not happening fast enough, you will want to confirm that you have enough documents that are *eligible* for processing. In a continuous job, documents may well be scheduled to be crawled at some time in the future, and are ineligible for crawling until that future time arrives. You can get a better sense of this by using the document and queue status reports.
If you only have 30 worker threads on your machine, it's extremely unlikely that you would find yourself unable to stuff documents fast enough with the default parameters. The only way that would not be true is if your stuffer queries are performing badly, and that would be important to know too. Thanks, Karl On Fri, Dec 12, 2014 at 7:11 AM, Aeham Abushwashi < [email protected]> wrote: > > Hi, > > Are there any gotchas one should be aware of when configuring property > "org.apache.manifoldcf.crawler.stuffamountfactor"? > > At times, I see the manifold nodes in my cluster (and the postgresql box) > not utilising all the resources they have. I have configured 30 worker > threads which tend to sit idle waiting for documents (continuous crawl). > This led me to tweak the batch size of the Stuffer thread indirectly using > "org.apache.manifoldcf.crawler.stuffamountfactor" and setting it to 20 (I > believe the default is 2). > > I understand that increasing the batch size results in a bigger result set > coming back from the database. If the size is in the 1000s I doubt it would > cause problems. My hope is a bigger stuffer batch would allow worker > threads to operate more efficiently and handle more documents where > possible. > > Please let me know if there are any particular concerns/guidelines over > tweaking this config property or if there are better ways for increasing > the width of the processing pipeline for each manifold instance. > > Thanks, > Aeham >
