FWIW, you can diagnose a slow stuffer query by getting a thread dump. If there are tons of idle worker threads AND your stuffer thread is waiting on Postgresql, that's a good sign it is not keeping up due to database reasons.
Karl On Fri, Dec 12, 2014 at 7:23 AM, Karl Wright <[email protected]> wrote: > > Hi Aeham, > > Before you assume that stuffing is just not happening fast enough, you > will want to confirm that you have enough documents that are *eligible* for > processing. In a continuous job, documents may well be scheduled to be > crawled at some time in the future, and are ineligible for crawling until > that future time arrives. You can get a better sense of this by using the > document and queue status reports. > > If you only have 30 worker threads on your machine, it's extremely > unlikely that you would find yourself unable to stuff documents fast enough > with the default parameters. The only way that would not be true is if > your stuffer queries are performing badly, and that would be important to > know too. > > Thanks, > Karl > > > > > On Fri, Dec 12, 2014 at 7:11 AM, Aeham Abushwashi < > [email protected]> wrote: >> >> Hi, >> >> Are there any gotchas one should be aware of when configuring property >> "org.apache.manifoldcf.crawler.stuffamountfactor"? >> >> At times, I see the manifold nodes in my cluster (and the postgresql box) >> not utilising all the resources they have. I have configured 30 worker >> threads which tend to sit idle waiting for documents (continuous crawl). >> This led me to tweak the batch size of the Stuffer thread indirectly using >> "org.apache.manifoldcf.crawler.stuffamountfactor" and setting it to 20 (I >> believe the default is 2). >> >> I understand that increasing the batch size results in a bigger result >> set coming back from the database. If the size is in the 1000s I doubt it >> would cause problems. My hope is a bigger stuffer batch would allow worker >> threads to operate more efficiently and handle more documents where >> possible. >> >> Please let me know if there are any particular concerns/guidelines over >> tweaking this config property or if there are better ways for increasing >> the width of the processing pipeline for each manifold instance. >> >> Thanks, >> Aeham >> >
