Hi Aeham, Given that your stuffer thread has to wait for multiple other machines to finish stuffing before it runs, it may make sense to increase the amount stuffed at one time. Unfortunately the stuffer lock has to remain because otherwise the same document could be stuffed twice. Using a database transaction is unworkable in this context because of the tendency to deadlock.
Thanks, Karl On Fri, Dec 12, 2014 at 12:46 PM, Aeham Abushwashi < [email protected]> wrote: > > Thanks Karl. > > The stuffer thread query isn't doing too badly. Judging by stats from the > pg_stat_activity table in postgresql, the stuffer query usually takes < 2 > seconds to return. > > > >> In a continuous job, documents may well be scheduled to be crawled at > some time in the future, and are ineligible for crawling until that future > time arrives. > > Such documents would be excluded by the stuffer query, right? > > Thanks for the pointer to the queue status page. Using the root server > name as an identifier class, I get the bulk of documents grouped under the > "About to Process" and "Waiting for Processing" categories. For example, I > have a job with 677,856 and 102,342 docs respecitvely. Another job has > 320,804 and 443,596 doc respectively. All other status categories have 0 > docs. > > > >> If there are tons of idle worker threads AND your stuffer thread is > waiting on Postgresql, that's a good sign it is not keeping up due to > database reasons. > > Interestingly, the stuffer thread spends the majority of its time trying > to acquire the stuffer lock. I have 3 nodes in the cluster and each node's > stuffer thread spends ~ 2/3 of its time blocked waiting for the lock. Of > course the SQL query itself and connection grabbing/releasing all happen > within the scope of the lock. The effect is that the more nodes there are > in the cluster, the less time each node has for stuffing documents. >
