Hi Karl, just want to let you know - after a "full analyse" (for all tables), the DB performance seems to be OK, and I see, the crawler is moving now! :) I'll add the analyze command to the "vacuum" daily script :)
Thank you very much for you time, hopefully, the lack of analyzing was the root cause for that particular problem. Konstantin 2016-04-27 15:16 GMT+02:00 Karl Wright <[email protected]>: > Ok, here's the query: > > SELECT t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount > FROM jobqueue t0 WHERE t0.docpriority<? AND (t0.status=? OR t0.status=?) > AND t0.checkaction=? AND t0.checktime<=? ... ORDER BY t0.docpriority ASC > LIMIT 400 > > Here's the index: > > "public";"jobqueue";"i1459943614200";"";"CREATE INDEX i1459943614200 > ON jobqueue USING btree (docpriority, status, checkaction, checktime)" > > In earlier versions of Postgresql it uses this index for both access and > ordering (it reports "ordered index scan" in the plan or something like > that). We rely on the ability to use this index for fast stuffer queries. > Maybe postgresql 9.3 broke this in some way? > > Here's the documentation; maybe you can figure it out. :-P You can play > around using "EXPLAIN" to figure out what it's trying to do. My guess is, > as I said, that it just thinks it has a better way, which isn't in fact > better and is in fact worse... > > http://www.postgresql.org/docs/9.2/static/indexes-ordering.html > > Karl > > > On Wed, Apr 27, 2016 at 9:00 AM, jetnet <[email protected]> wrote: >> >> yes, I use the recommended db settings: >> shared_buffers = 1024MB # min 128kB >> >> The indexes: >> select * from pg_indexes where tablename = 'jobqueue'; >> >> "public";"jobqueue";"jobqueue_pkey";"";"CREATE UNIQUE INDEX >> jobqueue_pkey ON jobqueue USING btree (id)" >> "public";"jobqueue";"i1459943614205";"";"CREATE INDEX i1459943614205 >> ON jobqueue USING btree (jobid, status)" >> "public";"jobqueue";"i1459943614204";"";"CREATE INDEX i1459943614204 >> ON jobqueue USING btree (needpriority)" >> "public";"jobqueue";"i1459943614203";"";"CREATE INDEX i1459943614203 >> ON jobqueue USING btree (isseed, jobid)" >> "public";"jobqueue";"i1459943614202";"";"CREATE INDEX i1459943614202 >> ON jobqueue USING btree (failtime, jobid)" >> "public";"jobqueue";"i1459943614201";"";"CREATE INDEX i1459943614201 >> ON jobqueue USING btree (status, checkaction, checktime)" >> "public";"jobqueue";"i1459943614200";"";"CREATE INDEX i1459943614200 >> ON jobqueue USING btree (docpriority, status, checkaction, checktime)" >> "public";"jobqueue";"i1459943614199";"";"CREATE UNIQUE INDEX >> i1459943614199 ON jobqueue USING btree (dochash, jobid)" >> >> >> Thanks! >> Konstantin >> >> 2016-04-27 14:57 GMT+02:00 Karl Wright <[email protected]>: >> > Also, have you looked at increasing the amount of shared buffers in your >> > postgresql configuration? See the "how-to-build-and-deploy" page for >> > that; >> > it can make a lot of difference for MCF. So even if the plan is not >> > perfect >> > the database simply performs better. >> > >> > Karl >> > >> > On Wed, Apr 27, 2016 at 8:53 AM, Karl Wright <[email protected]> wrote: >> >> >> >> Hi Konstantin, >> >> >> >> Can you list the indexes defined on the jobqueue table? >> >> >> >> The query we do for stuffing is precisely aligned with one of the >> >> indexes >> >> we create on the jobqueue table. It should be doing an ordered index >> >> read >> >> with complex filtering conditions; instead it is doing this: >> >> >> >> WARN 2016-04-27 13:49:49,440 (Stuffer thread) - Plan: >> >> -> Bitmap Heap Scan on jobqueue t0 (cost=2223.73..3352.95 >> >> rows=286 width=235) (actual time=46.705..93.152 rows=20858 loops=1) >> >> >> >> ... with a hash join and a hash sort. >> >> >> >> Postgres is being way too smart for its own good here. Probably that's >> >> at >> >> least partly because its time estimates for its chosen plan are far >> >> less >> >> than reality. There may be a way to configure postgresql not to make >> >> that >> >> mistake but I don't know offhand how you do it. >> >> >> >> In order for us to be able to control the plan via hints, which would >> >> make >> >> this problem go away once and for all, you need the version of Postgres >> >> that >> >> you pay for. Sigh. >> >> >> >> Anyhow, let's verify that the right indexes are in place, at least... >> >> >> >> Karl >> >> >> >> >> >> On Wed, Apr 27, 2016 at 8:38 AM, jetnet <[email protected]> wrote: >> >>> >> >>> yet another long running query: >> >>> >> >>> WARN 2016-04-27 14:23:31,562 (Agents thread) - Found a long-running >> >>> query (215465 ms): [UPDATE jobqueue SET >> >>> docpriority=?,needpriorityprocessid=NULL,needpriority=? WHERE >> >>> docpriority<?] >> >>> WARN 2016-04-27 14:23:31,562 (Agents thread) - Parameter 0: >> >>> '1.000000001E9' >> >>> WARN 2016-04-27 14:23:31,562 (Agents thread) - Parameter 1: 'T' >> >>> WARN 2016-04-27 14:23:31,562 (Agents thread) - Parameter 2: >> >>> '1.000000001E9' >> >>> WARN 2016-04-27 14:23:31,562 (Agents thread) - Plan: Update on >> >>> jobqueue (cost=0.56..37802.01 rows=19647 width=352) >> >>> WARN 2016-04-27 14:23:31,562 (Agents thread) - Plan: -> Index >> >>> Scan using i1459943614200 on jobqueue (cost=0.56..37802.01 rows=19647 >> >>> width=352) >> >>> WARN 2016-04-27 14:23:31,562 (Agents thread) - Plan: Index >> >>> Cond: (docpriority < 1000000001::double precision) >> >>> WARN 2016-04-27 14:23:31,562 (Agents thread) - >> >>> WARN 2016-04-27 14:23:31,578 (Agents thread) - Stats: n_distinct=4.0 >> >>> most_common_vals={Z,C,G,P} >> >>> most_common_freqs={0.64193332,0.21796666,0.13086666,0.0092333332} >> >>> >> >>> >> >>> >> >>> 2016-04-27 14:32 GMT+02:00 jetnet <[email protected]>: >> >>> > Thank you for the fast reply! >> >>> > >> >>> > I provided the query to show, how many docs have been processed by >> >>> > the >> >>> > instances so far. >> >>> > I do not use UI to check the jobs statues - it takes 10-15 minutes >> >>> > to >> >>> > display the page :( (ui.maxstatuscount=100.000). >> >>> > The full vacuum and re-index database are running as a scheduled job >> >>> > on a daily basis. >> >>> > So, just tried to analyze the "problem" table - jobqueue, but I >> >>> > don't >> >>> > think it's helped a lot, the same "status" query took 250 second >> >>> > now, >> >>> > improved, but it's still very long. >> >>> > >> >>> > The long-running queries reported in the log are all about jobqueue >> >>> > table, e.g.: >> >>> > >> >>> > WARN 2016-04-27 13:49:48,596 (Stuffer thread) - Found a >> >>> > long-running >> >>> > query (75328 ms): [SELECT >> >>> > >> >>> > t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount >> >>> > FROM jobqueue t0 WHERE t0.docpriority<? AND (t0.status=? OR >> >>> > t0.status=?) AND t0.checkaction=? AND t0.checktime<=? AND >> >>> > EXISTS(SELECT 'x' FROM jobs t1 WHERE (t1.status=? OR t1.status=?) >> >>> > AND >> >>> > t1.id=t0.jobid AND t1.priority=?) AND NOT EXISTS(SELECT 'x' FROM >> >>> > jobqueue t2 WHERE t2.dochash=t0.dochash AND (t2.status=? OR >> >>> > t2.status=? OR t2.status=? OR t2.status=? OR t2.status=? OR >> >>> > t2.status=?) AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT 'x' FROM >> >>> > prereqevents t3,events t4 WHERE t0.id=t3.owner AND >> >>> > t3.eventname=t4.name) ORDER BY t0.docpriority ASC LIMIT 400] >> >>> > WARN 2016-04-27 13:49:48,628 (Stuffer thread) - Parameter 0: >> >>> > '1.000000001E9' >> >>> > WARN 2016-04-27 13:49:48,628 (Stuffer thread) - Parameter 1: 'P' >> >>> > WARN 2016-04-27 13:49:48,628 (Stuffer thread) - Parameter 2: 'G' >> >>> > WARN 2016-04-27 13:49:48,628 (Stuffer thread) - Parameter 3: 'R' >> >>> > WARN 2016-04-27 13:49:48,628 (Stuffer thread) - Parameter 4: >> >>> > '1461757713112' >> >>> > WARN 2016-04-27 13:49:48,628 (Stuffer thread) - Parameter 5: 'A' >> >>> > WARN 2016-04-27 13:49:48,628 (Stuffer thread) - Parameter 6: 'a' >> >>> > WARN 2016-04-27 13:49:48,628 (Stuffer thread) - Parameter 7: '5' >> >>> > WARN 2016-04-27 13:49:48,628 (Stuffer thread) - Parameter 8: 'A' >> >>> > WARN 2016-04-27 13:49:48,628 (Stuffer thread) - Parameter 9: 'F' >> >>> > WARN 2016-04-27 13:49:48,628 (Stuffer thread) - Parameter 10: 'a' >> >>> > WARN 2016-04-27 13:49:48,628 (Stuffer thread) - Parameter 11: 'f' >> >>> > WARN 2016-04-27 13:49:48,628 (Stuffer thread) - Parameter 12: 'D' >> >>> > WARN 2016-04-27 13:49:48,628 (Stuffer thread) - Parameter 13: 'd' >> >>> > WARN 2016-04-27 13:49:49,440 (Stuffer thread) - Plan: Limit >> >>> > (cost=4016.80..4016.98 rows=72 width=235) (actual >> >>> > time=738.775..738.842 rows=400 loops=1) >> >>> > WARN 2016-04-27 13:49:49,440 (Stuffer thread) - Plan: -> Sort >> >>> > (cost=4016.80..4016.98 rows=72 width=235) (actual >> >>> > time=738.774..738.792 rows=400 loops=1) >> >>> > WARN 2016-04-27 13:49:49,440 (Stuffer thread) - Plan: Sort >> >>> > Key: t0.docpriority >> >>> > WARN 2016-04-27 13:49:49,440 (Stuffer thread) - Plan: Sort >> >>> > Method: top-N heapsort Memory: 139kB >> >>> > WARN 2016-04-27 13:49:49,440 (Stuffer thread) - Plan: -> >> >>> > Nested Loop Anti Join (cost=2252.06..4014.58 rows=72 width=235) >> >>> > (actual time=46.841..723.735 rows=20215 loops=1) >> >>> > WARN 2016-04-27 13:49:49,440 (Stuffer thread) - Plan: >> >>> > -> Hash Anti Join (cost=2251.51..3387.34 rows=72 width=235) >> >>> > (actual >> >>> > time=46.770..108.460 rows=20215 loops=1) >> >>> > WARN 2016-04-27 13:49:49,440 (Stuffer thread) - Plan: >> >>> > Hash Cond: (t0.id = t3.owner) >> >>> > WARN 2016-04-27 13:49:49,440 (Stuffer thread) - Plan: >> >>> > -> Nested Loop (cost=2223.73..3358.66 rows=72 width=235) >> >>> > (actual time=46.747..102.890 rows=20215 loops=1) >> >>> > WARN 2016-04-27 13:49:49,440 (Stuffer thread) - Plan: >> >>> > Join Filter: (t0.jobid = t1.id) >> >>> > WARN 2016-04-27 13:49:49,440 (Stuffer thread) - Plan: >> >>> > Rows Removed by Join Filter: 643 >> >>> > WARN 2016-04-27 13:49:49,440 (Stuffer thread) - Plan: >> >>> > -> Seq Scan on jobs t1 (cost=0.00..2.14 rows=1 width=8) >> >>> > (actual time=0.015..0.024 rows=1 loops=1) >> >>> > WARN 2016-04-27 13:49:49,440 (Stuffer thread) - Plan: >> >>> > Filter: ((priority = 5::bigint) AND ((status = >> >>> > 'A'::bpchar) OR (status = 'a'::bpchar))) >> >>> > WARN 2016-04-27 13:49:49,440 (Stuffer thread) - Plan: >> >>> > Rows Removed by Filter: 3 >> >>> > WARN 2016-04-27 13:49:49,440 (Stuffer thread) - Plan: >> >>> > -> Bitmap Heap Scan on jobqueue t0 >> >>> > (cost=2223.73..3352.95 >> >>> > rows=286 width=235) (actual time=46.705..93.152 rows=20858 loops=1) >> >>> > WARN 2016-04-27 13:49:49,440 (Stuffer thread) - Plan: >> >>> > Recheck Cond: (((docpriority < 1000000001::double >> >>> > precision) AND (status = 'P'::bpchar) AND (checkaction = >> >>> > 'R'::bpchar) >> >>> > AND (checktime <= 1461757713112::bigint)) OR ((docpriority < >> >>> > 1000000001::double precision) AND (status = 'G'::bpchar) AND >> >>> > (checkaction = 'R'::bpchar) AND (checktime <= >> >>> > 1461757713112::bigint))) >> >>> > WARN 2016-04-27 13:49:49,440 (Stuffer thread) - Plan: >> >>> > -> BitmapOr (cost=2223.73..2223.73 rows=288 >> >>> > width=0) (actual time=45.349..45.349 rows=0 loops=1) >> >>> > WARN 2016-04-27 13:49:49,440 (Stuffer thread) - Plan: >> >>> > -> Bitmap Index Scan on i1459943614200 >> >>> > (cost=0.00..1111.80 rows=20 width=0) (actual time=29.325..29.325 >> >>> > rows=333 loops=1) >> >>> > WARN 2016-04-27 13:49:49,440 (Stuffer thread) - Plan: >> >>> > Index Cond: ((docpriority < >> >>> > 1000000001::double precision) AND (status = 'P'::bpchar) AND >> >>> > (checkaction = 'R'::bpchar) AND (checktime <= >> >>> > 1461757713112::bigint)) >> >>> > WARN 2016-04-27 13:49:49,440 (Stuffer thread) - Plan: >> >>> > -> Bitmap Index Scan on i1459943614200 >> >>> > (cost=0.00..1111.80 rows=268 width=0) (actual time=16.020..16.020 >> >>> > rows=20555 loops=1) >> >>> > WARN 2016-04-27 13:49:49,440 (Stuffer thread) - Plan: >> >>> > Index Cond: ((docpriority < >> >>> > 1000000001::double precision) AND (status = 'G'::bpchar) AND >> >>> > (checkaction = 'R'::bpchar) AND (checktime <= >> >>> > 1461757713112::bigint)) >> >>> > WARN 2016-04-27 13:49:49,440 (Stuffer thread) - Plan: >> >>> > -> Hash (cost=26.15..26.15 rows=130 width=8) (actual >> >>> > time=0.004..0.004 rows=0 loops=1) >> >>> > WARN 2016-04-27 13:49:49,440 (Stuffer thread) - Plan: >> >>> > Buckets: 1024 Batches: 1 Memory Usage: 0kB >> >>> > WARN 2016-04-27 13:49:49,440 (Stuffer thread) - Plan: >> >>> > -> Hash Join (cost=12.93..26.15 rows=130 width=8) >> >>> > (actual >> >>> > time=0.003..0.003 rows=0 loops=1) >> >>> > WARN 2016-04-27 13:49:49,440 (Stuffer thread) - Plan: >> >>> > Hash Cond: ((t3.eventname)::text = (t4.name)::text) >> >>> > WARN 2016-04-27 13:49:49,440 (Stuffer thread) - Plan: >> >>> > -> Seq Scan on prereqevents t3 (cost=0.00..11.40 >> >>> > rows=140 width=524) (actual time=0.001..0.001 rows=0 loops=1) >> >>> > WARN 2016-04-27 13:49:49,440 (Stuffer thread) - Plan: >> >>> > -> Hash (cost=11.30..11.30 rows=130 width=516) >> >>> > (never executed) >> >>> > WARN 2016-04-27 13:49:49,440 (Stuffer thread) - Plan: >> >>> > -> Seq Scan on events t4 (cost=0.00..11.30 >> >>> > rows=130 width=516) (never executed) >> >>> > WARN 2016-04-27 13:49:49,440 (Stuffer thread) - Plan: >> >>> > -> Index Scan using i1459943614199 on jobqueue t2 (cost=0.56..8.59 >> >>> > rows=1 width=49) (actual time=0.030..0.030 rows=0 loops=20215) >> >>> > WARN 2016-04-27 13:49:49,440 (Stuffer thread) - Plan: >> >>> > Index Cond: ((dochash)::text = (t0.dochash)::text) >> >>> > WARN 2016-04-27 13:49:49,440 (Stuffer thread) - Plan: >> >>> > Filter: ((jobid <> t0.jobid) AND ((status = 'A'::bpchar) OR >> >>> > (status = 'F'::bpchar) OR (status = 'a'::bpchar) OR (status = >> >>> > 'f'::bpchar) OR (status = 'D'::bpchar) OR (status = 'd'::bpchar))) >> >>> > WARN 2016-04-27 13:49:49,440 (Stuffer thread) - Plan: >> >>> > Rows Removed by Filter: 1 >> >>> > WARN 2016-04-27 13:49:49,440 (Stuffer thread) - Plan: Total >> >>> > runtime: >> >>> > 747.031 ms >> >>> > WARN 2016-04-27 13:49:49,440 (Stuffer thread) - >> >>> > WARN 2016-04-27 13:49:50,206 (Stuffer thread) - Stats: >> >>> > n_distinct=5.0 most_common_vals={Z,C,G,P} >> >>> > most_common_freqs={0.64013332,0.21616666,0.13356666,0.0101} >> >>> > >> >>> > >> >>> > 2016-04-27 13:43 GMT+02:00 Karl Wright <[email protected]>: >> >>> >> Hi Konstantin, >> >>> >> >> >>> >> The query you are looking at is performed by the UI only, and there >> >>> >> is >> >>> >> a >> >>> >> parameter you can set which applies a limit to the number of >> >>> >> documents >> >>> >> so >> >>> >> that the count is reported as "<limit>+" in the UI. This is the >> >>> >> parameter: >> >>> >> >> >>> >> org.apache.manifoldcf.ui.maxstatuscount >> >>> >> >> >>> >> As for why the database gets slow for crawling, unless you are >> >>> >> seeing >> >>> >> reports in the log of long-running queries, then it's a good chance >> >>> >> you need >> >>> >> to vacuum your database instance. I generally recommend that a >> >>> >> vacuum >> >>> >> full >> >>> >> be done periodically for database instances. Autovacuuming has >> >>> >> gotten >> >>> >> a lot >> >>> >> better in postgres than it used to be but at least in the past the >> >>> >> autovacuuming process would get far behind ManifoldCF and so the >> >>> >> database >> >>> >> would get quite bloated anyway. So I'd give that a try. >> >>> >> >> >>> >> If you are seeing logging output mentioning slow queries, you may >> >>> >> need >> >>> >> to >> >>> >> tune how often MCF analyzes certain tables. There are parameters >> >>> >> that >> >>> >> control that as well. In general, if there is a slow query with a >> >>> >> bad >> >>> >> plan, >> >>> >> and analyzing the tables involved makes it come up with a much >> >>> >> better >> >>> >> plan, >> >>> >> analysis is not happening often enough. But first, before you get >> >>> >> to >> >>> >> that >> >>> >> point, have a look at the log and see whether this is likely to be >> >>> >> the >> >>> >> problem. (Usually it is the stuffer query that gets slow when >> >>> >> there's >> >>> >> an >> >>> >> issue with table analysis, FWIW). Please feel free to post the >> >>> >> plan >> >>> >> of the >> >>> >> queries being reported here. >> >>> >> >> >>> >> Thanks, >> >>> >> Karl >> >>> >> >> >>> >> >> >>> >> On Wed, Apr 27, 2016 at 7:33 AM, jetnet <[email protected]> wrote: >> >>> >>> >> >>> >>> Hi Karl, >> >>> >>> >> >>> >>> I set up two MCF instances (quick setup) on the same machine, >> >>> >>> using >> >>> >>> the same Postgres 9.3 instance (with different databases >> >>> >>> "org.apache.manifoldcf.database.name" of course). >> >>> >>> After a couple of days I've got a performance issue: one MCF >> >>> >>> instance >> >>> >>> has become very slow - it processes a few docs per hour only. I >> >>> >>> guess, >> >>> >>> the bottleneck is the database: >> >>> >>> >> >>> >>> "normal" instance: >> >>> >>> SELECT status, count(*) AS count FROM jobqueue GROUP BY status -- >> >>> >>> 738.311 rows in the table, took 1,2 sec >> >>> >>> "G";50674 >> >>> >>> "F";68 >> >>> >>> "P";149179 >> >>> >>> "C";402367 >> >>> >>> "A";33 >> >>> >>> "Z";136676 >> >>> >>> >> >>> >>> "slow" instance (currently with a single active job): >> >>> >>> SELECT status, count(*) AS count FROM jobqueue GROUP BY status -- >> >>> >>> 2.745.329 rows in the table, took 350 sec >> >>> >>> "G";337922 --STATUS_PENDINGPURGATORY >> >>> >>> "F";449 --STATUS_ACTIVEPURGATORY >> >>> >>> "P";25909 --STATUS_PENDING >> >>> >>> "C";562772 --STATUS_COMPLETE >> >>> >>> "A";9 --STATUS_ACTIVE >> >>> >>> "Z";1644927 --STATUS_PURGATORY >> >>> >>> >> >>> >>> Since "count(*)" is terrible slow in Postgres, I used the >> >>> >>> following >> >>> >>> sql to count jobqueue's rows: >> >>> >>> SELECT reltuples::bigint AS approximate_row_count FROM pg_class >> >>> >>> WHERE >> >>> >>> relname = 'jobqueue'; >> >>> >>> >> >>> >>> Both MCF instances have the same number of working threads, >> >>> >>> database >> >>> >>> handles etc. >> >>> >>> Is the database "full"? What could you recommend to improve the >> >>> >>> performance? >> >>> >>> >> >>> >>> Thank you! >> >>> >>> Konstantin >> >>> >> >> >>> >> >> >> >> >> >> > > >
