I have a performance problem; I'd like any suggestions on where to continue investigation.
A set of insert-only processes seems to serialize itself. :-( The processes appear to be blocked on disk IO, and probably the table drive, rather than the pg_xlog drive. Each process is inserting a block of 10K rows into a table. I'm guessing they are "serialized" because one process by itself takes 15-20 secs; running ten processes in parallel averages 100-150 secs (each), with elapsed (wall) time of 150-200 secs. Polling pg_locks shows each process has (been granted) only the locks you would expect. I RARELY see an Exclusive lock on an index, and then only on one index at a time. A sample from pg_locks: TABLE/INDEX GRANTED PID MODE m_reason t 7340 AccessShare message t 7340 AccessShare message t 7340 RowExclusive pk_message t 7340 AccessShare tmp_message t 7340 AccessShare ("m_reason" is a one-row lookup table; see INSERT cmd below). -------------------------- The query plan is quite reasonable (see below). On a side note, this is the first app I've had to deal with that is sweet to pg_xlog, but hammers the drive bearing the base table (3x the traffic). "log_executor_stats" for a sample insert look reasonable (except the "elapsed"!) ! system usage stats: ! 308.591728 elapsed 3.480000 user 1.270000 system sec ! [4.000000 user 1.390000 sys total] ! 0/0 [0/0] filesystem blocks in/out ! 18212/15 [19002/418] page faults/reclaims, 0 [0] swaps ! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent ! 0/0 [0/0] voluntary/involuntary context switches ! buffer usage stats: ! Shared blocks: 9675 read, 8781 written, buffer hit rate = 97.66% ! Local blocks: 504 read, 64 written, buffer hit rate = 0.00% ! Direct blocks: 0 read, 0 written Summarized "ps" output for the above backend process, sampled every 5 secs, shows it is 94% in the 'D' state, 3% in the 'S' state. ================ == BACKGROUND == ================ **SOFTWARE - PG 7.4.6, RedHat 8. ---------------------------------- **HARDWARE Xeon 2x2 2.4GHz 2GB RAM 4 x 73GB SCSI; pg_xlog and base on separate drives. ---------------------------------- **APPLICATION Six machines post batches of 10K messages to the PG db server. Machine #nn generates its ID keys as "nn00000000001"::bigint etc. Each process runs: - "COPY tmp_message FROM STDIN" loads its own one-use TEMP table. - " INSERT INTO message SELECT tmp.* FROM tmp_message AS tmp JOIN m_reason ON m_reason.name = tmp.reason LEFT JOIN message USING (ID) WHERE message.ID is null (check required because crash recovery logic requires idempotent insert) "DROP TABLE tmp_message" --- call me paranoid, this is 7.4 The COPY step time is almost constant when #processes varies from 1 to 10. ---------------------------------- **POSTGRES pg_autovacuum is running with default parameters. Non-default GUC values: checkpoint_segments = 512 default_statistics_target = 200 effective_cache_size = 500000 log_min_duration_statement = 1000 max_fsm_pages = 1000000 max_fsm_relations = 1000 random_page_cost = 1 shared_buffers = 10000 sort_mem = 16384 stats_block_level = true stats_command_string = true stats_row_level = true vacuum_mem = 65536 wal_buffers = 2000 Wal_buffers and checkpoint_segments look outrageous, but were tuned for another process, that posts batches of 10000 6KB rows in a single insert. ---------------------------------- TABLE/INDEX STATISTICS ---------------------------------- MACHINE STATISTICS ps gives the backend process as >98% in (D) state, with <1% CPU. A "top" snapshot: CPU states: cpu user nice system irq softirq iowait idle total 2.0% 0.0% 0.8% 0.0% 0.0% 96.9% 0.0% cpu00 2.5% 0.0% 1.9% 0.0% 0.0% 95.4% 0.0% cpu01 1.7% 0.0% 0.1% 0.0% 0.3% 97.6% 0.0% cpu02 0.5% 0.0% 0.7% 0.0% 0.0% 98.6% 0.0% cpu03 3.1% 0.0% 0.5% 0.0% 0.0% 96.2% 0.0% Mem: 2061552k av, 2041752k used, 19800k free, 0k shrd, 21020k buff iostat reports that the $PGDATA/base drive is being worked but not overworked. The pg_xlog drive is underworked: KBPS TPS KBPS TPS KBPS TPS KBPS TPS 12:30 1 2 763 16 31 8 3336 269 12:40 5 3 1151 22 5 5 2705 320 ^pg_xlog^ ^base^ The base drive has run as much as 10MBPS, 5K TPS. ---------------------------------- EXPLAIN ANALYZE output: The plan is eminently reasonable. But there's no visible relationship between the top level "actual time" and the "total runtime": Nested Loop Left Join (cost=0.00..31109.64 rows=9980 width=351) (actual time=0.289..2357.346 rows=9980 loops=1) Filter: ("inner".id IS NULL) -> Nested Loop (cost=0.00..735.56 rows=9980 width=351) (actual time=0.092..1917.677 rows=9980 loops=1) Join Filter: (("outer".name)::text = ("inner".reason)::text) -> Seq Scan on m_reason r (cost=0.00..1.01 rows=1 width=12) (actual time=0.008..0.050 rows=1 loops=1) -> Seq Scan on tmp_message t (cost=0.00..609.80 rows=9980 width=355) (actual time=0.067..1756.617 rows=9980 loops=1) -> Index Scan using pk_message on message (cost=0.00..3.02 rows=1 width=8) (actual time=0.014..0.014 rows=0 loops=9980) Index Cond: ("outer".id = message.id) Total runtime: 737401.687 ms -- "Dreams come true, not free." -- S.Sondheim, ITW ---------------------------(end of broadcast)--------------------------- TIP 8: explain analyze is your friend