We have a 9.1.11 backend (Ubuntu 12.04 x86_64, m1.medium EC2 instance)
which seems to be stuck at COMMIT for 2 days now:
mydb=# SELECT procpid, waiting, current_query,
CURRENT_TIMESTAMP - query_start AS query_elapsed,
CURRENT_TIMESTAMP - xact_start AS xact_elapsed
FROM pg_stat_activity WHERE procpid != pg_backend_pid() AND
current_query != '';
-[ RECORD 1 ]-+---
procpid | 6061
waiting | f
current_query | COMMIT;
query_elapsed | 2 days 08:59:17.619142
xact_elapsed | 3 days 15:48:10.739912
The transaction behind that COMMIT has been the only thing running on
this Postgres instance for the past 3 days or so, since Postgres was
started on that machine. I spun the EC2 instance for this database up
solely to test a database subsetting process, which is what the
transaction was doing before it got stuck at COMMIT -- using a bunch
of DELETEs and ALTER TABLE ... DROP|ADD CONSTRAINTs to delete 90% or
so of our data in order to be able to pg_dump a slimmed-down
development copy.
The EC2 instances we use have separate EBS-backed volumes for the
PostgreSQL data and WAL directories. The backend in question seems to
be stuck reading a ton of data from the data partition: the monitoring
for those EBS volumes shows those volumes have been hammered reading a
constant aggregate 90MB/sec since that COMMIT started. The write
bandwidth to the postgresql-data partition has been almost nil since
the COMMIT, and there has been no read/write activity on the WAL
volumes.
Here, we can see that backend has managed to read 22 TB despite the
fact that the entire database is only 228 GB on disk.
$ sudo cat /proc/6061/io
rchar: 24505414843923
wchar: 23516159014
syscr: 2991395854
syscw: 2874613
read_bytes: 24791719338496
write_bytes: 22417580032
cancelled_write_bytes: 221208576
$ df -h /dev/md0 /dev/md1
Filesystem Size Used Avail Use% Mounted on
/dev/md0480G 228G 253G 48% /mnt/ebs/postgresql-data
/dev/md1 32G 20G 13G 61% /mnt/ebs/postgresql-wal
Running an strace on the backend shows a whole ton of read() calls and
the occasional lseek(). I grabbed a backtrace of the backend with gdb,
attached.
Attached also are the non-default pg_settings for this instance.
You'll notice that fsync, full_page_writes, and autovacuum are all
off: this is intentional, since this instance is transient and has
nothing important on it. There are no interesting errors in the
Postgres log files since it was spun up.
Any ideas on how to further diagnose or avoid this problem?
Josh
(gdb) bt
#0 0x7fa235a73fe0 in read () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x006249fd in read (__nbytes=8192, __buf=0x7fa21f8d7ee0,
__fd=) at /usr/include/x86_64-linux-gnu/bits/unistd.h:45
#2 FileRead (file=, buffer=0x7fa21f8d7ee0 "RI", amount=8192)
at
/build/buildd/postgresql-9.1-9.1.11/build/../src/backend/storage/file/fd.c:1191
#3 0x0063d856 in mdread (reln=,
forknum=, blocknum=253211, buffer=0x7fa21f8d7ee0 "RI")
at
/build/buildd/postgresql-9.1-9.1.11/build/../src/backend/storage/smgr/md.c:646
#4 0x00620556 in ReadBuffer_common (smgr=,
relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=253211,
mode=RBM_NORMAL, strategy=, hit=0x7fffa31cb7df "")
at
/build/buildd/postgresql-9.1-9.1.11/build/../src/backend/storage/buffer/bufmgr.c:440
#5 0x00621014 in ReadBufferExtended (reln=0x18678a0,
forkNum=MAIN_FORKNUM, blockNum=253211, mode=RBM_NORMAL,
strategy=)
at
/build/buildd/postgresql-9.1-9.1.11/build/../src/backend/storage/buffer/bufmgr.c:246
#6 0x0046ff81 in heapgetpage (scan=0x2202340, page=253211)
at
/build/buildd/postgresql-9.1-9.1.11/build/../src/backend/access/heap/heapam.c:230
#7 0x0047049a in heapgettup_pagemode (scan=0x2202340,
dir=, nkeys=0, key=0x0)
at
/build/buildd/postgresql-9.1-9.1.11/build/../src/backend/access/heap/heapam.c:824
#8 0x00471506 in heap_getnext (scan=0x2202340,
direction=)
at
/build/buildd/postgresql-9.1-9.1.11/build/../src/backend/access/heap/heapam.c:1349
#9 0x0058a50b in SeqNext (node=)
at
/build/buildd/postgresql-9.1-9.1.11/build/../src/backend/executor/nodeSeqscan.c:66
#10 0x00579fa3 in ExecScanFetch (recheckMtd=0x58a4d0 ,
accessMtd=0x58a4e0 , node=0x2201590)
at
/build/buildd/postgresql-9.1-9.1.11/build/../src/backend/executor/execScan.c:82
#11 ExecScan (node=0x2201590, accessMtd=0x58a4e0 ,
recheckMtd=0x58a4d0 )
at
/build/buildd/postgresql-9.1-9.1.11/build/../src/backend/executor/execScan.c:167
#12 0x00572a98 in ExecProcNode (node=0x2201590)
at
/build/buildd/postgresql-9.1-9.1.11/build/../src/backend/executor/execProcnode.c:394
#13 0x0058587a in ExecLockRows (node=0x22013a0)
at
/build/buildd/postgresql-9.1-9.1.11/build/../src/backend/executor/nodeLockRows.c:54
#14 0x00572948 in ExecProcNode (node=0x22013a0)
at
/build/buildd/postgresql-9.1-9.1.11/build/../src/ba