[PERFORM] COMMIT stuck for days after bulk delete

2014-01-14 Thread Josh Kupershmidt
We have a 9.1.11 backend (Ubuntu 12.04 x86_64, m1.medium EC2 instance)
which seems to be stuck at COMMIT for 2 days now:

mydb=# SELECT procpid, waiting, current_query,
CURRENT_TIMESTAMP - query_start AS query_elapsed,
CURRENT_TIMESTAMP - xact_start AS xact_elapsed
FROM pg_stat_activity WHERE procpid != pg_backend_pid() AND
current_query != 'IDLE';
-[ RECORD 1 ]-+---
procpid   | 6061
waiting   | f
current_query | COMMIT;
query_elapsed | 2 days 08:59:17.619142
xact_elapsed  | 3 days 15:48:10.739912


The transaction behind that COMMIT has been the only thing running on
this Postgres instance for the past 3 days or so, since Postgres was
started on that machine. I spun the EC2 instance for this database up
solely to test a database subsetting process, which is what the
transaction was doing before it got stuck at COMMIT -- using a bunch
of DELETEs and ALTER TABLE ... DROP|ADD CONSTRAINTs to delete 90% or
so of our data in order to be able to pg_dump a slimmed-down
development copy.

The EC2 instances we use have separate EBS-backed volumes for the
PostgreSQL data and WAL directories. The backend in question seems to
be stuck reading a ton of data from the data partition: the monitoring
for those EBS volumes shows those volumes have been hammered reading a
constant aggregate 90MB/sec since that COMMIT started. The write
bandwidth to the postgresql-data partition has been almost nil since
the COMMIT, and there has been no read/write activity on the WAL
volumes.

Here, we can see that backend has managed to read 22 TB despite the
fact that the entire database is only 228 GB on disk.

$ sudo cat /proc/6061/io
rchar: 24505414843923
wchar: 23516159014
syscr: 2991395854
syscw: 2874613
read_bytes: 24791719338496
write_bytes: 22417580032
cancelled_write_bytes: 221208576

$ df -h /dev/md0 /dev/md1
Filesystem  Size  Used Avail Use% Mounted on
/dev/md0480G  228G  253G  48% /mnt/ebs/postgresql-data
/dev/md1 32G   20G   13G  61% /mnt/ebs/postgresql-wal

Running an strace on the backend shows a whole ton of read() calls and
the occasional lseek(). I grabbed a backtrace of the backend with gdb,
attached.

Attached also are the non-default pg_settings for this instance.
You'll notice that fsync, full_page_writes, and autovacuum are all
off: this is intentional, since this instance is transient and has
nothing important on it. There are no interesting errors in the
Postgres log files since it was spun up.

Any ideas on how to further diagnose or avoid this problem?

Josh
(gdb) bt
#0  0x7fa235a73fe0 in read () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x006249fd in read (__nbytes=8192, __buf=0x7fa21f8d7ee0, 
__fd=optimized out) at /usr/include/x86_64-linux-gnu/bits/unistd.h:45
#2  FileRead (file=optimized out, buffer=0x7fa21f8d7ee0 RI, amount=8192)
at 
/build/buildd/postgresql-9.1-9.1.11/build/../src/backend/storage/file/fd.c:1191
#3  0x0063d856 in mdread (reln=optimized out, 
forknum=optimized out, blocknum=253211, buffer=0x7fa21f8d7ee0 RI)
at 
/build/buildd/postgresql-9.1-9.1.11/build/../src/backend/storage/smgr/md.c:646
#4  0x00620556 in ReadBuffer_common (smgr=optimized out, 
relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=253211, 
mode=RBM_NORMAL, strategy=optimized out, hit=0x7fffa31cb7df )
at 
/build/buildd/postgresql-9.1-9.1.11/build/../src/backend/storage/buffer/bufmgr.c:440
#5  0x00621014 in ReadBufferExtended (reln=0x18678a0, 
forkNum=MAIN_FORKNUM, blockNum=253211, mode=RBM_NORMAL, 
strategy=optimized out)
at 
/build/buildd/postgresql-9.1-9.1.11/build/../src/backend/storage/buffer/bufmgr.c:246
#6  0x0046ff81 in heapgetpage (scan=0x2202340, page=253211)
at 
/build/buildd/postgresql-9.1-9.1.11/build/../src/backend/access/heap/heapam.c:230
#7  0x0047049a in heapgettup_pagemode (scan=0x2202340, 
dir=optimized out, nkeys=0, key=0x0)
at 
/build/buildd/postgresql-9.1-9.1.11/build/../src/backend/access/heap/heapam.c:824
#8  0x00471506 in heap_getnext (scan=0x2202340, 
direction=optimized out)
at 
/build/buildd/postgresql-9.1-9.1.11/build/../src/backend/access/heap/heapam.c:1349
#9  0x0058a50b in SeqNext (node=optimized out)
at 
/build/buildd/postgresql-9.1-9.1.11/build/../src/backend/executor/nodeSeqscan.c:66
#10 0x00579fa3 in ExecScanFetch (recheckMtd=0x58a4d0 SeqRecheck, 
accessMtd=0x58a4e0 SeqNext, node=0x2201590)
at 
/build/buildd/postgresql-9.1-9.1.11/build/../src/backend/executor/execScan.c:82
#11 ExecScan (node=0x2201590, accessMtd=0x58a4e0 SeqNext, 
recheckMtd=0x58a4d0 SeqRecheck)
at 
/build/buildd/postgresql-9.1-9.1.11/build/../src/backend/executor/execScan.c:167
#12 0x00572a98 in ExecProcNode (node=0x2201590)
at 
/build/buildd/postgresql-9.1-9.1.11/build/../src/backend/executor/execProcnode.c:394
#13 0x0058587a in ExecLockRows (node=0x22013a0)
at 

Re: [PERFORM] COMMIT stuck for days after bulk delete

2014-01-14 Thread Tom Lane
Josh Kupershmidt schmi...@gmail.com writes:
 We have a 9.1.11 backend (Ubuntu 12.04 x86_64, m1.medium EC2 instance)
 which seems to be stuck at COMMIT for 2 days now:
 ...
 The transaction behind that COMMIT has been the only thing running on
 this Postgres instance for the past 3 days or so, since Postgres was
 started on that machine. I spun the EC2 instance for this database up
 solely to test a database subsetting process, which is what the
 transaction was doing before it got stuck at COMMIT -- using a bunch
 of DELETEs and ALTER TABLE ... DROP|ADD CONSTRAINTs to delete 90% or
 so of our data in order to be able to pg_dump a slimmed-down
 development copy.

A plausible guess is that the backend is running around trying to verify
that some deferred foreign key constraints still hold.  But without
knowing what your schema is, that's only a guess.

If that is it, a likely solution is to drop *all* the FK constraints
before doing the bulk delete, then (in a new transaction, probably)
recreate the ones you still want.

regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] COMMIT stuck for days after bulk delete

2014-01-14 Thread Josh Kupershmidt
On Tue, Jan 14, 2014 at 12:36 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Josh Kupershmidt schmi...@gmail.com writes:
 We have a 9.1.11 backend (Ubuntu 12.04 x86_64, m1.medium EC2 instance)
 which seems to be stuck at COMMIT for 2 days now:
 ...
 The transaction behind that COMMIT has been the only thing running on
 this Postgres instance for the past 3 days or so, since Postgres was
 started on that machine. I spun the EC2 instance for this database up
 solely to test a database subsetting process, which is what the
 transaction was doing before it got stuck at COMMIT -- using a bunch
 of DELETEs and ALTER TABLE ... DROP|ADD CONSTRAINTs to delete 90% or
 so of our data in order to be able to pg_dump a slimmed-down
 development copy.

 A plausible guess is that the backend is running around trying to verify
 that some deferred foreign key constraints still hold.  But without
 knowing what your schema is, that's only a guess.

Yeah, that's a good guess. A bunch of the FK constraints I am dropping
and re-adding are marked DEFERRABLE INITIALLY DEFERRED; there are 167
counted by:

SELECT COUNT(*)
  FROM pg_catalog.pg_constraint c
  WHERE contype = 'f' AND condeferrable AND condeferred AND
  connamespace =
(SELECT oid FROM pg_catalog.pg_namespace WHERE nspname = 'public') ;

 If that is it, a likely solution is to drop *all* the FK constraints
 before doing the bulk delete, then (in a new transaction, probably)
 recreate the ones you still want.

Will try that, thanks for the suggestion.

Josh


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance