Re: [HACKERS] canceling autovacuum task woes

2012-08-30 Thread Robert Haas
On Mon, Aug 13, 2012 at 11:59 PM, Peter Eisentraut pete...@gmx.net wrote:
 On Tue, 2012-07-24 at 16:14 -0400, Robert Haas wrote:
 On Tue, Jul 24, 2012 at 4:09 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Robert Haas robertmh...@gmail.com writes:
  Yeah, you're right.  So you do get the table name.  But you don't get
  the cause, which is what you really need to understand why it's
  happening.  Attached is a patch that adds some more detail.
 
  Uh, what's the added dependency on pgstat.h for?  Looks sane to the
  eyeball otherwise.

 Woops, that was leftovers from some earlier silliness that I (mostly)
 removed before posting.

 New version attached.


 The detail message should have a period at the end.

Oops.  Fixed, sorry it took me so long to get to this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] canceling autovacuum task woes

2012-08-13 Thread Peter Eisentraut
On Tue, 2012-07-24 at 16:14 -0400, Robert Haas wrote:
 On Tue, Jul 24, 2012 at 4:09 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  Robert Haas robertmh...@gmail.com writes:
  Yeah, you're right.  So you do get the table name.  But you don't get
  the cause, which is what you really need to understand why it's
  happening.  Attached is a patch that adds some more detail.
 
  Uh, what's the added dependency on pgstat.h for?  Looks sane to the
  eyeball otherwise.
 
 Woops, that was leftovers from some earlier silliness that I (mostly)
 removed before posting.
 
 New version attached.
 

The detail message should have a period at the end.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] canceling autovacuum task woes

2012-07-24 Thread Andres Freund
Hi,

On Tuesday, July 24, 2012 07:48:27 PM Robert Haas wrote:
 I am running into a lot of customer situations where the customer
 reports that canceling autovacuum task shows up in the logs, and
 it's unclear whether this is happening often enough to matter, and
 even more unclear what's causing it.
 
 Me: So, do you know what table it's getting cancelled on?
 Customer: Nope.
 Me: Are you running any DDL commands anywhere in the cluster?
 Customer: Nope, absolutely none.
 Me: Well you've got to be running something somewhere or it wouldn't
 be having a lock conflict.
 Customer: OK, well I don't know of any.  What should I do?
 
 It would be awfully nice if the process that does the cancelling would
 provide the same kind of reporting that we do for a deadlock: the
 relevant lock tag, the PID of the process sending the cancel, and the
 query string.
 
 Personally, I'm starting to have a sneaky suspicion that there is
 something actually broken here - that is, that there are lock
 conflicts involve here other than the obvious one (SHARE UPDATE
 EXCLUSIVE on the table) that are allowing autovac to get cancelled
 more often than we realize.  But whether that's true or not, the
 current logging is wholly inadequate.
Very, very, very quick guess: The relation extension lock?

 Thoughts?  Anybody else have this problem?
I have seen spuriously high occurances of that message before, but I never 
really investigated it.

Andres
-- 
Andres Freund   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] canceling autovacuum task woes

2012-07-24 Thread Steve Singer

On 12-07-24 01:48 PM, Robert Haas wrote:

I am running into a lot of customer situations where the customer
reports that canceling autovacuum task shows up in the logs, and
it's unclear whether this is happening often enough to matter, and
even more unclear what's causing it.


Could autovacuum be compacting a lot of space at the end of the table.  
This is described
in the thread 
http://archives.postgresql.org/message-id/4d8df88e.7080...@yahoo.com




Me: So, do you know what table it's getting cancelled on?
Customer: Nope.
Me: Are you running any DDL commands anywhere in the cluster?
Customer: Nope, absolutely none.
Me: Well you've got to be running something somewhere or it wouldn't
be having a lock conflict.
Customer: OK, well I don't know of any.  What should I do?

It would be awfully nice if the process that does the cancelling would
provide the same kind of reporting that we do for a deadlock: the
relevant lock tag, the PID of the process sending the cancel, and the
query string.

Personally, I'm starting to have a sneaky suspicion that there is
something actually broken here - that is, that there are lock
conflicts involve here other than the obvious one (SHARE UPDATE
EXCLUSIVE on the table) that are allowing autovac to get cancelled
more often than we realize.  But whether that's true or not, the
current logging is wholly inadequate.

Thoughts?  Anybody else have this problem?




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] canceling autovacuum task woes

2012-07-24 Thread Robert Haas
On Tue, Jul 24, 2012 at 2:11 PM, Steve Singer ssin...@ca.afilias.info wrote:
 On 12-07-24 01:48 PM, Robert Haas wrote:
 I am running into a lot of customer situations where the customer
 reports that canceling autovacuum task shows up in the logs, and
 it's unclear whether this is happening often enough to matter, and
 even more unclear what's causing it.

 Could autovacuum be compacting a lot of space at the end of the table.  This
 is described
 in the thread
 http://archives.postgresql.org/message-id/4d8df88e.7080...@yahoo.com

You (and Andres) may well be right, but I think the way we find out is
to add some better logging.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] canceling autovacuum task woes

2012-07-24 Thread Andrew Dunstan


On 07/24/2012 01:48 PM, Robert Haas wrote:

I am running into a lot of customer situations where the customer
reports that canceling autovacuum task shows up in the logs, and
it's unclear whether this is happening often enough to matter, and
even more unclear what's causing it.

Me: So, do you know what table it's getting cancelled on?
Customer: Nope.
Me: Are you running any DDL commands anywhere in the cluster?
Customer: Nope, absolutely none.
Me: Well you've got to be running something somewhere or it wouldn't
be having a lock conflict.
Customer: OK, well I don't know of any.  What should I do?

It would be awfully nice if the process that does the cancelling would
provide the same kind of reporting that we do for a deadlock: the
relevant lock tag, the PID of the process sending the cancel, and the
query string.



+1 for more information on this.

cheers

andrew



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] canceling autovacuum task woes

2012-07-24 Thread Alvaro Herrera


 On Tuesday, July 24, 2012 07:48:27 PM Robert Haas wrote:
  I am running into a lot of customer situations where the customer
  reports that canceling autovacuum task shows up in the logs, and
  it's unclear whether this is happening often enough to matter, and
  even more unclear what's causing it.
  
  Me: So, do you know what table it's getting cancelled on?
  Customer: Nope.
  Me: Are you running any DDL commands anywhere in the cluster?
  Customer: Nope, absolutely none.
  Me: Well you've got to be running something somewhere or it wouldn't
  be having a lock conflict.
  Customer: OK, well I don't know of any.  What should I do?
  
  It would be awfully nice if the process that does the cancelling would
  provide the same kind of reporting that we do for a deadlock: the
  relevant lock tag, the PID of the process sending the cancel, and the
  query string.

Hm, autovacuum is supposed to set an errcontext callback that would tell
you the table name it's working on at the time of the cancel.  So if
even that is missing, something strange is going on.

No objections to the general idea of adding more info about the process
blocked on autovacuum.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] canceling autovacuum task woes

2012-07-24 Thread Alvaro Herrera

Excerpts from Alvaro Herrera's message of mar jul 24 15:30:49 -0400 2012:
 
  On Tuesday, July 24, 2012 07:48:27 PM Robert Haas wrote:
   I am running into a lot of customer situations where the customer
   reports that canceling autovacuum task shows up in the logs, and
   it's unclear whether this is happening often enough to matter, and
   even more unclear what's causing it.
   
   Me: So, do you know what table it's getting cancelled on?
   Customer: Nope.

 Hm, autovacuum is supposed to set an errcontext callback that would tell
 you the table name it's working on at the time of the cancel.  So if
 even that is missing, something strange is going on.

Yep, it says:

ERROR:  canceling autovacuum task
CONTEXT:  automatic vacuum of table alvherre.public.foo

So at least that part seems pilot error more than anything else.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] canceling autovacuum task woes

2012-07-24 Thread Robert Haas
On Tue, Jul 24, 2012 at 3:35 PM, Alvaro Herrera
alvhe...@commandprompt.com wrote:
 Yep, it says:

 ERROR:  canceling autovacuum task
 CONTEXT:  automatic vacuum of table alvherre.public.foo

 So at least that part seems pilot error more than anything else.

Yeah, you're right.  So you do get the table name.  But you don't get
the cause, which is what you really need to understand why it's
happening.  Attached is a patch that adds some more detail.  Here's an
example of what the output looks like:

LOG:  sending cancel to blocking autovacuum PID 21595
DETAIL:  Process 21618 waits for AccessExclusiveLock on relation 27863
of database 16384
STATEMENT:  drop table if exists pgbench_accounts
ERROR:  canceling autovacuum task
CONTEXT:  automatic vacuum of table rhaas.public.pgbench_accounts

I think that's a lot more useful than just getting those last two lines...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


more-autovac-cancel-logging.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] canceling autovacuum task woes

2012-07-24 Thread Alvaro Herrera

Excerpts from Robert Haas's message of mar jul 24 15:52:23 -0400 2012:
 On Tue, Jul 24, 2012 at 3:35 PM, Alvaro Herrera
 alvhe...@commandprompt.com wrote:
  Yep, it says:
 
  ERROR:  canceling autovacuum task
  CONTEXT:  automatic vacuum of table alvherre.public.foo
 
  So at least that part seems pilot error more than anything else.
 
 Yeah, you're right.  So you do get the table name.  But you don't get
 the cause, which is what you really need to understand why it's
 happening.  Attached is a patch that adds some more detail.  Here's an
 example of what the output looks like:
 
 LOG:  sending cancel to blocking autovacuum PID 21595
 DETAIL:  Process 21618 waits for AccessExclusiveLock on relation 27863
 of database 16384
 STATEMENT:  drop table if exists pgbench_accounts
 ERROR:  canceling autovacuum task
 CONTEXT:  automatic vacuum of table rhaas.public.pgbench_accounts

Looks great.  Are you considering backpatching this?


-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] canceling autovacuum task woes

2012-07-24 Thread Robert Haas
On Tue, Jul 24, 2012 at 4:03 PM, Alvaro Herrera
alvhe...@commandprompt.com wrote:
 Excerpts from Robert Haas's message of mar jul 24 15:52:23 -0400 2012:
 On Tue, Jul 24, 2012 at 3:35 PM, Alvaro Herrera
 alvhe...@commandprompt.com wrote:
  Yep, it says:
 
  ERROR:  canceling autovacuum task
  CONTEXT:  automatic vacuum of table alvherre.public.foo
 
  So at least that part seems pilot error more than anything else.

 Yeah, you're right.  So you do get the table name.  But you don't get
 the cause, which is what you really need to understand why it's
 happening.  Attached is a patch that adds some more detail.  Here's an
 example of what the output looks like:

 LOG:  sending cancel to blocking autovacuum PID 21595
 DETAIL:  Process 21618 waits for AccessExclusiveLock on relation 27863
 of database 16384
 STATEMENT:  drop table if exists pgbench_accounts
 ERROR:  canceling autovacuum task
 CONTEXT:  automatic vacuum of table rhaas.public.pgbench_accounts

 Looks great.  Are you considering backpatching this?

Well, that would certainly make MY life easier.  I am not sure whether
it would be in line with project policy, however.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] canceling autovacuum task woes

2012-07-24 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 Yeah, you're right.  So you do get the table name.  But you don't get
 the cause, which is what you really need to understand why it's
 happening.  Attached is a patch that adds some more detail.

Uh, what's the added dependency on pgstat.h for?  Looks sane to the
eyeball otherwise.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] canceling autovacuum task woes

2012-07-24 Thread Robert Haas
On Tue, Jul 24, 2012 at 4:09 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 Yeah, you're right.  So you do get the table name.  But you don't get
 the cause, which is what you really need to understand why it's
 happening.  Attached is a patch that adds some more detail.

 Uh, what's the added dependency on pgstat.h for?  Looks sane to the
 eyeball otherwise.

Woops, that was leftovers from some earlier silliness that I (mostly)
removed before posting.

New version attached.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


more-autovac-cancel-logging.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] canceling autovacuum task woes

2012-07-24 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Tue, Jul 24, 2012 at 4:03 PM, Alvaro Herrera
 alvhe...@commandprompt.com wrote:
 Looks great.  Are you considering backpatching this?

 Well, that would certainly make MY life easier.  I am not sure whether
 it would be in line with project policy, however.

+1 for a backpatch.  Otherwise it'll be years before we gain any
information about the unexpected cancels that you think exist.

However, after looking some more at deadlock.c, I wonder whether
(a) this patch gives sufficient detail, and (b) whether there isn't a
problem that's obvious by inspection.  It appears to me that as the
blocking_autovacuum_proc stuff is coded, it will finger an AV proc as
needing to be killed even though it may be several graph edges out from
the current proc.  This means that with respect to (a), the connection
from the process doing the kill to the AV proc may be inadequately
documented by this patch, and with respect to (b), there might well be
cases where we found an AV proc somewhere in the graph traversal but
it's not actually guilty of blocking the current process ... especially
not after the queue reorderings that we may have done.  I think I'd be
happier with that code if it restricted its AV targets to procs that
*directly* block the current process, which not incidentally would make
this amount of log detail sufficient.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] canceling autovacuum task woes

2012-07-24 Thread Robert Haas
On Jul 24, 2012, at 4:31 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 On Tue, Jul 24, 2012 at 4:03 PM, Alvaro Herrera
 alvhe...@commandprompt.com wrote:
 Looks great.  Are you considering backpatching this?
 
 Well, that would certainly make MY life easier.  I am not sure whether
 it would be in line with project policy, however.
 
 +1 for a backpatch.  Otherwise it'll be years before we gain any
 information about the unexpected cancels that you think exist

OK, great.

 However, after looking some more at deadlock.c, I wonder whether
 (a) this patch gives sufficient detail, and (b) whether there isn't a
 problem that's obvious by inspection.  It appears to me that as the
 blocking_autovacuum_proc stuff is coded, it will finger an AV proc as
 needing to be killed even though it may be several graph edges out from
 the current proc.  This means that with respect to (a), the connection
 from the process doing the kill to the AV proc may be inadequately
 documented by this patch, and with respect to (b), there might well be
 cases where we found an AV proc somewhere in the graph traversal but
 it's not actually guilty of blocking the current process ... especially
 not after the queue reorderings that we may have done.  I think I'd be
 happier with that code if it restricted its AV targets to procs that
 *directly* block the current process, which not incidentally would make
 this amount of log detail sufficient.

Uggh.  Well, that certainly sounds like something that could cause spurious 
cancels - or excessively fast ones, since presumably if we limit it to things 
that directly block the current process, you'll always allow the full 
deadlock_timeout before nuking the autovac worker.  So +1 for changing that.

Does an edge in this context mean any lock, or just an ungranted one?  I assume 
the latter, which still leaves the question of where the edges are coming from 
in the first place.

...Robert
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] canceling autovacuum task woes

2012-07-24 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 On Jul 24, 2012, at 4:31 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 ... This means that with respect to (a), the connection
 from the process doing the kill to the AV proc may be inadequately
 documented by this patch, and with respect to (b), there might well be
 cases where we found an AV proc somewhere in the graph traversal but
 it's not actually guilty of blocking the current process ... especially
 not after the queue reorderings that we may have done.  I think I'd be
 happier with that code if it restricted its AV targets to procs that
 *directly* block the current process, which not incidentally would make
 this amount of log detail sufficient.

 Uggh.  Well, that certainly sounds like something that could cause spurious 
 cancels - or excessively fast ones, since presumably if we limit it to things 
 that directly block the current process, you'll always allow the full 
 deadlock_timeout before nuking the autovac worker.  So +1 for changing that.

I think something as simple as the attached would do the trick.  I've
verified that this still allows the expected cancel cases, though of
course it's harder to prove that it gives any benefit.

 Does an edge in this context mean any lock, or just an ungranted one?  I 
 assume the latter, which still leaves the question of where the edges are 
 coming from in the first place.

The deadlock code follows hard edges, which mean A wants a lock that
B has already got, as well as soft edges, which mean A wants a lock
that B also wants, and B is ahead of A in the queue for it.  We don't
kill autovacs that are the end of soft edges, which I think is good,
but the fact that we follow them at all makes it a little unclear what
we might reach recursively.  Your point about determinacy of the timeout
is probably even a better argument for only allowing the direct blockee
to wield the knife.

regards, tom lane

diff --git a/src/backend/storage/lmgr/deadlock.c b/src/backend/storage/lmgr/deadlock.c
index 288186a6ceebf40d5a43638c7be5be35c12dd3b3..6620a3d571d769f75e6ae3fbbe4a0ed2cf271720 100644
*** a/src/backend/storage/lmgr/deadlock.c
--- b/src/backend/storage/lmgr/deadlock.c
*** FindLockCycleRecurse(PGPROC *checkProc,
*** 527,551 
  if ((proclock-holdMask  LOCKBIT_ON(lm)) 
  	(conflictMask  LOCKBIT_ON(lm)))
  {
- 	/*
- 	 * Look for a blocking autovacuum. There can be more than
- 	 * one in the deadlock cycle, in which case we just pick a
- 	 * random one.	We stash the autovacuum worker's PGPROC so
- 	 * that the caller can send a cancel signal to it, if
- 	 * appropriate.
- 	 *
- 	 * Note we read vacuumFlags without any locking.  This is
- 	 * OK only for checking the PROC_IS_AUTOVACUUM flag,
- 	 * because that flag is set at process start and never
- 	 * reset; there is logic elsewhere to avoid canceling an
- 	 * autovacuum that is working for preventing Xid
- 	 * wraparound problems (which needs to read a different
- 	 * vacuumFlag bit), but we don't do that here to avoid
- 	 * grabbing ProcArrayLock.
- 	 */
- 	if (pgxact-vacuumFlags  PROC_IS_AUTOVACUUM)
- 		blocking_autovacuum_proc = proc;
- 
  	/* This proc hard-blocks checkProc */
  	if (FindLockCycleRecurse(proc, depth + 1,
  			 softEdges, nSoftEdges))
--- 527,532 
*** FindLockCycleRecurse(PGPROC *checkProc,
*** 559,565 
  
  		return true;
  	}
! 	/* If no deadlock, we're done looking at this proclock */
  	break;
  }
  			}
--- 540,575 
  
  		return true;
  	}
! 
! 	/*
! 	 * No deadlock here, but see if this proc is an autovacuum
! 	 * that is directly hard-blocking our own proc.  If so,
! 	 * report it so that the caller can send a cancel signal
! 	 * to it, if appropriate.  If there's more than one such
! 	 * proc (probably not possible given that autovacuums all
! 	 * take similar lock types), it's indeterminate which one
! 	 * will be reported.
! 	 *
! 	 * We don't touch autovacuums that are indirectly blocking
! 	 * us; it's up to the direct blockee to take action.  This
! 	 * rule simplifies understanding the behavior and ensures
! 	 * that an autovacuum won't be canceled with less than
! 	 * deadlock_timeout grace period.
! 	 *
! 	 * Note we read vacuumFlags without any locking.  This is
! 	 * OK only for checking the PROC_IS_AUTOVACUUM flag,
! 	 * because that flag is set at process start and never
! 	 * reset.  There is logic elsewhere to avoid canceling an
! 	 * autovacuum that is working to prevent XID wraparound
! 	 * problems (which needs to read a different vacuumFlag
! 	 * bit), but we don't do that here to avoid grabbing
! 	 * ProcArrayLock.
! 	 */
! 	if (checkProc == MyProc 
! 		pgxact-vacuumFlags  PROC_IS_AUTOVACUUM)
! 		blocking_autovacuum_proc = proc;
!