Re: [HACKERS] ERROR during end-of-xact/FATAL
On Fri, Nov 15, 2013 at 09:51:32AM -0500, Robert Haas wrote: > On Wed, Nov 13, 2013 at 11:04 AM, Noah Misch wrote: > >> So, in short, ERROR + ERROR*10 = PANIC, but FATAL + ERROR*10 = FATAL. > >> That's bizarre. > > > > Quite so. > > > >> Given that that's where we are, promoting an ERROR during FATAL > >> processing to PANIC doesn't seem like it's losing much; we're > >> essentially already doing that in the (probably more likely) case of a > >> persistent ERROR during ERROR processing. But since PANIC sucks, I'd > >> rather go the other direction: let's make an ERROR during ERROR > >> processing promote to FATAL. And then let's do what you write above: > >> make sure that there's a separate on-shmem-exit callback for each > >> critical shared memory resource and that we call of those during FATAL > >> processing. > > > > Many of the factors that can cause AbortTransaction() to fail can also cause > > CommitTransaction() to fail, and those would still PANIC if the transaction > > had an xid. How practical might it be to also escape from an error during > > CommitTransaction() with a FATAL instead of PANIC? There's more to fix up > > in > > that case (sinval, NOTIFY), but it may be within reach. If such a technique > > can only reasonably fix abort, though, I have doubts it buys us enough. > > The critical stuff that's got to happen after > RecordTransactionCommit() appears to be ProcArrayEndTransaction() and > AtEOXact_Inval(). Unfortunately, the latter is well after the point > when we're supposed to only be doing "non-critical resource cleanup", > nonwithstanding which it appears to be critical. > > So here's a sketch. Hoist the preparatory logic in > RecordTransactionCommit() - smgrGetPendingDeletes, > xactGetCommittedChildren, and xactGetCommittedInvalidationMessages up > into the caller and do it before setting TRANS_COMMIT. If any of that > stuff fails, we'll land in AbortTransaction() which must cope. As > soon as we exit the commit critical section, set a flag somewhere > (where?) indicating that we have written our commit record; when that > flag is set, (a) promote any ERROR after that point through the end of > commit cleanup to FATAL and (b) if we enter AbortTransaction(), don't > try to RecordTransactionAbort(). > > I can't see that the notification stuff requires fixup in this case; > AFAICS, it is just adjusting backend-local state, and it's OK to > disregard any problems there during a FATAL exit. Do you see > something to the contrary? The part not to miss is SignalBackends() in ProcessCompletedNotifies(). It happens outside CommitTransaction(), just before the backend goes idle. Without that, listeners could miss our notification until some other NOTIFY transaction signals them. That said, since ProcessCompletedNotifies() already accepts the possibility of failure, it would not be crazy to overlook this. > But invalidation messages are a problem: > if we commit and exit without sending our queued-up invalidation > messages, Bad Things Will Happen. Perhaps we could arrange things so > that in that case only, we just PANIC. That would allow most write > transactions to get by with FATAL, promoting to PANIC only in the case > of transactions that have modified system catalogs and only until the > invalidations have actually been sent. Avoiding the PANIC in that > case seems to require some additional wizardry which is not entirely > clear to me at this time. Agreed. > I think we'll have to approach the various problems in this area > stepwise, or we'll never make any progress. That's one option. The larger point is that allowing ERROR-late-in-COMMIT and FATAL + ERROR scenarios to get away with FATAL requires this sort of analysis of every exit callback and all the later stages of CommitTransaction(). The current bug level shows such analysis hasn't been happening, and the dearth of bug reports suggests the scenarios are rare. I don't think the project stands to gain enough from the changes contemplated here and, perhaps more notably, from performing such analysis during review of future patches that touch CommitTransaction() and the exit sequence. It remains my recommendation to run proc_exit_prepare() and the post-CLOG actions of CommitTransaction() and AbortTransaction() in critical sections, giving the scenarios blunt but reliable treatment. If reports of excess PANIC arise, the thing to fix is the code causing the late error. Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ERROR during end-of-xact/FATAL
Robert Haas escribió: > On Thu, Nov 28, 2013 at 10:10 AM, Alvaro Herrera > wrote: > > Robert Haas escribió: > >> On Wed, Nov 6, 2013 at 9:40 AM, Alvaro Herrera > >> wrote: > >> > Noah Misch wrote: > >> >> Incomplete list: > >> >> > >> >> - If smgrDoPendingDeletes() finds files to delete, mdunlink() and its > >> >> callee > >> >> relpathbackend() call palloc(); this is true in all supported > >> >> branches. In > >> >> 9.3, due to commit 279628a0, smgrDoPendingDeletes() itself calls > >> >> palloc(). > >> >> (In fact, it does so even when the pending list is empty -- this is > >> >> the only > >> >> palloc() during a trivial transaction commit.) palloc() failure there > >> >> yields a PANIC during commit. > >> > > >> > I think we should fix this routine to avoid the palloc when not > >> > necessary. > >> > That initial palloc is pointless. > > > > Here's a trivial patch we could apply to 9.3 immediately. Anything else > > such as the ideas proposed below would require more effort than anyone > > can probably spend here soon. > > Yeah, this seems like a good thing to do for now. Pushed, thanks. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ERROR during end-of-xact/FATAL
On Thu, Nov 28, 2013 at 10:10 AM, Alvaro Herrera wrote: > Robert Haas escribió: >> On Wed, Nov 6, 2013 at 9:40 AM, Alvaro Herrera >> wrote: >> > Noah Misch wrote: >> >> Incomplete list: >> >> >> >> - If smgrDoPendingDeletes() finds files to delete, mdunlink() and its >> >> callee >> >> relpathbackend() call palloc(); this is true in all supported branches. >> >> In >> >> 9.3, due to commit 279628a0, smgrDoPendingDeletes() itself calls >> >> palloc(). >> >> (In fact, it does so even when the pending list is empty -- this is the >> >> only >> >> palloc() during a trivial transaction commit.) palloc() failure there >> >> yields a PANIC during commit. >> > >> > I think we should fix this routine to avoid the palloc when not necessary. >> > That initial palloc is pointless. > > Here's a trivial patch we could apply to 9.3 immediately. Anything else > such as the ideas proposed below would require more effort than anyone > can probably spend here soon. Yeah, this seems like a good thing to do for now. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ERROR during end-of-xact/FATAL
Robert Haas escribió: > On Wed, Nov 6, 2013 at 9:40 AM, Alvaro Herrera > wrote: > > Noah Misch wrote: > >> Incomplete list: > >> > >> - If smgrDoPendingDeletes() finds files to delete, mdunlink() and its > >> callee > >> relpathbackend() call palloc(); this is true in all supported branches. > >> In > >> 9.3, due to commit 279628a0, smgrDoPendingDeletes() itself calls > >> palloc(). > >> (In fact, it does so even when the pending list is empty -- this is the > >> only > >> palloc() during a trivial transaction commit.) palloc() failure there > >> yields a PANIC during commit. > > > > I think we should fix this routine to avoid the palloc when not necessary. > > That initial palloc is pointless. Here's a trivial patch we could apply to 9.3 immediately. Anything else such as the ideas proposed below would require more effort than anyone can probably spend here soon. > > Also, there have been previous discussions about having relpathbackend > > not use palloc at all. That was only because we wanted to use it in > > pg_xlogdump which didn't have palloc support at the time, so it's no > > longer as pressing; but perhaps it's still worthy of consideration. > > +1, but I'd like it better if we could find a way of avoiding the > palloc in all cases. Panicking because we run out of memory at the > wrong time isn't really very nice. Maybe the information needs to be > maintained in the format in which it ultimately needs to be returned, > so that we needn't rejigger it in the middle of a critical section. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services *** a/src/backend/catalog/storage.c --- b/src/backend/catalog/storage.c *** *** 314,321 smgrDoPendingDeletes(bool isCommit) PendingRelDelete *next; int nrels = 0, i = 0, ! maxrels = 8; ! SMgrRelation *srels = palloc(maxrels * sizeof(SMgrRelation)); prev = NULL; for (pending = pendingDeletes; pending != NULL; pending = next) --- 314,321 PendingRelDelete *next; int nrels = 0, i = 0, ! maxrels = 0; ! SMgrRelation *srels = NULL; prev = NULL; for (pending = pendingDeletes; pending != NULL; pending = next) *** *** 340,347 smgrDoPendingDeletes(bool isCommit) srel = smgropen(pending->relnode, pending->backend); ! /* extend the array if needed (double the size) */ ! if (maxrels <= nrels) { maxrels *= 2; srels = repalloc(srels, sizeof(SMgrRelation) * maxrels); --- 340,352 srel = smgropen(pending->relnode, pending->backend); ! /* allocate the initial array, or extend it, if needed */ ! if (maxrels == 0) ! { ! maxrels = 8; ! srels = palloc(sizeof(SMgrRelation) * maxrels ); ! } ! else if (maxrels <= nrels) { maxrels *= 2; srels = repalloc(srels, sizeof(SMgrRelation) * maxrels); *** *** 361,370 smgrDoPendingDeletes(bool isCommit) for (i = 0; i < nrels; i++) smgrclose(srels[i]); - } - - pfree(srels); } /* --- 366,374 for (i = 0; i < nrels; i++) smgrclose(srels[i]); + pfree(srels); + } } /* -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ERROR during end-of-xact/FATAL
On Wed, Nov 13, 2013 at 11:04 AM, Noah Misch wrote: >> So, in short, ERROR + ERROR*10 = PANIC, but FATAL + ERROR*10 = FATAL. >> That's bizarre. > > Quite so. > >> Given that that's where we are, promoting an ERROR during FATAL >> processing to PANIC doesn't seem like it's losing much; we're >> essentially already doing that in the (probably more likely) case of a >> persistent ERROR during ERROR processing. But since PANIC sucks, I'd >> rather go the other direction: let's make an ERROR during ERROR >> processing promote to FATAL. And then let's do what you write above: >> make sure that there's a separate on-shmem-exit callback for each >> critical shared memory resource and that we call of those during FATAL >> processing. > > Many of the factors that can cause AbortTransaction() to fail can also cause > CommitTransaction() to fail, and those would still PANIC if the transaction > had an xid. How practical might it be to also escape from an error during > CommitTransaction() with a FATAL instead of PANIC? There's more to fix up in > that case (sinval, NOTIFY), but it may be within reach. If such a technique > can only reasonably fix abort, though, I have doubts it buys us enough. The critical stuff that's got to happen after RecordTransactionCommit() appears to be ProcArrayEndTransaction() and AtEOXact_Inval(). Unfortunately, the latter is well after the point when we're supposed to only be doing "non-critical resource cleanup", nonwithstanding which it appears to be critical. So here's a sketch. Hoist the preparatory logic in RecordTransactionCommit() - smgrGetPendingDeletes, xactGetCommittedChildren, and xactGetCommittedInvalidationMessages up into the caller and do it before setting TRANS_COMMIT. If any of that stuff fails, we'll land in AbortTransaction() which must cope. As soon as we exit the commit critical section, set a flag somewhere (where?) indicating that we have written our commit record; when that flag is set, (a) promote any ERROR after that point through the end of commit cleanup to FATAL and (b) if we enter AbortTransaction(), don't try to RecordTransactionAbort(). I can't see that the notification stuff requires fixup in this case; AFAICS, it is just adjusting backend-local state, and it's OK to disregard any problems there during a FATAL exit. Do you see something to the contrary? But invalidation messages are a problem: if we commit and exit without sending our queued-up invalidation messages, Bad Things Will Happen. Perhaps we could arrange things so that in that case only, we just PANIC. That would allow most write transactions to get by with FATAL, promoting to PANIC only in the case of transactions that have modified system catalogs and only until the invalidations have actually been sent. Avoiding the PANIC in that case seems to require some additional wizardry which is not entirely clear to me at this time. I think we'll have to approach the various problems in this area stepwise, or we'll never make any progress. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ERROR during end-of-xact/FATAL
On Tue, Nov 12, 2013 at 09:55:34AM -0500, Robert Haas wrote: > On Fri, Nov 8, 2013 at 4:13 PM, Noah Misch wrote: > > A PANIC will reinitialize everything relevant, largely resolving the > > problems > > around ERROR during FATAL. It's a heavy-handed solution, but it may well be > > the best solution. Efforts to harden CommitTransaction() and > > AbortTransaction() seem well-spent, but the additional effort to make FATAL > > exit cope where AbortTransaction() or another exit action could not cope > > seems > > to be slicing ever-smaller portions of additional robustness. > > > > I pondered a variant of that conclusion that distinguished critical cleanup > > needs from the rest. Each shared resource (heavyweight locks, buffer pins, > > LWLocks) would have an on_shmem_exit() callback that cleans up the resource > > under a critical section. (AtProcExit_Buffers() used to fill such a role, > > but > > resowner.c's work during AbortTransaction() has mostly supplanted it.) The > > ShutdownPostgres callback would not use a critical section, so lesser > > failures > > in AbortTransaction() would not upgrade to a PANIC. But I'm leaning against > > such a complication on the grounds that it would add seldom-tested code > > paths > > posing as much a chance of eroding robustness as bolstering it. > > The current situation is just plain weird: in the ERROR-then-ERROR > case, we emit a WARNING and bounce right back into AbortTransaction(), > and if it doesn't work any better the second time than the first time, > we recurse again, and eventually if it fails enough times in a row, we > just give up and PANIC. But in the ERROR-then-FATAL case, we *don't* > retry AbortTransaction(); instead, we just continue running the rest > of the on_shmem_exit callbacks and then exit. > > So, in short, ERROR + ERROR*10 = PANIC, but FATAL + ERROR*10 = FATAL. > That's bizarre. Quite so. > Given that that's where we are, promoting an ERROR during FATAL > processing to PANIC doesn't seem like it's losing much; we're > essentially already doing that in the (probably more likely) case of a > persistent ERROR during ERROR processing. But since PANIC sucks, I'd > rather go the other direction: let's make an ERROR during ERROR > processing promote to FATAL. And then let's do what you write above: > make sure that there's a separate on-shmem-exit callback for each > critical shared memory resource and that we call of those during FATAL > processing. Many of the factors that can cause AbortTransaction() to fail can also cause CommitTransaction() to fail, and those would still PANIC if the transaction had an xid. How practical might it be to also escape from an error during CommitTransaction() with a FATAL instead of PANIC? There's more to fix up in that case (sinval, NOTIFY), but it may be within reach. If such a technique can only reasonably fix abort, though, I have doubts it buys us enough. > It seems to me that that's how things were originally designed to > work, but that we've drifted away from it basically because the > low-level callbacks to release heavyweight locks and buffer pins > turned out to be kinda, uh, slow, and we thought those code paths > couldn't be taken anyway (turns out they can). I think we could > either make those routines very fast, or arrange only to run that code > at all in the case where AbortTransaction() didn't complete > successfully. Agreed; the performance hazards look tractable. > It's true that such code will be rarely run, but the > logic is simple enough that I think we can verify it by hand, and it's > sure nice to avoid PANICs. True. -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ERROR during end-of-xact/FATAL
On Fri, Nov 8, 2013 at 4:13 PM, Noah Misch wrote: > A PANIC will reinitialize everything relevant, largely resolving the problems > around ERROR during FATAL. It's a heavy-handed solution, but it may well be > the best solution. Efforts to harden CommitTransaction() and > AbortTransaction() seem well-spent, but the additional effort to make FATAL > exit cope where AbortTransaction() or another exit action could not cope seems > to be slicing ever-smaller portions of additional robustness. > > I pondered a variant of that conclusion that distinguished critical cleanup > needs from the rest. Each shared resource (heavyweight locks, buffer pins, > LWLocks) would have an on_shmem_exit() callback that cleans up the resource > under a critical section. (AtProcExit_Buffers() used to fill such a role, but > resowner.c's work during AbortTransaction() has mostly supplanted it.) The > ShutdownPostgres callback would not use a critical section, so lesser failures > in AbortTransaction() would not upgrade to a PANIC. But I'm leaning against > such a complication on the grounds that it would add seldom-tested code paths > posing as much a chance of eroding robustness as bolstering it. The current situation is just plain weird: in the ERROR-then-ERROR case, we emit a WARNING and bounce right back into AbortTransaction(), and if it doesn't work any better the second time than the first time, we recurse again, and eventually if it fails enough times in a row, we just give up and PANIC. But in the ERROR-then-FATAL case, we *don't* retry AbortTransaction(); instead, we just continue running the rest of the on_shmem_exit callbacks and then exit. So, in short, ERROR + ERROR*10 = PANIC, but FATAL + ERROR*10 = FATAL. That's bizarre. Given that that's where we are, promoting an ERROR during FATAL processing to PANIC doesn't seem like it's losing much; we're essentially already doing that in the (probably more likely) case of a persistent ERROR during ERROR processing. But since PANIC sucks, I'd rather go the other direction: let's make an ERROR during ERROR processing promote to FATAL. And then let's do what you write above: make sure that there's a separate on-shmem-exit callback for each critical shared memory resource and that we call of those during FATAL processing. It seems to me that that's how things were originally designed to work, but that we've drifted away from it basically because the low-level callbacks to release heavyweight locks and buffer pins turned out to be kinda, uh, slow, and we thought those code paths couldn't be taken anyway (turns out they can). I think we could either make those routines very fast, or arrange only to run that code at all in the case where AbortTransaction() didn't complete successfully. It's true that such code will be rarely run, but the logic is simple enough that I think we can verify it by hand, and it's sure nice to avoid PANICs. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ERROR during end-of-xact/FATAL
On Sat, Nov 9, 2013 at 2:43 AM, Noah Misch wrote: > On Wed, Nov 06, 2013 at 10:14:53AM +0530, Amit Kapila wrote: >> On Thu, Oct 31, 2013 at 8:22 PM, Noah Misch wrote: >> About unclean FATAL-then-ERROR scenario, one way to deal at high level >> could be to treat such a case as backend crash in which case >> postmaster reinitialises shared memory and other stuff. >> >> > If we can't manage to >> > free a shared memory resource like a lock or buffer pin, we really must >> > PANIC. >> >> Can't we try to initialise the shared memory and other resources, >> wouldn't that resolve the problem's that can occur due to scenario >> explained by you? > > A PANIC will reinitialize everything relevant, largely resolving the problems > around ERROR during FATAL. It's a heavy-handed solution, but it may well be > the best solution. Efforts to harden CommitTransaction() and > AbortTransaction() seem well-spent, but the additional effort to make FATAL > exit cope where AbortTransaction() or another exit action could not cope seems > to be slicing ever-smaller portions of additional robustness. > > I pondered a variant of that conclusion that distinguished critical cleanup > needs from the rest. Each shared resource (heavyweight locks, buffer pins, > LWLocks) would have an on_shmem_exit() callback that cleans up the resource > under a critical section. (AtProcExit_Buffers() used to fill such a role, but > resowner.c's work during AbortTransaction() has mostly supplanted it.) The > ShutdownPostgres callback would not use a critical section, so lesser failures > in AbortTransaction() would not upgrade to a PANIC. But I'm leaning against > such a complication on the grounds that it would add seldom-tested code paths > posing as much a chance of eroding robustness as bolstering it. I think here PANIC is safe and less complicated solution for this situation. Apart from this we can try to avoid palloc or other such errors in AbortTransaction/CommitTransaction path. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ERROR during end-of-xact/FATAL
On Wed, Nov 06, 2013 at 10:14:53AM +0530, Amit Kapila wrote: > On Thu, Oct 31, 2013 at 8:22 PM, Noah Misch wrote: > > If the original AbortTransaction() pertained to a FATAL, the situation is > > worse. errfinish() promotes the ERROR thrown from AbortTransaction() to > > another FATAL, > > isn't errstart promotes ERROR to FATAL? Right. > When I tried above scenario, I hit Assert at different place ... > This means that in the situation when an ERROR occurs in > AbortTransaction which is called as a result of FATAL error, there are > many more possibilities of Assert. Agreed. > About unclean FATAL-then-ERROR scenario, one way to deal at high level > could be to treat such a case as backend crash in which case > postmaster reinitialises shared memory and other stuff. > > > If we can't manage to > > free a shared memory resource like a lock or buffer pin, we really must > > PANIC. > > Can't we try to initialise the shared memory and other resources, > wouldn't that resolve the problem's that can occur due to scenario > explained by you? A PANIC will reinitialize everything relevant, largely resolving the problems around ERROR during FATAL. It's a heavy-handed solution, but it may well be the best solution. Efforts to harden CommitTransaction() and AbortTransaction() seem well-spent, but the additional effort to make FATAL exit cope where AbortTransaction() or another exit action could not cope seems to be slicing ever-smaller portions of additional robustness. I pondered a variant of that conclusion that distinguished critical cleanup needs from the rest. Each shared resource (heavyweight locks, buffer pins, LWLocks) would have an on_shmem_exit() callback that cleans up the resource under a critical section. (AtProcExit_Buffers() used to fill such a role, but resowner.c's work during AbortTransaction() has mostly supplanted it.) The ShutdownPostgres callback would not use a critical section, so lesser failures in AbortTransaction() would not upgrade to a PANIC. But I'm leaning against such a complication on the grounds that it would add seldom-tested code paths posing as much a chance of eroding robustness as bolstering it. Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ERROR during end-of-xact/FATAL
On Wed, Nov 6, 2013 at 9:40 AM, Alvaro Herrera wrote: > Noah Misch wrote: >> Incomplete list: >> >> - If smgrDoPendingDeletes() finds files to delete, mdunlink() and its callee >> relpathbackend() call palloc(); this is true in all supported branches. In >> 9.3, due to commit 279628a0, smgrDoPendingDeletes() itself calls palloc(). >> (In fact, it does so even when the pending list is empty -- this is the >> only >> palloc() during a trivial transaction commit.) palloc() failure there >> yields a PANIC during commit. > > I think we should fix this routine to avoid the palloc when not necessary. > That initial palloc is pointless. > > Also, there have been previous discussions about having relpathbackend > not use palloc at all. That was only because we wanted to use it in > pg_xlogdump which didn't have palloc support at the time, so it's no > longer as pressing; but perhaps it's still worthy of consideration. +1, but I'd like it better if we could find a way of avoiding the palloc in all cases. Panicking because we run out of memory at the wrong time isn't really very nice. Maybe the information needs to be maintained in the format in which it ultimately needs to be returned, so that we needn't rejigger it in the middle of a critical section. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ERROR during end-of-xact/FATAL
Noah Misch wrote: > Incomplete list: > > - If smgrDoPendingDeletes() finds files to delete, mdunlink() and its callee > relpathbackend() call palloc(); this is true in all supported branches. In > 9.3, due to commit 279628a0, smgrDoPendingDeletes() itself calls palloc(). > (In fact, it does so even when the pending list is empty -- this is the only > palloc() during a trivial transaction commit.) palloc() failure there > yields a PANIC during commit. I think we should fix this routine to avoid the palloc when not necessary. That initial palloc is pointless. Also, there have been previous discussions about having relpathbackend not use palloc at all. That was only because we wanted to use it in pg_xlogdump which didn't have palloc support at the time, so it's no longer as pressing; but perhaps it's still worthy of consideration. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ERROR during end-of-xact/FATAL
On Thu, Oct 31, 2013 at 8:22 PM, Noah Misch wrote: > CommitTransaction() and AbortTransaction() both do much work, and large > portions of that work either should not or must not throw errors. An error > during either function will, as usual, siglongjmp() out. Ordinarily, > PostgresMain() will regain control and fire off a fresh AbortTransaction(). > The consequences thereof depend on the original function's progress: > > - Before the function updates CurrentTransactionState->state, an ERROR is > fully acceptable. CommitTransaction() specifically places failure-prone > tasks accordingly; AbortTransaction() has no analogous tasks. > > - After the function updates CurrentTransactionState->state, an ERROR yields a > user-unfriendly e.g. "WARNING: AbortTransaction while in COMMIT state". > This is not itself harmful, but we've largely kept the things that can fail > for pedestrian reasons ahead of that point. > > - After CommitTransaction() calls RecordTransactionCommit() for an xid-bearing > transaction, an ERROR upgrades to e.g. "PANIC: cannot abort transaction > 805, it was already committed". > > - After AbortTransaction() calls ProcArrayEndTransaction() for an xid-bearing > transaction, an ERROR will lead to this assertion failure: > > TRAP: FailedAssertion("!(((allPgXact[proc->pgprocno].xid) != > ((TransactionId) 0)))", File: "procarray.c", Line: 396) > > If the original AbortTransaction() pertained to a FATAL, the situation is > worse. errfinish() promotes the ERROR thrown from AbortTransaction() to > another FATAL, isn't errstart promotes ERROR to FATAL? > so we reenter proc_exit(). Thanks to the following logic in > shmem_exit(), following comment is in proc_exit_prepare(), but the effect will be same as described you due to logic in shmem_exit(). > we will never return to AbortTransaction(): > > /* > * call all the registered callbacks. > * > * Note that since we decrement on_proc_exit_index each time, if a > * callback calls ereport(ERROR) or ereport(FATAL) then it won't be > * invoked again when control comes back here (nor will the > * previously-completed callbacks). So, an infinite loop should not > be > * possible. > */ > > As a result, we miss any cleanups that had not yet happened in the original > AbortTransaction(). In particular, this can leak heavyweight locks. An > asserts build subsequently fails this way: > > TRAP: FailedAssertion("!(SHMQueueEmpty(&(MyProc->myProcLocks[i])))", File: > "proc.c", Line: 788) > > In a production build, the affected PGPROC slot just continues to hold the > lock until the next backend claiming that slot calls LockReleaseAll(). Oops. > Bottom line: most bets are off given an ERROR after RecordTransactionCommit() > in CommitTransaction() or anywhere in AbortTransaction(). When I tried above scenario, I hit Assert at different place shmem_exit()->pgstat_beshutdown_hook()->pgstat_report_stat() pgstat_report_stat(bool force) { .. Assert(entry->trans == NULL); } The reason is that in AbortTransaction, an ERROR is thrown before call to AtEOXact_PgStat(false). This means that in the situation when an ERROR occurs in AbortTransaction which is called as a result of FATAL error, there are many more possibilities of Assert. > > Now, while those assertion failures are worth preventing on general principle, > the actual field importance depends on whether things actually do fail in the > vulnerable end-of-xact work. We've prevented the errors that would otherwise > be relatively common, but there are several rare ways to get a late ERROR. > Incomplete list: > > - If smgrDoPendingDeletes() finds files to delete, mdunlink() and its callee > relpathbackend() call palloc(); this is true in all supported branches. In > 9.3, due to commit 279628a0, smgrDoPendingDeletes() itself calls palloc(). > (In fact, it does so even when the pending list is empty -- this is the only > palloc() during a trivial transaction commit.) palloc() failure there > yields a PANIC during commit. > > - ResourceOwnerRelease() calls FileClose() during abort, and FileClose() > raises an ERROR when close() returns EIO. > > - AtEOXact_Inval() can lead to calls like RelationReloadIndexInfo(), which has > many ways to throw errors. This precedes releasing heavyweight locks, so an > error here during an abort pertaining to a FATAL exit orphans locks as > described above. This relates into another recent thread: > http://www.postgresql.org/message-id/20130805170931.ga369...@tornado.leadboat.com > > > What should we do to mitigate these problems? Certainly we can harden > individual end-of-xact tasks to not throw errors, as we have in the past. > What higher-level strategies should we consider? What about for the unclean > result of the FATAL-then-ERROR scenario in particular? About unclean FATAL-then-ERROR scenario, one way to deal at high level could be to treat s
[HACKERS] ERROR during end-of-xact/FATAL
CommitTransaction() and AbortTransaction() both do much work, and large portions of that work either should not or must not throw errors. An error during either function will, as usual, siglongjmp() out. Ordinarily, PostgresMain() will regain control and fire off a fresh AbortTransaction(). The consequences thereof depend on the original function's progress: - Before the function updates CurrentTransactionState->state, an ERROR is fully acceptable. CommitTransaction() specifically places failure-prone tasks accordingly; AbortTransaction() has no analogous tasks. - After the function updates CurrentTransactionState->state, an ERROR yields a user-unfriendly e.g. "WARNING: AbortTransaction while in COMMIT state". This is not itself harmful, but we've largely kept the things that can fail for pedestrian reasons ahead of that point. - After CommitTransaction() calls RecordTransactionCommit() for an xid-bearing transaction, an ERROR upgrades to e.g. "PANIC: cannot abort transaction 805, it was already committed". - After AbortTransaction() calls ProcArrayEndTransaction() for an xid-bearing transaction, an ERROR will lead to this assertion failure: TRAP: FailedAssertion("!(((allPgXact[proc->pgprocno].xid) != ((TransactionId) 0)))", File: "procarray.c", Line: 396) If the original AbortTransaction() pertained to a FATAL, the situation is worse. errfinish() promotes the ERROR thrown from AbortTransaction() to another FATAL, so we reenter proc_exit(). Thanks to the following logic in shmem_exit(), we will never return to AbortTransaction(): /* * call all the registered callbacks. * * Note that since we decrement on_proc_exit_index each time, if a * callback calls ereport(ERROR) or ereport(FATAL) then it won't be * invoked again when control comes back here (nor will the * previously-completed callbacks). So, an infinite loop should not be * possible. */ As a result, we miss any cleanups that had not yet happened in the original AbortTransaction(). In particular, this can leak heavyweight locks. An asserts build subsequently fails this way: TRAP: FailedAssertion("!(SHMQueueEmpty(&(MyProc->myProcLocks[i])))", File: "proc.c", Line: 788) In a production build, the affected PGPROC slot just continues to hold the lock until the next backend claiming that slot calls LockReleaseAll(). Oops. Bottom line: most bets are off given an ERROR after RecordTransactionCommit() in CommitTransaction() or anywhere in AbortTransaction(). Now, while those assertion failures are worth preventing on general principle, the actual field importance depends on whether things actually do fail in the vulnerable end-of-xact work. We've prevented the errors that would otherwise be relatively common, but there are several rare ways to get a late ERROR. Incomplete list: - If smgrDoPendingDeletes() finds files to delete, mdunlink() and its callee relpathbackend() call palloc(); this is true in all supported branches. In 9.3, due to commit 279628a0, smgrDoPendingDeletes() itself calls palloc(). (In fact, it does so even when the pending list is empty -- this is the only palloc() during a trivial transaction commit.) palloc() failure there yields a PANIC during commit. - ResourceOwnerRelease() calls FileClose() during abort, and FileClose() raises an ERROR when close() returns EIO. - AtEOXact_Inval() can lead to calls like RelationReloadIndexInfo(), which has many ways to throw errors. This precedes releasing heavyweight locks, so an error here during an abort pertaining to a FATAL exit orphans locks as described above. This relates into another recent thread: http://www.postgresql.org/message-id/20130805170931.ga369...@tornado.leadboat.com What should we do to mitigate these problems? Certainly we can harden individual end-of-xact tasks to not throw errors, as we have in the past. What higher-level strategies should we consider? What about for the unclean result of the FATAL-then-ERROR scenario in particular? If we can't manage to free a shared memory resource like a lock or buffer pin, we really must PANIC. Releasing those things is quite reliable, though. The tasks that have the highest chance of capsizing the AbortTransaction() are of backend-local interest, or they're tasks for which we tolerate failure as a rule (e.g. unlinking files). Robert Haas provided a large slice of the research for this report. Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers