Re: error_severity of brin work item
This thread has stalled, and from the discussions there isn't consencus on how to proceed (if at all?), so I'm marking this Returned with Feedback. If the discussion and work picks up again then it can be resubmitted. -- Daniel Gustafsson https://vmware.com/
Re: error_severity of brin work item
On Fri, Sep 24, 2021 at 10:42:36AM -0300, Alvaro Herrera wrote: > I think the most reasonable action is to push the patch in > https://postgr.es/m/20201123193957.GA21810@alvherre.pgsql to all > branches, closing the immediate hole, and we can see about the xact-hook > stuff (when we have it) to master only, which I view as a better > mechanism to protect work-items going forward. That's what I understand > Justin was suggesting in his last reply. I suggested to use the 20 line solution rather than invent new infrastructure just to avoid corner case errors. But I also don't understand why changing to use an INFO message doesn't work. vacuum_open_relation() outputs a WARNING for manual vacuum and an INFO for autovacuum when log_min_duration >= 0. Why couldn't this do the same thing? I found a few instances of this in my logs, indicating that my understanding of the logic is not wrong. log_time | 2021-01-13 15:23:40.687-05 session_line | 2 session_start_time | 2021-01-13 15:23:19-05 error_severity | LOG sql_state_code | 42P01 message| skipping vacuum of "pg_toast_984781820" --- relation no longer exists backend_type | autovacuum worker -- Justin
Re: error_severity of brin work item
On 2021-Sep-24, Jaime Casanova wrote: > Do you plan to work on this for this CF? I think the most reasonable action is to push the patch in https://postgr.es/m/20201123193957.GA21810@alvherre.pgsql to all branches, closing the immediate hole, and we can see about the xact-hook stuff (when we have it) to master only, which I view as a better mechanism to protect work-items going forward. That's what I understand Justin was suggesting in his last reply. -- Álvaro Herrera 39°49'30"S 73°17'W — https://www.EnterpriseDB.com/ "La fuerza no está en los medios físicos sino que reside en una voluntad indomable" (Gandhi)
Re: error_severity of brin work item
On Wed, Mar 10, 2021 at 08:24:50AM -0500, David Steele wrote: > On 12/1/20 5:25 PM, Justin Pryzby wrote: > > On Tue, Dec 01, 2020 at 03:57:24PM -0300, Alvaro Herrera wrote: > > > > > > Another idea is if perform_work_item() were responsible for discarding > > > > relations which disappear. Currently it does this, which is racy since > > > > it > > > > holds no lock. > > > > > > That has the property that it remains contained in autovacuum.c, but no > > > other advantages I think. > > > > It has the advantage that it moves all the try_open stuff out of brin. > > > > I started implementing this, and then realized that the try_open stuff > > *has* to > > be in the brin_summarize function, to handle the case that someone passes a > > non-index, since it's SQL exposed. > > So maybe we should use your LockOid patch now, and refactor in the future > > if we > > add additional work-item types. > > Thoughts on this, Álvaro? I can see that the first version of this patch was > not ideal but the rework seems to have stalled. Since it is a bug perhaps it > would be better to get something in as Justin suggests? > Hi Álvaro, Do you plan to work on this for this CF? -- Jaime Casanova Director de Servicios Profesionales SystemGuards - Consultores de PostgreSQL
Re: error_severity of brin work item
On 12/1/20 5:25 PM, Justin Pryzby wrote: On Tue, Dec 01, 2020 at 03:57:24PM -0300, Alvaro Herrera wrote: Another idea is if perform_work_item() were responsible for discarding relations which disappear. Currently it does this, which is racy since it holds no lock. That has the property that it remains contained in autovacuum.c, but no other advantages I think. It has the advantage that it moves all the try_open stuff out of brin. I started implementing this, and then realized that the try_open stuff *has* to be in the brin_summarize function, to handle the case that someone passes a non-index, since it's SQL exposed. So maybe we should use your LockOid patch now, and refactor in the future if we add additional work-item types. Thoughts on this, Álvaro? I can see that the first version of this patch was not ideal but the rework seems to have stalled. Since it is a bug perhaps it would be better to get something in as Justin suggests? Regards, -- -David da...@pgmasters.net
Re: error_severity of brin work item
On Tue, Dec 01, 2020 at 03:57:24PM -0300, Alvaro Herrera wrote: > On 2020-Dec-01, Justin Pryzby wrote: > > > This was an idea I made up - I don't know any of the details of this, but if > > you give a hint I could look at it more. There'd (still) be a race window, > > but > > I think that's ok. > > See CommitTransaction() and friends, where AtEOXact_on_commit_actions() ..thanks. I was going to persue the other idea first since this is new to me. > Not sure what's a good place for OIDs to be added to the list. We don't > have AM-specific entry points for relation drop. I think this is the > weakest point of this. I assume it would just add OIDs of *all* dropped rels, and the autovacuum half would silently ignore any OID for which there's no work item. (As it would do in any case). > > Another idea is if perform_work_item() were responsible for discarding > > relations which disappear. Currently it does this, which is racy since it > > holds no lock. > > That has the property that it remains contained in autovacuum.c, but no > other advantages I think. It has the advantage that it moves all the try_open stuff out of brin. I started implementing this, and then realized that the try_open stuff *has* to be in the brin_summarize function, to handle the case that someone passes a non-index, since it's SQL exposed. So maybe we should use your LockOid patch now, and refactor in the future if we add additional work-item types. -- Justin
Re: error_severity of brin work item
On 2020-Dec-01, Justin Pryzby wrote: > This was an idea I made up - I don't know any of the details of this, but if > you give a hint I could look at it more. There'd (still) be a race window, > but > I think that's ok. See CommitTransaction() and friends, where AtEOXact_on_commit_actions() and others are called. You'd have to create a new routine (say AtEOXact_Autovacuum or more specific AtEOXact_AutovacuumWorkItems), to be called at the right places in xact.c. Keep a global variable, say a list of OIDs. On subxact commit, the list is reassigned to its parent transaction; on subxact abort, the list is discarded. On top xact commit, the list of OIDs is passed to some new routine in autovacuum.c that scans the workitem array and deletes items as appropriate. Not sure what's a good place for OIDs to be added to the list. We don't have AM-specific entry points for relation drop. I think this is the weakest point of this. > Another idea is if perform_work_item() were responsible for discarding > relations which disappear. Currently it does this, which is racy since it > holds no lock. That has the property that it remains contained in autovacuum.c, but no other advantages I think.
Re: error_severity of brin work item
On Tue, Dec 01, 2020 at 11:07:30AM -0300, Alvaro Herrera wrote: > > Should it be done in an AtCommit hook or something like that ? > > I didn't like this idea much on first read, on extensibility grounds, > but perhaps it's not so bad because we can generalize it whenever > there's pressure to add a second type of work-item (*if* that ever > happens). > > I suppose the idea is that index_drop saves the index OID when a BRIN > index with autosummarization=on is dropped, and then the > AtCommit_WorkItems() call scans the items list and drops those that > match any OIDs in the list. (It does require to be mindful of subxact > aborts, of course.) This was an idea I made up - I don't know any of the details of this, but if you give a hint I could look at it more. There'd (still) be a race window, but I think that's ok. Another idea is if perform_work_item() were responsible for discarding relations which disappear. Currently it does this, which is racy since it holds no lock. cur_relname = get_rel_name(workitem->avw_relation); cur_nspname = get_namespace_name(get_rel_namespace(workitem->avw_relation)); cur_datname = get_database_name(MyDatabaseId); if (!cur_relname || !cur_nspname || !cur_datname) goto deleted2; -- Justin
Re: error_severity of brin work item
On 2020-Nov-30, Justin Pryzby wrote: > On Mon, Nov 30, 2020 at 08:47:32PM -0300, Alvaro Herrera wrote: > > The more I look at this, the less I like it. This would set a precedent > > that any action that can be initiated from an autovac work-item has a > > requirement of silently being discarded when it referenced a > > non-existant relation. > > My original request was to change to INFO, which is what vacuum proper does at > vacuum_open_relation(). I realize that still means that new work item > handlers > would have to do the LockOid, try_open dance - maybe it could be factored out. As I understand, INFO is not well suited to messages that are not going to the client. Anyway, my point is about the contortions that are needed to support the case, rather than what exactly do we do when it happens. Retrospectively, it's strange that this problem (what happens when indexes with pending work-items are dropped) hadn't manifested. It seems a pretty obvious one. > > I'd rather have the code that drops the index go through the list of > > work-items and delete those that reference that relation. > > > > I'm not sure if this is something that ought to be done in index_drop(); > > One objection might be that if the drop is rolled back, the work-items > > are lost. > > Should it be done in an AtCommit hook or something like that ? I didn't like this idea much on first read, on extensibility grounds, but perhaps it's not so bad because we can generalize it whenever there's pressure to add a second type of work-item (*if* that ever happens). I suppose the idea is that index_drop saves the index OID when a BRIN index with autosummarization=on is dropped, and then the AtCommit_WorkItems() call scans the items list and drops those that match any OIDs in the list. (It does require to be mindful of subxact aborts, of course.)
Re: error_severity of brin work item
On Mon, Nov 30, 2020 at 08:47:32PM -0300, Alvaro Herrera wrote: > The more I look at this, the less I like it. This would set a precedent > that any action that can be initiated from an autovac work-item has a > requirement of silently being discarded when it referenced a > non-existant relation. My original request was to change to INFO, which is what vacuum proper does at vacuum_open_relation(). I realize that still means that new work item handlers would have to do the LockOid, try_open dance - maybe it could be factored out. I noticed this on multiple servers immediate after changing a nagios script to look for all ERRORs (rather than a specific list of log_messages) and for a longer time period. > I'd rather have the code that drops the index go through the list of > work-items and delete those that reference that relation. > > I'm not sure if this is something that ought to be done in index_drop(); > One objection might be that if the drop is rolled back, the work-items > are lost. Should it be done in an AtCommit hook or something like that ? -- Justin
Re: error_severity of brin work item
The more I look at this, the less I like it. This would set a precedent that any action that can be initiated from an autovac work-item has a requirement of silently being discarded when it referenced a non-existant relation. I'd rather have the code that drops the index go through the list of work-items and delete those that reference that relation. I'm not sure if this is something that ought to be done in index_drop(); One objection might be that if the drop is rolled back, the work-items are lost. It's the easiest, though; and work-items are supposed to be lossy anyway, and vacuum would fix the lack of summarization eventually. So, not pretty, but not all that bad. (Hopefully rolled-back drops are not all that common.)
Re: error_severity of brin work item
On Mon, Nov 23, 2020 at 04:39:57PM -0300, Alvaro Herrera wrote: > I think this formulation (attached v3) has fewer moving parts. > > However, now that I did that, I wonder if this is really the best > approach to solve this problem. Maybe instead of doing this at the BRIN > level, it should be handled at the autovac level, by having the worker > copy the work-item to local memory and remove it from the shared list as > soon as it is in progress. That way, if *any* error occurs while trying > to execute it, it will go away instead of being retried for all > eternity. > > Preliminary patch for that attached as autovacuum-workitem.patch. > > I would propose to clean that up to apply instead of your proposed fix. I don't know why you said it would be retried for eternity ? I think the perform_work_item() TRY/CATCH avoids that. Also .. I think your idea doesn't solve my issue, that REINDEX CONCURRENTLY is causing vacuum to leave errors in my logs. I check that the first patch avoids the issue and the 2nd one does not. -- Justin
Re: error_severity of brin work item
I think this formulation (attached v3) has fewer moving parts. However, now that I did that, I wonder if this is really the best approach to solve this problem. Maybe instead of doing this at the BRIN level, it should be handled at the autovac level, by having the worker copy the work-item to local memory and remove it from the shared list as soon as it is in progress. That way, if *any* error occurs while trying to execute it, it will go away instead of being retried for all eternity. Preliminary patch for that attached as autovacuum-workitem.patch. I would propose to clean that up to apply instead of your proposed fix. >From e0b457866e503a1b70b94c540147026b652fb019 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Fri, 13 Nov 2020 13:39:31 -0300 Subject: [PATCH v3] Avoid errors in brin summarization.. ..which can happen if an index is reindexed concurrently --- src/backend/access/brin/brin.c | 37 -- 1 file changed, 26 insertions(+), 11 deletions(-) diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c index 1f72562c60..4d012ebd76 100644 --- a/src/backend/access/brin/brin.c +++ b/src/backend/access/brin/brin.c @@ -33,6 +33,7 @@ #include "postmaster/autovacuum.h" #include "storage/bufmgr.h" #include "storage/freespace.h" +#include "storage/lmgr.h" #include "utils/acl.h" #include "utils/builtins.h" #include "utils/index_selfuncs.h" @@ -886,21 +887,35 @@ brin_summarize_range(PG_FUNCTION_ARGS) /* * We must lock table before index to avoid deadlocks. However, if the - * passed indexoid isn't an index then IndexGetRelation() will fail. - * Rather than emitting a not-very-helpful error message, postpone - * complaining, expecting that the is-it-an-index test below will fail. + * passed indexoid isn't an index, then IndexGetRelation(false) would + * emit a not-very-helpful error message. Instead, postpone complaining + * until the is-it-an-index test, below. */ heapoid = IndexGetRelation(indexoid, true); - if (OidIsValid(heapoid)) - heapRel = table_open(heapoid, ShareUpdateExclusiveLock); - else - heapRel = NULL; + if (heapoid != InvalidOid) + LockRelationOid(heapoid, ShareUpdateExclusiveLock); - indexRel = index_open(indexoid, ShareUpdateExclusiveLock); + /* Now we can open the index; silently skip if it's gone. */ + indexRel = try_relation_open(indexoid, ShareUpdateExclusiveLock); + if (!indexRel) + { + if (heapoid != InvalidOid) + UnlockRelationOid(heapoid, ShareUpdateExclusiveLock); + PG_RETURN_INT32(0); + } + + /* If passed a non-index, fail noisily */ + if (indexRel->rd_rel->relkind != RELKIND_INDEX) + ereport(ERROR, +(errcode(ERRCODE_WRONG_OBJECT_TYPE), + errmsg("\"%s\" is not an index", + RelationGetRelationName(indexRel; + + /* Now we can open the table */ + heapRel = table_open(heapoid, NoLock); /* Must be a BRIN index */ - if (indexRel->rd_rel->relkind != RELKIND_INDEX || - indexRel->rd_rel->relam != BRIN_AM_OID) + if (indexRel->rd_rel->relam != BRIN_AM_OID) ereport(ERROR, (errcode(ERRCODE_WRONG_OBJECT_TYPE), errmsg("\"%s\" is not a BRIN index", @@ -916,7 +931,7 @@ brin_summarize_range(PG_FUNCTION_ARGS) * barely possible that a race against an index drop/recreation could have * netted us the wrong table. Recheck. */ - if (heapRel == NULL || heapoid != IndexGetRelation(indexoid, false)) + if (heapoid != IndexGetRelation(indexoid, false)) ereport(ERROR, (errcode(ERRCODE_UNDEFINED_TABLE), errmsg("could not open parent table of index %s", -- 2.20.1 commit 2326df869ee8516b8b1f0e01930946e9a63800b5 Author: Alvaro Herrera AuthorDate: Mon Nov 23 15:56:23 2020 -0300 CommitDate: Mon Nov 23 15:57:28 2020 -0300 autovacuum-centered workitem fix diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c index aa5b97fbac..77c1401bc9 100644 --- a/src/backend/postmaster/autovacuum.c +++ b/src/backend/postmaster/autovacuum.c @@ -2561,6 +2561,7 @@ deleted: for (i = 0; i < NUM_WORKITEMS; i++) { AutoVacuumWorkItem *workitem = >av_workItems[i]; + AutoVacuumWorkItem my_workitem; if (!workitem->avw_used) continue; @@ -2569,11 +2570,14 @@ deleted: if (workitem->avw_database != MyDatabaseId) continue; + my_workitem = *workitem; + workitem->avw_used = false; + /* claim this one, and release lock while performing it */ workitem->avw_active = true; LWLockRelease(AutovacuumLock); - perform_work_item(workitem); + perform_work_item(_workitem); /* * Check for config changes before acquiring lock for further jobs.
Re: error_severity of brin work item
On Thu, Nov 19, 2020 at 03:15:21PM -0300, Alvaro Herrera wrote: > On 2020-Nov-19, Justin Pryzby wrote: > > > On Fri, Nov 13, 2020 at 12:11:21PM -0600, Justin Pryzby wrote: > > > > Your patch didn't actually say "try_relation_open", so didn't work. > > > But it does works if I do that, and close the table. > > Thanks for fixing and testing. > > > That patch broke the case that a non-index is passed, which I addressed > > here. > > Hmm, I think the reaction to that should be the same as before, so > rather than return 0, the patch should raise the same error that > index_open() would. The resulting logic is not very clear and requires a lot of commentary.. BTW I saw that in tablecmds.c, RangeVarCallbackForAttachIndex() does this: if (classform->relkind != RELKIND_PARTITIONED_INDEX && classform->relkind != RELKIND_INDEX) ereport(ERROR, (errcode(ERRCODE_INVALID_OBJECT_DEFINITION), errmsg("\"%s\" is not an index", rv->relname))); Is it wrong to use ERRCODE_INVALID_OBJECT_DEFINITION ? Most other places say ERRCODE_WRONG_OBJECT_TYPE Likewise, transformPartitionCmd() in parse_utilcmd.c: (errcode(ERRCODE_INVALID_OBJECT_DEFINITION), errmsg("\"%s\" is not a partitioned table", RelationGetRelationName(parentRel; -- Justin >From 3c6deeb57ba007b142f35312a436defb9d1b3783 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Fri, 13 Nov 2020 13:39:31 -0300 Subject: [PATCH] Avoid errors in brin summarization.. ..which can happen if an index is reindexed concurrently --- src/backend/access/brin/brin.c | 34 -- 1 file changed, 28 insertions(+), 6 deletions(-) diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c index 1f72562c60..a1089d5f9e 100644 --- a/src/backend/access/brin/brin.c +++ b/src/backend/access/brin/brin.c @@ -886,9 +886,9 @@ brin_summarize_range(PG_FUNCTION_ARGS) /* * We must lock table before index to avoid deadlocks. However, if the - * passed indexoid isn't an index then IndexGetRelation() will fail. - * Rather than emitting a not-very-helpful error message, postpone - * complaining, expecting that the is-it-an-index test below will fail. + * passed indexoid isn't an index, then IndexGetRelation(true) would + * emit a not-very-helpful error message. Instead, postpone complaining + * until the is-it-an-index test, below. */ heapoid = IndexGetRelation(indexoid, true); if (OidIsValid(heapoid)) @@ -896,11 +896,33 @@ brin_summarize_range(PG_FUNCTION_ARGS) else heapRel = NULL; - indexRel = index_open(indexoid, ShareUpdateExclusiveLock); + indexRel = try_relation_open(indexoid, ShareUpdateExclusiveLock); + + /* Silently skip autovacuum work-items if an index has disappeared. */ + if (!indexRel) + { + if (heapRel) + table_close(heapRel, ShareUpdateExclusiveLock); + + PG_RETURN_INT32(0); + } + + /* If passed a non-index, fail noisily */ + if (indexRel->rd_rel->relkind != RELKIND_INDEX) + ereport(ERROR, +(errcode(ERRCODE_WRONG_OBJECT_TYPE), + errmsg("\"%s\" is not an index", + RelationGetRelationName(indexRel; + + /* If the table wasn't opened for some other reason, silently skip it */ + if (!heapRel) + { + relation_close(indexRel, ShareUpdateExclusiveLock); + PG_RETURN_INT32(0); + } /* Must be a BRIN index */ - if (indexRel->rd_rel->relkind != RELKIND_INDEX || - indexRel->rd_rel->relam != BRIN_AM_OID) + if (indexRel->rd_rel->relam != BRIN_AM_OID) ereport(ERROR, (errcode(ERRCODE_WRONG_OBJECT_TYPE), errmsg("\"%s\" is not a BRIN index", -- 2.17.0
Re: error_severity of brin work item
On 2020-Nov-19, Justin Pryzby wrote: > On Fri, Nov 13, 2020 at 12:11:21PM -0600, Justin Pryzby wrote: > > Your patch didn't actually say "try_relation_open", so didn't work. > > But it does works if I do that, and close the table. Thanks for fixing and testing. > That patch broke the case that a non-index is passed, which I addressed here. Hmm, I think the reaction to that should be the same as before, so rather than return 0, the patch should raise the same error that index_open() would.
Re: error_severity of brin work item
On Fri, Nov 13, 2020 at 12:11:21PM -0600, Justin Pryzby wrote: > On Fri, Nov 13, 2020 at 01:39:31PM -0300, Alvaro Herrera wrote: > > On 2020-Nov-13, Justin Pryzby wrote: > > > > > I saw a bunch of these in my logs: > > > > > > log_time | 2020-10-25 22:59:45.619-07 > > > database | > > > left | could not open relation with OID 292103095 > > > left | processing work entry for relation > > > "ts.child.alarms_202010_alarm_clear_time_idx" > > > > > > Those happen following a REINDEX job on that index. > > > > > > I think that should be more like an INFO message, since that's what > > > vacuum does > > > (vacuum_open_relation), and a queued work item is even more likely to hit > > > a > > > dropped relation. > > > > Ah, interesting. Yeah, I agree this is a bug. I think it can be fixed > > by using try_relation_open() on the index; if that returns NULL, discard > > the work item. > > > > Does this patch solve the problem? > > Your patch didn't actually say "try_relation_open", so didn't work. > But it does works if I do that, and close the table. That patch broke the case that a non-index is passed, which I addressed here. I wondered if the function should return NULL in those cases, but it seems to be "impossible". >From ef5685253a04cf720574f21b711e177e13299ba3 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Fri, 13 Nov 2020 13:39:31 -0300 Subject: [PATCH v3] Avoid errors in brin summarization.. ..which can happen if an index is reindexed concurrently --- src/backend/access/brin/brin.c | 22 +++--- src/test/regress/expected/brin.out | 6 +- 2 files changed, 20 insertions(+), 8 deletions(-) diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c index 1f72562c60..46e5269260 100644 --- a/src/backend/access/brin/brin.c +++ b/src/backend/access/brin/brin.c @@ -887,16 +887,24 @@ brin_summarize_range(PG_FUNCTION_ARGS) /* * We must lock table before index to avoid deadlocks. However, if the * passed indexoid isn't an index then IndexGetRelation() will fail. - * Rather than emitting a not-very-helpful error message, postpone - * complaining, expecting that the is-it-an-index test below will fail. + * Rather than emitting a not-very-helpful error message, prepare to + * return without doing anything. This allows autovacuum work-items to be + * silently discarded rather than uselessly accumulating error messages in + * the server log. */ heapoid = IndexGetRelation(indexoid, true); - if (OidIsValid(heapoid)) - heapRel = table_open(heapoid, ShareUpdateExclusiveLock); - else - heapRel = NULL; + heapRel = OidIsValid(heapoid) ? + table_open(heapoid, ShareUpdateExclusiveLock) : NULL; - indexRel = index_open(indexoid, ShareUpdateExclusiveLock); + if (heapRel == NULL) + PG_RETURN_INT32(0); + + indexRel = try_relation_open(indexoid, ShareUpdateExclusiveLock); + if (indexRel == NULL) + { + table_close(heapRel, ShareUpdateExclusiveLock); + PG_RETURN_INT32(0); + } /* Must be a BRIN index */ if (indexRel->rd_rel->relkind != RELKIND_INDEX || diff --git a/src/test/regress/expected/brin.out b/src/test/regress/expected/brin.out index e53d6e4885..51dff7295d 100644 --- a/src/test/regress/expected/brin.out +++ b/src/test/regress/expected/brin.out @@ -405,7 +405,11 @@ UPDATE brintest SET int8col = int8col * int4col; UPDATE brintest SET textcol = '' WHERE textcol IS NOT NULL; -- Tests for brin_summarize_new_values SELECT brin_summarize_new_values('brintest'); -- error, not an index -ERROR: "brintest" is not an index + brin_summarize_new_values +--- + 0 +(1 row) + SELECT brin_summarize_new_values('tenk1_unique1'); -- error, not a BRIN index ERROR: "tenk1_unique1" is not a BRIN index SELECT brin_summarize_new_values('brinidx'); -- ok, no change expected -- 2.17.0
Re: error_severity of brin work item
On Fri, Nov 13, 2020 at 01:39:31PM -0300, Alvaro Herrera wrote: > On 2020-Nov-13, Justin Pryzby wrote: > > > I saw a bunch of these in my logs: > > > > log_time | 2020-10-25 22:59:45.619-07 > > database | > > left | could not open relation with OID 292103095 > > left | processing work entry for relation > > "ts.child.alarms_202010_alarm_clear_time_idx" > > > > Those happen following a REINDEX job on that index. > > > > I think that should be more like an INFO message, since that's what vacuum > > does > > (vacuum_open_relation), and a queued work item is even more likely to hit a > > dropped relation. > > Ah, interesting. Yeah, I agree this is a bug. I think it can be fixed > by using try_relation_open() on the index; if that returns NULL, discard > the work item. > > Does this patch solve the problem? Your patch didn't actually say "try_relation_open", so didn't work. But it does works if I do that, and close the table. I tested like: pryzbyj=# ALTER SYSTEM SET backtrace_functions='try_relation_open,relation_open'; pryzbyj=# ALTER SYSTEM SET autovacuum_naptime=3; SELECT pg_reload_conf(); pryzbyj=# CREATE TABLE tt AS SELECT generate_series(1,)i; pryzbyj=# CREATE INDEX ON tt USING brin(i) WITH(autosummarize,pages_per_range=1); pryzbyj=# \! while :; do psql -h /tmp -qc 'SET client_min_messages=info' -c 'REINDEX INDEX CONCURRENTLY tt_i_idx'; done& -- run this 5-10 times and hit the "...was not recorded" message, which for -- whatever reason causes the race condition involving work queue pryzbyj=# UPDATE tt SET i=1+i; 2020-11-13 11:50:46.093 CST [30687] ERROR: could not open relation with OID 1110882 2020-11-13 11:50:46.093 CST [30687] CONTEXT: processing work entry for relation "pryzbyj.public.tt_i_idx" 2020-11-13 11:50:46.093 CST [30687] BACKTRACE: postgres: autovacuum worker pryzbyj(+0xb9ce8) [0x55acf2af0ce8] postgres: autovacuum worker pryzbyj(index_open+0xb) [0x55acf2bab59b] postgres: autovacuum worker pryzbyj(brin_summarize_range+0x8f) [0x55acf2b5b5bf] postgres: autovacuum worker pryzbyj(DirectFunctionCall2Coll+0x62) [0x55acf2f40372] ... -- Justin >From e08c6d3e2b10964633904ff247e70330077d31b4 Mon Sep 17 00:00:00 2001 From: Alvaro Herrera Date: Fri, 13 Nov 2020 13:39:31 -0300 Subject: [PATCH v2] error_severity of brin work item --- src/backend/access/brin/brin.c | 15 --- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c index 1f72562c60..8278a5209c 100644 --- a/src/backend/access/brin/brin.c +++ b/src/backend/access/brin/brin.c @@ -887,8 +887,10 @@ brin_summarize_range(PG_FUNCTION_ARGS) /* * We must lock table before index to avoid deadlocks. However, if the * passed indexoid isn't an index then IndexGetRelation() will fail. - * Rather than emitting a not-very-helpful error message, postpone - * complaining, expecting that the is-it-an-index test below will fail. + * Rather than emitting a not-very-helpful error message, prepare to + * return without doing anything. This allows autovacuum work-items to be + * silently discarded rather than uselessly accumulating error messages in + * the server log. */ heapoid = IndexGetRelation(indexoid, true); if (OidIsValid(heapoid)) @@ -896,7 +898,14 @@ brin_summarize_range(PG_FUNCTION_ARGS) else heapRel = NULL; - indexRel = index_open(indexoid, ShareUpdateExclusiveLock); + indexRel = try_relation_open(indexoid, ShareUpdateExclusiveLock); + if (heapRel == NULL || indexRel == NULL) + { + if (heapRel != NULL) + table_close(heapRel, ShareUpdateExclusiveLock); + + PG_RETURN_INT32(0); + } /* Must be a BRIN index */ if (indexRel->rd_rel->relkind != RELKIND_INDEX || -- 2.17.0
Re: error_severity of brin work item
On 2020-Nov-13, Justin Pryzby wrote: > I saw a bunch of these in my logs: > > log_time | 2020-10-25 22:59:45.619-07 > database | > left | could not open relation with OID 292103095 > left | processing work entry for relation > "ts.child.alarms_202010_alarm_clear_time_idx" > > Those happen following a REINDEX job on that index. > > I think that should be more like an INFO message, since that's what vacuum > does > (vacuum_open_relation), and a queued work item is even more likely to hit a > dropped relation. Ah, interesting. Yeah, I agree this is a bug. I think it can be fixed by using try_relation_open() on the index; if that returns NULL, discard the work item. Does this patch solve the problem? diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c index 1f72562c60..6021b16a01 100644 --- a/src/backend/access/brin/brin.c +++ b/src/backend/access/brin/brin.c @@ -887,8 +887,10 @@ brin_summarize_range(PG_FUNCTION_ARGS) /* * We must lock table before index to avoid deadlocks. However, if the * passed indexoid isn't an index then IndexGetRelation() will fail. - * Rather than emitting a not-very-helpful error message, postpone - * complaining, expecting that the is-it-an-index test below will fail. + * Rather than emitting a not-very-helpful error message, prepare to + * return without doing anything. This allows autovacuum work-items to be + * silently discarded rather than uselessly accumulating error messages in + * the server log. */ heapoid = IndexGetRelation(indexoid, true); if (OidIsValid(heapoid)) @@ -897,6 +899,11 @@ brin_summarize_range(PG_FUNCTION_ARGS) heapRel = NULL; indexRel = index_open(indexoid, ShareUpdateExclusiveLock); + if (heapRel == NULL || indexRel == NULL) + { + /* Table or index don't exist anymore; ignore this request */ + PG_RETURN_INT32(0); + } /* Must be a BRIN index */ if (indexRel->rd_rel->relkind != RELKIND_INDEX ||
error_severity of brin work item
I saw a bunch of these in my logs: log_time | 2020-10-25 22:59:45.619-07 database | left | could not open relation with OID 292103095 left | processing work entry for relation "ts.child.alarms_202010_alarm_clear_time_idx" Those happen following a REINDEX job on that index. I think that should be more like an INFO message, since that's what vacuum does (vacuum_open_relation), and a queued work item is even more likely to hit a dropped relation. It's easy to hit by setting autovacuum_naptime=1 and looping around REINDEX CONCURRENTLY while updating a table. Autovacuum is queueing work items for later: src/backend/postmaster/autovacuum.c-switch (workitem->avw_type) src/backend/postmaster/autovacuum.c-{ src/backend/postmaster/autovacuum.c-case AVW_BRINSummarizeRange: src/backend/postmaster/autovacuum.c: DirectFunctionCall2(brin_summarize_range, src/backend/postmaster/autovacuum.c- ObjectIdGetDatum(workitem->avw_relation), src/backend/postmaster/autovacuum.c- Int64GetDatum((int64) workitem->avw_blockNumber)); src/backend/postmaster/autovacuum.c-break; And if the index is missing: brin_summarize_range(PG_FUNCTION_ARGS) |/* | * We must lock table before index to avoid deadlocks. However, if the | * passed indexoid isn't an index then IndexGetRelation() will fail. | * Rather than emitting a not-very-helpful error message, postpone | * complaining, expecting that the is-it-an-index test below will fail. | */ |heapoid = IndexGetRelation(indexoid, true); |if (OidIsValid(heapoid)) |heapRel = table_open(heapoid, ShareUpdateExclusiveLock); |else |heapRel = NULL; | |indexRel = index_open(indexoid, ShareUpdateExclusiveLock); table_open is succcessful and then index_open fails. (I thought locks would have prevented that ?) Justin