Re: [HACKERS] index-only count(*) for indexes supporting bitmap scans
I've pushed the executor part of this, but mostly not the planner part, because I didn't think the latter was anywhere near ready for prime time: the regression test changes it was causing were entirely bogus. Hi Tom, Thanks for the commit and the explanation. I'll try to address your comments if I continue working on the planner part. -- Alexander Kuzmenkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only count(*) for indexes supporting bitmap scans
Alexander Kuzmenkov writes: > On 04.09.2017 15:17, Alexey Chernyshov wrote: >> make check-world fails on contrib/postgres_fdw because of Subquery Scan on >> ... Probably, query plan was changed. > Thanks for testing! This is the same problem as the one in > 'select_distinct' I mentioned before. I changed the test, the updated > patch is attached. I've pushed the executor part of this, but mostly not the planner part, because I didn't think the latter was anywhere near ready for prime time: the regression test changes it was causing were entirely bogus. You had basically two categories of plan changes showing up in the regression tests. One was insertion of Subquery Scan nodes where they hadn't been before; that was because the use_physical_tlist change broke the optimization that removed no-op Subquery Scans. I fixed that by narrowing the use_physical_tlist change to apply only to BitmapHeapPath nodes, which is the only case where there would be any benefit anyway. The remaining plan diffs after making that change all amounted to replacing regular index-only scan plans with bitmap scans, which seems to me to be silly on its face: if we can use an IOS then it is unlikely that a bitmap scan will be better. Those changes indicate that the cost adjustment you'd inserted in cost_bitmap_heap_scan was way too optimistic, which is hardly surprising. I think a realistic adjustment would have to account for all or most of these factors: * Whether the index AM is ever going to return recheck = false. The planner has no way to know that at present, but since there are lots of cases where it simply won't ever happen, I think assuming that it always will is not acceptable. Perhaps we could extend the AM API so that we could find out whether recheck would happen always, never, or sometimes. (Doing better than "sometimes" might be too hard, but I think most opclasses are going to be "always" or "never" anyway.) * Whether the bitmap will become lossy, causing us to have to make rechecks anyway. We could probably estimate that pretty well based on comparing the initial number-of-pages estimate to work_mem. * Whether the plan will need to fetch heap tuples to make filter-qual checks. In principle the planner ought to know that. In practice, right now it doesn't bother to figure out whether the qual will be empty until createplan.c time, because that's rather a significant amount of work and it's undesirable to expend it for paths we might not end up using. We might be able to approximate it better than we do now, though. It's a live problem for regular indexscan costing as well as bitmap scans, IIRC. * What fraction of the table is actually all-visible. You'd effectively hard-wired that at 50%, but it's easy to do better --- the regular IOS code does if (indexonly) pages_fetched = ceil(pages_fetched * (1.0 - baserel->allvisfrac)); and it would be appropriate to do the same here if we conclude that the other gating conditions apply. Without a patch that deals more realistically with these concerns, I think we're better off not changing the cost estimate at all. As far as the executor side goes, I made several cosmetic changes and one not so cosmetic one: I changed the decision about whether to prefetch so that it looks at whether the potential prefetch page is all-visible. This still relies on the same assumption you had that the recheck flag will stay the same from one page to the next, but at least it's not assuming that the all-visible state will stay the same. I'm going to mark the CF entry closed, but if you feel like working on the cost estimate business, feel free to submit another patch for that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index only scan for cube and seg
Hi! > 29 окт. 2017 г., в 2:24, Alexander Korotkov > написал(а): > > As I can see, cube GiST code always uses DatumGetNDBOX() macro to transform > Datum to (NDBOX *). > > #define DatumGetNDBOX(x) ((NDBOX *) PG_DETOAST_DATUM(x)) > > Thus, it should be safe to just remove both compress/decompress methods from > existing opclass. Alexander, Tom, you are absolutely right. I was sure there is toasting code in cube's compress, but it was not ever there. Here is patch for cube that drops functions. Best regards, Andrey Borodin. 0001-Enable-Index-Only-Scan-in-cube.patch Description: Binary data 0001-Enable-Index-Only-Scan-in-seg.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index only scan for cube and seg
On Fri, Oct 27, 2017 at 9:54 PM, Tom Lane wrote: > Robert Haas writes: > > On Thu, Oct 26, 2017 at 12:22 PM, Andrey Borodin > wrote: > >> For cube there is new default opclass. > > > I seem to recall that changing the default opclass causes unsolvable > > problems with upgrades. You might want to check the archives for > > previous discussions of this issue; unfortunately, I don't recall the > > details off-hand. > > Quite aside from that, replacing the opclass with a new one creates > user-visible headaches that I don't think are justified, i.e. having to > reconstruct indexes in order to get the benefit. > > Maybe I'm missing something, but ISTM you could just drop the compress > function and call it good. This would mean that an IOS scan would > sometimes hand back a toast-compressed value, but what's the problem > with that? > +1, I think in this case replacing default opclass or even duplicating opclass isn't justified. (The only reason for making a decompress function that just detoasts > is that your other support functions then do not have to consider > the possibility that they're handed a toast-compressed value. I have > not checked whether that really matters for cube.) > As I can see, cube GiST code always uses DatumGetNDBOX() macro to transform Datum to (NDBOX *). #define DatumGetNDBOX(x) ((NDBOX *) PG_DETOAST_DATUM(x)) Thus, it should be safe to just remove both compress/decompress methods from existing opclass. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Index only scan for cube and seg
Robert Haas writes: > On Thu, Oct 26, 2017 at 12:22 PM, Andrey Borodin wrote: >> For cube there is new default opclass. > I seem to recall that changing the default opclass causes unsolvable > problems with upgrades. You might want to check the archives for > previous discussions of this issue; unfortunately, I don't recall the > details off-hand. Quite aside from that, replacing the opclass with a new one creates user-visible headaches that I don't think are justified, i.e. having to reconstruct indexes in order to get the benefit. Maybe I'm missing something, but ISTM you could just drop the compress function and call it good. This would mean that an IOS scan would sometimes hand back a toast-compressed value, but what's the problem with that? (The only reason for making a decompress function that just detoasts is that your other support functions then do not have to consider the possibility that they're handed a toast-compressed value. I have not checked whether that really matters for cube.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index only scan for cube and seg
On Thu, Oct 26, 2017 at 12:22 PM, Andrey Borodin wrote: > For cube there is new default opclass. I seem to recall that changing the default opclass causes unsolvable problems with upgrades. You might want to check the archives for previous discussions of this issue; unfortunately, I don't recall the details off-hand. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Index only scan for cube and seg
Hi hackers! Here are patches enabling Index Only Scan for cube and seg extensions. These patches follow this discussion [0]. For cube there is new default opclass. We cannot drop old opclass, because it could TOAST come cube values in rare occasions. Index Only Scan is enabled only for newly created indexes. Btw I can add fetch to old opclass so that IOS would be enabled. For seg compress and decompress functions are dropped from opclass and extension. Index Only Scan is enabled. There are two more functions which can be deleted ghstore_decompress g_intbig_decompress But it will not lead to any feasible consequences. [0] https://www.postgresql.org/message-id/flat/CAJEAwVELVx9gYscpE%3DBe6iJxvdW5unZ_LkcAaVNSeOwvdwtD%3DA%40mail.gmail.com#CAJEAwVELVx9gYscpE=Be6iJxvdW5unZ_LkcAaVNSeOwvdwtD=a...@mail.gmail.com 0001-Create-cube-opclass-without-toasting.patch Description: Binary data 0001-Enable-Index-Only-Scan-in-seg.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index expression syntax
On 29.09.2017 11:03, Marko Tiikkaja wrote: On Fri, Sep 29, 2017 at 9:31 AM, Konstantin Knizhnik mailto:k.knizh...@postgrespro.ru>> wrote: I wonder why syntax error is produced in this case: postgres=# create index metaindex on foo using gin(to_tsvector('english', x)||to_tsvector('english',y)); ERROR: syntax error at or near "||" LINE 1: ...taindex on foo using gin(to_tsvector('english', x)||to_tsvec... ^ [ .. ] /|expression:|/ An expression based on one or more columns of the table. The expression usually must be written with surrounding parentheses, as shown in the syntax. However, the parentheses can be omitted if the expression has the form of a function call. So documentations states that sometimes it is possible to avoid parentheses, but it is unclear why I have to use double parentheses... I think that either grammar should be fixed, either documentation should be updated. Your expression is clearly not a function call, it's a concatenation of two of them. The documentation seems perfectly accurate to me? O, sorry! You are right. I just didn't notice extra parenthesis in CREATE INDEX syntax in case of using expressions. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Index expression syntax
On Fri, Sep 29, 2017 at 9:31 AM, Konstantin Knizhnik < k.knizh...@postgrespro.ru> wrote: > I wonder why syntax error is produced in this case: > > postgres=# create index metaindex on foo using gin(to_tsvector('english', > x)||to_tsvector('english',y)); > ERROR: syntax error at or near "||" > LINE 1: ...taindex on foo using gin(to_tsvector('english', x)||to_tsvec... > ^ > [ .. ] > > *expression:* An expression based on one or more columns of the table. > The expression usually must be written with surrounding parentheses, as > shown in the syntax. However, the parentheses can be omitted if the > expression has the form of a function call. > > So documentations states that sometimes it is possible to avoid > parentheses, but it is unclear why I have to use double parentheses... > I think that either grammar should be fixed, either documentation should > be updated. > Your expression is clearly not a function call, it's a concatenation of two of them. The documentation seems perfectly accurate to me? .m
[HACKERS] Index expression syntax
I wonder why syntax error is produced in this case: postgres=# create index metaindex on foo using gin(to_tsvector('english', x)||to_tsvector('english',y)); ERROR: syntax error at or near "||" LINE 1: ...taindex on foo using gin(to_tsvector('english', x)||to_tsvec... ^ The error can be eliminated if extra surrounding parentheses are added: postgres=# create index metaindex on foo using gin((to_tsvector('english', x)||to_tsvector('english',y))); CREATE INDEX Postgresql documentations says: CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ]/|name|/ ] ON/|table_name|/ [ USING/|method|/ ] ( {/|column_name|/ | (/|expression|/ ) } [ COLLATE/|collation|/ ] [/|opclass|/ ] [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] ) [ WITH (/|storage_parameter|/ =/|value|/ [, ... ] ) ] [ TABLESPACE/|tablespace_name|/ ] [ WHERE/|predicate|/ ] /|expression:|/ An expression based on one or more columns of the table. The expression usually must be written with surrounding parentheses, as shown in the syntax. However, the parentheses can be omitted if the expression has the form of a function call. -- So documentations states that sometimes it is possible to avoid parentheses, but it is unclear why I have to use double parentheses... I think that either grammar should be fixed, either documentation should be updated. /||/ -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Index Only Scan support for cube
On Wed, Sep 20, 2017 at 8:26 AM, Andrey Borodin wrote: > Hi hackers! > > 23 мая 2017 г., в 14:53, Andrew Borodin > написал(а): > > > > Here's a small patch that implements fetch function necessary for > > Index Only Scans that use cube data type. > > Tom Lane have just commited d3a4f89 (Allow no-op GiST support functions to > be omitted) Thanks, Tom! : ) > "Index Only Scan support for cube" patch now is obsolete. I'm working on > another similar patch for contribs to support GiST IOS and remove no-op > support functions. > Good. BTW, some strangeness of g_cube_decompress() catch my eye. It compares results of two evaluations of same expression DatumGetNDBOXP(entry->key). NDBOX *key = DatumGetNDBOXP(entry->key); > if (key != DatumGetNDBOXP(entry->key)) In fact it's correct, because it compares results of two detoasting. If datum isn't toasted then results would be the same. And if data is toasted then results would be two different allocation of detoasted datum. However, we do extra detoasting here. For example, see gbt_var_decompress(). There is no extra detoasting here. GBT_VARKEY *key = (GBT_VARKEY *) PG_DETOAST_DATUM(entry->key); > if (key != (GBT_VARKEY *) DatumGetPointer(entry->key)) -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Index Only Scan support for cube
Hi hackers! > 23 мая 2017 г., в 14:53, Andrew Borodin написал(а): > > Here's a small patch that implements fetch function necessary for > Index Only Scans that use cube data type. Tom Lane have just commited d3a4f89 (Allow no-op GiST support functions to be omitted) Thanks, Tom! : ) "Index Only Scan support for cube" patch now is obsolete. I'm working on another similar patch for contribs to support GiST IOS and remove no-op support functions. Best regards, Andrey Borodin. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only count(*) for indexes supporting bitmap scans
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: tested, passed Documentation:tested, passed One thing I have noticed is a trailing whitespace after "bogus one": 123 +* If we don't have to fetch the tuple, just return a 124 +* bogus one 125 +*/ The new status of this patch is: Ready for Committer -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only count(*) for indexes supporting bitmap scans
On 04.09.2017 15:17, Alexey Chernyshov wrote: make check-world fails on contrib/postgres_fdw because of Subquery Scan on ... Probably, query plan was changed. Hi Alexey, Thanks for testing! This is the same problem as the one in 'select_distinct' I mentioned before. I changed the test, the updated patch is attached. -- Alexander Kuzmenkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out index c19b3318c7..5f5f65d60c 100644 --- a/contrib/postgres_fdw/expected/postgres_fdw.out +++ b/contrib/postgres_fdw/expected/postgres_fdw.out @@ -2203,22 +2203,23 @@ SELECT t1c1, avg(t1c1 + t2c1) FROM (SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2 -- join with lateral reference EXPLAIN (VERBOSE, COSTS OFF) SELECT t1."C 1" FROM "S 1"."T 1" t1, LATERAL (SELECT DISTINCT t2.c1, t3.c1 FROM ft1 t2, ft2 t3 WHERE t2.c1 = t3.c1 AND t2.c2 = t1.c2) q ORDER BY t1."C 1" OFFSET 10 LIMIT 10; - QUERY PLAN - +QUERY PLAN +-- Limit Output: t1."C 1" -> Nested Loop Output: t1."C 1" -> Index Scan using t1_pkey on "S 1"."T 1" t1 Output: t1."C 1", t1.c2, t1.c3, t1.c4, t1.c5, t1.c6, t1.c7, t1.c8 - -> HashAggregate - Output: t2.c1, t3.c1 - Group Key: t2.c1, t3.c1 - -> Foreign Scan + -> Subquery Scan on q + -> HashAggregate Output: t2.c1, t3.c1 - Relations: (public.ft1 t2) INNER JOIN (public.ft2 t3) - Remote SQL: SELECT r1."C 1", r2."C 1" FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (((r1."C 1" = r2."C 1")) AND ((r1.c2 = $1::integer -(13 rows) + Group Key: t2.c1, t3.c1 + -> Foreign Scan + Output: t2.c1, t3.c1 + Relations: (public.ft1 t2) INNER JOIN (public.ft2 t3) + Remote SQL: SELECT r1."C 1", r2."C 1" FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (((r1."C 1" = r2."C 1")) AND ((r1.c2 = $1::integer +(14 rows) SELECT t1."C 1" FROM "S 1"."T 1" t1, LATERAL (SELECT DISTINCT t2.c1, t3.c1 FROM ft1 t2, ft2 t3 WHERE t2.c1 = t3.c1 AND t2.c2 = t1.c2) q ORDER BY t1."C 1" OFFSET 10 LIMIT 10; C 1 @@ -2672,16 +2673,17 @@ select c2, sum(c1) from ft2 group by c2 having avg(c1) < 500 and sum(c1) < 49800 -- Unshippable HAVING clause will be evaluated locally, and other qual in HAVING clause is pushed down explain (verbose, costs off) select count(*) from (select c5, count(c1) from ft1 group by c5, sqrt(c2) having (avg(c1) / avg(c1)) * random() <= 1 and avg(c1) < 500) x; - QUERY PLAN -- + QUERY PLAN +--- Aggregate Output: count(*) - -> Foreign Scan - Output: ft1.c5, NULL::bigint, (sqrt((ft1.c2)::double precision)) - Filter: (avg(ft1.c1)) / (avg(ft1.c1::double precision * random()) <= '1'::double precision) - Relations: Aggregate on (public.ft1) - Remote SQL: SELECT c5, NULL::bigint, sqrt(c2), avg("C 1") FROM "S 1"."T 1" GROUP BY c5, (sqrt(c2)) HAVING ((avg("C 1") < 500::numeric)) -(7 rows) + -> Subquery Scan on x + -> Foreign Scan + Output: ft1.c5, NULL::bigint, (sqrt((ft1.c2)::double precision)) + Filter: (avg(ft1.c1)) / (avg(ft1.c1::double precision * random()) <= '1'::double precision) + Relations: Aggregate on (public.ft1) + Remote SQL: SELECT c5, NULL::bigint, sqrt(c2), avg("C 1") FROM "S 1"."T 1" GROUP BY c5, (sqrt(c2)) HAVING ((avg("C 1") < 500::numeric)) +(8 rows) select count(*) from (select c5, count(c1) from ft1 group by c5, sqrt(c2) having (avg(c1) / avg(c1)) * random() <
Re: [HACKERS] index-only count(*) for indexes supporting bitmap scans
The following review has been posted through the commitfest application: make installcheck-world: tested, failed Implements feature: not tested Spec compliant: not tested Documentation:not tested Hi Alexander, make check-world fails on contrib/postgres_fdw because of Subquery Scan on ... Probably, query plan was changed. The new status of this patch is: Waiting on Author -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only count(*) for indexes supporting bitmap scans
|Hello everyone, I made a new patch according to the previous comments. It is simpler now, only adding a few checks to the bitmap heap scan node. When the target list for the bitmap heap scan is empty, and there is no filter, and the bitmap page generated by the index scan is exact, and the corresponding heap page is visible to all transaction, we don't fetch it. The performance is better than with the previous patch, because now it can leverage the parallel heap scan logic. A simple benchmark is attached: this patch is more than ten times faster on a frequent search term, and two times faster on an infrequent one. Still, there is one thing that is bothering me. I use empty targetlist as the marker of that I should not fetch tuples. Because of that, I have to make sure that use_physical_tlist() doesn't touch empty tlists. Consequently, if count(*) sits on top of a subquery, this subquery has to project and cannot be deleted (see trivial_subqueryscan). There is such a query in the regression test select_distinct: "select count(*) from (select distinct two, four, two from tenk1);". For that particular query it shouldn't matter much, so I changed the test, but the broader implications of this escape me at the moment. The cost estimation is very simplified now: I just halve the number of pages to be fetched. The most important missing part is checking whether we have any quals that are not checked by the index: if we do, we'll have to fetch all the tuples. Finding nonindex qualifications is somewhat convoluted for the bitmap index scan tree involving multiple indexes, so I didn't implement it for now. We could also consider estimating the number of lossy pages in the tid bitmap given current work_mem size. I'll be glad to hear your thoughts on this.| diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c index 79f534e4e9..d7ea6f6929 100644 --- a/src/backend/executor/nodeBitmapHeapscan.c +++ b/src/backend/executor/nodeBitmapHeapscan.c @@ -39,6 +39,7 @@ #include "access/relscan.h" #include "access/transam.h" +#include "access/visibilitymap.h" #include "executor/execdebug.h" #include "executor/nodeBitmapHeapscan.h" #include "miscadmin.h" @@ -225,9 +226,25 @@ BitmapHeapNext(BitmapHeapScanState *node) } /* - * Fetch the current heap page and identify candidate tuples. + * If we don't need the tuple contents and are only counting them, + * we can skip fetching the page if the bitmap doesn't need rechecking + * and all tuples on the page are visible to our transaction */ - bitgetpage(scan, tbmres); + node->bhs_nofetch = !tbmres->recheck +&& node->ss.ps.qual == NULL +&& node->ss.ps.plan->targetlist == NIL +&& VM_ALL_VISIBLE(node->ss.ss_currentRelation, tbmres->blockno, + &node->bhs_vmbuffer); + + if (node->bhs_nofetch) +scan->rs_ntuples = tbmres->ntuples; + else + { +/* + * Fetch the current heap page and identify candidate tuples. + */ +bitgetpage(scan, tbmres); + } if (tbmres->ntuples >= 0) node->exact_pages++; @@ -289,45 +306,58 @@ BitmapHeapNext(BitmapHeapScanState *node) */ BitmapPrefetch(node, scan); - /* - * Okay to fetch the tuple - */ - targoffset = scan->rs_vistuples[scan->rs_cindex]; - dp = (Page) BufferGetPage(scan->rs_cbuf); - lp = PageGetItemId(dp, targoffset); - Assert(ItemIdIsNormal(lp)); - scan->rs_ctup.t_data = (HeapTupleHeader) PageGetItem((Page) dp, lp); - scan->rs_ctup.t_len = ItemIdGetLength(lp); - scan->rs_ctup.t_tableOid = scan->rs_rd->rd_id; - ItemPointerSet(&scan->rs_ctup.t_self, tbmres->blockno, targoffset); + if (node->bhs_nofetch) + { + /* + * If we don't have to fetch the tuple, just return a + * bogus one + */ + slot->tts_isempty = false; + slot->tts_nvalid = 0; + } + else + { + /* + * Okay to fetch the tuple + */ + targoffset = scan->rs_vistuples[scan->rs_cindex]; + dp = (Page) BufferGetPage(scan->rs_cbuf); + lp = PageGetItemId(dp, targoffset); + Assert(ItemIdIsNormal(lp)); - pgstat_count_heap_fetch(scan->rs_rd); + scan->rs_ctup.t_data = (HeapTupleHeader) PageGetItem((Page) dp, lp); + scan->rs_ctup.t_len = ItemIdGetLength(lp); + scan->rs_ctup.t_tableOid = scan->rs_rd->rd_id; + ItemPointerSet(&scan->rs_ctup.t_self, tbmres->blockno, targoffset); - /* - * Set up the result slot to point to this tuple. Note that the slot - * acquires a pin on the buffer. - */ - ExecStoreTuple(&scan->rs_ctup, - slot, - scan->rs_cbuf, - false); + pgstat_count_heap_fetch(scan->rs_rd); - /* - * If we are using lossy info, we have to recheck the qual conditions - * at every tuple. - */ - if (tbmres->recheck) - { - econtext->ecxt_scantuple = slot; - ResetExprContext(econtext); + /* + * Set up the result slot to point to this tuple. Note that the slot + * acquires a pin on the buffer. + */ + ExecStoreTuple(&scan-
Re: [HACKERS] Index created in BEFORE trigger not updated during INSERT
Tom Lane wrote: >> Andres Freund writes: >>> Hm, strategically sprinkled CheckTableNotInUse() might do the trick? > > Attached is a proposed patch that closes off this problem. I've tested > it to the extent that it blocks Albe's example and passes check-world. I tested it, and it works fine. Thanks! > I'm unsure whether to back-patch or not; the main argument for not doing > so is that if any extensions are calling DefineIndex() directly, this > would be an API break for them. Given what a weird case this is, I'm not > sure it's worth that. > > A possible end-run around the API objection would be to not add an extra > argument to DefineIndex() in the back branches, but to use !is_alter_table > as the control condition. That would work for the core callers, although > we might need a special case for bootstrap mode. On the other hand, > thinking again about hypothetical third-party callers, it's possible that > that's not the right answer for them, in which case they'd be really in > trouble. So I'm not that much in love with that answer. It causes a slight bellyache to leave an unfixed data corruption bug in the back branches (if only index corruption), but I agree that it is such a weird case to create an index in a BEFORE trigger that we probably can live without a back-patch. Yours, Laurenz Albe -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index created in BEFORE trigger not updated during INSERT
Michael Paquier writes: > The patch looks good to me, could you add a regression test? Done, thanks for the review. I stuck the test into triggers.sql, which is not completely on point since there are other ways to get to this error. But if we're thinking of it as a framework for testing other CheckTableNotInUse cases then adding it to e.g. create_index.sql doesn't seem right either. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index created in BEFORE trigger not updated during INSERT
On 2017-06-03 18:23:33 -0400, Tom Lane wrote: > Attached is a proposed patch that closes off this problem. I've tested > it to the extent that it blocks Albe's example and passes check-world. Cool. > I'm unsure whether to back-patch or not; the main argument for not doing > so is that if any extensions are calling DefineIndex() directly, this > would be an API break for them. Given what a weird case this is, I'm not > sure it's worth that. I slightly lean against backpatching, it has taken long to be reported, and it's pretty easy to work around... - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index created in BEFORE trigger not updated during INSERT
On Sun, Jun 4, 2017 at 7:23 AM, Tom Lane wrote: > I'm unsure whether to back-patch or not; the main argument for not doing > so is that if any extensions are calling DefineIndex() directly, this > would be an API break for them. Given what a weird case this is, I'm not > sure it's worth that. Knowing that it has taken 20 years to get a report for this problem, It seems to me that one report does not win against the risk of destabilizing extensions that would surprisingly break after a minor update. > A possible end-run around the API objection would be to not add an extra > argument to DefineIndex() in the back branches, but to use !is_alter_table > as the control condition. That would work for the core callers, although > we might need a special case for bootstrap mode. On the other hand, > thinking again about hypothetical third-party callers, it's possible that > that's not the right answer for them, in which case they'd be really in > trouble. So I'm not that much in love with that answer. Yes, that's not worth the trouble I think for back-branches. The patch looks good to me, could you add a regression test? This could be used later on as a basis for other DDL commands if somebody looks at this problem for the other ones. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index created in BEFORE trigger not updated during INSERT
I wrote: > Andres Freund writes: >> Hm, strategically sprinkled CheckTableNotInUse() might do the trick? > +1. We can't reasonably make it work: the outer query already has its > list of indexes that need to be inserted into. Also, if you try to > make the index via ALTER TABLE ADD CONSTRAINT in the trigger, that will > give you "cannot ALTER TABLE "mytable" because it is being used by active > queries in this session" because of the check in AlterTable(). Attached is a proposed patch that closes off this problem. I've tested it to the extent that it blocks Albe's example and passes check-world. I'm unsure whether to back-patch or not; the main argument for not doing so is that if any extensions are calling DefineIndex() directly, this would be an API break for them. Given what a weird case this is, I'm not sure it's worth that. A possible end-run around the API objection would be to not add an extra argument to DefineIndex() in the back branches, but to use !is_alter_table as the control condition. That would work for the core callers, although we might need a special case for bootstrap mode. On the other hand, thinking again about hypothetical third-party callers, it's possible that that's not the right answer for them, in which case they'd be really in trouble. So I'm not that much in love with that answer. > It doesn't seem terribly hard to fix the CREATE INDEX case to behave > likewise, but I wonder how many other cases we've missed? That remains an open question :-( regards, tom lane diff --git a/src/backend/bootstrap/bootparse.y b/src/backend/bootstrap/bootparse.y index de3695c..2e1fef0 100644 *** a/src/backend/bootstrap/bootparse.y --- b/src/backend/bootstrap/bootparse.y *** Boot_DeclareIndexStmt: *** 323,328 --- 323,329 $4, false, false, + false, true, /* skip_build */ false); do_end(); *** Boot_DeclareUniqueIndexStmt: *** 366,371 --- 367,373 $5, false, false, + false, true, /* skip_build */ false); do_end(); diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c index 4861799..c090279 100644 *** a/src/backend/commands/indexcmds.c --- b/src/backend/commands/indexcmds.c *** CheckIndexCompatible(Oid oldId, *** 295,300 --- 295,303 * 'is_alter_table': this is due to an ALTER rather than a CREATE operation. * 'check_rights': check for CREATE rights in namespace and tablespace. (This * should be true except when ALTER is deleting/recreating an index.) + * 'check_not_in_use': check for table not already in use in current session. + * This should be true unless caller is holding the table open, in which + * case the caller had better have checked it earlier. * 'skip_build': make the catalog entries but leave the index file empty; * it will be filled later. * 'quiet': suppress the NOTICE chatter ordinarily provided for constraints. *** DefineIndex(Oid relationId, *** 307,312 --- 310,316 Oid indexRelationId, bool is_alter_table, bool check_rights, + bool check_not_in_use, bool skip_build, bool quiet) { *** DefineIndex(Oid relationId, *** 405,410 --- 409,423 errmsg("cannot create indexes on temporary tables of other sessions"))); /* + * Unless our caller vouches for having checked this already, insist that + * the table not be in use by our own session, either. Otherwise we might + * fail to make entries in the new index (for instance, if an INSERT or + * UPDATE is in progress and has already made its list of target indexes). + */ + if (check_not_in_use) + CheckTableNotInUse(rel, "CREATE INDEX"); + + /* * Verify we (still) have CREATE rights in the rel's namespace. * (Presumably we did when the rel was created, but maybe not anymore.) * Skip check if caller doesn't want it. Also skip check if diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 7959120..b61fda9 100644 *** a/src/backend/commands/tablecmds.c --- b/src/backend/commands/tablecmds.c *** ATExecAddIndex(AlteredTableInfo *tab, Re *** 6679,6684 --- 6679,6685 InvalidOid, /* no predefined OID */ true, /* is_alter_table */ check_rights, + false, /* check_not_in_use - we did it already */ skip_build, quiet); diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c index 1e941fb..a22fd53 100644 *** a/src/backend/tcop/utility.c --- b/src/backend/tcop/utility.c *** ProcessUtilitySlow(ParseState *pstate, *** 1329,1334 --- 1329,1335 InvalidOid, /* no predefined OID */ false, /* is_alter_table */ true, /* check_rights */ + true, /* check_not_in_use */ false, /
Re: [HACKERS] Index created in BEFORE trigger not updated during INSERT
Andres Freund writes: > On 2017-05-24 08:26:24 -0400, Robert Haas wrote: >> I'm willing to bet that nobody ever thought about that case very hard. >> It seems like we should either make it work or prohibit it, but I >> can't actually see quite how to do either off-hand. > Hm, strategically sprinkled CheckTableNotInUse() might do the trick? +1. We can't reasonably make it work: the outer query already has its list of indexes that need to be inserted into. Also, if you try to make the index via ALTER TABLE ADD CONSTRAINT in the trigger, that will give you "cannot ALTER TABLE "mytable" because it is being used by active queries in this session" because of the check in AlterTable(). It doesn't seem terribly hard to fix the CREATE INDEX case to behave likewise, but I wonder how many other cases we've missed? One small issue is that naive use of CheckTableNotInUse would produce an error reading "cannot CREATE INDEX "mytable" because ...", which is not exactly on point. It'd be better for it to say "cannot create an index on "mytable" because ...". However, given that it took twenty years for anybody to notice this, maybe it's close enough. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index created in BEFORE trigger not updated during INSERT
On 2017-05-24 08:26:24 -0400, Robert Haas wrote: > On Mon, May 22, 2017 at 7:05 AM, Albe Laurenz wrote: > > Not that it is a useful use case, but I believe that this is > > a bug that causes index corruption: > > > > CREATE TABLE mytable( > >id integer PRIMARY KEY, > >id2 integer NOT NULL > > ); > > > > CREATE FUNCTION makeindex() RETURNS trigger > >LANGUAGE plpgsql AS > > $$BEGIN > >CREATE INDEX ON mytable(id2); > >RETURN NEW; > > END;$$; > > I'm willing to bet that nobody ever thought about that case very hard. > It seems like we should either make it work or prohibit it, but I > can't actually see quite how to do either off-hand. Hm, strategically sprinkled CheckTableNotInUse() might do the trick? I've neither tried nor thought this through fully, but I can't think of a case where pre-existing relcache references to tables are ok... - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index created in BEFORE trigger not updated during INSERT
On Mon, May 22, 2017 at 7:05 AM, Albe Laurenz wrote: > Not that it is a useful use case, but I believe that this is > a bug that causes index corruption: > > CREATE TABLE mytable( >id integer PRIMARY KEY, >id2 integer NOT NULL > ); > > CREATE FUNCTION makeindex() RETURNS trigger >LANGUAGE plpgsql AS > $$BEGIN >CREATE INDEX ON mytable(id2); >RETURN NEW; > END;$$; I'm willing to bet that nobody ever thought about that case very hard. It seems like we should either make it work or prohibit it, but I can't actually see quite how to do either off-hand. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Index Only Scan support for cube
Hi, hackers! Here's a small patch that implements fetch function necessary for Index Only Scans that use cube data type. I reuse function g_cube_decompress() instead of creating new function g_cube_fetch(). Essentially, they both have to detoast data. How do you think, is it better to create a shallow copy of g_cube_decompress instead? Any other suggestions on the functionality? This CREATE TABLE SOMECUBES AS SELECT CUBE(X,X+1) C FROM GENERATE_SERIES(1,100) X; CREATE INDEX SOMECUBES_IDX ON SOMECUBES USING GIST(C); SET ENABLE_SEQSCAN = FALSE; EXPLAIN (COSTS OFF ) SELECT C FROM SOMECUBES WHERE C<@CUBE(30,40); now produces Index Only Scan using somecubes_idx on somecubes Index Cond: (c <@ '(30),(40)'::cube) instead of Index Scan using somecubes_idx on somecubes Index Cond: (c <@ '(30),(40)'::cube) Best regards, Andrey Borodin, Octonica. cubefetch.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Index created in BEFORE trigger not updated during INSERT
Not that it is a useful use case, but I believe that this is a bug that causes index corruption: CREATE TABLE mytable( id integer PRIMARY KEY, id2 integer NOT NULL ); CREATE FUNCTION makeindex() RETURNS trigger LANGUAGE plpgsql AS $$BEGIN CREATE INDEX ON mytable(id2); RETURN NEW; END;$$; CREATE TRIGGER makeindex BEFORE INSERT ON mytable FOR EACH ROW EXECUTE PROCEDURE makeindex(); INSERT INTO mytable VALUES (1, 42); SELECT * FROM mytable; ┌┬─┐ │ id │ id2 │ ├┼─┤ │ 1 │ 42 │ └┴─┘ (1 row) But 42 is not present in the index: SET enable_seqscan = off; SELECT * FROM mytable WHERE id2 = 42; ┌┬─┐ │ id │ id2 │ ├┼─┤ └┴─┘ (0 rows) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only count(*) for indexes supporting bitmap scans
On 12.04.2017 12:29, Alexander Korotkov wrote: That's a cool feature for FTS users! Please, register this patch to the next commitfest. I've added this to the 2017-07 commitfest: https://commitfest.postgresql.org/14/1117/ Also, what is planning overhead of this patch? That's worth testing too. The planning overhead is about 0.1 - 0.07 ms on the samples from my first letter. Another thing catch my eye. The estimated number of rows in Bitmap Count node is the same as in Bitmap Index Scan node. Should it be 1 because it always returns single row? You're right, I'll fix this in the next version of the patch. Does this limitation cause a performance drawback? When bitmap index scan returns all rechecks, alternative to Bitmap Count is still Aggregate + Bitmap Heap Scan. Thus, I think Bitmap Count would have the same performance or even slightly faster. That's worth testing. Bitmap heap scan can indeed be faster, because it prefetches heap pages, and can be run in parallel. When the table data is not cached, the difference is not big on my machine. It could probably be significant if I used a storage that supported parallel reads. When the data is cached in memory, the parallel bitmap heap scan can be significantly faster. We could use the standard bitmap heap scan node with some tweaks, as discussed in the other subthread, to avoid this regression. Here are some test queries: --- not cached --- test1=# explain analyze select count(*) from pglist where fts @@ to_tsquery( 'tom & lane' ); QUERY PLAN -- Bitmap Count on pglist (cost=542.65..1087.68 rows=54503 width=8) (actual time=30264.174..30264.177 rows=1 loops=1) Recheck Cond: (fts @@ to_tsquery('tom & lane'::text)) Rows Removed by Index Recheck: 270853 Heap Fetches: 66138 Heap Blocks: exact=39854 lossy=66138 -> Bitmap Index Scan on idx_pglist_fts (cost=0.00..529.02 rows=54503 width=0) (actual time=525.341..525.341 rows=222813 loops=1) Index Cond: (fts @@ to_tsquery('tom & lane'::text)) Planning time: 128.238 ms Execution time: 30264.299 ms (9 rows) test1=# set enable_bitmapcount to off; SET test1=# explain analyze select count(*) from pglist where fts @@ to_tsquery( 'tom & lane' ); QUERY PLAN Finalize Aggregate (cost=119989.73..119989.74 rows=1 width=8) (actual time=31699.829..31699.830 rows=1 loops=1) -> Gather (cost=119989.52..119989.73 rows=2 width=8) (actual time=31698.699..31699.819 rows=3 loops=1) Workers Planned: 2 Workers Launched: 2 -> Partial Aggregate (cost=118989.52..118989.53 rows=1 width=8) (actual time=31689.289..31689.290 rows=1 loops=3) -> Parallel Bitmap Heap Scan on pglist (cost=542.65..118932.74 rows=22710 width=0) (actual time=608.532..31634.692 rows=74271 loops=3) Recheck Cond: (fts @@ to_tsquery('tom & lane'::text)) Rows Removed by Index Recheck: 90284 Heap Blocks: exact=13242 lossy=21960 -> Bitmap Index Scan on idx_pglist_fts (cost=0.00..529.02 rows=54503 width=0) (actual time=552.136..552.136 rows=222813 loops=1) Index Cond: (fts @@ to_tsquery('tom & lane'::text)) Planning time: 160.055 ms Execution time: 31724.468 ms (13 rows) - cached - test1=# explain analyze select count(*) from pglist where fts @@ to_tsquery( 'tom & lane' ); QUERY PLAN -- Finalize Aggregate (cost=119989.73..119989.74 rows=1 width=8) (actual time=1250.973..1250.973 rows=1 loops=1) -> Gather (cost=119989.52..119989.73 rows=2 width=8) (actual time=1250.588..1250.964 rows=3 loops=1) Workers Planned: 2 Workers Launched: 2 -> Partial Aggregate (cost=118989.52..118989.53 rows=1 width=8) (actual time=1246.144..1246.144 rows=1 loops=3) -> Parallel Bitmap Heap Scan on pglist (cost=542.65..118932.74 rows=22710 width=0) (actual time=82.781..1237.585 rows=74271 loops=3) Recheck Cond: (fts @@ to_tsquery('tom & lane'::text)) Rows Removed by Index Recheck: 90284 Heap Blocks: exact=13221 lossy=22217 -> Bitmap Index Scan on idx_pglist_fts
Re: [HACKERS] index-only count(*) for indexes supporting bitmap scans
On 12.04.2017 17:24, Tom Lane wrote: TBH, I'm not sure you need to do any of that work. Have you got evidence that the planner will fail to choose the right plan regardless? I'm particularly unconvinced that choose_bitmap_and is a critical problem, because once you're into having to AND multiple indexes, you're talking about an expensive query anyhow. The most expensive part would probably be accessing the heap in the bitmap heap scan. It may be worth trying to choose an index path that checks all the restriction and therefore allows us to skip this heap access. This path might not be the cheapest one, though. The standard AND selection procedure would have discarded it based on cost. I've seen this happen on the regression database. Somehow I can't seem to reproduce it on my earlier full-text search example. An example: regression=# explain select count(*) from tenk1 where hundred < 90 and thousand < 31; QUERY PLAN --- Bitmap Count on tenk1 (cost=182.69..185.56 rows=1 width=8) Recheck Cond: ((thousand < 31) AND (hundred < 90)) -> BitmapAnd (cost=182.69..182.69 rows=287 width=0) -> Bitmap Index Scan on tenk1_thous_tenthous (cost=0.00..6.68 rows=319 width=0) Index Cond: (thousand < 31) -> Bitmap Index Scan on tenk1_hundred (cost=0.00..175.62 rows=8978 width=0) Index Cond: (hundred < 90) (7 rows) regression=# set enable_bitmapcount to off; SET regression=# explain select count(*) from tenk1 where hundred < 90 and thousand < 31; QUERY PLAN --- Aggregate (cost=375.34..375.35 rows=1 width=8) -> Bitmap Heap Scan on tenk1 (cost=6.75..374.62 rows=287 width=0) Recheck Cond: (thousand < 31) Filter: (hundred < 90) -> Bitmap Index Scan on tenk1_thous_tenthous (cost=0.00..6.68 rows=319 width=0) Index Cond: (thousand < 31) (6 rows) -- Alexander Kuzmenkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only count(*) for indexes supporting bitmap scans
Alexander Kuzmenkov writes: > With planner, the changes are more complex. Two things must be done > there. First, when the tlist is empty, we must use a different cost > function for bitmap heap scan, because the heap access pattern is > different. Second, choose_bitmap_and() must use a different algorithm > for choosing the right combination of paths. A standard algorithm > chooses the combination based on cost. For count(*) purposes, the > decisive factor is that the path has to check all the restrictions, or > else we'll need heap access to recheck some of them, which defeats the > purpose of having this optimization. The planner code that builds and > costs the index path is fairly complex, and integrating this additional > behavior into it didn't look good to me. TBH, I'm not sure you need to do any of that work. Have you got evidence that the planner will fail to choose the right plan regardless? I'm particularly unconvinced that choose_bitmap_and is a critical problem, because once you're into having to AND multiple indexes, you're talking about an expensive query anyhow. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only count(*) for indexes supporting bitmap scans
On 12.04.2017 15:04, Tom Lane wrote: Andrew Gierth writes: "Alexander" == Alexander Kuzmenkov writes: Alexander> Structurally, the patch consists of two major parts: a Alexander> specialized executor node Why? It strikes me that the significant fact here is not that we're doing count(*), but that we don't need any columns from the bitmap heap scan result. Rather than creating a whole new node, can't the existing bitmap heapscan be taught to skip fetching the actual table page in cases where it's all-visible, not lossy, and no columns are needed? +1 ... while I hadn't actually looked at the code, it seemed to me that anything like the optimization-as-described would be impossibly klugy from the planner's standpoint. Your formulation sounds lots nicer. Detecting that no columns are needed in the executor might be a bit tricky because of the planner's habit of inserting a "physical tlist" to avoid a projection step. (See also nearby discussion about custom scan planning.) But we could fix that. I think one rule that would make sense is to just disable the physical-tlist substitution if the relation's targetlist is empty --- it wouldn't be buying much in such a case anyhow. Then the runtime tlist for the scan node would also be empty, and away you go. When making an early prototype of this optimization, I did what you are describing with the bitmap heap scan executor node. In an internal review, it was said that the bitmap heap scan node is already complicated enough and shouldn't have more logic added to it, so I rewrote it a separate node. To me, your approach looks good too, so if the community is generally in favor of this, I could rewrite the executor as such. With planner, the changes are more complex. Two things must be done there. First, when the tlist is empty, we must use a different cost function for bitmap heap scan, because the heap access pattern is different. Second, choose_bitmap_and() must use a different algorithm for choosing the right combination of paths. A standard algorithm chooses the combination based on cost. For count(*) purposes, the decisive factor is that the path has to check all the restrictions, or else we'll need heap access to recheck some of them, which defeats the purpose of having this optimization. The planner code that builds and costs the index path is fairly complex, and integrating this additional behavior into it didn't look good to me. Instead, I created a specialized path node and isolated the logic that handles it. The resulting implementation adds several functions, but it is mostly self-contained and has a single entry point in grouping_planner(). It handles the specific case of bitmap count plans and doesn't complicate the existing code any further. The planner part is to some extent independent of whether we use a separate executor node or not. If we choose not to, the same count(*) optimization code I proposed could create plain bitmap heap scan paths. -- Alexander Kuzmenkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only count(*) for indexes supporting bitmap scans
Andrew Gierth writes: > "Alexander" == Alexander Kuzmenkov writes: > Alexander> Structurally, the patch consists of two major parts: a > Alexander> specialized executor node > Why? > It strikes me that the significant fact here is not that we're doing > count(*), but that we don't need any columns from the bitmap heap scan > result. Rather than creating a whole new node, can't the existing > bitmap heapscan be taught to skip fetching the actual table page in > cases where it's all-visible, not lossy, and no columns are needed? +1 ... while I hadn't actually looked at the code, it seemed to me that anything like the optimization-as-described would be impossibly klugy from the planner's standpoint. Your formulation sounds lots nicer. Detecting that no columns are needed in the executor might be a bit tricky because of the planner's habit of inserting a "physical tlist" to avoid a projection step. (See also nearby discussion about custom scan planning.) But we could fix that. I think one rule that would make sense is to just disable the physical-tlist substitution if the relation's targetlist is empty --- it wouldn't be buying much in such a case anyhow. Then the runtime tlist for the scan node would also be empty, and away you go. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only count(*) for indexes supporting bitmap scans
> "Alexander" == Alexander Kuzmenkov writes: Alexander> Structurally, the patch consists of two major parts: a Alexander> specialized executor node Why? It strikes me that the significant fact here is not that we're doing count(*), but that we don't need any columns from the bitmap heap scan result. Rather than creating a whole new node, can't the existing bitmap heapscan be taught to skip fetching the actual table page in cases where it's all-visible, not lossy, and no columns are needed? (this would also have the advantage of getting parallelism for free) -- Andrew (irc:RhodiumToad) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] index-only count(*) for indexes supporting bitmap scans
On Wed, Apr 12, 2017 at 12:40 AM, Alexander Korotkov < a.korot...@postgrespro.ru> wrote: > On Tue, Apr 11, 2017 at 8:24 PM, Alexander Kuzmenkov < > a.kuzmen...@postgrespro.ru> wrote: > >> I would like to propose a patch that speeds up the queries of the form >> 'select >> count(*) ... where ...', where the restriction clauses can be satisfied >> by some >> indexes. At the moment, such queries use index-only scans followed by >> aggregation. Index-only scans are only possible for indexes that are >> capable of >> returning indexed tuples, that is, support the 'amgettuple' access >> method. They >> are not applicable to indexes such as GIN and RUM. However, it is >> possible to >> improve count(*) performance for indexes that support bitmap scans. Having >> performed a bitmap index scan or a combination of such, the bits in >> bitmap can >> be counted to obtain the final result. Of course, the bitmap pages that >> are >> lossy or not visible to all existing transactions will still require heap >> access. >> > > That's a cool feature for FTS users! Please, register this patch to the > next commitfest. > > This patch has some important limitations: >> * It only supports targetlist consisting of a single expression that can >> be >> projected from count(*). >> * count(expr) is not supported. We could support it for cases where the >> "expr is not null" restriction can be satisfied with an index. >> * The current implementation does not support parallel execution. It >> could be >> implemented during the PostgreSQL 11 release cycle. >> * For some indexes, the bitmap index scan will always require rechecking >> all >> the tuples. Bitmap count plans should not be used in such cases. For now, >> this >> check is not implemented. >> > > Does this limitation cause a performance drawback? When bitmap index scan > returns all rechecks, alternative to Bitmap Count is still Aggregate + > Bitmap Heap Scan. Thus, I think Bitmap Count would have the same > performance or even slightly faster. That's worth testing. > > Also, what is planning overhead of this patch? That's worth testing too. > Another thing catch my eye. The estimated number of rows in Bitmap Count node is the same as in Bitmap Index Scan node. Should it be 1 because it always returns single row? test1=# explain analyze select count(*) from pglist where fts @@ > to_tsquery( 'tom & lane' ); > QUERY > PLAN > > > Bitmap Count on pglist (cost=550.65..1095.68 rows=54503 width=8) (actual > time=1120.281..1120.281 rows=1 loops=1) >Recheck Cond: (fts @@ to_tsquery('tom & lane'::text)) >Heap Fetches: 0 >Heap Blocks: exact=105992 >-> Bitmap Index Scan on idx_pglist_rum_fts (cost=0.00..537.02 > rows=54503 width=0) (actual time=1056.060..1056.060 rows=222813 loops=1) > Index Cond: (fts @@ to_tsquery('tom & lane'::text)) > Planning time: 119.568 ms > Execution time: 1121.409 ms > (8 rows) -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] index-only count(*) for indexes supporting bitmap scans
On Tue, Apr 11, 2017 at 8:24 PM, Alexander Kuzmenkov < a.kuzmen...@postgrespro.ru> wrote: > I would like to propose a patch that speeds up the queries of the form > 'select > count(*) ... where ...', where the restriction clauses can be satisfied > by some > indexes. At the moment, such queries use index-only scans followed by > aggregation. Index-only scans are only possible for indexes that are > capable of > returning indexed tuples, that is, support the 'amgettuple' access method. > They > are not applicable to indexes such as GIN and RUM. However, it is possible > to > improve count(*) performance for indexes that support bitmap scans. Having > performed a bitmap index scan or a combination of such, the bits in bitmap > can > be counted to obtain the final result. Of course, the bitmap pages that are > lossy or not visible to all existing transactions will still require heap > access. > That's a cool feature for FTS users! Please, register this patch to the next commitfest. This patch has some important limitations: > * It only supports targetlist consisting of a single expression that can be > projected from count(*). > * count(expr) is not supported. We could support it for cases where the > "expr is not null" restriction can be satisfied with an index. > * The current implementation does not support parallel execution. It could > be > implemented during the PostgreSQL 11 release cycle. > * For some indexes, the bitmap index scan will always require rechecking > all > the tuples. Bitmap count plans should not be used in such cases. For now, > this > check is not implemented. > Does this limitation cause a performance drawback? When bitmap index scan returns all rechecks, alternative to Bitmap Count is still Aggregate + Bitmap Heap Scan. Thus, I think Bitmap Count would have the same performance or even slightly faster. That's worth testing. Also, what is planning overhead of this patch? That's worth testing too. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
[HACKERS] index-only count(*) for indexes supporting bitmap scans
Hi, I would like to propose a patch that speeds up the queries of the form 'select count(*) ... where ...', where the restriction clauses can be satisfied by some indexes. At the moment, such queries use index-only scans followed by aggregation. Index-only scans are only possible for indexes that are capable of returning indexed tuples, that is, support the 'amgettuple' access method. They are not applicable to indexes such as GIN and RUM. However, it is possible to improve count(*) performance for indexes that support bitmap scans. Having performed a bitmap index scan or a combination of such, the bits in bitmap can be counted to obtain the final result. Of course, the bitmap pages that are lossy or not visible to all existing transactions will still require heap access. One kind of applications that can benefit from this change is the full-text search with pagination. To show a search results page, the application has to know the results that go to current page, and the total number of the results. Getting one page is fast, when the desired sorting order can be provided by an index. For example, results can be sorted by date with a separate btree index, or by relevance with RUM index. However, getting the total number of results is more difficult. With text search indexes, it requires a bitmap heap scan, which can be rather slow due to obligatory heap access. A well-known hack for this is using the approximate data from 'explain' results. The proposed change allows the user to obtain the precise number of the results in an efficient and idiomatic manner. The performance of this approach was tested on an archive of pgsql-hackers mailing list. The detailed results for two sample queries can be found in the attached file 'benchmark.txt'. The first test demonstrates full-text search with RUM index, ordering the results by rank. For a query with low selectivity, getting the top results is much faster than counting them all with a bitmap heap scan. With bitmap count execution plan, the results can be counted much faster. A similar test is done with a GIN index, where the results are restricted and ordered by date using another btree index. Again, it shows a significant speedup for count(*) query for bitmap count scan compared to bitmap heap scan. These results demonstrate that the bitmap count plan can indeed be useful for full- text search scenarios. Structurally, the patch consists of two major parts: a specialized executor node and the generation of corresponding paths and plans. The optimizer behavior here is similar to that of the min/max aggregate optimization. The main entry point is preprocess_count_aggs(); it is called by grouping_planner(). It searches for count(*) expression in the target list, and, if possible, generates a special path for it (struct BitmapCountPath). This path is then added to UPPERREL_GROUP_AGG upperrel, and competes with other paths at the further stages of planning. The executor node (nodeBitmapCount.c) is similar to the bitmap heap scan node, with the main difference being that it does not access heap for tuples that are known to satisfy the restriction and to be visible to all transactions. This patch has some important limitations: * It only supports targetlist consisting of a single expression that can be projected from count(*). * count(expr) is not supported. We could support it for cases where the "expr is not null" restriction can be satisfied with an index. * The current implementation does not support parallel execution. It could be implemented during the PostgreSQL 11 release cycle. * For some indexes, the bitmap index scan will always require rechecking all the tuples. Bitmap count plans should not be used in such cases. For now, this check is not implemented. I would be glad to hear your comments on this patch. Regards, Alexander Kuzmenkov === = The data test1=# \d pglist Table "public.pglist" Column |Type | Collation | Nullable | Default +-+---+--+- id | integer | | | sent | timestamp without time zone | | | subject| text| | | author | text| | | body_plain | text| | | fts| tsvector| | | Indexes: "idx_pglist_rum_fts" rum (fts) "idx_pglist_fts" gin (fts) "idx_pglist_sent" btree (sent) test1=# select min(sent), max(sent), count(*) from pglist; min | max | count -+-+- 1997-06-24 11:31:09 | 2016-04-27 23:46:29 | 1013770 (1 row)
Re: [HACKERS] Index usage for elem-contained-by-const-range clauses
On 3/24/17 10:50 AM, David Steele wrote: Hi Pritam, On 3/17/17 5:41 PM, Pritam Baral wrote: So sorry. I'm attaching the correct version of the original with this, in case you want to test the limited implementation, because I still have to go through Tom's list of suggestions. BTW, the patch is for applying on top of REL9_6_2, and while I suspect it may work on master too, I haven't tested it since the original submission (Feb 23). Also, I noticed that patch haven't regression tests. Some mention of this optimization in docs is also nice to have. > > -- > Alexander Korotkov > Postgres Professional: http://www.postgrespro.com > The Russian Postgres Company This thread has been idle for a week. Please respond and/or post a new patch by 2017-03-28 00:00 AoE (UTC-12) or this submission will be marked "Returned with Feedback". This submission has been marked "Returned with Feedback". Please feel free to resubmit to a future commitfest. Regards, -- -David da...@pgmasters.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index usage for elem-contained-by-const-range clauses
On Sat, Mar 18, 2017 at 12:41 AM, Pritam Baral wrote: > On Friday 10 March 2017 07:59 PM, Alexander Korotkov wrote: > >> Hi, Pritam! > > I've assigned to review this patch. > > On Thu, Feb 23, >> 2017 at 2:17 AM, Pritam Baral wrote: > > >> The topic has been previously discussed[0] on the -performance mailing >> list, > about four years ago. > > In that thread, Tom suggested[0] >> the planner could be made to "expand > "intcol <@ > >> 'x,y'::int4range" into "intcol between x and y", using something similar >> > to the > index LIKE optimization (ie, the "special operator" >> stuff in indxpath.c)". > > > That's cool idea. But I would say more. >> Sometimes it's useful to transform "intcol between x and y" into "intcol <@ >> 'x,y'::int4range". btree_gin supports "intcol between x and y" as overlap >> of "intcol >= x" and "intcol <= y". That is very inefficient. But it this >> clause would be transformed into "intcol <@ 'x,y'::int4range", btree_gin >> could handle this very efficient. > > > > This patch tries to do >> exactly that. It's not tied to any specific datatype, > and has >> > been tested with both builtin types and custom range types. Most > of > the > checking for proper datatypes, operators, and btree index happens > before > this > code, so I haven't run into any issues yet in my > testing. But I'm not > familiar > enough with the internals to be > able to confidently say it can handle > all cases > just yet. > > > > I've tried this patch. It applies cleanly, but doesn't compile. > > > indxpath.c:4252:1: error: conflicting types for > 'range_elem_contained_quals' > range_elem_contained_quals(Node *leftop, > Datum rightop) > ^ > indxpath.c:192:14: note: previous declaration is here > > static List *range_elem_contained_quals(Node *leftop, Oid expr_op, Oid > opfamily, > ^ > Could you please recheck that you published > right version of patch? > > So sorry. I'm attaching the correct version of the original with this, > in case you want to test the limited implementation, because I still > have to go through Tom's list of suggestions. > > BTW, the patch is for applying on top of REL9_6_2, and while I > suspect it may work on master too, I haven't tested it since the > original submission (Feb 23). What is idea behind basing patch on the REL9_6_2? This patch implements new functionality and it's definitely not going to be considered to be committed to stable release branches. If you are interesting in committing this patch to master, please rebase it on master branch. If not, please clarify the purpose of this submission. Also, please include some numbering to the patch name, so that we could distinguish one version of patch from another. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Index usage for elem-contained-by-const-range clauses
Hi Pritam, On 3/17/17 5:41 PM, Pritam Baral wrote: So sorry. I'm attaching the correct version of the original with this, in case you want to test the limited implementation, because I still have to go through Tom's list of suggestions. BTW, the patch is for applying on top of REL9_6_2, and while I suspect it may work on master too, I haven't tested it since the original submission (Feb 23). Also, I noticed that patch haven't regression tests. Some mention of this optimization in docs is also nice to have. > > -- > Alexander Korotkov > Postgres Professional: http://www.postgrespro.com > The Russian Postgres Company This thread has been idle for a week. Please respond and/or post a new patch by 2017-03-28 00:00 AoE (UTC-12) or this submission will be marked "Returned with Feedback". Thanks, -- -David da...@pgmasters.net -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index usage for elem-contained-by-const-range clauses
On Friday 10 March 2017 07:59 PM, Alexander Korotkov wrote: Hi, Pritam! > > I've assigned to review this patch. > > On Thu, Feb 23, 2017 at 2:17 AM, Pritam Baral wrote: > > The topic has been previously discussed[0] on the -performance mailing list, > about four years ago. > > In that thread, Tom suggested[0] the planner could be made to "expand > "intcol <@ > 'x,y'::int4range" into "intcol between x and y", using something similar > to the > index LIKE optimization (ie, the "special operator" stuff in indxpath.c)". > > > That's cool idea. But I would say more. Sometimes it's useful to transform "intcol between x and y" into "intcol <@ 'x,y'::int4range". btree_gin supports "intcol between x and y" as overlap of "intcol >= x" and "intcol <= y". That is very inefficient. But it this clause would be transformed into "intcol <@ 'x,y'::int4range", btree_gin could handle this very efficient. > > > > This patch tries to do exactly that. It's not tied to any specific datatype, > and has been tested with both builtin types and custom range types. Most > of the > checking for proper datatypes, operators, and btree index happens before > this > code, so I haven't run into any issues yet in my testing. But I'm not > familiar > enough with the internals to be able to confidently say it can handle > all cases > just yet. > > > I've tried this patch. It applies cleanly, but doesn't compile. > > indxpath.c:4252:1: error: conflicting types for 'range_elem_contained_quals' > range_elem_contained_quals(Node *leftop, Datum rightop) > ^ > indxpath.c:192:14: note: previous declaration is here > static List *range_elem_contained_quals(Node *leftop, Oid expr_op, Oid opfamily, > ^ > Could you please recheck that you published right version of patch? So sorry. I'm attaching the correct version of the original with this, in case you want to test the limited implementation, because I still have to go through Tom's list of suggestions. BTW, the patch is for applying on top of REL9_6_2, and while I suspect it may work on master too, I haven't tested it since the original submission (Feb 23). Also, I noticed that patch haven't regression tests. Some mention of this optimization in docs is also nice to have. > > -- > Alexander Korotkov > Postgres Professional: http://www.postgrespro.com > The Russian Postgres Company -- #!/usr/bin/env regards Chhatoi Pritam Baral diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c index 2952bfb7c2..40a3c2c9f4 100644 --- a/src/backend/optimizer/path/indxpath.c +++ b/src/backend/optimizer/path/indxpath.c @@ -30,21 +30,23 @@ #include "optimizer/pathnode.h" #include "optimizer/paths.h" #include "optimizer/predtest.h" #include "optimizer/prep.h" #include "optimizer/restrictinfo.h" #include "optimizer/var.h" #include "utils/builtins.h" #include "utils/bytea.h" #include "utils/lsyscache.h" #include "utils/pg_locale.h" +#include "utils/rangetypes.h" #include "utils/selfuncs.h" +#include "utils/typcache.h" #define IsBooleanOpfamily(opfamily) \ ((opfamily) == BOOL_BTREE_FAM_OID || (opfamily) == BOOL_HASH_FAM_OID) #define IndexCollMatchesExprColl(idxcollation, exprcollation) \ ((idxcollation) == InvalidOid || (idxcollation) == (exprcollation)) /* Whether we are looking for plain indexscan, bitmap scan, or either */ typedef enum @@ -180,20 +182,21 @@ static Expr *expand_boolean_index_clause(Node *clause, int indexcol, IndexOptInfo *index); static List *expand_indexqual_opclause(RestrictInfo *rinfo, Oid opfamily, Oid idxcollation); static RestrictInfo *expand_indexqual_rowcompare(RestrictInfo *rinfo, IndexOptInfo *index, int indexcol); static List *prefix_quals(Node *leftop, Oid opfamily, Oid collation, Const *prefix, Pattern_Prefix_Status pstatus); static List *network_prefix_quals(Node *leftop, Oid expr_op, Oid opfamily, Datum rightop); +static List *range_elem_contained_quals(Node *leftop, Datum rightop); static Datum string_to_datum(const char *str, Oid datatype); static Const *string_to_const(const char *str, Oid datatype); /* * create_index_paths() * Generate all interesting index paths for the given relation. * Candidate paths are added to the rel's pathlist (using add_path). * * To be considered for an index scan, an index must match one or more @@ -3286,20 +3289,23 @@ match_special_index_operator(Expr *clause, Oid opfamily, Oid idxcollation, /* the right-hand const is type text for all of these */ pstatus = pattern_fixed_prefix(patt, Pattern_Type_Regex_IC, expr_coll, &prefix, NULL); isIndexable = (pstatus != Pattern_Prefix_None); break; case OID_INET_SUB_OP: case OID_INET_SUBEQ_OP: isIndexable = true; break; + case OID_RANGE_ELEM_CONTAINED_OP: + isIndexable = true; + break; } if (prefix) { pfree(DatumGe
Re: [HACKERS] Index usage for elem-contained-by-const-range clauses
On Sunday 12 March 2017 01:58 AM, Jim Nasby wrote: On 3/10/17 8:29 AM, Alexander Korotkov wrote: >> That's cool idea. But I would say more. Sometimes it's useful to >> transform "intcol between x and y" into "intcol <@ 'x,y'::int4range". >> btree_gin supports "intcol between x and y" as overlap of "intcol >= x" >> and "intcol <= y". That is very inefficient. But it this clause would >> be transformed into "intcol <@ 'x,y'::int4range", btree_gin could handle >> this very efficient. > > That's certainly be nice as well, but IMHO it's outside the scope of this patch to accomplish that. Also, I think btree indexes are more common than btree_gin. The motivation for this originally came from trying to use the primary key of a large table in a range search, and the primary key index was the default btree. Also, this is my first deep dive into Postgres's source code, so I took a few easy ways out, just to get started. If it's not too complex to get btree_gin to handle between queries as contained-in-range, I can give it a try. > BTW, while we're wishing for things... Something else that would be nice is if there was a way to do these kind of transforms without hacking the backend... Indeed. And this was one of the things Tom said back when a similar discussion had happened (on the -performance mailing list). But seeing as how it's been almost four years since then, I decided to go ahead with the backend hacking anyway. >> Also, I noticed that patch haven't regression tests. > > BTW, those tests need to pay special attention to inclusive vs exclusive bounds. I will add regression tests, though I do have to get through all of Tom's suggestions elsewhere in this thread first. -- #!/usr/bin/env regards Chhatoi Pritam Baral -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index usage for elem-contained-by-const-range clauses
I wrote: > * You're not bothering to insert any inputcollid into the generated > comparison operator nodes. I'm not sure why that fails to fall over > for text comparisons (if indeed it does fail ...) but it's wrong. > Use the range type's collation there. Oh ... looking at this again, I realize that there's an additional validity check missing: if the range type's collation doesn't match the index column's collation, we can't do this optimization at all. That check probably belongs in match_special_index_operator. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index usage for elem-contained-by-const-range clauses
Pritam Baral writes: > The topic has been previously discussed[0] on the -performance mailing list, > about four years ago. > In that thread, Tom suggested[0] the planner could be made to "expand > "intcol <@ > 'x,y'::int4range" into "intcol between x and y", using something similar > to the > index LIKE optimization (ie, the "special operator" stuff in indxpath.c)". > This patch tries to do exactly that. I took a quick look through this, and have some thoughts --- * In match_special_index_operator, there are two switch statements and you've added a clause to only one of them. In the second one, you need to add a check that you're working with a btree index. I'd expect the patch as-is to fall over if an "indexkey <@ range" clause were matched to a hash index. * You're failing to account for the case of "range @> indexkey", which really this ought to work for. That would take a bit of fooling around with the structure of match_special_index_operator to allow indexkey on the right, but we've foreseen since the beginning that that would someday be necessary. Looks like today is that day. * The first bit in range_elem_contained_quals will fall over for an indexkey that is an expression rather than a simple Var. Probably you should just be using exprType() instead. (Not sure about whether that's sufficient in domain cases, though.) Or actually, why look at that at all? Seems like what you want is to look at the RHS input, ie the range value, and get the relevant element datatype from it. That would correspond to what will happen if the <@ operator executes normally: elem_contained_by_range does not consult the type of its LHS. * The "return NIL" for an empty range looks pretty dubious. Even if it fails to fail altogether, it's not doing what we really want, which is to signal that the condition cannot succeed so we needn't search the index. Maybe what we should do in that case is generate an "indexkey = NULL" qual. * Likewise, if the range is infinite, you're just returning NIL and that's leaving something on the table. Probably worth generating "indexkey IS NOT NULL" in that case; it's not going to help much in typical usage, but it would prevent scanning nulls if there are a lot of them in the index. * elog(ERROR) doesn't return, so stuff like this is not idiomatic: +if (opr2oid == InvalidOid) +{ +elog(ERROR, "no <= operator for opfamily %u", opfamily); +return NIL; +} It'd be sufficient to do +if (opr2oid == InvalidOid) +elog(ERROR, "no <= operator for opfamily %u", opfamily); * You're not bothering to insert any inputcollid into the generated comparison operator nodes. I'm not sure why that fails to fall over for text comparisons (if indeed it does fail ...) but it's wrong. Use the range type's collation there. * It's sort of annoying that the whole thing only works for a Const range value. A different approach you might think about is to make this work more like ScalarArrayOp, ie we pass the qual through to btree as-is and teach relevant functions in access/nbtree/ how to extract index bound conditions from the range datum at runtime. That would likely end up being a significantly larger patch, though, so you might reasonably conclude it's not worth the effort. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index usage for elem-contained-by-const-range clauses
On 3/10/17 8:29 AM, Alexander Korotkov wrote: That's cool idea. But I would say more. Sometimes it's useful to transform "intcol between x and y" into "intcol <@ 'x,y'::int4range". btree_gin supports "intcol between x and y" as overlap of "intcol >= x" and "intcol <= y". That is very inefficient. But it this clause would be transformed into "intcol <@ 'x,y'::int4range", btree_gin could handle this very efficient. That's certainly be nice as well, but IMHO it's outside the scope of this patch to accomplish that. BTW, while we're wishing for things... Something else that would be nice is if there was a way to do these kind of transforms without hacking the backend... Also, I noticed that patch haven't regression tests. BTW, those tests need to pay special attention to inclusive vs exclusive bounds. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index usage for elem-contained-by-const-range clauses
Hi, Pritam! I've assigned to review this patch. On Thu, Feb 23, 2017 at 2:17 AM, Pritam Baral wrote: > The topic has been previously discussed[0] on the -performance mailing > list, > about four years ago. > > In that thread, Tom suggested[0] the planner could be made to "expand > "intcol <@ > 'x,y'::int4range" into "intcol between x and y", using something similar > to the > index LIKE optimization (ie, the "special operator" stuff in indxpath.c)". > That's cool idea. But I would say more. Sometimes it's useful to transform "intcol between x and y" into "intcol <@ 'x,y'::int4range". btree_gin supports "intcol between x and y" as overlap of "intcol >= x" and "intcol <= y". That is very inefficient. But it this clause would be transformed into "intcol <@ 'x,y'::int4range", btree_gin could handle this very efficient. > > This patch tries to do exactly that. It's not tied to any specific > datatype, > and has been tested with both builtin types and custom range types. Most > of the > checking for proper datatypes, operators, and btree index happens before > this > code, so I haven't run into any issues yet in my testing. But I'm not > familiar > enough with the internals to be able to confidently say it can handle > all cases > just yet. > I've tried this patch. It applies cleanly, but doesn't compile. indxpath.c:4252:1: error: conflicting types for 'range_elem_contained_quals' range_elem_contained_quals(Node *leftop, Datum rightop) ^ indxpath.c:192:14: note: previous declaration is here static List *range_elem_contained_quals(Node *leftop, Oid expr_op, Oid opfamily, ^ Could you please recheck that you published right version of patch? Also, I noticed that patch haven't regression tests. Some mention of this optimization in docs is also nice to have. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Index usage for elem-contained-by-const-range clauses
On Thu, Feb 23, 2017 at 4:47 AM, Pritam Baral wrote: > The topic has been previously discussed[0] on the -performance mailing list, > about four years ago. > > In that thread, Tom suggested[0] the planner could be made to "expand > "intcol <@ > 'x,y'::int4range" into "intcol between x and y", using something similar > to the > index LIKE optimization (ie, the "special operator" stuff in indxpath.c)". > > This patch tries to do exactly that. Please add your patch to https://commitfest.postgresql.org/ so it doesn't get overlooked. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On 2/19/17 5:27 AM, Robert Haas wrote: (1) a multi-batch hash join, (2) a nested loop, and (3) a merge join. (2) is easy to implement but will generate a ton of random I/O if the table is not resident in RAM. (3) is most suitable for very large tables but takes more work to code, and is also likely to be a lot slower for small tables than a hash or nestloop-based approach. As I understand it, #3 is already in place for validate_index(). I think you'd just need a different callback that checks the heap key. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Index usage for elem-contained-by-const-range clauses
The topic has been previously discussed[0] on the -performance mailing list, about four years ago. In that thread, Tom suggested[0] the planner could be made to "expand "intcol <@ 'x,y'::int4range" into "intcol between x and y", using something similar to the index LIKE optimization (ie, the "special operator" stuff in indxpath.c)". This patch tries to do exactly that. It's not tied to any specific datatype, and has been tested with both builtin types and custom range types. Most of the checking for proper datatypes, operators, and btree index happens before this code, so I haven't run into any issues yet in my testing. But I'm not familiar enough with the internals to be able to confidently say it can handle all cases just yet. [0]: https://www.postgresql.org/message-id/flat/9860.1364013108%40sss.pgh.pa.us#9860.1364013...@sss.pgh.pa.us -- #!/usr/bin/env regards Chhatoi Pritam Baral diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c index 2952bfb7c2..84dfd8362a 100644 --- a/src/backend/optimizer/path/indxpath.c +++ b/src/backend/optimizer/path/indxpath.c @@ -30,21 +30,23 @@ #include "optimizer/pathnode.h" #include "optimizer/paths.h" #include "optimizer/predtest.h" #include "optimizer/prep.h" #include "optimizer/restrictinfo.h" #include "optimizer/var.h" #include "utils/builtins.h" #include "utils/bytea.h" #include "utils/lsyscache.h" #include "utils/pg_locale.h" +#include "utils/rangetypes.h" #include "utils/selfuncs.h" +#include "utils/typcache.h" #define IsBooleanOpfamily(opfamily) \ ((opfamily) == BOOL_BTREE_FAM_OID || (opfamily) == BOOL_HASH_FAM_OID) #define IndexCollMatchesExprColl(idxcollation, exprcollation) \ ((idxcollation) == InvalidOid || (idxcollation) == (exprcollation)) /* Whether we are looking for plain indexscan, bitmap scan, or either */ typedef enum @@ -180,20 +182,22 @@ static Expr *expand_boolean_index_clause(Node *clause, int indexcol, IndexOptInfo *index); static List *expand_indexqual_opclause(RestrictInfo *rinfo, Oid opfamily, Oid idxcollation); static RestrictInfo *expand_indexqual_rowcompare(RestrictInfo *rinfo, IndexOptInfo *index, int indexcol); static List *prefix_quals(Node *leftop, Oid opfamily, Oid collation, Const *prefix, Pattern_Prefix_Status pstatus); static List *network_prefix_quals(Node *leftop, Oid expr_op, Oid opfamily, Datum rightop); +static List *range_elem_contained_quals(Node *leftop, Oid expr_op, Oid opfamily, + Datum rightop); static Datum string_to_datum(const char *str, Oid datatype); static Const *string_to_const(const char *str, Oid datatype); /* * create_index_paths() * Generate all interesting index paths for the given relation. * Candidate paths are added to the rel's pathlist (using add_path). * * To be considered for an index scan, an index must match one or more @@ -3286,20 +3290,23 @@ match_special_index_operator(Expr *clause, Oid opfamily, Oid idxcollation, /* the right-hand const is type text for all of these */ pstatus = pattern_fixed_prefix(patt, Pattern_Type_Regex_IC, expr_coll, &prefix, NULL); isIndexable = (pstatus != Pattern_Prefix_None); break; case OID_INET_SUB_OP: case OID_INET_SUBEQ_OP: isIndexable = true; break; + case OID_RANGE_ELEM_CONTAINED_OP: + isIndexable = true; + break; } if (prefix) { pfree(DatumGetPointer(prefix->constvalue)); pfree(prefix); } /* done if the expression doesn't look indexable */ if (!isIndexable) @@ -3614,20 +3621,27 @@ expand_indexqual_opclause(RestrictInfo *rinfo, Oid opfamily, Oid idxcollation) break; case OID_INET_SUB_OP: case OID_INET_SUBEQ_OP: if (!op_in_opfamily(expr_op, opfamily)) { return network_prefix_quals(leftop, expr_op, opfamily, patt->constvalue); } break; + case OID_RANGE_ELEM_CONTAINED_OP: + if (!op_in_opfamily(expr_op, opfamily)) + { +return range_elem_contained_quals(leftop, expr_op, opfamily, + patt->constvalue); + } + break; } /* Default case: just make a list of the unmodified indexqual */ return list_make1(rinfo); } /* * expand_indexqual_rowcompare --- expand a single indexqual condition * that is a RowCompareExpr * @@ -4096,20 +4110,124 @@ network_prefix_quals(Node *leftop, Oid expr_op, Oid opfamily, Datum rightop) InvalidOid, /* not collatable */ -1, opr2right, false, false), InvalidOid, InvalidOid); result = lappend(result, make_simple_restrictinfo(expr)); return result; } /* + * Given an element leftop and a range rightop, and an elem contained-by range + * operator, generate suitable indexqual condition(s). + */ +static List * +range_elem_contained_quals(Node *leftop, Datum rightop) +{ + Oiddatatype; + Oidopfamily; + Oidopr1oid; + Oidopr2oid; + List *result = NIL; + Expr *expr; + RangeType *range; + TypeC
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Sun, Feb 19, 2017 at 3:52 PM, Pavan Deolasee wrote: > This particular case of corruption results in a heap tuple getting indexed > by a wrong key (or to be precise, indexed by its old value). So the only way > to detect the corruption is to look at each index key and check if it > matches with the corresponding heap tuple. We could write some kind of self > join that can use a sequential scan and an index-only scan (David Rowley had > suggested something of that sort internally here), but we can't guarantee > index-only scan on a table which is being concurrently updated. So not sure > even that will catch every possible case. Oh, so the problem isn't index entries that are altogether missing? I guess I was confused. You can certainly guarantee an index-only scan if you write the validation code in C rather than using SQL. I think the issue is that if the table is large enough that keeping a TID -> index value mapping in a hash table is impractical, there's not going to be a real efficient strategy for this. Ignoring the question of whether you use the main executor for this or just roll your own code, your options for a large table are (1) a multi-batch hash join, (2) a nested loop, and (3) a merge join. (2) is easy to implement but will generate a ton of random I/O if the table is not resident in RAM. (3) is most suitable for very large tables but takes more work to code, and is also likely to be a lot slower for small tables than a hash or nestloop-based approach. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Sun, Feb 19, 2017 at 3:43 PM, Robert Haas wrote: > On Fri, Feb 17, 2017 at 11:15 PM, Tom Lane wrote: > > > Ah, nah, scratch that. If any post-index-build pruning had occurred on a > > page, the evidence would be gone --- the non-matching older tuples would > > be removed and what remained would look consistent. But it wouldn't > > match the index. You might be able to find problems if you were willing > > to do the expensive check on *every* HOT chain, but that seems none too > > practical. > > If the goal is just to detect tuples that aren't indexed and should > be, what about scanning each index and building a TIDBitmap of all of > the index entries, and then scanning the heap for non-HOT tuples and > probing the TIDBitmap for each one? If you find a heap entry for > which you didn't find an index entry, go and recheck the index and see > if one got added concurrently. If not, whine. > > This particular case of corruption results in a heap tuple getting indexed by a wrong key (or to be precise, indexed by its old value). So the only way to detect the corruption is to look at each index key and check if it matches with the corresponding heap tuple. We could write some kind of self join that can use a sequential scan and an index-only scan (David Rowley had suggested something of that sort internally here), but we can't guarantee index-only scan on a table which is being concurrently updated. So not sure even that will catch every possible case. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Fri, Feb 17, 2017 at 11:15 PM, Tom Lane wrote: > I wrote: >> However, you might be able to find it without so much random I/O. >> I'm envisioning a seqscan over the table, in which you simply look for >> HOT chains in which the indexed columns aren't all the same. When you >> find one, you'd have to do a pretty expensive index lookup to confirm >> whether things are OK or not, but hopefully that'd be rare enough to not >> be a big performance sink. > > Ah, nah, scratch that. If any post-index-build pruning had occurred on a > page, the evidence would be gone --- the non-matching older tuples would > be removed and what remained would look consistent. But it wouldn't > match the index. You might be able to find problems if you were willing > to do the expensive check on *every* HOT chain, but that seems none too > practical. If the goal is just to detect tuples that aren't indexed and should be, what about scanning each index and building a TIDBitmap of all of the index entries, and then scanning the heap for non-HOT tuples and probing the TIDBitmap for each one? If you find a heap entry for which you didn't find an index entry, go and recheck the index and see if one got added concurrently. If not, whine. One problem is that you'd need to make sure that the TIDBitmap didn't lossify, but that could be prevented either by specifying a very large maxbytes setting or by using something that serves the same function instead of a TIDBitmap per se. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Fri, Feb 17, 2017 at 9:31 AM, Tom Lane wrote: > This seems like it'd be quite a different tool than amcheck, though. > Also, it would only find broken-HOT-chain corruption, which might be > a rare enough issue to not deserve a single-purpose tool. FWIW, my ambition for amcheck is that it will have checks for a large variety of invariants that involve the heap, and related SLRU structures such as MultiXacts. Though, that would probably necessitate code written by other people that are subject matter experts in areas that I am not. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
I wrote: > However, you might be able to find it without so much random I/O. > I'm envisioning a seqscan over the table, in which you simply look for > HOT chains in which the indexed columns aren't all the same. When you > find one, you'd have to do a pretty expensive index lookup to confirm > whether things are OK or not, but hopefully that'd be rare enough to not > be a big performance sink. Ah, nah, scratch that. If any post-index-build pruning had occurred on a page, the evidence would be gone --- the non-matching older tuples would be removed and what remained would look consistent. But it wouldn't match the index. You might be able to find problems if you were willing to do the expensive check on *every* HOT chain, but that seems none too practical. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
Peter Geoghegan writes: > The difference with a test that could detect this variety of > corruption is that that would need to visit the heap, which tends to > be much larger than any one index, or even all indexes. That would > probably need to be random I/O, too. It might be possible to mostly > not visit the heap, though -- I'm not sure offhand. I'd have to study > the problem in detail, which I have no time for at the moment. Unless I misunderstand this problem, the issue is that there might be some broken HOT chains, that is chains in which not all the heap tuples have the same values for the index columns, and in particular the live entry at the end of the chain doesn't agree with the index. It seems pretty impossible to detect that by examining the index alone. However, you might be able to find it without so much random I/O. I'm envisioning a seqscan over the table, in which you simply look for HOT chains in which the indexed columns aren't all the same. When you find one, you'd have to do a pretty expensive index lookup to confirm whether things are OK or not, but hopefully that'd be rare enough to not be a big performance sink. This seems like it'd be quite a different tool than amcheck, though. Also, it would only find broken-HOT-chain corruption, which might be a rare enough issue to not deserve a single-purpose tool. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Fri, Feb 17, 2017 at 8:23 AM, Keith Fiske wrote: > It's not the load I'm worried about, it's the locks that are required at > some point during the rebuild. Doing an index rebuild here and there isn't a > big deal, but trying to do it for an entire heavily loaded, multi-terabyte > database is hardly trivial. And I'd say doing a scan is far less invasive > than actually rebuilding the index since little to no writing is actually > being done. Certainly, the checks that amcheck performs that don't require a heavier lock (just a SELECT-style AccessShareLock) were vastly less expensive than reindexing, and of course had only negligible potential to block other operations. And, REINDEX certainly is a foot-gun, lock-wise, which is why I try to avoid it. The difference with a test that could detect this variety of corruption is that that would need to visit the heap, which tends to be much larger than any one index, or even all indexes. That would probably need to be random I/O, too. It might be possible to mostly not visit the heap, though -- I'm not sure offhand. I'd have to study the problem in detail, which I have no time for at the moment. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
Keith Fiske wrote: > I can understandable if it's simply not possible, but if it is, I think in > any cases of data corruption, having some means to check for it to be sure > you're in the clear would be useful. Maybe it is possible. I just didn't try, since it didn't seem very useful. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Fri, Feb 17, 2017 at 11:12 AM, Alvaro Herrera wrote: > Keith Fiske wrote: > > > Was just curious if anyone was able to come up with any sort of method to > > test whether an index was corrupted by this bug, other than just waiting > > for bad query results? We've used concurrent index rebuilding quite > > extensively over the years to remove bloat from busy systems, but > > reindexing the entire database "just in case" is unrealistic in many of > our > > cases. > > As stated, if the CREATE INDEX operates on columns that are previously > already indexed (which is normally the case when you rebuild because of > bloat) then there is no chance of index corruption. > > Scanning indexes+tables is just as load-intensive as rebuilding the > indexes anyway. You don't save any work. I suppose it can be a problem > if you have an index big enough that it doesn't fit on your remaining > free space (but in that case you have a pre-existing problem which you > should solve anyway). > > -- > Álvaro Herrerahttps://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services > It's not the load I'm worried about, it's the locks that are required at some point during the rebuild. Doing an index rebuild here and there isn't a big deal, but trying to do it for an entire heavily loaded, multi-terabyte database is hardly trivial. And I'd say doing a scan is far less invasive than actually rebuilding the index since little to no writing is actually being done. I can understandable if it's simply not possible, but if it is, I think in any cases of data corruption, having some means to check for it to be sure you're in the clear would be useful. Keith
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
Keith Fiske wrote: > Was just curious if anyone was able to come up with any sort of method to > test whether an index was corrupted by this bug, other than just waiting > for bad query results? We've used concurrent index rebuilding quite > extensively over the years to remove bloat from busy systems, but > reindexing the entire database "just in case" is unrealistic in many of our > cases. As stated, if the CREATE INDEX operates on columns that are previously already indexed (which is normally the case when you rebuild because of bloat) then there is no chance of index corruption. Scanning indexes+tables is just as load-intensive as rebuilding the indexes anyway. You don't save any work. I suppose it can be a problem if you have an index big enough that it doesn't fit on your remaining free space (but in that case you have a pre-existing problem which you should solve anyway). -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Mon, Feb 6, 2017 at 10:17 PM, Amit Kapila wrote: > On Mon, Feb 6, 2017 at 10:28 PM, Tom Lane wrote: > > Amit Kapila writes: > >> Hmm. Consider that the first time relcahe invalidation occurs while > >> computing id_attrs, so now the retry logic will compute the correct > >> set of attrs (considering two indexes, if we take the example given by > >> Alvaro above.). However, the attrs computed for hot_* and key_* will > >> be using only one index, so this will lead to a situation where two of > >> the attrs (hot_attrs and key_attrs) are computed with one index and > >> id_attrs is computed with two indexes. I think this can lead to > >> Assertion in HeapSatisfiesHOTandKeyUpdate(). > > > > That seems like something we'd be best off to fix by changing the > > assertion. > > > > The above-mentioned problem doesn't exist in your version of the patch > (which is now committed) because the index attrs are cached after > invalidation and I have mentioned the same in my yesterday's followup > mail. The guarantee of safety is that we don't handle invalidation > between two consecutive calls to RelationGetIndexAttrBitmap() in > heap_update, but if we do handle due to any reason which is not > apparent to me, then I think there is some chance of hitting the > assertion. > > > I doubt it's really going to be practical to guarantee that > > those bitmapsets are always 100% consistent throughout a transaction. > > > > Agreed. As the code stands, I think it is only expected to be > guaranteed across three consecutive calls to > RelationGetIndexAttrBitmap() in heap_update. Now, if the assertion in > HeapSatisfiesHOTandKeyUpdate() is useless and we can remove it, then > probably we don't need a guarantee to be consistent in three > consecutive calls as well. > > >> Okay, please find attached patch which is an extension of Tom's and > >> Andres's patch and I think it fixes the above problem noted by me. > > > > I don't like this patch at all. It only fixes the above issue if the > > relcache inval arrives while RelationGetIndexAttrBitmap is executing. > > If the inval arrives between two such calls, you still have the problem > > of the second one delivering a bitmapset that might be inconsistent > > with the first one. > > > > I think it won't happen between two consecutive calls to > RelationGetIndexAttrBitmap in heap_update. It can happen during > multi-row update, but that should be fine. I think this version of > patch only defers the creation of fresh bitmapsets where ever > possible. > > -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > Was just curious if anyone was able to come up with any sort of method to test whether an index was corrupted by this bug, other than just waiting for bad query results? We've used concurrent index rebuilding quite extensively over the years to remove bloat from busy systems, but reindexing the entire database "just in case" is unrealistic in many of our cases.
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Mon, Feb 6, 2017 at 10:28 PM, Tom Lane wrote: > Amit Kapila writes: >> Hmm. Consider that the first time relcahe invalidation occurs while >> computing id_attrs, so now the retry logic will compute the correct >> set of attrs (considering two indexes, if we take the example given by >> Alvaro above.). However, the attrs computed for hot_* and key_* will >> be using only one index, so this will lead to a situation where two of >> the attrs (hot_attrs and key_attrs) are computed with one index and >> id_attrs is computed with two indexes. I think this can lead to >> Assertion in HeapSatisfiesHOTandKeyUpdate(). > > That seems like something we'd be best off to fix by changing the > assertion. > The above-mentioned problem doesn't exist in your version of the patch (which is now committed) because the index attrs are cached after invalidation and I have mentioned the same in my yesterday's followup mail. The guarantee of safety is that we don't handle invalidation between two consecutive calls to RelationGetIndexAttrBitmap() in heap_update, but if we do handle due to any reason which is not apparent to me, then I think there is some chance of hitting the assertion. > I doubt it's really going to be practical to guarantee that > those bitmapsets are always 100% consistent throughout a transaction. > Agreed. As the code stands, I think it is only expected to be guaranteed across three consecutive calls to RelationGetIndexAttrBitmap() in heap_update. Now, if the assertion in HeapSatisfiesHOTandKeyUpdate() is useless and we can remove it, then probably we don't need a guarantee to be consistent in three consecutive calls as well. >> Okay, please find attached patch which is an extension of Tom's and >> Andres's patch and I think it fixes the above problem noted by me. > > I don't like this patch at all. It only fixes the above issue if the > relcache inval arrives while RelationGetIndexAttrBitmap is executing. > If the inval arrives between two such calls, you still have the problem > of the second one delivering a bitmapset that might be inconsistent > with the first one. > I think it won't happen between two consecutive calls to RelationGetIndexAttrBitmap in heap_update. It can happen during multi-row update, but that should be fine. I think this version of patch only defers the creation of fresh bitmapsets where ever possible. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Mon, Feb 6, 2017 at 11:54 PM, Tom Lane wrote: > After some discussion among the release team, we've concluded that the > best thing to do is to push Pavan's/my patch into today's releases. > This does not close the matter by any means: we should continue to > study whether there are related bugs or whether there's a more principled > way of fixing this bug. But that patch clearly makes things better, > and we shouldn't let worries about whether there are more bugs stop > us from providing some kind of fix to users. > > I've made the push, and barring negative reports from the buildfarm, > it will be in today's releases. > Thank you for taking care of it. Buildfarm is looking green until now. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
After some discussion among the release team, we've concluded that the best thing to do is to push Pavan's/my patch into today's releases. This does not close the matter by any means: we should continue to study whether there are related bugs or whether there's a more principled way of fixing this bug. But that patch clearly makes things better, and we shouldn't let worries about whether there are more bugs stop us from providing some kind of fix to users. I've made the push, and barring negative reports from the buildfarm, it will be in today's releases. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Sun, Feb 5, 2017 at 9:42 PM, Pavan Deolasee wrote: > On Mon, Feb 6, 2017 at 5:41 AM, Peter Geoghegan wrote: >> On Sun, Feb 5, 2017 at 4:09 PM, Robert Haas wrote: >> > I don't think this kind of black-and-white thinking is very helpful. >> > Obviously, data corruption is bad. However, this bug has (from what >> > one can tell from this thread) been with us for over a decade; it must >> > necessarily be either low-probability or low-severity, or somebody >> > would've found it and fixed it before now. Indeed, the discovery of >> > this bug was driven by new feature development, not a user report. It >> > seems pretty clear that if we try to patch this and get it wrong, the >> > effects of our mistake could easily be a lot more serious than the >> > original bug. >> >> +1. The fact that it wasn't driven by a user report convinces me that >> this is the way to go. >> > > I'm not sure that just because the bug wasn't reported by a user, makes it > any less critical. As Tomas pointed down thread, the nature of the bug is > such that the users may not discover it very easily, but that doesn't mean > it couldn't be affecting them all the time. We can now correlate many past > reports of index corruption to this bug, but we don't have evidence to prove > that. Lack of any good tool or built-in checks probably makes it even > harder. > > The reason why I discovered this with WARM is because it now has a built-in > recheck logic, which was discarding some tuples returned by the index scan. > It took me lots of code review and inspection to finally conclude that this > must be an existing bug. Even then for lack of any utility, I could not > detect this easily with master. That doesn't mean I wasn't able to > reproduce, but detection was a problem. > > If you look at the reproduction setup, one in every 10, if not 5, CREATE > INDEX CONCURRENTLY ends up creating a corrupt index. That's a good 10% > probability. And this is on a low end laptop, with just 5-10 concurrent > clients running. Probability of hitting the problem will be much higher on a > bigger machine, with many users (on a decent AWS machine, I would find every > third CIC to be corrupt). Moreover the setup is not doing anything > extraordinary. Just concurrent updates which change between HOT to non-HOT > and a CIC. > Not that I am advocating that we should do a release just for this; having a fix we believe in is certainly as important a factor, but that the idea that the bug has been around a long time means it is less of an issue does seem wrong. We've certainly seen plenty of cases over the years where bugs have existed in the code in seldom used code paths, only to be exposed by new features or other code changes over time. In general, we should be less worried about the age of a bug vs our expectations that users are likely to hit that bug now, which does seem high based on the above numbers. In the meantime, it's certainly worth warning users, providing help on how to determine if this is a likely problem for them, and possibly rolling a patch ahead of upstream in cases where that's feasible. Robert Treat play: xzilla.net work: omniti.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
Alvaro Herrera writes: > Tom Lane wrote: >> Better to fix the callers so that they don't have the assumption you >> refer to. Or maybe we could adjust the API of RelationGetIndexAttrBitmap >> so that it returns all the sets needed by a given calling module at >> once, which would allow us to guarantee they're consistent. > Note that my "interesting attrs" patch does away with these independent > bitmaps (which was last posted by Pavan as part of his WARM set). I > think we should fix just this bug now, and for the future look at that > other approach. BTW, if there is a risk of the assertion failure that Amit posits, it seems like it should have happened in the tests that Pavan was doing originally. I'd sort of like to see a demonstration that it can actually happen before we spend any great amount of time fixing it. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
Tom Lane wrote: > Better to fix the callers so that they don't have the assumption you > refer to. Or maybe we could adjust the API of RelationGetIndexAttrBitmap > so that it returns all the sets needed by a given calling module at > once, which would allow us to guarantee they're consistent. Note that my "interesting attrs" patch does away with these independent bitmaps (which was last posted by Pavan as part of his WARM set). I think we should fix just this bug now, and for the future look at that other approach. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
Amit Kapila writes: > Hmm. Consider that the first time relcahe invalidation occurs while > computing id_attrs, so now the retry logic will compute the correct > set of attrs (considering two indexes, if we take the example given by > Alvaro above.). However, the attrs computed for hot_* and key_* will > be using only one index, so this will lead to a situation where two of > the attrs (hot_attrs and key_attrs) are computed with one index and > id_attrs is computed with two indexes. I think this can lead to > Assertion in HeapSatisfiesHOTandKeyUpdate(). That seems like something we'd be best off to fix by changing the assertion. I doubt it's really going to be practical to guarantee that those bitmapsets are always 100% consistent throughout a transaction. > Okay, please find attached patch which is an extension of Tom's and > Andres's patch and I think it fixes the above problem noted by me. I don't like this patch at all. It only fixes the above issue if the relcache inval arrives while RelationGetIndexAttrBitmap is executing. If the inval arrives between two such calls, you still have the problem of the second one delivering a bitmapset that might be inconsistent with the first one. To go in this direction, I think we would have to hot-wire RelationGetIndexAttrBitmap so that once any bitmapset has been requested in a transaction, we intentionally ignore all index set updates for the remainder of the transaction. And that would very likely create more problems than it solves (consider locally initiated DDL within the transaction, for example). Better to fix the callers so that they don't have the assumption you refer to. Or maybe we could adjust the API of RelationGetIndexAttrBitmap so that it returns all the sets needed by a given calling module at once, which would allow us to guarantee they're consistent. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
Andres Freund wrote: > To show what I mean here's an *unpolished* and *barely tested* patch > implementing the first of my suggestions. > > Alvaro, Pavan, I think should address the issue as well? Hmm, interesting idea. Maybe a patch along these lines is a good way to make index-list cache less brittle going forward. However, I'm against putting it in the stable branches -- and we should definitely not stall an immediate fix in order to get this patch ready. IMO we should get Tom's patch in tree for all branches very soon, and then after the release you can finalize this one and put it to master. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
Tom Lane wrote: > Pavan Deolasee writes: > > 2. In the second patch, we tried to recompute attribute lists if a relcache > > flush happens in between and index list is invalidated. We've seen problems > > with that, especially it getting into an infinite loop with > > CACHE_CLOBBER_ALWAYS. Not clear why it happens, but apparently relcache > > flushes keep happening. > > Well, yeah: the point of CLOBBER_CACHE_ALWAYS is to make a relcache flush > happen whenever it possibly could. The way to deal with that without > looping is to test whether the index set *actually* changed, not whether > it just might have changed. Ah, that's a nice and simple way to fix that patch! I see that Pavan confirmed that it fixes the tests we saw failing too. It seems to me that we should go with this one, and it looks unlikely that this causes other problems. BTW, while neiter Pavan nor I sent this patch to -hackers, this implementation is pretty much the same thing we did. Pavan deserves credit as co-author in this commit. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
El 05/02/17 a las 21:57, Tomas Vondra escribió: > > +1 to not rushing fixes into releases. While I think we now finally > understand the mechanics of this bug, the fact that we came up with > three different fixes in this thread, only to discover issues with each > of them, warrants some caution. I agree also with Robert on not rushing the patch. My point was if we had to rush the release. > OTOH I disagree with the notion that bugs that are not driven by user > reports are somehow less severe. Some data corruption bugs cause quite > visible breakage - segfaults, immediate crashes, etc. Those are pretty > clear bugs, and are reported by users. > > Other data corruption bugs are much more subtle - for example this bug > may lead to incorrect results to some queries, failing to detect values > violating UNIQUE constraints, issues with foreign keys, etc. I was recalling just yesterday after sending the mail a logical replication setup we did on a 9.3 server of a customer which brought up data inconsistencies on the primary key of one of the tables. The table had duplicate values. As Tomas says, it's subtle and hard to find unless you constantly run index checks (query a sample of the data from the heap and from the index and check they match). In our case, the customer was not aware of the dups until we found them. Regards, -- Martín Marquéshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
> 6 февр. 2017 г., в 4:57, Peter Geoghegan написал(а): > > I meant that I find the fact that there were no user reports in all > these years to be a good reason to not proceed for now in this > instance. Well, we had some strange situations with indexes (see below for example) but I couldn’t even think that CIC is the problem. And it is really difficult to give diagnostics for problems of such kind. Because 1. you may see the problem several days after last major change in the database and 2. you don’t even know how to start investigating the problem. xdb314g/maildb M # show enable_indexscan ; enable_indexscan -- off (1 row) Time: 0.120 ms xdb314g/maildb M # select * from mail.folders where uid=448300241 and fid=1; -[ RECORD 1 ]---+-- uid | 448300241 fid | 1 <...> Time: 30398.637 ms xdb314g/maildb M # set enable_indexscan to true; SET Time: 0.111 ms xdb314g/maildb M # select * from mail.folders where uid=448300241 and fid=1; (0 rows) Time: 0.312 ms xdb314g/maildb M # The row of course hasn’t been deleted. -- May the force be with you… https://simply.name
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Mon, Feb 6, 2017 at 9:47 AM, Pavan Deolasee wrote: > > > On Mon, Feb 6, 2017 at 9:41 AM, Amit Kapila wrote: >> >> >> >> Hmm. Consider that the first time relcahe invalidation occurs while >> computing id_attrs, so now the retry logic will compute the correct >> set of attrs (considering two indexes, if we take the example given by >> Alvaro above.). > > > I don't quite get that. Since rd_idattr must be already cached at that point > and we don't expect to process a relcache flush between successive calls to > RelationGetIndexAttrBitmap(), we should return a consistent copy of > rd_idattr. > Right, I was mistaken. However, I am not sure if there is a need to perform retry logic in RelationGetIndexAttrBitmap(). It is safe to do that at transaction end. I feel even though Tom's fix looks reliable with respect to the problem reported in this thread, we should take some time and evaluate different proposals and see which one is the best way to move forward. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Mon, Feb 6, 2017 at 8:01 AM, Pavan Deolasee wrote: > >> > I like this approach. I'll run the patch on a build with > CACHE_CLOBBER_ALWAYS, but I'm pretty sure it will be ok. > While it looks certain that the fix will miss this point release, FWIW I ran this patch with CACHE_CLOBBER_ALWAYS and it seems to be working as expected.. Hard to run all the tests, but a small subset of regression and test_decoding seems ok. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Mon, Feb 6, 2017 at 9:41 AM, Amit Kapila wrote: > > > Hmm. Consider that the first time relcahe invalidation occurs while > computing id_attrs, so now the retry logic will compute the correct > set of attrs (considering two indexes, if we take the example given by > Alvaro above.). I don't quite get that. Since rd_idattr must be already cached at that point and we don't expect to process a relcache flush between successive calls to RelationGetIndexAttrBitmap(), we should return a consistent copy of rd_idattr. But may be I am missing something. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Mon, Feb 6, 2017 at 8:35 AM, Pavan Deolasee wrote: > > > On Mon, Feb 6, 2017 at 8:15 AM, Amit Kapila wrote: >> >> On Mon, Feb 6, 2017 at 8:01 AM, Pavan Deolasee >> wrote: >> > >> > >> > On Mon, Feb 6, 2017 at 1:44 AM, Tom Lane wrote: >> >> >> >> >> >> >> >> > 2. In the second patch, we tried to recompute attribute lists if a >> >> > relcache >> >> > flush happens in between and index list is invalidated. We've seen >> >> > problems >> >> > with that, especially it getting into an infinite loop with >> >> > CACHE_CLOBBER_ALWAYS. Not clear why it happens, but apparently >> >> > relcache >> >> > flushes keep happening. >> >> >> >> Well, yeah: the point of CLOBBER_CACHE_ALWAYS is to make a relcache >> >> flush >> >> happen whenever it possibly could. >> > >> > >> > Ah, ok. That explains why the retry logic as originally proposed was >> > causing >> > infinite loop. >> > >> >> >> >> The way to deal with that without >> >> looping is to test whether the index set *actually* changed, not >> >> whether >> >> it just might have changed. >> > >> > >> > I like this approach. I'll run the patch on a build with >> > CACHE_CLOBBER_ALWAYS, but I'm pretty sure it will be ok. I also like the >> > fact that retry logic is now limited to only when index set changes, >> > which >> > fits in the situation we're dealing with. >> > >> >> Don't you think it also has the same problem as mentioned by me in >> Alvaro's patch and you also agreed on it? > > > No, I don't think so. There can't be any more relcache flushes occurring > between the first RelationGetIndexAttrBitmap() and the subsequent > RelationGetIndexAttrBitmap() calls. The problem with the earlier proposed > approach was that we were not caching, yet returning stale information. That > implied the next call will again recompute a possibly different value. The > retry logic is trying to close a small race yet important race condition > where index set invalidation happens, but we cache stale information. > Hmm. Consider that the first time relcahe invalidation occurs while computing id_attrs, so now the retry logic will compute the correct set of attrs (considering two indexes, if we take the example given by Alvaro above.). However, the attrs computed for hot_* and key_* will be using only one index, so this will lead to a situation where two of the attrs (hot_attrs and key_attrs) are computed with one index and id_attrs is computed with two indexes. I think this can lead to Assertion in HeapSatisfiesHOTandKeyUpdate(). >> >> Do you see any need to fix >> it locally in >> RelationGetIndexAttrBitmap(), why can't we clear at transaction end? >> > > We're trying to fix it in RelationGetIndexAttrBitmap() because that's where > the race exists. I'm not saying there aren't other and better ways of > handling it, but we don't know (I've studied Andres's proposal yet). > Okay, please find attached patch which is an extension of Tom's and Andres's patch and I think it fixes the above problem noted by me. Basically, invalidate the bitmaps at transaction end rather than in RelationGetIndexAttrBitmap(). I think it is okay for RelationGetIndexAttrBitmap() to use stale information until transaction end because all the updates in the current running transaction will be consumed by the second phase of CIC. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com invalidate_indexattr_v6.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On 2017-02-05 22:34:34 -0500, Tom Lane wrote: > Pavan Deolasee writes: > The point is that there's a nontrivial chance of a hasty fix introducing > worse problems than we fix. > > Given the lack of consensus about exactly how to fix this, I'm feeling > like it's a good idea if whatever we come up with gets some time to age > awhile in git before we ship it. Right. And I'm not even convinced that we really know the extent of the bug; it seems fairly plausible that there's further incidences of this. There's also the issue that the mechanics in the older backbranches are different again, because of SnapshotNow. >> I'm bit a surprised with this position. What do we tell our users now that >> we know this bug exists? That we're scheduling a bugfix for the next point release. I don't think we can truthfully claim that there's no known corruption bugs in any of the release in the last few years. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
Pavan Deolasee writes: > On Mon, Feb 6, 2017 at 5:44 AM, Andres Freund wrote: >> +1. I don't think we serve our users by putting out a nontrivial bugfix >> hastily. Nor do I think we serve them in this instance by delaying the >> release to cover this fix; there's enough other fixes in the release >> imo. > I'm bit a surprised with this position. The point is that there's a nontrivial chance of a hasty fix introducing worse problems than we fix. Given the lack of consensus about exactly how to fix this, I'm feeling like it's a good idea if whatever we come up with gets some time to age awhile in git before we ship it. Obviously, 2ndQ or EDB or any other distributor can choose to ship a patch in their own builds if they're sufficiently comfortable with the particular patch. That doesn't translate to there having to be a fix in the community's wraps this week. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On 2017-02-05 21:49:57 -0500, Tom Lane wrote: > Andres Freund writes: > > I've not yet read the full thread, but I'm a bit confused so far. We > > obviously can get changing information about indexes here, but isn't > > that something we have to deal with anyway? If we guarantee that we > > don't loose knowledge that there's a pending invalidation, why do we > > have to retry? > > We don't *have to* retry; the core of the fix is to not enter stale data > into the cache after we've already received an invalidation signal. Right, the bug is that we do so without remembering that fact. > The current caller can survive with stale data; or if that's not true, > C.I.C. has got more fundamental problems than posited here. Indeed. > But it seems like a generally bad idea for relcache to return data > that it knows (or at least has reason to believe) is stale. I'm not convinced by this - this kind of staleness can only occur if changes happen during the execution of the cache building. And the information returned can be outdated by pretty much the moment you return anyway. It's also how a number of the current caches essentially already work. > Also, even if you're okay with return-stale-data-but-don't-cache-it, > I have little faith that invalidate_indexattr_v3.patch successfully > implements that. I sent a prototype clarifying what I mean. It's not really like invalidate_indexattr_v3.patch Btw, it seems RelationGetIndexList() avoids a similar issue onlye due to either luck or a lot of undocumented assumptions. The only reason that setting relation->rd_indexlist / rd_indexvalid at the end doesn't cause a similar issue seems to be that no invalidations appear to be processed after the systable_beginscan(). And that in turn seems to not process relcache invals after RegisterSnapshot(GetCatalogSnapshot(relid)). Phew, that seems a bit fragile. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Mon, Feb 6, 2017 at 5:44 AM, Andres Freund wrote: > On 2017-02-05 12:51:09 -0500, Tom Lane wrote: > > Michael Paquier writes: > > > On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule > wrote: > > >> I agree with Pavan - a release with known important bug is not good > idea. > > > > > This bug has been around for some time, so I would recommend taking > > > the time necessary to make the best fix possible, even if it means > > > waiting for the next round of minor releases. > > > > I think the way to think about this sort of thing is, if we had found > > this bug when a release wasn't imminent, would we consider it bad enough > > to justify an unscheduled release cycle? I have to think the answer for > > this one is "probably not". > > +1. I don't think we serve our users by putting out a nontrivial bugfix > hastily. Nor do I think we serve them in this instance by delaying the > release to cover this fix; there's enough other fixes in the release > imo. > > I'm bit a surprised with this position. What do we tell our users now that we know this bug exists? They can't fully trust CIC and they don't have any mechanism to confirm if the newly built index is corruption free or not. I'm not suggesting that we should hastily refactor the code, but at the very least some sort of band-aid fix which helps the situation, yet have minimal side-effects, is warranted. I believe proposed retry patch does exactly that. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Mon, Feb 6, 2017 at 8:15 AM, Amit Kapila wrote: > On Mon, Feb 6, 2017 at 8:01 AM, Pavan Deolasee > wrote: > > > > > > On Mon, Feb 6, 2017 at 1:44 AM, Tom Lane wrote: > >> > >> > >> > >> > 2. In the second patch, we tried to recompute attribute lists if a > >> > relcache > >> > flush happens in between and index list is invalidated. We've seen > >> > problems > >> > with that, especially it getting into an infinite loop with > >> > CACHE_CLOBBER_ALWAYS. Not clear why it happens, but apparently > relcache > >> > flushes keep happening. > >> > >> Well, yeah: the point of CLOBBER_CACHE_ALWAYS is to make a relcache > flush > >> happen whenever it possibly could. > > > > > > Ah, ok. That explains why the retry logic as originally proposed was > causing > > infinite loop. > > > >> > >> The way to deal with that without > >> looping is to test whether the index set *actually* changed, not whether > >> it just might have changed. > > > > > > I like this approach. I'll run the patch on a build with > > CACHE_CLOBBER_ALWAYS, but I'm pretty sure it will be ok. I also like the > > fact that retry logic is now limited to only when index set changes, > which > > fits in the situation we're dealing with. > > > > Don't you think it also has the same problem as mentioned by me in > Alvaro's patch and you also agreed on it? No, I don't think so. There can't be any more relcache flushes occurring between the first RelationGetIndexAttrBitmap() and the subsequent RelationGetIndexAttrBitmap() calls. The problem with the earlier proposed approach was that we were not caching, yet returning stale information. That implied the next call will again recompute a possibly different value. The retry logic is trying to close a small race yet important race condition where index set invalidation happens, but we cache stale information. > Do you see any need to fix > it locally in > RelationGetIndexAttrBitmap(), why can't we clear at transaction end? > > We're trying to fix it in RelationGetIndexAttrBitmap() because that's where the race exists. I'm not saying there aren't other and better ways of handling it, but we don't know (I've studied Andres's proposal yet). Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
Andres Freund writes: > I've not yet read the full thread, but I'm a bit confused so far. We > obviously can get changing information about indexes here, but isn't > that something we have to deal with anyway? If we guarantee that we > don't loose knowledge that there's a pending invalidation, why do we > have to retry? We don't *have to* retry; the core of the fix is to not enter stale data into the cache after we've already received an invalidation signal. The current caller can survive with stale data; or if that's not true, C.I.C. has got more fundamental problems than posited here. But it seems like a generally bad idea for relcache to return data that it knows (or at least has reason to believe) is stale. Also, even if you're okay with return-stale-data-but-don't-cache-it, I have little faith that invalidate_indexattr_v3.patch successfully implements that. Consider the sequence: inval received during RelationGetIndexAttrBitmap, we clear rd_indexvalid, something asks for the relation OID list, we recalculate that and set rd_indexvalid, then we reach the end of RelationGetIndexAttrBitmap and see that rd_indexvalid is set. If that happened, we'd cache the bitmaps whether or not they were really up to date. Now admittedly, it's a bit hard to think of a reason why anything would ask for the index list of anything but a system catalog in that sequence, so as long as you assume that we don't allow C.I.C. (or more realistically, REINDEX CONCURRENTLY) on system catalogs, this failure mode is unreachable. But I much prefer having a positive verification that the index list is still what it was when we started. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On 2017-02-06 08:08:01 +0530, Amit Kapila wrote: > I don't see in your patch that you are setting rd_bitmapsvalid to 0. IIRC a plain relcache rebuild should do that (note there's also no place that directly resets rd_indexattrs). > Also, I think this suffers from the same problem as the patch proposed > by Alvaro which is that different attrs (hot_attrs, key_attrs and > id_attrs) will get computed based on different index lists which can > cause a problem as mentioned in my mail up thread. I'm not convinced that that's a legitimate concern. If that's an issue with CIC, then we have a lot bigger problems, because relcache entries and such will be inconsistent anyway if you have non-exclusive locks while changing catalog data. In the situations where that happens it better be harmless (otherwis CiC is broken), and the next access will recompute them. > I am not sure but I think your solution can be > extended to make it somewhat similar to what we do with rd_indexvalid > (basically, we should set rd_bitmapsvalid to 2 (bitmap is temporarily > forced) at rel cache invalidation and then clean it up transaction > end). I can try something on those lines. Not sure I understand what you mean with "clean it up at transaction end"? Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Sun, Feb 5, 2017 at 6:42 PM, Pavan Deolasee wrote: > I'm not sure that just because the bug wasn't reported by a user, makes it > any less critical. As Tomas pointed down thread, the nature of the bug is > such that the users may not discover it very easily, but that doesn't mean > it couldn't be affecting them all the time. We can now correlate many past > reports of index corruption to this bug, but we don't have evidence to prove > that. Lack of any good tool or built-in checks probably makes it even > harder. I think that we need to make an automated checker tool a requirement for very complicated development projects in the future. We're behind here. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Mon, Feb 6, 2017 at 8:01 AM, Pavan Deolasee wrote: > > > On Mon, Feb 6, 2017 at 1:44 AM, Tom Lane wrote: >> >> >> >> > 2. In the second patch, we tried to recompute attribute lists if a >> > relcache >> > flush happens in between and index list is invalidated. We've seen >> > problems >> > with that, especially it getting into an infinite loop with >> > CACHE_CLOBBER_ALWAYS. Not clear why it happens, but apparently relcache >> > flushes keep happening. >> >> Well, yeah: the point of CLOBBER_CACHE_ALWAYS is to make a relcache flush >> happen whenever it possibly could. > > > Ah, ok. That explains why the retry logic as originally proposed was causing > infinite loop. > >> >> The way to deal with that without >> looping is to test whether the index set *actually* changed, not whether >> it just might have changed. > > > I like this approach. I'll run the patch on a build with > CACHE_CLOBBER_ALWAYS, but I'm pretty sure it will be ok. I also like the > fact that retry logic is now limited to only when index set changes, which > fits in the situation we're dealing with. > Don't you think it also has the same problem as mentioned by me in Alvaro's patch and you also agreed on it? Do you see any need to fix it locally in RelationGetIndexAttrBitmap(), why can't we clear at transaction end? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Mon, Feb 6, 2017 at 5:41 AM, Peter Geoghegan wrote: > On Sun, Feb 5, 2017 at 4:09 PM, Robert Haas wrote: > > I don't think this kind of black-and-white thinking is very helpful. > > Obviously, data corruption is bad. However, this bug has (from what > > one can tell from this thread) been with us for over a decade; it must > > necessarily be either low-probability or low-severity, or somebody > > would've found it and fixed it before now. Indeed, the discovery of > > this bug was driven by new feature development, not a user report. It > > seems pretty clear that if we try to patch this and get it wrong, the > > effects of our mistake could easily be a lot more serious than the > > original bug. > > +1. The fact that it wasn't driven by a user report convinces me that > this is the way to go. > > I'm not sure that just because the bug wasn't reported by a user, makes it any less critical. As Tomas pointed down thread, the nature of the bug is such that the users may not discover it very easily, but that doesn't mean it couldn't be affecting them all the time. We can now correlate many past reports of index corruption to this bug, but we don't have evidence to prove that. Lack of any good tool or built-in checks probably makes it even harder. The reason why I discovered this with WARM is because it now has a built-in recheck logic, which was discarding some tuples returned by the index scan. It took me lots of code review and inspection to finally conclude that this must be an existing bug. Even then for lack of any utility, I could not detect this easily with master. That doesn't mean I wasn't able to reproduce, but detection was a problem. If you look at the reproduction setup, one in every 10, if not 5, CREATE INDEX CONCURRENTLY ends up creating a corrupt index. That's a good 10% probability. And this is on a low end laptop, with just 5-10 concurrent clients running. Probability of hitting the problem will be much higher on a bigger machine, with many users (on a decent AWS machine, I would find every third CIC to be corrupt). Moreover the setup is not doing anything extraordinary. Just concurrent updates which change between HOT to non-HOT and a CIC. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Mon, Feb 6, 2017 at 6:27 AM, Andres Freund wrote: > Hi, > > On 2017-02-05 16:37:33 -0800, Andres Freund wrote: >> > RelationGetIndexList(Relation relation) >> > @@ -4746,8 +4747,10 @@ RelationGetIndexPredicate(Relation relat >> > * we can include system attributes (e.g., OID) in the bitmap >> > representation. >> > * >> > * Caller had better hold at least RowExclusiveLock on the target relation >> > - * to ensure that it has a stable set of indexes. This also makes it safe >> > - * (deadlock-free) for us to take locks on the relation's indexes. >> > + * to ensure it is safe (deadlock-free) for us to take locks on the >> > relation's >> > + * indexes. Note that since the introduction of CREATE INDEX >> > CONCURRENTLY, >> > + * that lock level doesn't guarantee a stable set of indexes, so we have >> > to >> > + * be prepared to retry here in case of a change in the set of indexes. >> >> I've not yet read the full thread, but I'm a bit confused so far. We >> obviously can get changing information about indexes here, but isn't >> that something we have to deal with anyway? If we guarantee that we >> don't loose knowledge that there's a pending invalidation, why do we >> have to retry? Pretty much by definition the knowledge in a relcache >> entry can be outdated as soon as returned unless locking prevents that >> from being possible - which is not the case here. >> >> ISTM it'd be better not to have retry logic here, but to follow the more >> general pattern of making sure that we know whether the information >> needs to be recomputed at the next access. We could either do that by >> having an additional bit of information about the validity of the >> bitmaps (so we have invalid, building, valid - and only set valid at the >> end of computing the bitmaps when still building and not invalid again), >> or we simply move the bitmap computation into the normal relcache build. > > To show what I mean here's an *unpolished* and *barely tested* patch > implementing the first of my suggestions. > + * Signal that the attribute bitmaps are being built. If there's any + * relcache invalidations while building them, rd_bitmapsvalid will be + * reset to 0. In that case return the bitmaps, but don't mark them as + * valid. + */ I don't see in your patch that you are setting rd_bitmapsvalid to 0. Also, I think this suffers from the same problem as the patch proposed by Alvaro which is that different attrs (hot_attrs, key_attrs and id_attrs) will get computed based on different index lists which can cause a problem as mentioned in my mail up thread. However, I think your general approach and idea that we have to set the things up for next time is on spot. I am not sure but I think your solution can be extended to make it somewhat similar to what we do with rd_indexvalid (basically, we should set rd_bitmapsvalid to 2 (bitmap is temporarily forced) at rel cache invalidation and then clean it up transaction end). I can try something on those lines. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Mon, Feb 6, 2017 at 1:44 AM, Tom Lane wrote: > > > > 2. In the second patch, we tried to recompute attribute lists if a > relcache > > flush happens in between and index list is invalidated. We've seen > problems > > with that, especially it getting into an infinite loop with > > CACHE_CLOBBER_ALWAYS. Not clear why it happens, but apparently relcache > > flushes keep happening. > > Well, yeah: the point of CLOBBER_CACHE_ALWAYS is to make a relcache flush > happen whenever it possibly could. Ah, ok. That explains why the retry logic as originally proposed was causing infinite loop. > The way to deal with that without > looping is to test whether the index set *actually* changed, not whether > it just might have changed. > I like this approach. I'll run the patch on a build with CACHE_CLOBBER_ALWAYS, but I'm pretty sure it will be ok. I also like the fact that retry logic is now limited to only when index set changes, which fits in the situation we're dealing with. Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Sun, Feb 5, 2017 at 4:57 PM, Tomas Vondra wrote: > OTOH I disagree with the notion that bugs that are not driven by user > reports are somehow less severe. Some data corruption bugs cause quite > visible breakage - segfaults, immediate crashes, etc. Those are pretty clear > bugs, and are reported by users. I meant that I find the fact that there were no user reports in all these years to be a good reason to not proceed for now in this instance. I wrote amcheck to detect the latter variety of bug, so clearly I think that they are very serious. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
Hi, On 2017-02-05 16:37:33 -0800, Andres Freund wrote: > > RelationGetIndexList(Relation relation) > > @@ -4746,8 +4747,10 @@ RelationGetIndexPredicate(Relation relat > > * we can include system attributes (e.g., OID) in the bitmap > > representation. > > * > > * Caller had better hold at least RowExclusiveLock on the target relation > > - * to ensure that it has a stable set of indexes. This also makes it safe > > - * (deadlock-free) for us to take locks on the relation's indexes. > > + * to ensure it is safe (deadlock-free) for us to take locks on the > > relation's > > + * indexes. Note that since the introduction of CREATE INDEX CONCURRENTLY, > > + * that lock level doesn't guarantee a stable set of indexes, so we have to > > + * be prepared to retry here in case of a change in the set of indexes. > > I've not yet read the full thread, but I'm a bit confused so far. We > obviously can get changing information about indexes here, but isn't > that something we have to deal with anyway? If we guarantee that we > don't loose knowledge that there's a pending invalidation, why do we > have to retry? Pretty much by definition the knowledge in a relcache > entry can be outdated as soon as returned unless locking prevents that > from being possible - which is not the case here. > > ISTM it'd be better not to have retry logic here, but to follow the more > general pattern of making sure that we know whether the information > needs to be recomputed at the next access. We could either do that by > having an additional bit of information about the validity of the > bitmaps (so we have invalid, building, valid - and only set valid at the > end of computing the bitmaps when still building and not invalid again), > or we simply move the bitmap computation into the normal relcache build. To show what I mean here's an *unpolished* and *barely tested* patch implementing the first of my suggestions. Alvaro, Pavan, I think should address the issue as well? - Andres diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index 8a7c560e46..9e94495e75 100644 --- a/src/backend/utils/cache/relcache.c +++ b/src/backend/utils/cache/relcache.c @@ -4745,9 +4745,12 @@ RelationGetIndexPredicate(Relation relation) * Attribute numbers are offset by FirstLowInvalidHeapAttributeNumber so that * we can include system attributes (e.g., OID) in the bitmap representation. * - * Caller had better hold at least RowExclusiveLock on the target relation - * to ensure that it has a stable set of indexes. This also makes it safe - * (deadlock-free) for us to take locks on the relation's indexes. + * Caller had better hold at least RowExclusiveLock on the target relation to + * ensure that it has a stable set of indexes. This also makes it safe + * (deadlock-free) for us to take locks on the relation's indexes. Note that + * a concurrent CREATE/DROP INDEX CONCURRENTLY can lead to an outdated list + * being returned (will be recomputed at the next access), the CONCURRENTLY + * code deals with that. * * The returned result is palloc'd in the caller's memory context and should * be bms_free'd when not needed anymore. @@ -4766,7 +4769,7 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind) MemoryContext oldcxt; /* Quick exit if we already computed the result. */ - if (relation->rd_indexattr != NULL) + if (relation->rd_bitmapsvalid == 2) { switch (attrKind) { @@ -4788,6 +4791,14 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind) return NULL; /* + * Signal that the attribute bitmaps are being built. If there's any + * relcache invalidations while building them, rd_bitmapsvalid will be + * reset to 0. In that case return the bitmaps, but don't mark them as + * valid. + */ + relation->rd_bitmapsvalid = 1; + + /* * Get cached list of index OIDs */ indexoidlist = RelationGetIndexList(relation); @@ -4892,13 +4903,7 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind) bms_free(relation->rd_idattr); relation->rd_idattr = NULL; - /* - * Now save copies of the bitmaps in the relcache entry. We intentionally - * set rd_indexattr last, because that's the one that signals validity of - * the values; if we run out of memory before making that copy, we won't - * leave the relcache entry looking like the other ones are valid but - * empty. - */ + /* now save copies of the bitmaps in the relcache entry */ oldcxt = MemoryContextSwitchTo(CacheMemoryContext); relation->rd_keyattr = bms_copy(uindexattrs); relation->rd_pkattr = bms_copy(pkindexattrs); @@ -4906,6 +4911,18 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind) relation->rd_indexattr = bms_copy(indexattrs); MemoryContextSwitchTo(oldcxt); + /* + * If there've been no invalidations while building the entry, mark the + * stored bitmaps as being valid. Need to do so after the copies above,
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On 02/06/2017 01:11 AM, Peter Geoghegan wrote: On Sun, Feb 5, 2017 at 4:09 PM, Robert Haas wrote: I don't think this kind of black-and-white thinking is very helpful. Obviously, data corruption is bad. However, this bug has (from what one can tell from this thread) been with us for over a decade; it must necessarily be either low-probability or low-severity, or somebody would've found it and fixed it before now. Indeed, the discovery of this bug was driven by new feature development, not a user report. It seems pretty clear that if we try to patch this and get it wrong, the effects of our mistake could easily be a lot more serious than the original bug. +1. The fact that it wasn't driven by a user report convinces me that this is the way to go. +1 to not rushing fixes into releases. While I think we now finally understand the mechanics of this bug, the fact that we came up with three different fixes in this thread, only to discover issues with each of them, warrants some caution. OTOH I disagree with the notion that bugs that are not driven by user reports are somehow less severe. Some data corruption bugs cause quite visible breakage - segfaults, immediate crashes, etc. Those are pretty clear bugs, and are reported by users. Other data corruption bugs are much more subtle - for example this bug may lead to incorrect results to some queries, failing to detect values violating UNIQUE constraints, issues with foreign keys, etc. It's damn impossible to notice incorrect query results that only affect tiny subset of the rows (e.g. rows updated when the CIC was running), especially when the issue may go away after a while due to additional non-HOT updates. Regarding the other symptoms - I wonder how many strange 'duplicate value' errors were misdiagnosed, wrongly attributed to a recent power outage, etc. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On 2017-02-05 15:14:59 -0500, Tom Lane wrote: > I do not like any of the other patches proposed in this thread, because > they fail to guarantee delivering an up-to-date attribute bitmap to the > caller. I think we need a retry loop, and I think that it needs to look > like the attached. That looks like a reasonable approach, although I'm not sure it's the best one. > (Note that the tests whether rd_pkindex and rd_replidindex changed might > be redundant; but I'm not totally sure of that, and they're cheap.) I don't think they can, but I'm all for re-checking. > RelationGetIndexList(Relation relation) > @@ -4746,8 +4747,10 @@ RelationGetIndexPredicate(Relation relat > * we can include system attributes (e.g., OID) in the bitmap representation. > * > * Caller had better hold at least RowExclusiveLock on the target relation > - * to ensure that it has a stable set of indexes. This also makes it safe > - * (deadlock-free) for us to take locks on the relation's indexes. > + * to ensure it is safe (deadlock-free) for us to take locks on the > relation's > + * indexes. Note that since the introduction of CREATE INDEX CONCURRENTLY, > + * that lock level doesn't guarantee a stable set of indexes, so we have to > + * be prepared to retry here in case of a change in the set of indexes. I've not yet read the full thread, but I'm a bit confused so far. We obviously can get changing information about indexes here, but isn't that something we have to deal with anyway? If we guarantee that we don't loose knowledge that there's a pending invalidation, why do we have to retry? Pretty much by definition the knowledge in a relcache entry can be outdated as soon as returned unless locking prevents that from being possible - which is not the case here. ISTM it'd be better not to have retry logic here, but to follow the more general pattern of making sure that we know whether the information needs to be recomputed at the next access. We could either do that by having an additional bit of information about the validity of the bitmaps (so we have invalid, building, valid - and only set valid at the end of computing the bitmaps when still building and not invalid again), or we simply move the bitmap computation into the normal relcache build. BTW, the interactions of RelationSetIndexList with RelationGetIndexAttrBitmap() look more than a bit fragile to me, even if documented: * * We deliberately do not change rd_indexattr here: even when operating * with a temporary partial index list, HOT-update decisions must be made * correctly with respect to the full index set. It is up to the caller * to ensure that a correct rd_indexattr set has been cached before first * calling RelationSetIndexList; else a subsequent inquiry might cause a * wrong rd_indexattr set to get computed and cached. Likewise, we do not * touch rd_keyattr, rd_pkattr or rd_idattr. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On 2017-02-05 12:51:09 -0500, Tom Lane wrote: > Michael Paquier writes: > > On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule > > wrote: > >> I agree with Pavan - a release with known important bug is not good idea. > > > This bug has been around for some time, so I would recommend taking > > the time necessary to make the best fix possible, even if it means > > waiting for the next round of minor releases. > > I think the way to think about this sort of thing is, if we had found > this bug when a release wasn't imminent, would we consider it bad enough > to justify an unscheduled release cycle? I have to think the answer for > this one is "probably not". +1. I don't think we serve our users by putting out a nontrivial bugfix hastily. Nor do I think we serve them in this instance by delaying the release to cover this fix; there's enough other fixes in the release imo. - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Sun, Feb 5, 2017 at 4:09 PM, Robert Haas wrote: > I don't think this kind of black-and-white thinking is very helpful. > Obviously, data corruption is bad. However, this bug has (from what > one can tell from this thread) been with us for over a decade; it must > necessarily be either low-probability or low-severity, or somebody > would've found it and fixed it before now. Indeed, the discovery of > this bug was driven by new feature development, not a user report. It > seems pretty clear that if we try to patch this and get it wrong, the > effects of our mistake could easily be a lot more serious than the > original bug. +1. The fact that it wasn't driven by a user report convinces me that this is the way to go. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Sun, Feb 5, 2017 at 1:34 PM, Martín Marqués wrote: > El 05/02/17 a las 10:03, Michael Paquier escribió: >> On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule >> wrote: >>> I agree with Pavan - a release with known important bug is not good idea. >> >> This bug has been around for some time, so I would recommend taking >> the time necessary to make the best fix possible, even if it means >> waiting for the next round of minor releases. > > The fact that the bug has been around for a long time doesn't mean we > shouldn't take it seriously. > > IMO any kind of corruption (heap or index) should be prioritized. > There's also been comments about maybe this being the cause of old > reports about index corruption. > > I ask myself if it's a good idea to make a point release with a know > corruption bug in it. I don't think this kind of black-and-white thinking is very helpful. Obviously, data corruption is bad. However, this bug has (from what one can tell from this thread) been with us for over a decade; it must necessarily be either low-probability or low-severity, or somebody would've found it and fixed it before now. Indeed, the discovery of this bug was driven by new feature development, not a user report. It seems pretty clear that if we try to patch this and get it wrong, the effects of our mistake could easily be a lot more serious than the original bug. Now, I do not object to patching this tomorrow before the wrap if the various senior hackers who have studied this (Tom, Alvaro, Pavan, Amit) are all in agreement on the way forward. But if there is any significant chance that we don't fully understand this issue and that our fix might cause bigger problems, it would be more prudent to wait. I'm all in favor of releasing important bug fixes quickly, but releasing a bug fix before we're sure we've fixed the bug correctly is taking a good principle to an irrational extreme. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
[ Having now read the whole thread, I'm prepared to weigh in ... ] Pavan Deolasee writes: > This seems like a real problem to me. I don't know what the consequences > are, but definitely having various attribute lists to have different view > of the set of indexes doesn't seem right. For sure. > 2. In the second patch, we tried to recompute attribute lists if a relcache > flush happens in between and index list is invalidated. We've seen problems > with that, especially it getting into an infinite loop with > CACHE_CLOBBER_ALWAYS. Not clear why it happens, but apparently relcache > flushes keep happening. Well, yeah: the point of CLOBBER_CACHE_ALWAYS is to make a relcache flush happen whenever it possibly could. The way to deal with that without looping is to test whether the index set *actually* changed, not whether it just might have changed. I do not like any of the other patches proposed in this thread, because they fail to guarantee delivering an up-to-date attribute bitmap to the caller. I think we need a retry loop, and I think that it needs to look like the attached. (Note that the tests whether rd_pkindex and rd_replidindex changed might be redundant; but I'm not totally sure of that, and they're cheap.) regards, tom lane diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index 8a7c560..b659994 100644 *** a/src/backend/utils/cache/relcache.c --- b/src/backend/utils/cache/relcache.c *** RelationGetFKeyList(Relation relation) *** 4327,4335 * it is the pg_class OID of a unique index on OID when the relation has one, * and InvalidOid if there is no such index. * ! * In exactly the same way, we update rd_replidindex, which is the pg_class ! * OID of an index to be used as the relation's replication identity index, ! * or InvalidOid if there is no such index. */ List * RelationGetIndexList(Relation relation) --- 4327,4336 * it is the pg_class OID of a unique index on OID when the relation has one, * and InvalidOid if there is no such index. * ! * In exactly the same way, we update rd_pkindex, which is the OID of the ! * relation's primary key index if any, else InvalidOid; and rd_replidindex, ! * which is the pg_class OID of an index to be used as the relation's ! * replication identity index, or InvalidOid if there is no such index. */ List * RelationGetIndexList(Relation relation) *** RelationGetIndexPredicate(Relation relat *** 4746,4753 * we can include system attributes (e.g., OID) in the bitmap representation. * * Caller had better hold at least RowExclusiveLock on the target relation ! * to ensure that it has a stable set of indexes. This also makes it safe ! * (deadlock-free) for us to take locks on the relation's indexes. * * The returned result is palloc'd in the caller's memory context and should * be bms_free'd when not needed anymore. --- 4747,4756 * we can include system attributes (e.g., OID) in the bitmap representation. * * Caller had better hold at least RowExclusiveLock on the target relation ! * to ensure it is safe (deadlock-free) for us to take locks on the relation's ! * indexes. Note that since the introduction of CREATE INDEX CONCURRENTLY, ! * that lock level doesn't guarantee a stable set of indexes, so we have to ! * be prepared to retry here in case of a change in the set of indexes. * * The returned result is palloc'd in the caller's memory context and should * be bms_free'd when not needed anymore. *** RelationGetIndexAttrBitmap(Relation rela *** 4760,4765 --- 4763,4769 Bitmapset *pkindexattrs; /* columns in the primary index */ Bitmapset *idindexattrs; /* columns in the replica identity */ List *indexoidlist; + List *newindexoidlist; Oid relpkindex; Oid relreplindex; ListCell *l; *** RelationGetIndexAttrBitmap(Relation rela *** 4788,4795 return NULL; /* ! * Get cached list of index OIDs */ indexoidlist = RelationGetIndexList(relation); /* Fall out if no indexes (but relhasindex was set) */ --- 4792,4800 return NULL; /* ! * Get cached list of index OIDs. If we have to start over, we do so here. */ + restart: indexoidlist = RelationGetIndexList(relation); /* Fall out if no indexes (but relhasindex was set) */ *** RelationGetIndexAttrBitmap(Relation rela *** 4800,4808 * Copy the rd_pkindex and rd_replidindex value computed by * RelationGetIndexList before proceeding. This is needed because a * relcache flush could occur inside index_open below, resetting the ! * fields managed by RelationGetIndexList. (The values we're computing ! * will still be valid, assuming that caller has a sufficient lock on ! * the relation.) */ relpkindex = relation->rd_pkindex; relreplindex = relation->rd_replidindex; --- 4805,4812
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
El 05/02/17 a las 10:03, Michael Paquier escribió: > On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule wrote: >> I agree with Pavan - a release with known important bug is not good idea. > > This bug has been around for some time, so I would recommend taking > the time necessary to make the best fix possible, even if it means > waiting for the next round of minor releases. The fact that the bug has been around for a long time doesn't mean we shouldn't take it seriously. IMO any kind of corruption (heap or index) should be prioritized. There's also been comments about maybe this being the cause of old reports about index corruption. I ask myself if it's a good idea to make a point release with a know corruption bug in it. Regards, -- Martín Marquéshttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
2017-02-05 18:51 GMT+01:00 Tom Lane : > Michael Paquier writes: > > On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule > wrote: > >> I agree with Pavan - a release with known important bug is not good > idea. > > > This bug has been around for some time, so I would recommend taking > > the time necessary to make the best fix possible, even if it means > > waiting for the next round of minor releases. > > I think the way to think about this sort of thing is, if we had found > this bug when a release wasn't imminent, would we consider it bad enough > to justify an unscheduled release cycle? I have to think the answer for > this one is "probably not". > There are a questions 1. is it critical bug? 2. if is it, will be reason for special minor release? 3. how much time is necessary for fixing of this bug? These questions can be unimportant if there is some security fixes in next release, what I cannot to know. Regards Pavel > > regards, tom lane >
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
Michael Paquier writes: > On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule wrote: >> I agree with Pavan - a release with known important bug is not good idea. > This bug has been around for some time, so I would recommend taking > the time necessary to make the best fix possible, even if it means > waiting for the next round of minor releases. I think the way to think about this sort of thing is, if we had found this bug when a release wasn't imminent, would we consider it bad enough to justify an unscheduled release cycle? I have to think the answer for this one is "probably not". regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule wrote: > I agree with Pavan - a release with known important bug is not good idea. This bug has been around for some time, so I would recommend taking the time necessary to make the best fix possible, even if it means waiting for the next round of minor releases. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
2017-02-05 7:54 GMT+01:00 Pavan Deolasee : > > On Sat, Feb 4, 2017 at 11:54 PM, Tom Lane wrote: > >> >> Based on Pavan's comments, I think trying to force this into next week's >> releases would be extremely unwise. If the bug went undetected this long, >> it can wait for a fix for another three months. > > > Yes, I think bug existed ever since and went unnoticed. One reason could > be that the race happens only when the new index turns HOT updates into > non-HOT updates. Another reason could be that we don't have checks in place > to catch these kinds of corruption. Having said that, since we have > discovered the bug, at least many 2ndQuadrant customers have expressed > worry and want to know if the fix will be available in 9.6.2 and other > minor releases. Since the bug can lead to data corruption, the worry is > justified. Until we fix the bug, there will be a constant worry about using > CIC. > > If we can have some kind of band-aid fix to plug in the hole, that might > be enough as well. I tested my first patch (which will need some polishing) > and that works well AFAIK. I was worried about prepared queries and all, > but that seems ok too. RelationGetIndexList() always get called within > ExecInitModifyTable. The fix seems quite unlikely to cause any other side > effects. > > Another possible band-aid is to throw another relcache invalidation in > CIC. Like adding a dummy index_set_state_flags() within yet another > transaction. Seems ugly, but should fix the problem for now and not cause > any impact on relcache mechanism whatsoever. > > That seems better than >> risking new breakage when it's barely 48 hours to the release wrap >> deadline. We do not have time to recover from any mistakes. >> > > I'm not sure what the project policies are, but can we consider delaying > the release by a week for issues like these? Or do you think it will be > hard to come up with a proper fix for the issue and it will need some > serious refactoring? > I agree with Pavan - a release with known important bug is not good idea. Pavel > Thanks, > Pavan > > -- > Pavan Deolasee http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services >
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
On Sat, Feb 4, 2017 at 11:54 PM, Tom Lane wrote: > > Based on Pavan's comments, I think trying to force this into next week's > releases would be extremely unwise. If the bug went undetected this long, > it can wait for a fix for another three months. Yes, I think bug existed ever since and went unnoticed. One reason could be that the race happens only when the new index turns HOT updates into non-HOT updates. Another reason could be that we don't have checks in place to catch these kinds of corruption. Having said that, since we have discovered the bug, at least many 2ndQuadrant customers have expressed worry and want to know if the fix will be available in 9.6.2 and other minor releases. Since the bug can lead to data corruption, the worry is justified. Until we fix the bug, there will be a constant worry about using CIC. If we can have some kind of band-aid fix to plug in the hole, that might be enough as well. I tested my first patch (which will need some polishing) and that works well AFAIK. I was worried about prepared queries and all, but that seems ok too. RelationGetIndexList() always get called within ExecInitModifyTable. The fix seems quite unlikely to cause any other side effects. Another possible band-aid is to throw another relcache invalidation in CIC. Like adding a dummy index_set_state_flags() within yet another transaction. Seems ugly, but should fix the problem for now and not cause any impact on relcache mechanism whatsoever. That seems better than > risking new breakage when it's barely 48 hours to the release wrap > deadline. We do not have time to recover from any mistakes. > I'm not sure what the project policies are, but can we consider delaying the release by a week for issues like these? Or do you think it will be hard to come up with a proper fix for the issue and it will need some serious refactoring? Thanks, Pavan -- Pavan Deolasee http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY
Alvaro Herrera writes: > I intend to commit this soon to all branches, to ensure it gets into the > next set of minors. Based on Pavan's comments, I think trying to force this into next week's releases would be extremely unwise. If the bug went undetected this long, it can wait for a fix for another three months. That seems better than risking new breakage when it's barely 48 hours to the release wrap deadline. We do not have time to recover from any mistakes. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers