Re: [HACKERS] index-only count(*) for indexes supporting bitmap scans

2017-11-13 Thread Alexander Kuzmenkov

I've pushed the executor part of this, but mostly not the planner part,
because I didn't think the latter was anywhere near ready for prime
time: the regression test changes it was causing were entirely bogus.


Hi Tom,

Thanks for the commit and the explanation. I'll try to address your 
comments if I continue working on the planner part.


--

Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] index-only count(*) for indexes supporting bitmap scans

2017-11-01 Thread Tom Lane
Alexander Kuzmenkov  writes:
> On 04.09.2017 15:17, Alexey Chernyshov wrote:
>> make check-world fails on contrib/postgres_fdw because of Subquery Scan on 
>> ... Probably, query plan was changed.

> Thanks for testing! This is the same problem as the one in 
> 'select_distinct' I mentioned before. I changed the test, the updated 
> patch is attached.

I've pushed the executor part of this, but mostly not the planner part,
because I didn't think the latter was anywhere near ready for prime
time: the regression test changes it was causing were entirely bogus.

You had basically two categories of plan changes showing up in the
regression tests.  One was insertion of Subquery Scan nodes where
they hadn't been before; that was because the use_physical_tlist
change broke the optimization that removed no-op Subquery Scans.
I fixed that by narrowing the use_physical_tlist change to apply
only to BitmapHeapPath nodes, which is the only case where there
would be any benefit anyway.  The remaining plan diffs after making
that change all amounted to replacing regular index-only scan plans
with bitmap scans, which seems to me to be silly on its face: if we
can use an IOS then it is unlikely that a bitmap scan will be better.
Those changes indicate that the cost adjustment you'd inserted in
cost_bitmap_heap_scan was way too optimistic, which is hardly
surprising.  I think a realistic adjustment would have to account
for all or most of these factors:

* Whether the index AM is ever going to return recheck = false.
The planner has no way to know that at present, but since there are
lots of cases where it simply won't ever happen, I think assuming
that it always will is not acceptable.  Perhaps we could extend the
AM API so that we could find out whether recheck would happen always,
never, or sometimes.  (Doing better than "sometimes" might be too hard,
but I think most opclasses are going to be "always" or "never" anyway.)

* Whether the bitmap will become lossy, causing us to have to make
rechecks anyway.  We could probably estimate that pretty well based
on comparing the initial number-of-pages estimate to work_mem.

* Whether the plan will need to fetch heap tuples to make filter-qual
checks.  In principle the planner ought to know that.  In practice,
right now it doesn't bother to figure out whether the qual will be
empty until createplan.c time, because that's rather a significant
amount of work and it's undesirable to expend it for paths we might
not end up using.  We might be able to approximate it better than
we do now, though.  It's a live problem for regular indexscan costing
as well as bitmap scans, IIRC.

* What fraction of the table is actually all-visible.  You'd effectively
hard-wired that at 50%, but it's easy to do better --- the regular
IOS code does

if (indexonly)
pages_fetched = ceil(pages_fetched * (1.0 - baserel->allvisfrac));

and it would be appropriate to do the same here if we conclude that
the other gating conditions apply.

Without a patch that deals more realistically with these concerns,
I think we're better off not changing the cost estimate at all.

As far as the executor side goes, I made several cosmetic changes
and one not so cosmetic one: I changed the decision about whether
to prefetch so that it looks at whether the potential prefetch
page is all-visible.  This still relies on the same assumption you
had that the recheck flag will stay the same from one page to the
next, but at least it's not assuming that the all-visible state
will stay the same.

I'm going to mark the CF entry closed, but if you feel like working
on the cost estimate business, feel free to submit another patch
for that.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index only scan for cube and seg

2017-10-29 Thread Andrey Borodin
Hi!
> 29 окт. 2017 г., в 2:24, Alexander Korotkov  
> написал(а):
> 
> As I can see, cube GiST code always uses DatumGetNDBOX() macro to transform 
> Datum to (NDBOX *).
> 
> #define DatumGetNDBOX(x)  ((NDBOX *) PG_DETOAST_DATUM(x))
> 
> Thus, it should be safe to just remove both compress/decompress methods from 
> existing opclass.

Alexander, Tom, you are absolutely right. I was sure there is toasting code in 
cube's compress, but it was not ever there.

Here is patch for cube that drops functions.

Best regards, Andrey Borodin.


0001-Enable-Index-Only-Scan-in-cube.patch
Description: Binary data


0001-Enable-Index-Only-Scan-in-seg.patch
Description: Binary data



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index only scan for cube and seg

2017-10-28 Thread Alexander Korotkov
On Fri, Oct 27, 2017 at 9:54 PM, Tom Lane  wrote:

> Robert Haas  writes:
> > On Thu, Oct 26, 2017 at 12:22 PM, Andrey Borodin 
> wrote:
> >> For cube there is new default opclass.
>
> > I seem to recall that changing the default opclass causes unsolvable
> > problems with upgrades.  You might want to check the archives for
> > previous discussions of this issue; unfortunately, I don't recall the
> > details off-hand.
>
> Quite aside from that, replacing the opclass with a new one creates
> user-visible headaches that I don't think are justified, i.e. having to
> reconstruct indexes in order to get the benefit.
>
> Maybe I'm missing something, but ISTM you could just drop the compress
> function and call it good.  This would mean that an IOS scan would
> sometimes hand back a toast-compressed value, but what's the problem
> with that?
>

+1,
I think in this case replacing default opclass or even duplicating opclass
isn't justified.

(The only reason for making a decompress function that just detoasts
> is that your other support functions then do not have to consider
> the possibility that they're handed a toast-compressed value.  I have
> not checked whether that really matters for cube.)
>

As I can see, cube GiST code always uses DatumGetNDBOX() macro to transform
Datum to (NDBOX *).

#define DatumGetNDBOX(x) ((NDBOX *) PG_DETOAST_DATUM(x))

Thus, it should be safe to just remove both compress/decompress methods
from existing opclass.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Index only scan for cube and seg

2017-10-27 Thread Tom Lane
Robert Haas  writes:
> On Thu, Oct 26, 2017 at 12:22 PM, Andrey Borodin  wrote:
>> For cube there is new default opclass.

> I seem to recall that changing the default opclass causes unsolvable
> problems with upgrades.  You might want to check the archives for
> previous discussions of this issue; unfortunately, I don't recall the
> details off-hand.

Quite aside from that, replacing the opclass with a new one creates
user-visible headaches that I don't think are justified, i.e. having to
reconstruct indexes in order to get the benefit.

Maybe I'm missing something, but ISTM you could just drop the compress
function and call it good.  This would mean that an IOS scan would
sometimes hand back a toast-compressed value, but what's the problem
with that?

(The only reason for making a decompress function that just detoasts
is that your other support functions then do not have to consider
the possibility that they're handed a toast-compressed value.  I have
not checked whether that really matters for cube.)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index only scan for cube and seg

2017-10-27 Thread Robert Haas
On Thu, Oct 26, 2017 at 12:22 PM, Andrey Borodin  wrote:
> For cube there is new default opclass.

I seem to recall that changing the default opclass causes unsolvable
problems with upgrades.  You might want to check the archives for
previous discussions of this issue; unfortunately, I don't recall the
details off-hand.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Index only scan for cube and seg

2017-10-26 Thread Andrey Borodin
Hi hackers!

Here are patches enabling Index Only Scan for cube and seg extensions.

These patches follow this discussion [0].

For cube there is new default opclass. We cannot drop old opclass, because it 
could TOAST come cube values in rare occasions. Index Only Scan is enabled only 
for newly created indexes. Btw I can add fetch to old opclass so that IOS would 
be enabled.
For seg compress and decompress functions are dropped from opclass and 
extension. Index Only Scan is enabled.

There are two more functions which can be deleted
ghstore_decompress
g_intbig_decompress
But it will not lead to any feasible consequences.


[0] 
https://www.postgresql.org/message-id/flat/CAJEAwVELVx9gYscpE%3DBe6iJxvdW5unZ_LkcAaVNSeOwvdwtD%3DA%40mail.gmail.com#CAJEAwVELVx9gYscpE=Be6iJxvdW5unZ_LkcAaVNSeOwvdwtD=a...@mail.gmail.com



 

0001-Create-cube-opclass-without-toasting.patch
Description: Binary data


0001-Enable-Index-Only-Scan-in-seg.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index expression syntax

2017-09-29 Thread Konstantin Knizhnik



On 29.09.2017 11:03, Marko Tiikkaja wrote:
On Fri, Sep 29, 2017 at 9:31 AM, Konstantin Knizhnik 
mailto:k.knizh...@postgrespro.ru>> wrote:


I wonder why syntax error is produced in this case:

postgres=# create index metaindex on foo using
gin(to_tsvector('english', x)||to_tsvector('english',y));
ERROR:  syntax error at or near "||"
LINE 1: ...taindex on foo using gin(to_tsvector('english',
x)||to_tsvec...
^
[ .. ]

/|expression:|/
An expression based on one or more columns of the table. The
expression usually must be written with surrounding
parentheses, as shown in the syntax. However, the parentheses
can be omitted if the expression has the form of a function call.

So documentations states that sometimes it is possible to avoid
parentheses, but it is unclear why I have to use double
parentheses...
I think that either grammar should be fixed, either documentation
should be updated.


Your expression is clearly not a function call, it's a concatenation 
of two of them.  The documentation seems perfectly accurate to me?



O, sorry!
You are right. I just didn't notice extra parenthesis in CREATE INDEX 
syntax in case of using expressions.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] Index expression syntax

2017-09-29 Thread Marko Tiikkaja
On Fri, Sep 29, 2017 at 9:31 AM, Konstantin Knizhnik <
k.knizh...@postgrespro.ru> wrote:

> I wonder why syntax error is produced in this case:
>
> postgres=# create index metaindex on foo using gin(to_tsvector('english',
> x)||to_tsvector('english',y));
> ERROR:  syntax error at or near "||"
> LINE 1: ...taindex on foo using gin(to_tsvector('english', x)||to_tsvec...
>  ^
> [ .. ]
>
> *expression:* An expression based on one or more columns of the table.
> The expression usually must be written with surrounding parentheses, as
> shown in the syntax. However, the parentheses can be omitted if the
> expression has the form of a function call.
>
> So documentations states that sometimes it is possible to avoid
> parentheses, but it is unclear why I have to use double parentheses...
> I think that either grammar should be fixed, either documentation should
> be updated.
>

Your expression is clearly not a function call, it's a concatenation of two
of them.  The documentation seems perfectly accurate to me?


.m


[HACKERS] Index expression syntax

2017-09-29 Thread Konstantin Knizhnik

I wonder why syntax error is produced in this case:

postgres=# create index metaindex on foo using 
gin(to_tsvector('english', x)||to_tsvector('english',y));

ERROR:  syntax error at or near "||"
LINE 1: ...taindex on foo using gin(to_tsvector('english', x)||to_tsvec...
 ^
The error can be eliminated if extra surrounding parentheses are added:

postgres=# create index metaindex on foo using 
gin((to_tsvector('english', x)||to_tsvector('english',y)));

CREATE INDEX

Postgresql documentations says:

CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ]/|name|/  ] 
ON/|table_name|/  [ USING/|method|/  ]
( {/|column_name|/  | (/|expression|/  ) } [ COLLATE/|collation|/  ] 
[/|opclass|/  ] [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] )
[ WITH (/|storage_parameter|/  =/|value|/  [, ... ] ) ]
[ TABLESPACE/|tablespace_name|/  ]
[ WHERE/|predicate|/  ]

/|expression:|/
   An expression based on one or more columns of the table. The
   expression usually must be written with surrounding parentheses, as
   shown in the syntax. However, the parentheses can be omitted if the
   expression has the form of a function call.

--





So documentations states that sometimes it is possible to avoid 
parentheses, but it is unclear why I have to use double parentheses...
I think that either grammar should be fixed, either documentation should 
be updated.

/||/

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] Index Only Scan support for cube

2017-09-20 Thread Alexander Korotkov
On Wed, Sep 20, 2017 at 8:26 AM, Andrey Borodin 
wrote:

> Hi hackers!
> > 23 мая 2017 г., в 14:53, Andrew Borodin 
> написал(а):
> >
> > Here's a small patch that implements fetch function necessary for
> > Index Only Scans that use cube data type.
>
> Tom Lane have just commited d3a4f89 (Allow no-op GiST support functions to
> be omitted) Thanks, Tom! : )
> "Index Only Scan support for cube" patch now is obsolete. I'm working on
> another similar patch for contribs to support GiST IOS and remove no-op
> support functions.
>

Good.
BTW, some strangeness of g_cube_decompress() catch my eye.  It compares
results of two evaluations of same expression DatumGetNDBOXP(entry->key).

NDBOX   *key = DatumGetNDBOXP(entry->key);
> if (key != DatumGetNDBOXP(entry->key))


In fact it's correct, because it compares results of two detoasting.  If
datum isn't toasted then results would be the same.  And if data is toasted
then results would be two different allocation of detoasted datum.
However, we do extra detoasting here.

For example, see gbt_var_decompress().  There is no extra detoasting here.

GBT_VARKEY *key = (GBT_VARKEY *) PG_DETOAST_DATUM(entry->key);
> if (key != (GBT_VARKEY *) DatumGetPointer(entry->key))


--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Index Only Scan support for cube

2017-09-19 Thread Andrey Borodin
Hi hackers!
> 23 мая 2017 г., в 14:53, Andrew Borodin  написал(а):
> 
> Here's a small patch that implements fetch function necessary for
> Index Only Scans that use cube data type.

Tom Lane have just commited d3a4f89 (Allow no-op GiST support functions to be 
omitted) Thanks, Tom! : )
"Index Only Scan support for cube" patch now is obsolete. I'm working on 
another similar patch for contribs to support GiST IOS and remove no-op support 
functions.

Best regards, Andrey Borodin.

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] index-only count(*) for indexes supporting bitmap scans

2017-09-07 Thread Alexey Chernyshov
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

One thing I have noticed is a trailing whitespace after "bogus one":

123 +* If we don't have to fetch the tuple, just return a
124 +* bogus one 
125 +*/

The new status of this patch is: Ready for Committer

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] index-only count(*) for indexes supporting bitmap scans

2017-09-04 Thread Alexander Kuzmenkov

On 04.09.2017 15:17, Alexey Chernyshov wrote:

make check-world fails on contrib/postgres_fdw because of Subquery Scan on ... 
Probably, query plan was changed.


Hi Alexey,

Thanks for testing! This is the same problem as the one in 
'select_distinct' I mentioned before. I changed the test, the updated 
patch is attached.


--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index c19b3318c7..5f5f65d60c 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -2203,22 +2203,23 @@ SELECT t1c1, avg(t1c1 + t2c1) FROM (SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2
 -- join with lateral reference
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1."C 1" FROM "S 1"."T 1" t1, LATERAL (SELECT DISTINCT t2.c1, t3.c1 FROM ft1 t2, ft2 t3 WHERE t2.c1 = t3.c1 AND t2.c2 = t1.c2) q ORDER BY t1."C 1" OFFSET 10 LIMIT 10;
- QUERY PLAN 
-
+QUERY PLAN
+--
  Limit
Output: t1."C 1"
->  Nested Loop
  Output: t1."C 1"
  ->  Index Scan using t1_pkey on "S 1"."T 1" t1
Output: t1."C 1", t1.c2, t1.c3, t1.c4, t1.c5, t1.c6, t1.c7, t1.c8
- ->  HashAggregate
-   Output: t2.c1, t3.c1
-   Group Key: t2.c1, t3.c1
-   ->  Foreign Scan
+ ->  Subquery Scan on q
+   ->  HashAggregate
  Output: t2.c1, t3.c1
- Relations: (public.ft1 t2) INNER JOIN (public.ft2 t3)
- Remote SQL: SELECT r1."C 1", r2."C 1" FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (((r1."C 1" = r2."C 1")) AND ((r1.c2 = $1::integer
-(13 rows)
+ Group Key: t2.c1, t3.c1
+ ->  Foreign Scan
+   Output: t2.c1, t3.c1
+   Relations: (public.ft1 t2) INNER JOIN (public.ft2 t3)
+   Remote SQL: SELECT r1."C 1", r2."C 1" FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (((r1."C 1" = r2."C 1")) AND ((r1.c2 = $1::integer
+(14 rows)
 
 SELECT t1."C 1" FROM "S 1"."T 1" t1, LATERAL (SELECT DISTINCT t2.c1, t3.c1 FROM ft1 t2, ft2 t3 WHERE t2.c1 = t3.c1 AND t2.c2 = t1.c2) q ORDER BY t1."C 1" OFFSET 10 LIMIT 10;
  C 1 
@@ -2672,16 +2673,17 @@ select c2, sum(c1) from ft2 group by c2 having avg(c1) < 500 and sum(c1) < 49800
 -- Unshippable HAVING clause will be evaluated locally, and other qual in HAVING clause is pushed down
 explain (verbose, costs off)
 select count(*) from (select c5, count(c1) from ft1 group by c5, sqrt(c2) having (avg(c1) / avg(c1)) * random() <= 1 and avg(c1) < 500) x;
-   QUERY PLAN
--
+  QUERY PLAN   
+---
  Aggregate
Output: count(*)
-   ->  Foreign Scan
- Output: ft1.c5, NULL::bigint, (sqrt((ft1.c2)::double precision))
- Filter: (avg(ft1.c1)) / (avg(ft1.c1::double precision * random()) <= '1'::double precision)
- Relations: Aggregate on (public.ft1)
- Remote SQL: SELECT c5, NULL::bigint, sqrt(c2), avg("C 1") FROM "S 1"."T 1" GROUP BY c5, (sqrt(c2)) HAVING ((avg("C 1") < 500::numeric))
-(7 rows)
+   ->  Subquery Scan on x
+ ->  Foreign Scan
+   Output: ft1.c5, NULL::bigint, (sqrt((ft1.c2)::double precision))
+   Filter: (avg(ft1.c1)) / (avg(ft1.c1::double precision * random()) <= '1'::double precision)
+   Relations: Aggregate on (public.ft1)
+   Remote SQL: SELECT c5, NULL::bigint, sqrt(c2), avg("C 1") FROM "S 1"."T 1" GROUP BY c5, (sqrt(c2)) HAVING ((avg("C 1") < 500::numeric))
+(8 rows)
 
 select count(*) from (select c5, count(c1) from ft1 group by c5, sqrt(c2) having (avg(c1) / avg(c1)) * random() <

Re: [HACKERS] index-only count(*) for indexes supporting bitmap scans

2017-09-04 Thread Alexey Chernyshov
The following review has been posted through the commitfest application:
make installcheck-world:  tested, failed
Implements feature:   not tested
Spec compliant:   not tested
Documentation:not tested

Hi Alexander,

make check-world fails on contrib/postgres_fdw because of Subquery Scan on ... 
Probably, query plan was changed.

The new status of this patch is: Waiting on Author

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] index-only count(*) for indexes supporting bitmap scans

2017-08-21 Thread Alexander Kumenkov

|Hello everyone,

I made a new patch according to the previous comments. It is simpler 
now, only adding a few checks to the bitmap heap scan node. When the 
target list for the bitmap heap scan is empty, and there is no filter, 
and the bitmap page generated by the index scan is exact, and the 
corresponding heap page is visible to all transaction, we don't fetch it.


The performance is better than with the previous patch, because now it 
can leverage the parallel heap scan logic. A simple benchmark is 
attached: this patch is more than ten times faster on a frequent search 
term, and two times faster on an infrequent one.


Still, there is one thing that is bothering me. I use empty targetlist 
as the marker of that I should not fetch tuples. Because of that, I have 
to make sure that use_physical_tlist() doesn't touch empty tlists. 
Consequently, if count(*) sits on top of a subquery, this subquery has 
to project and cannot be deleted (see trivial_subqueryscan). There is 
such a query in the regression test select_distinct: "select count(*) 
from (select distinct two, four, two from tenk1);". For that particular 
query it shouldn't matter much, so I changed the test, but the broader 
implications of this escape me at the moment.


The cost estimation is very simplified now: I just halve the number of 
pages to be fetched. The most important missing part is checking whether 
we have any quals that are not checked by the index: if we do, we'll 
have to fetch all the tuples. Finding nonindex qualifications is 
somewhat convoluted for the bitmap index scan tree involving multiple 
indexes, so I didn't implement it for now. We could also consider 
estimating the number of lossy pages in the tid bitmap given current 
work_mem size.


I'll be glad to hear your thoughts on this.|
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 79f534e4e9..d7ea6f6929 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -39,6 +39,7 @@
 
 #include "access/relscan.h"
 #include "access/transam.h"
+#include "access/visibilitymap.h"
 #include "executor/execdebug.h"
 #include "executor/nodeBitmapHeapscan.h"
 #include "miscadmin.h"
@@ -225,9 +226,25 @@ BitmapHeapNext(BitmapHeapScanState *node)
 			}
 
 			/*
-			 * Fetch the current heap page and identify candidate tuples.
+			 * If we don't need the tuple contents and are only counting them,
+			 * we can skip fetching the page if the bitmap doesn't need rechecking
+			 * and all tuples on the page are visible to our transaction
 			 */
-			bitgetpage(scan, tbmres);
+			node->bhs_nofetch = !tbmres->recheck
+&& node->ss.ps.qual == NULL
+&& node->ss.ps.plan->targetlist == NIL
+&& VM_ALL_VISIBLE(node->ss.ss_currentRelation, tbmres->blockno,
+  &node->bhs_vmbuffer);
+
+			if (node->bhs_nofetch)
+scan->rs_ntuples = tbmres->ntuples;
+			else
+			{
+/*
+ * Fetch the current heap page and identify candidate tuples.
+ */
+bitgetpage(scan, tbmres);
+			}
 
 			if (tbmres->ntuples >= 0)
 node->exact_pages++;
@@ -289,45 +306,58 @@ BitmapHeapNext(BitmapHeapScanState *node)
 		 */
 		BitmapPrefetch(node, scan);
 
-		/*
-		 * Okay to fetch the tuple
-		 */
-		targoffset = scan->rs_vistuples[scan->rs_cindex];
-		dp = (Page) BufferGetPage(scan->rs_cbuf);
-		lp = PageGetItemId(dp, targoffset);
-		Assert(ItemIdIsNormal(lp));
 
-		scan->rs_ctup.t_data = (HeapTupleHeader) PageGetItem((Page) dp, lp);
-		scan->rs_ctup.t_len = ItemIdGetLength(lp);
-		scan->rs_ctup.t_tableOid = scan->rs_rd->rd_id;
-		ItemPointerSet(&scan->rs_ctup.t_self, tbmres->blockno, targoffset);
+		if (node->bhs_nofetch)
+		{
+			/*
+			 * If we don't have to fetch the tuple, just return a
+			 * bogus one 
+			 */
+			slot->tts_isempty = false;
+			slot->tts_nvalid = 0;
+		}
+		else
+		{
+			/*
+			 * Okay to fetch the tuple
+			 */
+			targoffset = scan->rs_vistuples[scan->rs_cindex];
+			dp = (Page) BufferGetPage(scan->rs_cbuf);
+			lp = PageGetItemId(dp, targoffset);
+			Assert(ItemIdIsNormal(lp));
 
-		pgstat_count_heap_fetch(scan->rs_rd);
+			scan->rs_ctup.t_data = (HeapTupleHeader) PageGetItem((Page) dp, lp);
+			scan->rs_ctup.t_len = ItemIdGetLength(lp);
+			scan->rs_ctup.t_tableOid = scan->rs_rd->rd_id;
+			ItemPointerSet(&scan->rs_ctup.t_self, tbmres->blockno, targoffset);
 
-		/*
-		 * Set up the result slot to point to this tuple. Note that the slot
-		 * acquires a pin on the buffer.
-		 */
-		ExecStoreTuple(&scan->rs_ctup,
-	   slot,
-	   scan->rs_cbuf,
-	   false);
+			pgstat_count_heap_fetch(scan->rs_rd);
 
-		/*
-		 * If we are using lossy info, we have to recheck the qual conditions
-		 * at every tuple.
-		 */
-		if (tbmres->recheck)
-		{
-			econtext->ecxt_scantuple = slot;
-			ResetExprContext(econtext);
+			/*
+			 * Set up the result slot to point to this tuple. Note that the slot
+			 * acquires a pin on the buffer.
+			 */
+			ExecStoreTuple(&scan-

Re: [HACKERS] Index created in BEFORE trigger not updated during INSERT

2017-06-06 Thread Albe Laurenz
Tom Lane wrote:
>> Andres Freund  writes:
>>> Hm, strategically sprinkled CheckTableNotInUse() might do the trick?
> 
> Attached is a proposed patch that closes off this problem.  I've tested
> it to the extent that it blocks Albe's example and passes check-world.

I tested it, and it works fine.  Thanks!

> I'm unsure whether to back-patch or not; the main argument for not doing
> so is that if any extensions are calling DefineIndex() directly, this
> would be an API break for them.  Given what a weird case this is, I'm not
> sure it's worth that.
> 
> A possible end-run around the API objection would be to not add an extra
> argument to DefineIndex() in the back branches, but to use !is_alter_table
> as the control condition.  That would work for the core callers, although
> we might need a special case for bootstrap mode.  On the other hand,
> thinking again about hypothetical third-party callers, it's possible that
> that's not the right answer for them, in which case they'd be really in
> trouble.  So I'm not that much in love with that answer.

It causes a slight bellyache to leave an unfixed data corruption bug
in the back branches (if only index corruption), but I agree that it is
such a weird case to create an index in a BEFORE trigger that we probably
can live without a back-patch.

Yours,
Laurenz Albe

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index created in BEFORE trigger not updated during INSERT

2017-06-04 Thread Tom Lane
Michael Paquier  writes:
> The patch looks good to me, could you add a regression test?

Done, thanks for the review.  I stuck the test into triggers.sql,
which is not completely on point since there are other ways to get
to this error.  But if we're thinking of it as a framework for testing
other CheckTableNotInUse cases then adding it to e.g. create_index.sql
doesn't seem right either.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index created in BEFORE trigger not updated during INSERT

2017-06-03 Thread Andres Freund
On 2017-06-03 18:23:33 -0400, Tom Lane wrote:
> Attached is a proposed patch that closes off this problem.  I've tested
> it to the extent that it blocks Albe's example and passes check-world.

Cool.

> I'm unsure whether to back-patch or not; the main argument for not doing
> so is that if any extensions are calling DefineIndex() directly, this
> would be an API break for them.  Given what a weird case this is, I'm not
> sure it's worth that.

I slightly lean against backpatching, it has taken long to be reported,
and it's pretty easy to work around...

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index created in BEFORE trigger not updated during INSERT

2017-06-03 Thread Michael Paquier
On Sun, Jun 4, 2017 at 7:23 AM, Tom Lane  wrote:
> I'm unsure whether to back-patch or not; the main argument for not doing
> so is that if any extensions are calling DefineIndex() directly, this
> would be an API break for them.  Given what a weird case this is, I'm not
> sure it's worth that.

Knowing that it has taken 20 years to get a report for this problem,
It seems to me that one report does not win against the risk of
destabilizing extensions that would surprisingly break after a minor
update.

> A possible end-run around the API objection would be to not add an extra
> argument to DefineIndex() in the back branches, but to use !is_alter_table
> as the control condition.  That would work for the core callers, although
> we might need a special case for bootstrap mode.  On the other hand,
> thinking again about hypothetical third-party callers, it's possible that
> that's not the right answer for them, in which case they'd be really in
> trouble.  So I'm not that much in love with that answer.

Yes, that's not worth the trouble I think for back-branches.

The patch looks good to me, could you add a regression test? This
could be used later on as a basis for other DDL commands if somebody
looks at this problem for the other ones.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index created in BEFORE trigger not updated during INSERT

2017-06-03 Thread Tom Lane
I wrote:
> Andres Freund  writes:
>> Hm, strategically sprinkled CheckTableNotInUse() might do the trick?

> +1.  We can't reasonably make it work: the outer query already has its
> list of indexes that need to be inserted into.  Also, if you try to
> make the index via ALTER TABLE ADD CONSTRAINT in the trigger, that will
> give you "cannot ALTER TABLE "mytable" because it is being used by active
> queries in this session" because of the check in AlterTable().

Attached is a proposed patch that closes off this problem.  I've tested
it to the extent that it blocks Albe's example and passes check-world.

I'm unsure whether to back-patch or not; the main argument for not doing
so is that if any extensions are calling DefineIndex() directly, this
would be an API break for them.  Given what a weird case this is, I'm not
sure it's worth that.

A possible end-run around the API objection would be to not add an extra
argument to DefineIndex() in the back branches, but to use !is_alter_table
as the control condition.  That would work for the core callers, although
we might need a special case for bootstrap mode.  On the other hand,
thinking again about hypothetical third-party callers, it's possible that
that's not the right answer for them, in which case they'd be really in
trouble.  So I'm not that much in love with that answer.

> It doesn't seem terribly hard to fix the CREATE INDEX case to behave
> likewise, but I wonder how many other cases we've missed?

That remains an open question :-(

regards, tom lane

diff --git a/src/backend/bootstrap/bootparse.y b/src/backend/bootstrap/bootparse.y
index de3695c..2e1fef0 100644
*** a/src/backend/bootstrap/bootparse.y
--- b/src/backend/bootstrap/bootparse.y
*** Boot_DeclareIndexStmt:
*** 323,328 
--- 323,329 
  $4,
  false,
  false,
+ false,
  true, /* skip_build */
  false);
  	do_end();
*** Boot_DeclareUniqueIndexStmt:
*** 366,371 
--- 367,373 
  $5,
  false,
  false,
+ false,
  true, /* skip_build */
  false);
  	do_end();
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 4861799..c090279 100644
*** a/src/backend/commands/indexcmds.c
--- b/src/backend/commands/indexcmds.c
*** CheckIndexCompatible(Oid oldId,
*** 295,300 
--- 295,303 
   * 'is_alter_table': this is due to an ALTER rather than a CREATE operation.
   * 'check_rights': check for CREATE rights in namespace and tablespace.  (This
   *		should be true except when ALTER is deleting/recreating an index.)
+  * 'check_not_in_use': check for table not already in use in current session.
+  *		This should be true unless caller is holding the table open, in which
+  *		case the caller had better have checked it earlier.
   * 'skip_build': make the catalog entries but leave the index file empty;
   *		it will be filled later.
   * 'quiet': suppress the NOTICE chatter ordinarily provided for constraints.
*** DefineIndex(Oid relationId,
*** 307,312 
--- 310,316 
  			Oid indexRelationId,
  			bool is_alter_table,
  			bool check_rights,
+ 			bool check_not_in_use,
  			bool skip_build,
  			bool quiet)
  {
*** DefineIndex(Oid relationId,
*** 405,410 
--- 409,423 
   errmsg("cannot create indexes on temporary tables of other sessions")));
  
  	/*
+ 	 * Unless our caller vouches for having checked this already, insist that
+ 	 * the table not be in use by our own session, either.  Otherwise we might
+ 	 * fail to make entries in the new index (for instance, if an INSERT or
+ 	 * UPDATE is in progress and has already made its list of target indexes).
+ 	 */
+ 	if (check_not_in_use)
+ 		CheckTableNotInUse(rel, "CREATE INDEX");
+ 
+ 	/*
  	 * Verify we (still) have CREATE rights in the rel's namespace.
  	 * (Presumably we did when the rel was created, but maybe not anymore.)
  	 * Skip check if caller doesn't want it.  Also skip check if
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 7959120..b61fda9 100644
*** a/src/backend/commands/tablecmds.c
--- b/src/backend/commands/tablecmds.c
*** ATExecAddIndex(AlteredTableInfo *tab, Re
*** 6679,6684 
--- 6679,6685 
  		  InvalidOid,	/* no predefined OID */
  		  true, /* is_alter_table */
  		  check_rights,
+ 		  false,	/* check_not_in_use - we did it already */
  		  skip_build,
  		  quiet);
  
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 1e941fb..a22fd53 100644
*** a/src/backend/tcop/utility.c
--- b/src/backend/tcop/utility.c
*** ProcessUtilitySlow(ParseState *pstate,
*** 1329,1334 
--- 1329,1335 
  	InvalidOid, /* no predefined OID */
  	false,		/* is_alter_table */
  	true,		/* check_rights */
+ 	true,		/* check_not_in_use */
  	false,		/

Re: [HACKERS] Index created in BEFORE trigger not updated during INSERT

2017-05-28 Thread Tom Lane
Andres Freund  writes:
> On 2017-05-24 08:26:24 -0400, Robert Haas wrote:
>> I'm willing to bet that nobody ever thought about that case very hard.
>> It seems like we should either make it work or prohibit it, but I
>> can't actually see quite how to do either off-hand.

> Hm, strategically sprinkled CheckTableNotInUse() might do the trick?

+1.  We can't reasonably make it work: the outer query already has its
list of indexes that need to be inserted into.  Also, if you try to
make the index via ALTER TABLE ADD CONSTRAINT in the trigger, that will
give you "cannot ALTER TABLE "mytable" because it is being used by active
queries in this session" because of the check in AlterTable().

It doesn't seem terribly hard to fix the CREATE INDEX case to behave
likewise, but I wonder how many other cases we've missed?

One small issue is that naive use of CheckTableNotInUse would produce an
error reading "cannot CREATE INDEX "mytable" because ...", which is not
exactly on point.  It'd be better for it to say "cannot create an index on
"mytable" because ...".  However, given that it took twenty years for
anybody to notice this, maybe it's close enough.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index created in BEFORE trigger not updated during INSERT

2017-05-24 Thread Andres Freund
On 2017-05-24 08:26:24 -0400, Robert Haas wrote:
> On Mon, May 22, 2017 at 7:05 AM, Albe Laurenz  wrote:
> > Not that it is a useful use case, but I believe that this is
> > a bug that causes index corruption:
> >
> > CREATE TABLE mytable(
> >id integer PRIMARY KEY,
> >id2 integer NOT NULL
> > );
> >
> > CREATE FUNCTION makeindex() RETURNS trigger
> >LANGUAGE plpgsql AS
> > $$BEGIN
> >CREATE INDEX ON mytable(id2);
> >RETURN NEW;
> > END;$$;
> 
> I'm willing to bet that nobody ever thought about that case very hard.
> It seems like we should either make it work or prohibit it, but I
> can't actually see quite how to do either off-hand.

Hm, strategically sprinkled CheckTableNotInUse() might do the trick?
I've neither tried nor thought this through fully, but I can't think of
a case where pre-existing relcache references to tables are ok...

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index created in BEFORE trigger not updated during INSERT

2017-05-24 Thread Robert Haas
On Mon, May 22, 2017 at 7:05 AM, Albe Laurenz  wrote:
> Not that it is a useful use case, but I believe that this is
> a bug that causes index corruption:
>
> CREATE TABLE mytable(
>id integer PRIMARY KEY,
>id2 integer NOT NULL
> );
>
> CREATE FUNCTION makeindex() RETURNS trigger
>LANGUAGE plpgsql AS
> $$BEGIN
>CREATE INDEX ON mytable(id2);
>RETURN NEW;
> END;$$;

I'm willing to bet that nobody ever thought about that case very hard.
It seems like we should either make it work or prohibit it, but I
can't actually see quite how to do either off-hand.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Index Only Scan support for cube

2017-05-23 Thread Andrew Borodin
Hi, hackers!

Here's a small patch that implements fetch function necessary for
Index Only Scans that use cube data type.
I reuse function g_cube_decompress() instead of creating new function
g_cube_fetch().
Essentially, they both have to detoast data.

How do you think, is it better to create a shallow copy of
g_cube_decompress instead?
Any other suggestions on the functionality?

This

CREATE TABLE SOMECUBES AS SELECT CUBE(X,X+1) C FROM GENERATE_SERIES(1,100) X;
CREATE INDEX SOMECUBES_IDX ON SOMECUBES USING GIST(C);
SET ENABLE_SEQSCAN = FALSE;
EXPLAIN (COSTS OFF ) SELECT C FROM SOMECUBES WHERE C<@CUBE(30,40);

now produces

Index Only Scan using somecubes_idx on somecubes
  Index Cond: (c <@ '(30),(40)'::cube)

instead of

Index Scan using somecubes_idx on somecubes
  Index Cond: (c <@ '(30),(40)'::cube)



Best regards, Andrey Borodin, Octonica.


cubefetch.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Index created in BEFORE trigger not updated during INSERT

2017-05-22 Thread Albe Laurenz
Not that it is a useful use case, but I believe that this is
a bug that causes index corruption:

CREATE TABLE mytable(
   id integer PRIMARY KEY,
   id2 integer NOT NULL
);

CREATE FUNCTION makeindex() RETURNS trigger
   LANGUAGE plpgsql AS
$$BEGIN
   CREATE INDEX ON mytable(id2);
   RETURN NEW;
END;$$;

CREATE TRIGGER makeindex BEFORE INSERT ON mytable FOR EACH ROW
   EXECUTE PROCEDURE makeindex();

INSERT INTO mytable VALUES (1, 42);

SELECT * FROM mytable;
┌┬─┐
│ id │ id2 │
├┼─┤
│  1 │  42 │
└┴─┘
(1 row)

But 42 is not present in the index:

SET enable_seqscan = off;

SELECT * FROM mytable WHERE id2 = 42;
┌┬─┐
│ id │ id2 │
├┼─┤
└┴─┘
(0 rows)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] index-only count(*) for indexes supporting bitmap scans

2017-04-12 Thread Alexander Kuzmenkov


On 12.04.2017 12:29, Alexander Korotkov wrote:

That's a cool feature for FTS users! Please, register this patch to 
the next commitfest.
I've added this to the 2017-07 commitfest: 
https://commitfest.postgresql.org/14/1117/



Also, what is planning overhead of this patch? That's worth
testing too.

The planning overhead is about 0.1 - 0.07 ms on the samples from my 
first letter.


Another thing catch my eye.  The estimated number of rows in Bitmap 
Count node is the same as in Bitmap Index Scan node.  Should it be 1 
because it always returns single row?

You're right, I'll fix this in the next version of the patch.


Does this limitation cause a performance drawback?  When bitmap
index scan returns all rechecks, alternative to Bitmap Count is
still Aggregate + Bitmap Heap Scan.  Thus, I think Bitmap Count
would have the same performance or even slightly faster.  That's
worth testing.

Bitmap heap scan can indeed be faster, because it prefetches heap pages, 
and can be run in parallel. When the table data is not cached, the 
difference is not big on my machine. It could probably be significant if 
I used a storage that supported parallel reads. When the data is cached 
in memory, the parallel bitmap heap scan can be significantly faster.
We could use the standard bitmap heap scan node with some tweaks, as 
discussed in the other subthread, to avoid this regression.


Here are some test queries:

--- not cached 
---
test1=# explain analyze select count(*) from pglist where fts @@ 
to_tsquery( 'tom & lane' );

  QUERY PLAN
--
 Bitmap Count on pglist  (cost=542.65..1087.68 rows=54503 width=8) 
(actual time=30264.174..30264.177 rows=1 loops=1)

   Recheck Cond: (fts @@ to_tsquery('tom & lane'::text))
   Rows Removed by Index Recheck: 270853
   Heap Fetches: 66138
   Heap Blocks: exact=39854 lossy=66138
   ->  Bitmap Index Scan on idx_pglist_fts  (cost=0.00..529.02 
rows=54503 width=0) (actual time=525.341..525.341 rows=222813 loops=1)

 Index Cond: (fts @@ to_tsquery('tom & lane'::text))
 Planning time: 128.238 ms
 Execution time: 30264.299 ms
(9 rows)

test1=# set enable_bitmapcount to off;
SET
test1=# explain analyze select count(*) from pglist where fts @@ 
to_tsquery( 'tom & lane' );

QUERY PLAN

 Finalize Aggregate  (cost=119989.73..119989.74 rows=1 width=8) (actual 
time=31699.829..31699.830 rows=1 loops=1)
   ->  Gather  (cost=119989.52..119989.73 rows=2 width=8) (actual 
time=31698.699..31699.819 rows=3 loops=1)

 Workers Planned: 2
 Workers Launched: 2
 ->  Partial Aggregate  (cost=118989.52..118989.53 rows=1 
width=8) (actual time=31689.289..31689.290 rows=1 loops=3)
   ->  Parallel Bitmap Heap Scan on pglist 
(cost=542.65..118932.74 rows=22710 width=0) (actual 
time=608.532..31634.692 rows=74271 loops=3)

 Recheck Cond: (fts @@ to_tsquery('tom & lane'::text))
 Rows Removed by Index Recheck: 90284
 Heap Blocks: exact=13242 lossy=21960
 ->  Bitmap Index Scan on idx_pglist_fts 
(cost=0.00..529.02 rows=54503 width=0) (actual time=552.136..552.136 
rows=222813 loops=1)
   Index Cond: (fts @@ to_tsquery('tom & 
lane'::text))

 Planning time: 160.055 ms
 Execution time: 31724.468 ms
(13 rows)


- cached 
-
test1=# explain analyze select count(*) from pglist where fts @@ 
to_tsquery( 'tom & lane' );

QUERY PLAN
--
 Finalize Aggregate  (cost=119989.73..119989.74 rows=1 width=8) (actual 
time=1250.973..1250.973 rows=1 loops=1)
   ->  Gather  (cost=119989.52..119989.73 rows=2 width=8) (actual 
time=1250.588..1250.964 rows=3 loops=1)

 Workers Planned: 2
 Workers Launched: 2
 ->  Partial Aggregate  (cost=118989.52..118989.53 rows=1 
width=8) (actual time=1246.144..1246.144 rows=1 loops=3)
   ->  Parallel Bitmap Heap Scan on pglist 
(cost=542.65..118932.74 rows=22710 width=0) (actual 
time=82.781..1237.585 rows=74271 loops=3)

 Recheck Cond: (fts @@ to_tsquery('tom & lane'::text))
 Rows Removed by Index Recheck: 90284
 Heap Blocks: exact=13221 lossy=22217
 ->  Bitmap Index Scan on idx_pglist_fts

Re: [HACKERS] index-only count(*) for indexes supporting bitmap scans

2017-04-12 Thread Alexander Kuzmenkov

On 12.04.2017 17:24, Tom Lane wrote:

TBH, I'm not sure you need to do any of that work.  Have you got evidence
that the planner will fail to choose the right plan regardless? I'm
particularly unconvinced that choose_bitmap_and is a critical problem,
because once you're into having to AND multiple indexes, you're talking
about an expensive query anyhow.
The most expensive part would probably be accessing the heap in the 
bitmap heap scan. It may be worth trying to choose an index path that 
checks all the restriction and therefore allows us to skip this heap 
access. This path might not be the cheapest one, though. The standard 
AND selection procedure would have discarded it based on cost.
I've seen this happen on the regression database. Somehow I can't seem 
to reproduce it on my earlier full-text search example.


An example:

regression=# explain select count(*) from tenk1 where hundred < 90 and 
thousand < 31;

QUERY PLAN
---
 Bitmap Count on tenk1  (cost=182.69..185.56 rows=1 width=8)
   Recheck Cond: ((thousand < 31) AND (hundred < 90))
   ->  BitmapAnd  (cost=182.69..182.69 rows=287 width=0)
 ->  Bitmap Index Scan on tenk1_thous_tenthous (cost=0.00..6.68 
rows=319 width=0)

   Index Cond: (thousand < 31)
 ->  Bitmap Index Scan on tenk1_hundred (cost=0.00..175.62 
rows=8978 width=0)

   Index Cond: (hundred < 90)
(7 rows)

regression=# set enable_bitmapcount to off;
SET
regression=# explain select count(*) from tenk1 where hundred < 90 and 
thousand < 31;

QUERY PLAN
---
 Aggregate  (cost=375.34..375.35 rows=1 width=8)
   ->  Bitmap Heap Scan on tenk1  (cost=6.75..374.62 rows=287 width=0)
 Recheck Cond: (thousand < 31)
 Filter: (hundred < 90)
 ->  Bitmap Index Scan on tenk1_thous_tenthous (cost=0.00..6.68 
rows=319 width=0)

   Index Cond: (thousand < 31)
(6 rows)

--

Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] index-only count(*) for indexes supporting bitmap scans

2017-04-12 Thread Tom Lane
Alexander Kuzmenkov  writes:
> With planner, the changes are more complex. Two things must be done 
> there. First, when the tlist is empty, we must use a different cost 
> function for bitmap heap scan, because the heap access pattern is 
> different. Second, choose_bitmap_and() must use a different algorithm 
> for choosing the right combination of paths. A standard algorithm 
> chooses the combination based on cost. For count(*) purposes, the 
> decisive factor is that the path has to check all the restrictions, or 
> else we'll need heap access to recheck some of them, which defeats the 
> purpose of having this optimization. The planner code that builds and 
> costs the index path is fairly complex, and integrating this additional 
> behavior into it didn't look good to me.

TBH, I'm not sure you need to do any of that work.  Have you got evidence
that the planner will fail to choose the right plan regardless?  I'm
particularly unconvinced that choose_bitmap_and is a critical problem,
because once you're into having to AND multiple indexes, you're talking
about an expensive query anyhow.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] index-only count(*) for indexes supporting bitmap scans

2017-04-12 Thread Alexander Kuzmenkov

On 12.04.2017 15:04, Tom Lane wrote:

Andrew Gierth  writes:

"Alexander" == Alexander Kuzmenkov  writes:
  Alexander> Structurally, the patch consists of two major parts: a
  Alexander> specialized executor node
Why?
It strikes me that the significant fact here is not that we're doing
count(*), but that we don't need any columns from the bitmap heap scan
result.  Rather than creating a whole new node, can't the existing
bitmap heapscan be taught to skip fetching the actual table page in
cases where it's all-visible, not lossy, and no columns are needed?

+1 ... while I hadn't actually looked at the code, it seemed to me that
anything like the optimization-as-described would be impossibly klugy
from the planner's standpoint.  Your formulation sounds lots nicer.

Detecting that no columns are needed in the executor might be a bit tricky
because of the planner's habit of inserting a "physical tlist" to avoid a
projection step.  (See also nearby discussion about custom scan planning.)
But we could fix that.  I think one rule that would make sense is to
just disable the physical-tlist substitution if the relation's targetlist
is empty --- it wouldn't be buying much in such a case anyhow.  Then the
runtime tlist for the scan node would also be empty, and away you go.
When making an early prototype of this optimization, I did what you are 
describing with the bitmap heap scan executor node. In an internal 
review, it was said that the bitmap heap scan node is already 
complicated enough and shouldn't have more logic added to it, so I 
rewrote it a separate node. To me, your approach looks good too, so if 
the community is generally in favor of this, I could rewrite the 
executor as such.


With planner, the changes are more complex. Two things must be done 
there. First, when the tlist is empty, we must use a different cost 
function for bitmap heap scan, because the heap access pattern is 
different. Second, choose_bitmap_and() must use a different algorithm 
for choosing the right combination of paths. A standard algorithm 
chooses the combination based on cost. For count(*) purposes, the 
decisive factor is that the path has to check all the restrictions, or 
else we'll need heap access to recheck some of them, which defeats the 
purpose of having this optimization. The planner code that builds and 
costs the index path is fairly complex, and integrating this additional 
behavior into it didn't look good to me. Instead, I created a 
specialized path node and isolated the logic that handles it. The 
resulting implementation adds several functions, but it is mostly 
self-contained and has a single entry point in grouping_planner(). It 
handles the specific case of bitmap count plans and doesn't complicate 
the existing code any further.


The planner part is to some extent independent of whether we use a 
separate executor node or not. If we choose not to, the same count(*) 
optimization code I proposed could create plain bitmap heap scan paths.


--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] index-only count(*) for indexes supporting bitmap scans

2017-04-12 Thread Tom Lane
Andrew Gierth  writes:
> "Alexander" == Alexander Kuzmenkov  writes:
>  Alexander> Structurally, the patch consists of two major parts: a
>  Alexander> specialized executor node

> Why?

> It strikes me that the significant fact here is not that we're doing
> count(*), but that we don't need any columns from the bitmap heap scan
> result.  Rather than creating a whole new node, can't the existing
> bitmap heapscan be taught to skip fetching the actual table page in
> cases where it's all-visible, not lossy, and no columns are needed?

+1 ... while I hadn't actually looked at the code, it seemed to me that
anything like the optimization-as-described would be impossibly klugy
from the planner's standpoint.  Your formulation sounds lots nicer.

Detecting that no columns are needed in the executor might be a bit tricky
because of the planner's habit of inserting a "physical tlist" to avoid a
projection step.  (See also nearby discussion about custom scan planning.)
But we could fix that.  I think one rule that would make sense is to
just disable the physical-tlist substitution if the relation's targetlist
is empty --- it wouldn't be buying much in such a case anyhow.  Then the
runtime tlist for the scan node would also be empty, and away you go.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] index-only count(*) for indexes supporting bitmap scans

2017-04-12 Thread Andrew Gierth
> "Alexander" == Alexander Kuzmenkov  writes:

 Alexander> Structurally, the patch consists of two major parts: a
 Alexander> specialized executor node

Why?

It strikes me that the significant fact here is not that we're doing
count(*), but that we don't need any columns from the bitmap heap scan
result.  Rather than creating a whole new node, can't the existing
bitmap heapscan be taught to skip fetching the actual table page in
cases where it's all-visible, not lossy, and no columns are needed?

(this would also have the advantage of getting parallelism for free)

-- 
Andrew (irc:RhodiumToad)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] index-only count(*) for indexes supporting bitmap scans

2017-04-12 Thread Alexander Korotkov
On Wed, Apr 12, 2017 at 12:40 AM, Alexander Korotkov <
a.korot...@postgrespro.ru> wrote:

> On Tue, Apr 11, 2017 at 8:24 PM, Alexander Kuzmenkov <
> a.kuzmen...@postgrespro.ru> wrote:
>
>> I would like to propose a patch that speeds up the queries of the form
>> 'select
>> count(*) ... where ...',  where the restriction clauses can be satisfied
>> by some
>> indexes. At the moment, such queries use index-only scans followed by
>> aggregation. Index-only scans are only possible for indexes that are
>> capable of
>> returning indexed tuples, that is, support the 'amgettuple' access
>> method. They
>> are not applicable to indexes such as GIN and RUM. However, it is
>> possible to
>> improve count(*) performance for indexes that support bitmap scans. Having
>> performed a bitmap index scan or a combination of such, the bits in
>> bitmap can
>> be counted to obtain the final result. Of course, the bitmap pages that
>> are
>> lossy or not visible to all existing transactions will still require heap
>> access.
>>
>
> That's a cool feature for FTS users! Please, register this patch to the
> next commitfest.
>
> This patch has some important limitations:
>> * It only supports targetlist consisting of a single expression that can
>> be
>> projected from count(*).
>> * count(expr) is not supported. We could support it for cases where the
>> "expr is not null" restriction can be satisfied with an index.
>> * The current implementation does not support parallel execution. It
>> could be
>> implemented during the PostgreSQL 11 release cycle.
>> * For some indexes, the bitmap index scan will always require rechecking
>> all
>> the tuples. Bitmap count plans should not be used in such cases. For now,
>> this
>> check is not implemented.
>>
>
> Does this limitation cause a performance drawback?  When bitmap index scan
> returns all rechecks, alternative to Bitmap Count is still Aggregate +
> Bitmap Heap Scan.  Thus, I think Bitmap Count would have the same
> performance or even slightly faster.  That's worth testing.
>
> Also, what is planning overhead of this patch?  That's worth testing too.
>

Another thing catch my eye.  The estimated number of rows in Bitmap Count
node is the same as in Bitmap Index Scan node.  Should it be 1 because it
always returns single row?

test1=# explain analyze select count(*) from pglist where fts @@
> to_tsquery( 'tom & lane' );
>  QUERY
> PLAN
>
> 
>  Bitmap Count on pglist  (cost=550.65..1095.68 rows=54503 width=8) (actual
> time=1120.281..1120.281 rows=1 loops=1)
>Recheck Cond: (fts @@ to_tsquery('tom & lane'::text))
>Heap Fetches: 0
>Heap Blocks: exact=105992
>->  Bitmap Index Scan on idx_pglist_rum_fts  (cost=0.00..537.02
> rows=54503 width=0) (actual time=1056.060..1056.060 rows=222813 loops=1)
>  Index Cond: (fts @@ to_tsquery('tom & lane'::text))
>  Planning time: 119.568 ms
>  Execution time: 1121.409 ms
> (8 rows)


--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] index-only count(*) for indexes supporting bitmap scans

2017-04-11 Thread Alexander Korotkov
On Tue, Apr 11, 2017 at 8:24 PM, Alexander Kuzmenkov <
a.kuzmen...@postgrespro.ru> wrote:

> I would like to propose a patch that speeds up the queries of the form
> 'select
> count(*) ... where ...',  where the restriction clauses can be satisfied
> by some
> indexes. At the moment, such queries use index-only scans followed by
> aggregation. Index-only scans are only possible for indexes that are
> capable of
> returning indexed tuples, that is, support the 'amgettuple' access method.
> They
> are not applicable to indexes such as GIN and RUM. However, it is possible
> to
> improve count(*) performance for indexes that support bitmap scans. Having
> performed a bitmap index scan or a combination of such, the bits in bitmap
> can
> be counted to obtain the final result. Of course, the bitmap pages that are
> lossy or not visible to all existing transactions will still require heap
> access.
>

That's a cool feature for FTS users! Please, register this patch to the
next commitfest.

This patch has some important limitations:
> * It only supports targetlist consisting of a single expression that can be
> projected from count(*).
> * count(expr) is not supported. We could support it for cases where the
> "expr is not null" restriction can be satisfied with an index.
> * The current implementation does not support parallel execution. It could
> be
> implemented during the PostgreSQL 11 release cycle.
> * For some indexes, the bitmap index scan will always require rechecking
> all
> the tuples. Bitmap count plans should not be used in such cases. For now,
> this
> check is not implemented.
>

Does this limitation cause a performance drawback?  When bitmap index scan
returns all rechecks, alternative to Bitmap Count is still Aggregate +
Bitmap Heap Scan.  Thus, I think Bitmap Count would have the same
performance or even slightly faster.  That's worth testing.

Also, what is planning overhead of this patch?  That's worth testing too.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


[HACKERS] index-only count(*) for indexes supporting bitmap scans

2017-04-11 Thread Alexander Kuzmenkov

Hi,

I would like to propose a patch that speeds up the queries of the form 
'select
count(*) ... where ...',  where the restriction clauses can be satisfied 
by some

indexes. At the moment, such queries use index-only scans followed by
aggregation. Index-only scans are only possible for indexes that are 
capable of
returning indexed tuples, that is, support the 'amgettuple' access 
method. They
are not applicable to indexes such as GIN and RUM. However, it is 
possible to

improve count(*) performance for indexes that support bitmap scans. Having
performed a bitmap index scan or a combination of such, the bits in 
bitmap can

be counted to obtain the final result. Of course, the bitmap pages that are
lossy or not visible to all existing transactions will still require heap
access.

One kind of applications that can benefit from this change is the full-text
search with pagination. To show a search results page, the application 
has to
know the results that go to current page, and the total number of the 
results.
Getting one page is fast, when the desired sorting order can be provided 
by an
index. For example, results can be sorted by date with a separate btree 
index,
or by relevance with RUM index. However, getting the total number of 
results is
more difficult. With text search indexes, it requires a bitmap heap 
scan, which
can be rather slow due to obligatory heap access. A well-known hack for 
this is
using the approximate data from 'explain' results. The proposed change 
allows

the user to obtain the precise number of the results in an efficient and
idiomatic manner.

The performance of this approach was tested on an archive of pgsql-hackers
mailing list. The detailed results for two sample queries can be found 
in the

attached file 'benchmark.txt'. The first test demonstrates full-text search
with RUM index, ordering the results by rank. For a query with low 
selectivity,
getting the top results is much faster than counting them all with a 
bitmap heap
scan. With bitmap count execution plan, the results can be counted much 
faster.
A similar test is done with a GIN index, where the results are 
restricted and
ordered by date using another btree index. Again, it shows a significant 
speedup

for count(*) query for bitmap count scan compared to bitmap heap scan. These
results demonstrate that the bitmap count plan can indeed be useful for 
full-

text search scenarios.

Structurally, the patch consists of two major parts: a specialized executor
node and the generation of corresponding paths and plans. The optimizer 
behavior
here is similar to that of the min/max aggregate optimization. The main 
entry

point is preprocess_count_aggs(); it is called by grouping_planner(). It
searches for count(*) expression in the target list, and, if possible, 
generates

a special path for it (struct BitmapCountPath). This path is then added to
UPPERREL_GROUP_AGG upperrel, and competes with other paths at the 
further stages

of planning.

The executor node (nodeBitmapCount.c) is similar to the bitmap heap scan 
node,
with the main difference being that it does not access heap for tuples 
that are

known to satisfy the restriction and to be visible to all transactions.

This patch has some important limitations:
* It only supports targetlist consisting of a single expression that can be
projected from count(*).
* count(expr) is not supported. We could support it for cases where the
"expr is not null" restriction can be satisfied with an index.
* The current implementation does not support parallel execution. It 
could be

implemented during the PostgreSQL 11 release cycle.
* For some indexes, the bitmap index scan will always require rechecking 
all
the tuples. Bitmap count plans should not be used in such cases. For 
now, this

check is not implemented.

I would be glad to hear your comments on this patch.

Regards,
Alexander Kuzmenkov

===
= The data

test1=# \d pglist
   Table "public.pglist"
   Column   |Type | Collation | Nullable | Default 
+-+---+--+-
 id | integer |   |  | 
 sent   | timestamp without time zone |   |  | 
 subject| text|   |  | 
 author | text|   |  | 
 body_plain | text|   |  | 
 fts| tsvector|   |  | 
Indexes:
"idx_pglist_rum_fts" rum (fts)
"idx_pglist_fts" gin (fts)
"idx_pglist_sent" btree (sent)


test1=# select min(sent), max(sent), count(*) from pglist;
 min | max |  count  
-+-+-
 1997-06-24 11:31:09 | 2016-04-27 23:46:29 | 1013770
(1 row)



Re: [HACKERS] Index usage for elem-contained-by-const-range clauses

2017-03-28 Thread David Steele

On 3/24/17 10:50 AM, David Steele wrote:

Hi Pritam,

On 3/17/17 5:41 PM, Pritam Baral wrote:


So sorry. I'm attaching the correct version of the original with this,
in case you want to test the limited implementation, because I still
have to go through Tom's list of suggestions.

BTW, the patch is for applying on top of REL9_6_2, and while I
suspect it may work on master too, I haven't tested it since the
original submission (Feb 23).


Also, I noticed that patch haven't regression tests.  Some mention of
this optimization in docs is also nice to have.  > > -- >
Alexander Korotkov > Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company


This thread has been idle for a week.  Please respond and/or post a new
patch by 2017-03-28 00:00 AoE (UTC-12) or this submission will be marked
"Returned with Feedback".


This submission has been marked "Returned with Feedback".  Please feel 
free to resubmit to a future commitfest.


Regards,
--
-David
da...@pgmasters.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index usage for elem-contained-by-const-range clauses

2017-03-27 Thread Alexander Korotkov
On Sat, Mar 18, 2017 at 12:41 AM, Pritam Baral 
wrote:

> On Friday 10 March 2017 07:59 PM, Alexander Korotkov wrote:
>
>> Hi, Pritam!  > > I've assigned to review this patch. > > On Thu, Feb 23,
>> 2017 at 2:17 AM, Pritam Baral  wrote: > >
>>  The topic has been previously discussed[0] on the -performance mailing
>> list, > about four years ago. > > In that thread, Tom suggested[0]
>> the planner could be made to "expand > "intcol <@ >
>>  'x,y'::int4range" into "intcol between x and y", using something similar
>> > to the > index LIKE optimization (ie, the "special operator"
>> stuff in indxpath.c)". > > > That's cool idea.  But I would say more.
>> Sometimes it's useful to transform "intcol between x and y" into "intcol <@
>> 'x,y'::int4range".  btree_gin supports "intcol between x and y" as overlap
>> of "intcol >= x" and "intcol <= y".  That is very inefficient.  But it this
>> clause would be transformed into "intcol <@ 'x,y'::int4range", btree_gin
>> could handle this very efficient. > > > > This patch tries to do
>> exactly that. It's not tied to any specific datatype, > and has
>>
> been tested with both builtin types and custom range types. Most > of
> the > checking for proper datatypes, operators, and btree index happens
> before > this > code, so I haven't run into any issues yet in my
> testing. But I'm not > familiar > enough with the internals to be
> able to confidently say it can handle > all cases > just yet. > > >
> I've tried this patch.  It applies cleanly, but doesn't compile. > >
> indxpath.c:4252:1: error: conflicting types for
> 'range_elem_contained_quals' > range_elem_contained_quals(Node *leftop,
> Datum rightop) > ^ > indxpath.c:192:14: note: previous declaration is here
> > static List *range_elem_contained_quals(Node *leftop, Oid expr_op, Oid
> opfamily, >  ^ > Could you please recheck that you published
> right version of patch?
>
> So sorry. I'm attaching the correct version of the original with this,
> in case you want to test the limited implementation, because I still
> have to go through Tom's list of suggestions.
>
> BTW, the patch is for applying on top of REL9_6_2, and while I
> suspect it may work on master too, I haven't tested it since the
> original submission (Feb 23).


What is idea behind basing patch on the REL9_6_2?
This patch implements new functionality and it's definitely not going to be
considered to be committed to stable release branches.
If you are interesting in committing this patch to master, please rebase it
on master branch.  If not, please clarify the purpose of this submission.

Also, please include some numbering to the patch name, so that we could
distinguish one version of patch from another.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Index usage for elem-contained-by-const-range clauses

2017-03-24 Thread David Steele

Hi Pritam,

On 3/17/17 5:41 PM, Pritam Baral wrote:


So sorry. I'm attaching the correct version of the original with this,
in case you want to test the limited implementation, because I still
have to go through Tom's list of suggestions.

BTW, the patch is for applying on top of REL9_6_2, and while I
suspect it may work on master too, I haven't tested it since the
original submission (Feb 23).


Also, I noticed that patch haven't regression tests.  Some mention of
this optimization in docs is also nice to have.  > > -- >
Alexander Korotkov > Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company


This thread has been idle for a week.  Please respond and/or post a new 
patch by 2017-03-28 00:00 AoE (UTC-12) or this submission will be marked 
"Returned with Feedback".


Thanks,
--
-David
da...@pgmasters.net


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index usage for elem-contained-by-const-range clauses

2017-03-17 Thread Pritam Baral

On Friday 10 March 2017 07:59 PM, Alexander Korotkov wrote:
Hi, Pritam!  > > I've assigned to review this patch. > > On Thu, Feb 23, 2017 at 2:17 AM, Pritam Baral  wrote: > > The topic has been previously discussed[0] on the -performance mailing list, > about four years ago. > > In that thread, Tom suggested[0] the planner could be made to "expand > "intcol <@ > 'x,y'::int4range" into "intcol between x and y", using something similar > to the > index LIKE optimization (ie, the "special operator" stuff in indxpath.c)". > > > That's cool idea.  But I would say more.  Sometimes it's useful to transform "intcol between x and y" into "intcol <@ 'x,y'::int4range".  btree_gin supports "intcol between x and y" as overlap of "intcol >= x" and "intcol <= y".  That is very inefficient.  But it this clause would be transformed into "intcol <@ 'x,y'::int4range", btree_gin could handle this very efficient. > > > > This patch tries to do exactly that. It's not tied to any specific datatype, > and has 

been tested with both builtin types and custom range types. Most > of the > checking for proper datatypes, operators, and 
btree index happens before > this > code, so I haven't run into any issues yet in my testing. But I'm not > 
familiar > enough with the internals to be able to confidently say it can handle > all cases > just yet. > > 
> I've tried this patch.  It applies cleanly, but doesn't compile. > > indxpath.c:4252:1: error: conflicting types for 
'range_elem_contained_quals' > range_elem_contained_quals(Node *leftop, Datum rightop) > ^ > indxpath.c:192:14: note: previous 
declaration is here > static List *range_elem_contained_quals(Node *leftop, Oid expr_op, Oid opfamily, >  ^ > Could 
you please recheck that you published right version of patch?

So sorry. I'm attaching the correct version of the original with this,
in case you want to test the limited implementation, because I still
have to go through Tom's list of suggestions.

BTW, the patch is for applying on top of REL9_6_2, and while I
suspect it may work on master too, I haven't tested it since the
original submission (Feb 23).


Also, I noticed that patch haven't regression tests.  Some mention of this optimization in 
docs is also nice to have.  > > -- > Alexander Korotkov > Postgres 
Professional: http://www.postgrespro.com > The Russian Postgres Company


--
#!/usr/bin/env regards
Chhatoi Pritam Baral

diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 2952bfb7c2..40a3c2c9f4 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -30,21 +30,23 @@
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
 #include "optimizer/predtest.h"
 #include "optimizer/prep.h"
 #include "optimizer/restrictinfo.h"
 #include "optimizer/var.h"
 #include "utils/builtins.h"
 #include "utils/bytea.h"
 #include "utils/lsyscache.h"
 #include "utils/pg_locale.h"
+#include "utils/rangetypes.h"
 #include "utils/selfuncs.h"
+#include "utils/typcache.h"
 
 
 #define IsBooleanOpfamily(opfamily) \
 	((opfamily) == BOOL_BTREE_FAM_OID || (opfamily) == BOOL_HASH_FAM_OID)
 
 #define IndexCollMatchesExprColl(idxcollation, exprcollation) \
 	((idxcollation) == InvalidOid || (idxcollation) == (exprcollation))
 
 /* Whether we are looking for plain indexscan, bitmap scan, or either */
 typedef enum
@@ -180,20 +182,21 @@ static Expr *expand_boolean_index_clause(Node *clause, int indexcol,
 			IndexOptInfo *index);
 static List *expand_indexqual_opclause(RestrictInfo *rinfo,
 		  Oid opfamily, Oid idxcollation);
 static RestrictInfo *expand_indexqual_rowcompare(RestrictInfo *rinfo,
 			IndexOptInfo *index,
 			int indexcol);
 static List *prefix_quals(Node *leftop, Oid opfamily, Oid collation,
 			 Const *prefix, Pattern_Prefix_Status pstatus);
 static List *network_prefix_quals(Node *leftop, Oid expr_op, Oid opfamily,
 	 Datum rightop);
+static List *range_elem_contained_quals(Node *leftop, Datum rightop);
 static Datum string_to_datum(const char *str, Oid datatype);
 static Const *string_to_const(const char *str, Oid datatype);
 
 
 /*
  * create_index_paths()
  *	  Generate all interesting index paths for the given relation.
  *	  Candidate paths are added to the rel's pathlist (using add_path).
  *
  * To be considered for an index scan, an index must match one or more
@@ -3286,20 +3289,23 @@ match_special_index_operator(Expr *clause, Oid opfamily, Oid idxcollation,
 			/* the right-hand const is type text for all of these */
 			pstatus = pattern_fixed_prefix(patt, Pattern_Type_Regex_IC, expr_coll,
 		   &prefix, NULL);
 			isIndexable = (pstatus != Pattern_Prefix_None);
 			break;
 
 		case OID_INET_SUB_OP:
 		case OID_INET_SUBEQ_OP:
 			isIndexable = true;
 			break;
+		case OID_RANGE_ELEM_CONTAINED_OP:
+			isIndexable = true;
+			break;
 	}
 
 	if (prefix)
 	{
 		pfree(DatumGe

Re: [HACKERS] Index usage for elem-contained-by-const-range clauses

2017-03-17 Thread Pritam Baral

On Sunday 12 March 2017 01:58 AM, Jim Nasby wrote:

On 3/10/17 8:29 AM, Alexander Korotkov wrote:  >> That's cool idea.  But I would say more.  Sometimes it's useful to >> transform "intcol between x and y" into 
"intcol <@ 'x,y'::int4range". >>  btree_gin supports "intcol between x and y" as overlap of "intcol >= x" >> and "intcol <= y".  
That is very inefficient.  But it this clause would >> be transformed into "intcol <@ 'x,y'::int4range", btree_gin could handle >> this very efficient. > > That's 
certainly be nice as well, but IMHO it's outside the scope of this patch to accomplish that.


Also, I think btree indexes are more common than btree_gin. The motivation for
this originally came from trying to use the primary key of a large table in a
range search, and the primary key index was the default btree.

Also, this is my first deep dive into Postgres's source code, so I took a few
easy ways out, just to get started. If it's not too complex to get btree_gin to
handle between queries as contained-in-range, I can give it a try.


 > BTW, while we're wishing for things... Something else that would be nice is 
if there was a way to do these kind of transforms without hacking the backend...


Indeed. And this was one of the things Tom said back when a similar discussion
had happened (on the -performance mailing list). But seeing as how it's been
almost four years since then, I decided to go ahead with the backend hacking
anyway.


 >> Also, I noticed that patch haven't regression tests. > > BTW, those tests 
need to pay special attention to inclusive vs exclusive bounds.


I will add regression tests, though I do have to get through all of Tom's
suggestions elsewhere in this thread first.

--
#!/usr/bin/env regards
Chhatoi Pritam Baral



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index usage for elem-contained-by-const-range clauses

2017-03-17 Thread Tom Lane
I wrote:
> * You're not bothering to insert any inputcollid into the generated
> comparison operator nodes.  I'm not sure why that fails to fall over
> for text comparisons (if indeed it does fail ...) but it's wrong.
> Use the range type's collation there.

Oh ... looking at this again, I realize that there's an additional
validity check missing: if the range type's collation doesn't match
the index column's collation, we can't do this optimization at all.
That check probably belongs in match_special_index_operator.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index usage for elem-contained-by-const-range clauses

2017-03-14 Thread Tom Lane
Pritam Baral  writes:
> The topic has been previously discussed[0] on the -performance mailing list,
> about four years ago.
> In that thread, Tom suggested[0] the planner could be made to "expand
> "intcol <@
> 'x,y'::int4range" into "intcol between x and y", using something similar
> to the
> index LIKE optimization (ie, the "special operator" stuff in indxpath.c)".

> This patch tries to do exactly that.

I took a quick look through this, and have some thoughts ---

* In match_special_index_operator, there are two switch statements and
you've added a clause to only one of them.  In the second one, you need to
add a check that you're working with a btree index.  I'd expect the patch
as-is to fall over if an "indexkey <@ range" clause were matched to a hash
index.

* You're failing to account for the case of "range @> indexkey", which
really this ought to work for.  That would take a bit of fooling around
with the structure of match_special_index_operator to allow indexkey on
the right, but we've foreseen since the beginning that that would someday
be necessary.  Looks like today is that day.

* The first bit in range_elem_contained_quals will fall over for an
indexkey that is an expression rather than a simple Var.  Probably
you should just be using exprType() instead.  (Not sure about whether
that's sufficient in domain cases, though.)  Or actually, why look at
that at all?  Seems like what you want is to look at the RHS input,
ie the range value, and get the relevant element datatype from it.
That would correspond to what will happen if the <@ operator executes
normally: elem_contained_by_range does not consult the type of its LHS.

* The "return NIL" for an empty range looks pretty dubious.  Even if it
fails to fail altogether, it's not doing what we really want, which is to
signal that the condition cannot succeed so we needn't search the index.
Maybe what we should do in that case is generate an "indexkey = NULL"
qual.

* Likewise, if the range is infinite, you're just returning NIL and that's
leaving something on the table.  Probably worth generating "indexkey IS
NOT NULL" in that case; it's not going to help much in typical usage, but
it would prevent scanning nulls if there are a lot of them in the index.

* elog(ERROR) doesn't return, so stuff like this is not idiomatic:

+if (opr2oid == InvalidOid)
+{
+elog(ERROR, "no <= operator for opfamily %u", opfamily);
+return NIL;
+}

It'd be sufficient to do

+if (opr2oid == InvalidOid)
+elog(ERROR, "no <= operator for opfamily %u", opfamily);

* You're not bothering to insert any inputcollid into the generated
comparison operator nodes.  I'm not sure why that fails to fall over
for text comparisons (if indeed it does fail ...) but it's wrong.
Use the range type's collation there.

* It's sort of annoying that the whole thing only works for a Const range
value.  A different approach you might think about is to make this work
more like ScalarArrayOp, ie we pass the qual through to btree as-is and
teach relevant functions in access/nbtree/ how to extract index bound
conditions from the range datum at runtime.  That would likely end up
being a significantly larger patch, though, so you might reasonably
conclude it's not worth the effort.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index usage for elem-contained-by-const-range clauses

2017-03-11 Thread Jim Nasby

On 3/10/17 8:29 AM, Alexander Korotkov wrote:

That's cool idea.  But I would say more.  Sometimes it's useful to
transform "intcol between x and y" into "intcol <@ 'x,y'::int4range".
 btree_gin supports "intcol between x and y" as overlap of "intcol >= x"
and "intcol <= y".  That is very inefficient.  But it this clause would
be transformed into "intcol <@ 'x,y'::int4range", btree_gin could handle
this very efficient.


That's certainly be nice as well, but IMHO it's outside the scope of 
this patch to accomplish that.


BTW, while we're wishing for things... Something else that would be nice 
is if there was a way to do these kind of transforms without hacking the 
backend...



Also, I noticed that patch haven't regression tests.


BTW, those tests need to pay special attention to inclusive vs exclusive 
bounds.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index usage for elem-contained-by-const-range clauses

2017-03-10 Thread Alexander Korotkov
Hi, Pritam!

I've assigned to review this patch.

On Thu, Feb 23, 2017 at 2:17 AM, Pritam Baral 
wrote:

> The topic has been previously discussed[0] on the -performance mailing
> list,
> about four years ago.
>
> In that thread, Tom suggested[0] the planner could be made to "expand
> "intcol <@
> 'x,y'::int4range" into "intcol between x and y", using something similar
> to the
> index LIKE optimization (ie, the "special operator" stuff in indxpath.c)".
>

That's cool idea.  But I would say more.  Sometimes it's useful to
transform "intcol between x and y" into "intcol <@ 'x,y'::int4range".
 btree_gin supports "intcol between x and y" as overlap of "intcol >= x"
and "intcol <= y".  That is very inefficient.  But it this clause would be
transformed into "intcol <@ 'x,y'::int4range", btree_gin could handle this
very efficient.


>
> This patch tries to do exactly that. It's not tied to any specific
> datatype,
> and has been tested with both builtin types and custom range types. Most
> of the
> checking for proper datatypes, operators, and btree index happens before
> this
> code, so I haven't run into any issues yet in my testing. But I'm not
> familiar
> enough with the internals to be able to confidently say it can handle
> all cases
> just yet.
>

I've tried this patch.  It applies cleanly, but doesn't compile.

indxpath.c:4252:1: error: conflicting types for 'range_elem_contained_quals'
range_elem_contained_quals(Node *leftop, Datum rightop)
^
indxpath.c:192:14: note: previous declaration is here
static List *range_elem_contained_quals(Node *leftop, Oid expr_op, Oid
opfamily,
 ^
Could you please recheck that you published right version of patch?
Also, I noticed that patch haven't regression tests.  Some mention of this
optimization in docs is also nice to have.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] Index usage for elem-contained-by-const-range clauses

2017-02-26 Thread Robert Haas
On Thu, Feb 23, 2017 at 4:47 AM, Pritam Baral  wrote:
> The topic has been previously discussed[0] on the -performance mailing list,
> about four years ago.
>
> In that thread, Tom suggested[0] the planner could be made to "expand
> "intcol <@
> 'x,y'::int4range" into "intcol between x and y", using something similar
> to the
> index LIKE optimization (ie, the "special operator" stuff in indxpath.c)".
>
> This patch tries to do exactly that.

Please add your patch to https://commitfest.postgresql.org/ so it
doesn't get overlooked.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-23 Thread Jim Nasby

On 2/19/17 5:27 AM, Robert Haas wrote:

(1) a multi-batch hash join, (2) a nested loop,
and (3) a merge join.  (2) is easy to implement but will generate a
ton of random I/O if the table is not resident in RAM.  (3) is most
suitable for very large tables but takes more work to code, and is
also likely to be a lot slower for small tables than a hash or
nestloop-based approach.


As I understand it, #3 is already in place for validate_index(). I think 
you'd just need a different callback that checks the heap key.

--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Index usage for elem-contained-by-const-range clauses

2017-02-22 Thread Pritam Baral
The topic has been previously discussed[0] on the -performance mailing list,
about four years ago.

In that thread, Tom suggested[0] the planner could be made to "expand
"intcol <@
'x,y'::int4range" into "intcol between x and y", using something similar
to the
index LIKE optimization (ie, the "special operator" stuff in indxpath.c)".

This patch tries to do exactly that. It's not tied to any specific datatype,
and has been tested with both builtin types and custom range types. Most
of the
checking for proper datatypes, operators, and btree index happens before
this
code, so I haven't run into any issues yet in my testing. But I'm not
familiar
enough with the internals to be able to confidently say it can handle
all cases
just yet.

[0]:
https://www.postgresql.org/message-id/flat/9860.1364013108%40sss.pgh.pa.us#9860.1364013...@sss.pgh.pa.us

-- 
#!/usr/bin/env regards
Chhatoi Pritam Baral
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 2952bfb7c2..84dfd8362a 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -30,21 +30,23 @@
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
 #include "optimizer/predtest.h"
 #include "optimizer/prep.h"
 #include "optimizer/restrictinfo.h"
 #include "optimizer/var.h"
 #include "utils/builtins.h"
 #include "utils/bytea.h"
 #include "utils/lsyscache.h"
 #include "utils/pg_locale.h"
+#include "utils/rangetypes.h"
 #include "utils/selfuncs.h"
+#include "utils/typcache.h"
 
 
 #define IsBooleanOpfamily(opfamily) \
 	((opfamily) == BOOL_BTREE_FAM_OID || (opfamily) == BOOL_HASH_FAM_OID)
 
 #define IndexCollMatchesExprColl(idxcollation, exprcollation) \
 	((idxcollation) == InvalidOid || (idxcollation) == (exprcollation))
 
 /* Whether we are looking for plain indexscan, bitmap scan, or either */
 typedef enum
@@ -180,20 +182,22 @@ static Expr *expand_boolean_index_clause(Node *clause, int indexcol,
 			IndexOptInfo *index);
 static List *expand_indexqual_opclause(RestrictInfo *rinfo,
 		  Oid opfamily, Oid idxcollation);
 static RestrictInfo *expand_indexqual_rowcompare(RestrictInfo *rinfo,
 			IndexOptInfo *index,
 			int indexcol);
 static List *prefix_quals(Node *leftop, Oid opfamily, Oid collation,
 			 Const *prefix, Pattern_Prefix_Status pstatus);
 static List *network_prefix_quals(Node *leftop, Oid expr_op, Oid opfamily,
 	 Datum rightop);
+static List *range_elem_contained_quals(Node *leftop, Oid expr_op, Oid opfamily,
+	 Datum rightop);
 static Datum string_to_datum(const char *str, Oid datatype);
 static Const *string_to_const(const char *str, Oid datatype);
 
 
 /*
  * create_index_paths()
  *	  Generate all interesting index paths for the given relation.
  *	  Candidate paths are added to the rel's pathlist (using add_path).
  *
  * To be considered for an index scan, an index must match one or more
@@ -3286,20 +3290,23 @@ match_special_index_operator(Expr *clause, Oid opfamily, Oid idxcollation,
 			/* the right-hand const is type text for all of these */
 			pstatus = pattern_fixed_prefix(patt, Pattern_Type_Regex_IC, expr_coll,
 		   &prefix, NULL);
 			isIndexable = (pstatus != Pattern_Prefix_None);
 			break;
 
 		case OID_INET_SUB_OP:
 		case OID_INET_SUBEQ_OP:
 			isIndexable = true;
 			break;
+		case OID_RANGE_ELEM_CONTAINED_OP:
+			isIndexable = true;
+			break;
 	}
 
 	if (prefix)
 	{
 		pfree(DatumGetPointer(prefix->constvalue));
 		pfree(prefix);
 	}
 
 	/* done if the expression doesn't look indexable */
 	if (!isIndexable)
@@ -3614,20 +3621,27 @@ expand_indexqual_opclause(RestrictInfo *rinfo, Oid opfamily, Oid idxcollation)
 			break;
 
 		case OID_INET_SUB_OP:
 		case OID_INET_SUBEQ_OP:
 			if (!op_in_opfamily(expr_op, opfamily))
 			{
 return network_prefix_quals(leftop, expr_op, opfamily,
 			patt->constvalue);
 			}
 			break;
+		case OID_RANGE_ELEM_CONTAINED_OP:
+			if (!op_in_opfamily(expr_op, opfamily))
+			{
+return range_elem_contained_quals(leftop, expr_op, opfamily,
+			patt->constvalue);
+			}
+			break;
 	}
 
 	/* Default case: just make a list of the unmodified indexqual */
 	return list_make1(rinfo);
 }
 
 /*
  * expand_indexqual_rowcompare --- expand a single indexqual condition
  *		that is a RowCompareExpr
  *
@@ -4096,20 +4110,124 @@ network_prefix_quals(Node *leftop, Oid expr_op, Oid opfamily, Datum rightop)
 			InvalidOid, /* not collatable */
 			-1, opr2right,
 			false, false),
 		 InvalidOid, InvalidOid);
 	result = lappend(result, make_simple_restrictinfo(expr));
 
 	return result;
 }
 
 /*
+ * Given an element leftop and a range rightop, and an elem contained-by range
+ * operator, generate suitable indexqual condition(s).
+ */
+static List *
+range_elem_contained_quals(Node *leftop, Datum rightop)
+{
+	Oiddatatype;
+	Oidopfamily;
+	Oidopr1oid;
+	Oidopr2oid;
+	List			*result = NIL;
+	Expr			*expr;
+	RangeType		*range;
+	TypeC

Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-19 Thread Robert Haas
On Sun, Feb 19, 2017 at 3:52 PM, Pavan Deolasee
 wrote:
> This particular case of corruption results in a heap tuple getting indexed
> by a wrong key (or to be precise, indexed by its old value). So the only way
> to detect the corruption is to look at each index key and check if it
> matches with the corresponding heap tuple. We could write some kind of self
> join that can use a sequential scan and an index-only scan (David Rowley had
> suggested something of that sort internally here), but we can't guarantee
> index-only scan on a table which is being concurrently updated. So not sure
> even that will catch every possible case.

Oh, so the problem isn't index entries that are altogether missing?  I
guess I was confused.

You can certainly guarantee an index-only scan if you write the
validation code in C rather than using SQL.  I think the issue is that
if the table is large enough that keeping a TID -> index value mapping
in a hash table is impractical, there's not going to be a real
efficient strategy for this.  Ignoring the question of whether you use
the main executor for this or just roll your own code, your options
for a large table are (1) a multi-batch hash join, (2) a nested loop,
and (3) a merge join.  (2) is easy to implement but will generate a
ton of random I/O if the table is not resident in RAM.  (3) is most
suitable for very large tables but takes more work to code, and is
also likely to be a lot slower for small tables than a hash or
nestloop-based approach.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-19 Thread Pavan Deolasee
On Sun, Feb 19, 2017 at 3:43 PM, Robert Haas  wrote:

> On Fri, Feb 17, 2017 at 11:15 PM, Tom Lane  wrote:
>
> > Ah, nah, scratch that.  If any post-index-build pruning had occurred on a
> > page, the evidence would be gone --- the non-matching older tuples would
> > be removed and what remained would look consistent.  But it wouldn't
> > match the index.  You might be able to find problems if you were willing
> > to do the expensive check on *every* HOT chain, but that seems none too
> > practical.
>
> If the goal is just to detect tuples that aren't indexed and should
> be, what about scanning each index and building a TIDBitmap of all of
> the index entries, and then scanning the heap for non-HOT tuples and
> probing the TIDBitmap for each one?  If you find a heap entry for
> which you didn't find an index entry, go and recheck the index and see
> if one got added concurrently.  If not, whine.
>
>
This particular case of corruption results in a heap tuple getting indexed
by a wrong key (or to be precise, indexed by its old value). So the only
way to detect the corruption is to look at each index key and check if it
matches with the corresponding heap tuple. We could write some kind of self
join that can use a sequential scan and an index-only scan (David Rowley
had suggested something of that sort internally here), but we can't
guarantee index-only scan on a table which is being concurrently updated.
So not sure even that will catch every possible case.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-19 Thread Robert Haas
On Fri, Feb 17, 2017 at 11:15 PM, Tom Lane  wrote:
> I wrote:
>> However, you might be able to find it without so much random I/O.
>> I'm envisioning a seqscan over the table, in which you simply look for
>> HOT chains in which the indexed columns aren't all the same.  When you
>> find one, you'd have to do a pretty expensive index lookup to confirm
>> whether things are OK or not, but hopefully that'd be rare enough to not
>> be a big performance sink.
>
> Ah, nah, scratch that.  If any post-index-build pruning had occurred on a
> page, the evidence would be gone --- the non-matching older tuples would
> be removed and what remained would look consistent.  But it wouldn't
> match the index.  You might be able to find problems if you were willing
> to do the expensive check on *every* HOT chain, but that seems none too
> practical.

If the goal is just to detect tuples that aren't indexed and should
be, what about scanning each index and building a TIDBitmap of all of
the index entries, and then scanning the heap for non-HOT tuples and
probing the TIDBitmap for each one?  If you find a heap entry for
which you didn't find an index entry, go and recheck the index and see
if one got added concurrently.  If not, whine.

One problem is that you'd need to make sure that the TIDBitmap didn't
lossify, but that could be prevented either by specifying a very large
maxbytes setting or by using something that serves the same function
instead of a TIDBitmap per se.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-17 Thread Peter Geoghegan
On Fri, Feb 17, 2017 at 9:31 AM, Tom Lane  wrote:
> This seems like it'd be quite a different tool than amcheck, though.
> Also, it would only find broken-HOT-chain corruption, which might be
> a rare enough issue to not deserve a single-purpose tool.

FWIW, my ambition for amcheck is that it will have checks for a large
variety of invariants that involve the heap, and related SLRU
structures such as MultiXacts. Though, that would probably necessitate
code written by other people that are subject matter experts in areas
that I am not.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-17 Thread Tom Lane
I wrote:
> However, you might be able to find it without so much random I/O.
> I'm envisioning a seqscan over the table, in which you simply look for
> HOT chains in which the indexed columns aren't all the same.  When you
> find one, you'd have to do a pretty expensive index lookup to confirm
> whether things are OK or not, but hopefully that'd be rare enough to not
> be a big performance sink.

Ah, nah, scratch that.  If any post-index-build pruning had occurred on a
page, the evidence would be gone --- the non-matching older tuples would
be removed and what remained would look consistent.  But it wouldn't
match the index.  You might be able to find problems if you were willing
to do the expensive check on *every* HOT chain, but that seems none too
practical.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-17 Thread Tom Lane
Peter Geoghegan  writes:
> The difference with a test that could detect this variety of
> corruption is that that would need to visit the heap, which tends to
> be much larger than any one index, or even all indexes. That would
> probably need to be random I/O, too. It might be possible to mostly
> not visit the heap, though -- I'm not sure offhand. I'd have to study
> the problem in detail, which I have no time for at the moment.

Unless I misunderstand this problem, the issue is that there might be some
broken HOT chains, that is chains in which not all the heap tuples have
the same values for the index columns, and in particular the live entry at
the end of the chain doesn't agree with the index.  It seems pretty
impossible to detect that by examining the index alone.

However, you might be able to find it without so much random I/O.
I'm envisioning a seqscan over the table, in which you simply look for
HOT chains in which the indexed columns aren't all the same.  When you
find one, you'd have to do a pretty expensive index lookup to confirm
whether things are OK or not, but hopefully that'd be rare enough to not
be a big performance sink.

This seems like it'd be quite a different tool than amcheck, though.
Also, it would only find broken-HOT-chain corruption, which might be
a rare enough issue to not deserve a single-purpose tool.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-17 Thread Peter Geoghegan
On Fri, Feb 17, 2017 at 8:23 AM, Keith Fiske  wrote:
> It's not the load I'm worried about, it's the locks that are required at
> some point during the rebuild. Doing an index rebuild here and there isn't a
> big deal, but trying to do it for an entire heavily loaded, multi-terabyte
> database is hardly trivial. And I'd say doing a scan is far less invasive
> than actually rebuilding the index since little to no writing is actually
> being done.

Certainly, the checks that amcheck performs that don't require a
heavier lock (just a SELECT-style AccessShareLock) were vastly less
expensive than reindexing, and of course had only negligible potential
to block other operations. And, REINDEX certainly is a foot-gun,
lock-wise, which is why I try to avoid it.

The difference with a test that could detect this variety of
corruption is that that would need to visit the heap, which tends to
be much larger than any one index, or even all indexes. That would
probably need to be random I/O, too. It might be possible to mostly
not visit the heap, though -- I'm not sure offhand. I'd have to study
the problem in detail, which I have no time for at the moment.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-17 Thread Alvaro Herrera
Keith Fiske wrote:

> I can understandable if it's simply not possible, but if it is, I think in
> any cases of data corruption, having some means to check for it to be sure
> you're in the clear would be useful.

Maybe it is possible.  I just didn't try, since it didn't seem very
useful.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-17 Thread Keith Fiske
On Fri, Feb 17, 2017 at 11:12 AM, Alvaro Herrera 
wrote:

> Keith Fiske wrote:
>
> > Was just curious if anyone was able to come up with any sort of method to
> > test whether an index was corrupted by this bug, other than just waiting
> > for bad query results? We've used concurrent index rebuilding quite
> > extensively over the years to remove bloat from busy systems, but
> > reindexing the entire database "just in case" is unrealistic in many of
> our
> > cases.
>
> As stated, if the CREATE INDEX operates on columns that are previously
> already indexed (which is normally the case when you rebuild because of
> bloat) then there is no chance of index corruption.
>
> Scanning indexes+tables is just as load-intensive as rebuilding the
> indexes anyway.  You don't save any work.  I suppose it can be a problem
> if you have an index big enough that it doesn't fit on your remaining
> free space (but in that case you have a pre-existing problem which you
> should solve anyway).
>
> --
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>

It's not the load I'm worried about, it's the locks that are required at
some point during the rebuild. Doing an index rebuild here and there isn't
a big deal, but trying to do it for an entire heavily loaded,
multi-terabyte database is hardly trivial. And I'd say doing a scan is far
less invasive than actually rebuilding the index since little to no writing
is actually being done.

I can understandable if it's simply not possible, but if it is, I think in
any cases of data corruption, having some means to check for it to be sure
you're in the clear would be useful.

Keith


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-17 Thread Alvaro Herrera
Keith Fiske wrote:

> Was just curious if anyone was able to come up with any sort of method to
> test whether an index was corrupted by this bug, other than just waiting
> for bad query results? We've used concurrent index rebuilding quite
> extensively over the years to remove bloat from busy systems, but
> reindexing the entire database "just in case" is unrealistic in many of our
> cases.

As stated, if the CREATE INDEX operates on columns that are previously
already indexed (which is normally the case when you rebuild because of
bloat) then there is no chance of index corruption.

Scanning indexes+tables is just as load-intensive as rebuilding the
indexes anyway.  You don't save any work.  I suppose it can be a problem
if you have an index big enough that it doesn't fit on your remaining
free space (but in that case you have a pre-existing problem which you
should solve anyway).

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-17 Thread Keith Fiske
On Mon, Feb 6, 2017 at 10:17 PM, Amit Kapila 
wrote:

> On Mon, Feb 6, 2017 at 10:28 PM, Tom Lane  wrote:
> > Amit Kapila  writes:
> >> Hmm.  Consider that the first time relcahe invalidation occurs while
> >> computing id_attrs, so now the retry logic will compute the correct
> >> set of attrs (considering two indexes, if we take the example given by
> >> Alvaro above.).  However, the attrs computed for hot_* and key_* will
> >> be using only one index, so this will lead to a situation where two of
> >> the attrs (hot_attrs and key_attrs) are computed with one index and
> >> id_attrs is computed with two indexes. I think this can lead to
> >> Assertion in HeapSatisfiesHOTandKeyUpdate().
> >
> > That seems like something we'd be best off to fix by changing the
> > assertion.
> >
>
> The above-mentioned problem doesn't exist in your version of the patch
> (which is now committed) because the index attrs are cached after
> invalidation and I have mentioned the same in my yesterday's followup
> mail.   The guarantee of safety is that we don't handle invalidation
> between two consecutive calls to RelationGetIndexAttrBitmap() in
> heap_update, but if we do handle due to any reason which is not
> apparent to me, then I think there is some chance of hitting the
> assertion.
>
> >  I doubt it's really going to be practical to guarantee that
> > those bitmapsets are always 100% consistent throughout a transaction.
> >
>
> Agreed.  As the code stands, I think it is only expected to be
> guaranteed across three consecutive calls to
> RelationGetIndexAttrBitmap() in heap_update. Now, if the assertion in
> HeapSatisfiesHOTandKeyUpdate() is useless and we can remove it, then
> probably we don't need a guarantee to be consistent in three
> consecutive calls as well.
>
> >> Okay, please find attached patch which is an extension of Tom's and
> >> Andres's patch and I think it fixes the above problem noted by me.
> >
> > I don't like this patch at all.  It only fixes the above issue if the
> > relcache inval arrives while RelationGetIndexAttrBitmap is executing.
> > If the inval arrives between two such calls, you still have the problem
> > of the second one delivering a bitmapset that might be inconsistent
> > with the first one.
> >
>
> I think it won't happen between two consecutive calls to
> RelationGetIndexAttrBitmap in heap_update.  It can happen during
> multi-row update, but that should be fine.  I think this version of
> patch only defers the creation of fresh bitmapsets where ever
> possible.
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

Was just curious if anyone was able to come up with any sort of method to
test whether an index was corrupted by this bug, other than just waiting
for bad query results? We've used concurrent index rebuilding quite
extensively over the years to remove bloat from busy systems, but
reindexing the entire database "just in case" is unrealistic in many of our
cases.


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-06 Thread Amit Kapila
On Mon, Feb 6, 2017 at 10:28 PM, Tom Lane  wrote:
> Amit Kapila  writes:
>> Hmm.  Consider that the first time relcahe invalidation occurs while
>> computing id_attrs, so now the retry logic will compute the correct
>> set of attrs (considering two indexes, if we take the example given by
>> Alvaro above.).  However, the attrs computed for hot_* and key_* will
>> be using only one index, so this will lead to a situation where two of
>> the attrs (hot_attrs and key_attrs) are computed with one index and
>> id_attrs is computed with two indexes. I think this can lead to
>> Assertion in HeapSatisfiesHOTandKeyUpdate().
>
> That seems like something we'd be best off to fix by changing the
> assertion.
>

The above-mentioned problem doesn't exist in your version of the patch
(which is now committed) because the index attrs are cached after
invalidation and I have mentioned the same in my yesterday's followup
mail.   The guarantee of safety is that we don't handle invalidation
between two consecutive calls to RelationGetIndexAttrBitmap() in
heap_update, but if we do handle due to any reason which is not
apparent to me, then I think there is some chance of hitting the
assertion.

>  I doubt it's really going to be practical to guarantee that
> those bitmapsets are always 100% consistent throughout a transaction.
>

Agreed.  As the code stands, I think it is only expected to be
guaranteed across three consecutive calls to
RelationGetIndexAttrBitmap() in heap_update. Now, if the assertion in
HeapSatisfiesHOTandKeyUpdate() is useless and we can remove it, then
probably we don't need a guarantee to be consistent in three
consecutive calls as well.

>> Okay, please find attached patch which is an extension of Tom's and
>> Andres's patch and I think it fixes the above problem noted by me.
>
> I don't like this patch at all.  It only fixes the above issue if the
> relcache inval arrives while RelationGetIndexAttrBitmap is executing.
> If the inval arrives between two such calls, you still have the problem
> of the second one delivering a bitmapset that might be inconsistent
> with the first one.
>

I think it won't happen between two consecutive calls to
RelationGetIndexAttrBitmap in heap_update.  It can happen during
multi-row update, but that should be fine.  I think this version of
patch only defers the creation of fresh bitmapsets where ever
possible.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-06 Thread Pavan Deolasee
On Mon, Feb 6, 2017 at 11:54 PM, Tom Lane  wrote:

> After some discussion among the release team, we've concluded that the
> best thing to do is to push Pavan's/my patch into today's releases.
> This does not close the matter by any means: we should continue to
> study whether there are related bugs or whether there's a more principled
> way of fixing this bug.  But that patch clearly makes things better,
> and we shouldn't let worries about whether there are more bugs stop
> us from providing some kind of fix to users.
>
> I've made the push, and barring negative reports from the buildfarm,
> it will be in today's releases.
>

Thank you for taking care of it. Buildfarm is looking green until now.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-06 Thread Tom Lane
After some discussion among the release team, we've concluded that the
best thing to do is to push Pavan's/my patch into today's releases.
This does not close the matter by any means: we should continue to
study whether there are related bugs or whether there's a more principled
way of fixing this bug.  But that patch clearly makes things better,
and we shouldn't let worries about whether there are more bugs stop
us from providing some kind of fix to users.

I've made the push, and barring negative reports from the buildfarm,
it will be in today's releases.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-06 Thread Robert Treat
On Sun, Feb 5, 2017 at 9:42 PM, Pavan Deolasee  wrote:
> On Mon, Feb 6, 2017 at 5:41 AM, Peter Geoghegan  wrote:
>> On Sun, Feb 5, 2017 at 4:09 PM, Robert Haas  wrote:
>> > I don't think this kind of black-and-white thinking is very helpful.
>> > Obviously, data corruption is bad.  However, this bug has (from what
>> > one can tell from this thread) been with us for over a decade; it must
>> > necessarily be either low-probability or low-severity, or somebody
>> > would've found it and fixed it before now.  Indeed, the discovery of
>> > this bug was driven by new feature development, not a user report.  It
>> > seems pretty clear that if we try to patch this and get it wrong, the
>> > effects of our mistake could easily be a lot more serious than the
>> > original bug.
>>
>> +1. The fact that it wasn't driven by a user report convinces me that
>> this is the way to go.
>>
>
> I'm not sure that just because the bug wasn't reported by a user, makes it
> any less critical. As Tomas pointed down thread, the nature of the bug is
> such that the users may not discover it very easily, but that doesn't mean
> it couldn't be affecting them all the time. We can now correlate many past
> reports of index corruption to this bug, but we don't have evidence to prove
> that. Lack of any good tool or built-in checks probably makes it even
> harder.
>
> The reason why I discovered this with WARM is because it now has a built-in
> recheck logic, which was discarding some tuples returned by the index scan.
> It took me lots of code review and inspection to finally conclude that this
> must be an existing bug. Even then for lack of any utility, I could not
> detect this easily with master. That doesn't mean I wasn't able to
> reproduce, but detection was a problem.
>
> If you look at the reproduction setup, one in every 10, if not 5, CREATE
> INDEX CONCURRENTLY ends up creating a corrupt index. That's a good 10%
> probability. And this is on a low end laptop, with just 5-10 concurrent
> clients running. Probability of hitting the problem will be much higher on a
> bigger machine, with many users (on a decent AWS machine, I would find every
> third CIC to be corrupt). Moreover the setup is not doing anything
> extraordinary. Just concurrent updates which change between HOT to non-HOT
> and a CIC.
>

Not that I am advocating that we should do a release just for this;
having a fix we believe in is certainly as important a factor, but
that the idea that the bug has been around a long time means it is
less of an issue does seem wrong. We've certainly seen plenty of cases
over the years where bugs have existed in the code in seldom used code
paths, only to be exposed by new features or other code changes over
time. In general, we should be less worried about the age of a bug vs
our expectations that users are likely to hit that bug now, which does
seem high based on the above numbers.

In the meantime, it's certainly worth warning users, providing help on
how to determine if this is a likely problem for them, and possibly
rolling a patch ahead of upstream in cases where that's feasible.

Robert Treat
play: xzilla.net
work: omniti.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-06 Thread Tom Lane
Alvaro Herrera  writes:
> Tom Lane wrote:
>> Better to fix the callers so that they don't have the assumption you
>> refer to.  Or maybe we could adjust the API of RelationGetIndexAttrBitmap
>> so that it returns all the sets needed by a given calling module at
>> once, which would allow us to guarantee they're consistent.

> Note that my "interesting attrs" patch does away with these independent
> bitmaps (which was last posted by Pavan as part of his WARM set).  I
> think we should fix just this bug now, and for the future look at that
> other approach.

BTW, if there is a risk of the assertion failure that Amit posits,
it seems like it should have happened in the tests that Pavan was doing
originally.  I'd sort of like to see a demonstration that it can actually
happen before we spend any great amount of time fixing it.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-06 Thread Alvaro Herrera
Tom Lane wrote:

> Better to fix the callers so that they don't have the assumption you
> refer to.  Or maybe we could adjust the API of RelationGetIndexAttrBitmap
> so that it returns all the sets needed by a given calling module at
> once, which would allow us to guarantee they're consistent.

Note that my "interesting attrs" patch does away with these independent
bitmaps (which was last posted by Pavan as part of his WARM set).  I
think we should fix just this bug now, and for the future look at that
other approach.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-06 Thread Tom Lane
Amit Kapila  writes:
> Hmm.  Consider that the first time relcahe invalidation occurs while
> computing id_attrs, so now the retry logic will compute the correct
> set of attrs (considering two indexes, if we take the example given by
> Alvaro above.).  However, the attrs computed for hot_* and key_* will
> be using only one index, so this will lead to a situation where two of
> the attrs (hot_attrs and key_attrs) are computed with one index and
> id_attrs is computed with two indexes. I think this can lead to
> Assertion in HeapSatisfiesHOTandKeyUpdate().

That seems like something we'd be best off to fix by changing the
assertion.  I doubt it's really going to be practical to guarantee that
those bitmapsets are always 100% consistent throughout a transaction.

> Okay, please find attached patch which is an extension of Tom's and
> Andres's patch and I think it fixes the above problem noted by me.

I don't like this patch at all.  It only fixes the above issue if the
relcache inval arrives while RelationGetIndexAttrBitmap is executing.
If the inval arrives between two such calls, you still have the problem
of the second one delivering a bitmapset that might be inconsistent
with the first one.

To go in this direction, I think we would have to hot-wire
RelationGetIndexAttrBitmap so that once any bitmapset has been requested
in a transaction, we intentionally ignore all index set updates for the
remainder of the transaction.  And that would very likely create more
problems than it solves (consider locally initiated DDL within the
transaction, for example).

Better to fix the callers so that they don't have the assumption you
refer to.  Or maybe we could adjust the API of RelationGetIndexAttrBitmap
so that it returns all the sets needed by a given calling module at
once, which would allow us to guarantee they're consistent.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-06 Thread Alvaro Herrera
Andres Freund wrote:

> To show what I mean here's an *unpolished* and *barely tested* patch
> implementing the first of my suggestions.
> 
> Alvaro, Pavan, I think should address the issue as well?

Hmm, interesting idea.  Maybe a patch along these lines is a good way to
make index-list cache less brittle going forward.  However, I'm against
putting it in the stable branches -- and we should definitely not stall
an immediate fix in order to get this patch ready.  IMO we should get
Tom's patch in tree for all branches very soon, and then after the
release you can finalize this one and put it to master.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-06 Thread Alvaro Herrera
Tom Lane wrote:

> Pavan Deolasee  writes:

> > 2. In the second patch, we tried to recompute attribute lists if a relcache
> > flush happens in between and index list is invalidated. We've seen problems
> > with that, especially it getting into an infinite loop with
> > CACHE_CLOBBER_ALWAYS. Not clear why it happens, but apparently relcache
> > flushes keep happening.
> 
> Well, yeah: the point of CLOBBER_CACHE_ALWAYS is to make a relcache flush
> happen whenever it possibly could.  The way to deal with that without
> looping is to test whether the index set *actually* changed, not whether
> it just might have changed.

Ah, that's a nice and simple way to fix that patch!  I see that Pavan
confirmed that it fixes the tests we saw failing too.  It seems to me
that we should go with this one, and it looks unlikely that this causes
other problems.

BTW, while neiter Pavan nor I sent this patch to -hackers, this
implementation is pretty much the same thing we did.  Pavan deserves
credit as co-author in this commit.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-06 Thread Martín Marqués
El 05/02/17 a las 21:57, Tomas Vondra escribió:
> 
> +1 to not rushing fixes into releases. While I think we now finally
> understand the mechanics of this bug, the fact that we came up with
> three different fixes in this thread, only to discover issues with each
> of them, warrants some caution.

I agree also with Robert on not rushing the patch. My point was if we
had to rush the release.

> OTOH I disagree with the notion that bugs that are not driven by user
> reports are somehow less severe. Some data corruption bugs cause quite
> visible breakage - segfaults, immediate crashes, etc. Those are pretty
> clear bugs, and are reported by users.
> 
> Other data corruption bugs are much more subtle - for example this bug
> may lead to incorrect results to some queries, failing to detect values
> violating UNIQUE constraints, issues with foreign keys, etc.

I was recalling just yesterday after sending the mail a logical
replication setup we did on a 9.3 server of a customer which brought up
data inconsistencies on the primary key of one of the tables. The table
had duplicate values.

As Tomas says, it's subtle and hard to find unless you constantly run
index checks (query a sample of the data from the heap and from the
index and check they match). In our case, the customer was not aware of
the dups until we found them.

Regards,

-- 
Martín Marquéshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-06 Thread Vladimir Borodin

> 6 февр. 2017 г., в 4:57, Peter Geoghegan  написал(а):
> 
> I meant that I find the fact that there were no user reports in all
> these years to be a good reason to not proceed for now in this
> instance.

Well, we had some strange situations with indexes (see below for example) but I 
couldn’t even think that CIC is the problem.

And it is really difficult to give diagnostics for problems of such kind. 
Because 1. you may see the problem several days after last major change in the 
database and 2. you don’t even know how to start investigating the problem.

xdb314g/maildb M # show enable_indexscan ;
 enable_indexscan
--
 off
(1 row)

Time: 0.120 ms
xdb314g/maildb M #  select * from mail.folders where uid=448300241 and fid=1;
-[ RECORD 1 ]---+--
uid | 448300241
fid | 1
<...>

Time: 30398.637 ms
xdb314g/maildb M # set enable_indexscan to true;
SET
Time: 0.111 ms
xdb314g/maildb M #  select * from mail.folders where uid=448300241 and fid=1;
(0 rows)

Time: 0.312 ms

xdb314g/maildb M #

The row of course hasn’t been deleted.

--
May the force be with you…
https://simply.name



Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Amit Kapila
On Mon, Feb 6, 2017 at 9:47 AM, Pavan Deolasee  wrote:
>
>
> On Mon, Feb 6, 2017 at 9:41 AM, Amit Kapila  wrote:
>>
>>
>>
>> Hmm.  Consider that the first time relcahe invalidation occurs while
>> computing id_attrs, so now the retry logic will compute the correct
>> set of attrs (considering two indexes, if we take the example given by
>> Alvaro above.).
>
>
> I don't quite get that. Since rd_idattr must be already cached at that point
> and we don't expect to process a relcache flush between successive calls to
> RelationGetIndexAttrBitmap(), we should return a consistent copy of
> rd_idattr.
>

Right, I was mistaken.  However, I am not sure if there is a need to
perform retry logic in RelationGetIndexAttrBitmap().  It is safe to do
that at transaction end.  I feel even though Tom's fix looks reliable
with respect to the problem reported in this thread, we should take
some time and evaluate different proposals and see which one is the
best way to move forward.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Pavan Deolasee
On Mon, Feb 6, 2017 at 8:01 AM, Pavan Deolasee 
wrote:

>
>>
> I like this approach. I'll run the patch on a build with
> CACHE_CLOBBER_ALWAYS, but I'm pretty sure it will be ok.
>

While it looks certain that the fix will miss this point release, FWIW I
ran this patch with CACHE_CLOBBER_ALWAYS and it seems to be working as
expected.. Hard to run all the tests, but a small subset of regression and
test_decoding seems ok.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Pavan Deolasee
On Mon, Feb 6, 2017 at 9:41 AM, Amit Kapila  wrote:

>
>
> Hmm.  Consider that the first time relcahe invalidation occurs while
> computing id_attrs, so now the retry logic will compute the correct
> set of attrs (considering two indexes, if we take the example given by
> Alvaro above.).


I don't quite get that. Since rd_idattr must be already cached at that
point and we don't expect to process a relcache flush between successive
calls to RelationGetIndexAttrBitmap(), we should return a consistent copy
of rd_idattr. But may be I am missing something.

Thanks,
Pavan


-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Amit Kapila
On Mon, Feb 6, 2017 at 8:35 AM, Pavan Deolasee  wrote:
>
>
> On Mon, Feb 6, 2017 at 8:15 AM, Amit Kapila  wrote:
>>
>> On Mon, Feb 6, 2017 at 8:01 AM, Pavan Deolasee 
>> wrote:
>> >
>> >
>> > On Mon, Feb 6, 2017 at 1:44 AM, Tom Lane  wrote:
>> >>
>> >>
>> >>
>> >> > 2. In the second patch, we tried to recompute attribute lists if a
>> >> > relcache
>> >> > flush happens in between and index list is invalidated. We've seen
>> >> > problems
>> >> > with that, especially it getting into an infinite loop with
>> >> > CACHE_CLOBBER_ALWAYS. Not clear why it happens, but apparently
>> >> > relcache
>> >> > flushes keep happening.
>> >>
>> >> Well, yeah: the point of CLOBBER_CACHE_ALWAYS is to make a relcache
>> >> flush
>> >> happen whenever it possibly could.
>> >
>> >
>> > Ah, ok. That explains why the retry logic as originally proposed was
>> > causing
>> > infinite loop.
>> >
>> >>
>> >> The way to deal with that without
>> >> looping is to test whether the index set *actually* changed, not
>> >> whether
>> >> it just might have changed.
>> >
>> >
>> > I like this approach. I'll run the patch on a build with
>> > CACHE_CLOBBER_ALWAYS, but I'm pretty sure it will be ok. I also like the
>> > fact that retry logic is now limited to only when index set changes,
>> > which
>> > fits in the situation we're dealing with.
>> >
>>
>> Don't you think it also has the same problem as mentioned by me in
>> Alvaro's patch and you also agreed on it?
>
>
> No, I don't think so. There can't be any more relcache flushes occurring
> between the first RelationGetIndexAttrBitmap() and the subsequent
> RelationGetIndexAttrBitmap() calls. The problem with the earlier proposed
> approach was that we were not caching, yet returning stale information. That
> implied the next call will again recompute a possibly different value. The
> retry logic is trying to close a small race yet important race condition
> where index set invalidation happens, but we cache stale information.
>

Hmm.  Consider that the first time relcahe invalidation occurs while
computing id_attrs, so now the retry logic will compute the correct
set of attrs (considering two indexes, if we take the example given by
Alvaro above.).  However, the attrs computed for hot_* and key_* will
be using only one index, so this will lead to a situation where two of
the attrs (hot_attrs and key_attrs) are computed with one index and
id_attrs is computed with two indexes. I think this can lead to
Assertion in HeapSatisfiesHOTandKeyUpdate().

>>
>> Do you see any need to fix
>> it locally in
>> RelationGetIndexAttrBitmap(), why can't we clear at transaction end?
>>
>
> We're trying to fix it in RelationGetIndexAttrBitmap() because that's where
> the race exists. I'm not saying there aren't other and better ways of
> handling it, but we don't know (I've studied Andres's proposal yet).
>

Okay, please find attached patch which is an extension of Tom's and
Andres's patch and I think it fixes the above problem noted by me.
Basically, invalidate the bitmaps at transaction end rather than in
RelationGetIndexAttrBitmap().  I think it is okay for
RelationGetIndexAttrBitmap() to use stale information until
transaction end because all the updates in the current running
transaction will be consumed by the second phase of CIC.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


invalidate_indexattr_v6.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Andres Freund
On 2017-02-05 22:34:34 -0500, Tom Lane wrote:
> Pavan Deolasee  writes:
> The point is that there's a nontrivial chance of a hasty fix introducing
> worse problems than we fix.
> 
> Given the lack of consensus about exactly how to fix this, I'm feeling
> like it's a good idea if whatever we come up with gets some time to age
> awhile in git before we ship it.

Right. And I'm not even convinced that we really know the extent of the
bug; it seems fairly plausible that there's further incidences of this.
There's also the issue that the mechanics in the older backbranches are
different again, because of SnapshotNow.


>> I'm bit a surprised with this position. What do we tell our users now that
>> we know this bug exists?

That we're scheduling a bugfix for the next point release.  I don't
think we can truthfully claim that there's no known corruption bugs in
any of the  release in the last few years.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Tom Lane
Pavan Deolasee  writes:
> On Mon, Feb 6, 2017 at 5:44 AM, Andres Freund  wrote:
>> +1.  I don't think we serve our users by putting out a nontrivial bugfix
>> hastily. Nor do I think we serve them in this instance by delaying the
>> release to cover this fix; there's enough other fixes in the release
>> imo.

> I'm bit a surprised with this position.

The point is that there's a nontrivial chance of a hasty fix introducing
worse problems than we fix.

Given the lack of consensus about exactly how to fix this, I'm feeling
like it's a good idea if whatever we come up with gets some time to age
awhile in git before we ship it.

Obviously, 2ndQ or EDB or any other distributor can choose to ship a patch
in their own builds if they're sufficiently comfortable with the particular
patch.  That doesn't translate to there having to be a fix in the
community's wraps this week.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Andres Freund
On 2017-02-05 21:49:57 -0500, Tom Lane wrote:
> Andres Freund  writes:
> > I've not yet read the full thread, but I'm a bit confused so far. We
> > obviously can get changing information about indexes here, but isn't
> > that something we have to deal with anyway?  If we guarantee that we
> > don't loose knowledge that there's a pending invalidation, why do we
> > have to retry?
> 
> We don't *have to* retry; the core of the fix is to not enter stale data
> into the cache after we've already received an invalidation signal.

Right, the bug is that we do so without remembering that fact.


> The current caller can survive with stale data; or if that's not true,
> C.I.C.  has got more fundamental problems than posited here.

Indeed.


> But it seems like a generally bad idea for relcache to return data
> that it knows (or at least has reason to believe) is stale.

I'm not convinced by this - this kind of staleness can only occur if
changes happen during the execution of the cache building. And the
information returned can be outdated by pretty much the moment you
return anyway.  It's also how a number of the current caches
essentially already work.


> Also, even if you're okay with return-stale-data-but-don't-cache-it,
> I have little faith that invalidate_indexattr_v3.patch successfully
> implements that.

I sent a prototype clarifying what I mean. It's not really like
invalidate_indexattr_v3.patch


Btw, it seems RelationGetIndexList() avoids a similar issue onlye due to
either luck or a lot of undocumented assumptions.  The only reason that
setting relation->rd_indexlist / rd_indexvalid at the end doesn't cause
a similar issue seems to be that no invalidations appear to be processed
after the systable_beginscan().  And that in turn seems to not process
relcache invals after RegisterSnapshot(GetCatalogSnapshot(relid)).
Phew, that seems a bit fragile.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Pavan Deolasee
On Mon, Feb 6, 2017 at 5:44 AM, Andres Freund  wrote:

> On 2017-02-05 12:51:09 -0500, Tom Lane wrote:
> > Michael Paquier  writes:
> > > On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule 
> wrote:
> > >> I agree with Pavan - a release with known important bug is not good
> idea.
> >
> > > This bug has been around for some time, so I would recommend taking
> > > the time necessary to make the best fix possible, even if it means
> > > waiting for the next round of minor releases.
> >
> > I think the way to think about this sort of thing is, if we had found
> > this bug when a release wasn't imminent, would we consider it bad enough
> > to justify an unscheduled release cycle?  I have to think the answer for
> > this one is "probably not".
>
> +1.  I don't think we serve our users by putting out a nontrivial bugfix
> hastily. Nor do I think we serve them in this instance by delaying the
> release to cover this fix; there's enough other fixes in the release
> imo.
>
>
I'm bit a surprised with this position. What do we tell our users now that
we know this bug exists? They can't fully trust CIC and they don't have any
mechanism to confirm if the newly built index is corruption free or not.
I'm not suggesting that we should hastily refactor the code, but at the
very least some sort of band-aid fix which helps the situation, yet have
minimal side-effects, is warranted. I believe proposed retry patch does
exactly that.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Pavan Deolasee
On Mon, Feb 6, 2017 at 8:15 AM, Amit Kapila  wrote:

> On Mon, Feb 6, 2017 at 8:01 AM, Pavan Deolasee 
> wrote:
> >
> >
> > On Mon, Feb 6, 2017 at 1:44 AM, Tom Lane  wrote:
> >>
> >>
> >>
> >> > 2. In the second patch, we tried to recompute attribute lists if a
> >> > relcache
> >> > flush happens in between and index list is invalidated. We've seen
> >> > problems
> >> > with that, especially it getting into an infinite loop with
> >> > CACHE_CLOBBER_ALWAYS. Not clear why it happens, but apparently
> relcache
> >> > flushes keep happening.
> >>
> >> Well, yeah: the point of CLOBBER_CACHE_ALWAYS is to make a relcache
> flush
> >> happen whenever it possibly could.
> >
> >
> > Ah, ok. That explains why the retry logic as originally proposed was
> causing
> > infinite loop.
> >
> >>
> >> The way to deal with that without
> >> looping is to test whether the index set *actually* changed, not whether
> >> it just might have changed.
> >
> >
> > I like this approach. I'll run the patch on a build with
> > CACHE_CLOBBER_ALWAYS, but I'm pretty sure it will be ok. I also like the
> > fact that retry logic is now limited to only when index set changes,
> which
> > fits in the situation we're dealing with.
> >
>
> Don't you think it also has the same problem as mentioned by me in
> Alvaro's patch and you also agreed on it?


No, I don't think so. There can't be any more relcache flushes occurring
between the first RelationGetIndexAttrBitmap() and the subsequent
RelationGetIndexAttrBitmap() calls. The problem with the earlier proposed
approach was that we were not caching, yet returning stale information.
That implied the next call will again recompute a possibly different value.
The retry logic is trying to close a small race yet important race
condition where index set invalidation happens, but we cache stale
information.


> Do you see any need to fix
> it locally in
> RelationGetIndexAttrBitmap(), why can't we clear at transaction end?
>
>
We're trying to fix it in RelationGetIndexAttrBitmap() because that's where
the race exists. I'm not saying there aren't other and better ways of
handling it, but we don't know (I've studied Andres's proposal yet).

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Tom Lane
Andres Freund  writes:
> I've not yet read the full thread, but I'm a bit confused so far. We
> obviously can get changing information about indexes here, but isn't
> that something we have to deal with anyway?  If we guarantee that we
> don't loose knowledge that there's a pending invalidation, why do we
> have to retry?

We don't *have to* retry; the core of the fix is to not enter stale data
into the cache after we've already received an invalidation signal.  The
current caller can survive with stale data; or if that's not true, C.I.C.
has got more fundamental problems than posited here.  But it seems like a
generally bad idea for relcache to return data that it knows (or at least
has reason to believe) is stale.

Also, even if you're okay with return-stale-data-but-don't-cache-it,
I have little faith that invalidate_indexattr_v3.patch successfully
implements that.  Consider the sequence: inval received during
RelationGetIndexAttrBitmap, we clear rd_indexvalid, something asks for
the relation OID list, we recalculate that and set rd_indexvalid, then
we reach the end of RelationGetIndexAttrBitmap and see that rd_indexvalid
is set.  If that happened, we'd cache the bitmaps whether or not they were
really up to date.  Now admittedly, it's a bit hard to think of a reason
why anything would ask for the index list of anything but a system catalog
in that sequence, so as long as you assume that we don't allow C.I.C. (or
more realistically, REINDEX CONCURRENTLY) on system catalogs, this failure
mode is unreachable.  But I much prefer having a positive verification
that the index list is still what it was when we started.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Andres Freund
On 2017-02-06 08:08:01 +0530, Amit Kapila wrote:
> I don't see in your patch that you are setting rd_bitmapsvalid to 0.

IIRC a plain relcache rebuild should do that (note there's also no place
that directly resets rd_indexattrs).


> Also, I think this suffers from the same problem as the patch proposed
> by Alvaro which is that different attrs (hot_attrs, key_attrs and
> id_attrs) will get computed based on different index lists which can
> cause a problem as mentioned in my mail up thread.

I'm not convinced that that's a legitimate concern.  If that's an issue
with CIC, then we have a lot bigger problems, because relcache entries
and such will be inconsistent anyway if you have non-exclusive locks
while changing catalog data.  In the situations where that happens it
better be harmless (otherwis CiC is broken), and the next access will
recompute them.


> I am not sure but I think your solution can be
> extended to make it somewhat similar to what we do with rd_indexvalid
> (basically, we should set rd_bitmapsvalid to 2 (bitmap is temporarily
> forced) at rel cache invalidation and then clean it up transaction
> end).  I can try something on those lines.

Not sure I understand what you mean with "clean it up at transaction end"?

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Peter Geoghegan
On Sun, Feb 5, 2017 at 6:42 PM, Pavan Deolasee  wrote:
> I'm not sure that just because the bug wasn't reported by a user, makes it
> any less critical. As Tomas pointed down thread, the nature of the bug is
> such that the users may not discover it very easily, but that doesn't mean
> it couldn't be affecting them all the time. We can now correlate many past
> reports of index corruption to this bug, but we don't have evidence to prove
> that. Lack of any good tool or built-in checks probably makes it even
> harder.

I think that we need to make an automated checker tool a requirement
for very complicated development projects in the future. We're behind
here.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Amit Kapila
On Mon, Feb 6, 2017 at 8:01 AM, Pavan Deolasee  wrote:
>
>
> On Mon, Feb 6, 2017 at 1:44 AM, Tom Lane  wrote:
>>
>>
>>
>> > 2. In the second patch, we tried to recompute attribute lists if a
>> > relcache
>> > flush happens in between and index list is invalidated. We've seen
>> > problems
>> > with that, especially it getting into an infinite loop with
>> > CACHE_CLOBBER_ALWAYS. Not clear why it happens, but apparently relcache
>> > flushes keep happening.
>>
>> Well, yeah: the point of CLOBBER_CACHE_ALWAYS is to make a relcache flush
>> happen whenever it possibly could.
>
>
> Ah, ok. That explains why the retry logic as originally proposed was causing
> infinite loop.
>
>>
>> The way to deal with that without
>> looping is to test whether the index set *actually* changed, not whether
>> it just might have changed.
>
>
> I like this approach. I'll run the patch on a build with
> CACHE_CLOBBER_ALWAYS, but I'm pretty sure it will be ok. I also like the
> fact that retry logic is now limited to only when index set changes, which
> fits in the situation we're dealing with.
>

Don't you think it also has the same problem as mentioned by me in
Alvaro's patch and you also agreed on it?  Do you see any need to fix
it locally in
RelationGetIndexAttrBitmap(), why can't we clear at transaction end?



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Pavan Deolasee
On Mon, Feb 6, 2017 at 5:41 AM, Peter Geoghegan  wrote:

> On Sun, Feb 5, 2017 at 4:09 PM, Robert Haas  wrote:
> > I don't think this kind of black-and-white thinking is very helpful.
> > Obviously, data corruption is bad.  However, this bug has (from what
> > one can tell from this thread) been with us for over a decade; it must
> > necessarily be either low-probability or low-severity, or somebody
> > would've found it and fixed it before now.  Indeed, the discovery of
> > this bug was driven by new feature development, not a user report.  It
> > seems pretty clear that if we try to patch this and get it wrong, the
> > effects of our mistake could easily be a lot more serious than the
> > original bug.
>
> +1. The fact that it wasn't driven by a user report convinces me that
> this is the way to go.
>
>
I'm not sure that just because the bug wasn't reported by a user, makes it
any less critical. As Tomas pointed down thread, the nature of the bug is
such that the users may not discover it very easily, but that doesn't mean
it couldn't be affecting them all the time. We can now correlate many past
reports of index corruption to this bug, but we don't have evidence to
prove that. Lack of any good tool or built-in checks probably makes it even
harder.

The reason why I discovered this with WARM is because it now has a built-in
recheck logic, which was discarding some tuples returned by the index scan.
It took me lots of code review and inspection to finally conclude that this
must be an existing bug. Even then for lack of any utility, I could not
detect this easily with master. That doesn't mean I wasn't able to
reproduce, but detection was a problem.

If you look at the reproduction setup, one in every 10, if not 5, CREATE
INDEX CONCURRENTLY ends up creating a corrupt index. That's a good 10%
probability. And this is on a low end laptop, with just 5-10 concurrent
clients running. Probability of hitting the problem will be much higher on
a bigger machine, with many users (on a decent AWS machine, I would find
every third CIC to be corrupt). Moreover the setup is not doing anything
extraordinary. Just concurrent updates which change between HOT to non-HOT
and a CIC.

Thanks,
Pavan
-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Amit Kapila
On Mon, Feb 6, 2017 at 6:27 AM, Andres Freund  wrote:
> Hi,
>
> On 2017-02-05 16:37:33 -0800, Andres Freund wrote:
>> >  RelationGetIndexList(Relation relation)
>> > @@ -4746,8 +4747,10 @@ RelationGetIndexPredicate(Relation relat
>> >   * we can include system attributes (e.g., OID) in the bitmap 
>> > representation.
>> >   *
>> >   * Caller had better hold at least RowExclusiveLock on the target relation
>> > - * to ensure that it has a stable set of indexes.  This also makes it safe
>> > - * (deadlock-free) for us to take locks on the relation's indexes.
>> > + * to ensure it is safe (deadlock-free) for us to take locks on the 
>> > relation's
>> > + * indexes.  Note that since the introduction of CREATE INDEX 
>> > CONCURRENTLY,
>> > + * that lock level doesn't guarantee a stable set of indexes, so we have 
>> > to
>> > + * be prepared to retry here in case of a change in the set of indexes.
>>
>> I've not yet read the full thread, but I'm a bit confused so far. We
>> obviously can get changing information about indexes here, but isn't
>> that something we have to deal with anyway?  If we guarantee that we
>> don't loose knowledge that there's a pending invalidation, why do we
>> have to retry?  Pretty much by definition the knowledge in a relcache
>> entry can be outdated as soon as returned unless locking prevents that
>> from being possible - which is not the case here.
>>
>> ISTM it'd be better not to have retry logic here, but to follow the more
>> general pattern of making sure that we know whether the information
>> needs to be recomputed at the next access.  We could either do that by
>> having an additional bit of information about the validity of the
>> bitmaps (so we have invalid, building, valid - and only set valid at the
>> end of computing the bitmaps when still building and not invalid again),
>> or we simply move the bitmap computation into the normal relcache build.
>
> To show what I mean here's an *unpolished* and *barely tested* patch
> implementing the first of my suggestions.
>

+ * Signal that the attribute bitmaps are being built. If there's any
+ * relcache invalidations while building them, rd_bitmapsvalid will be
+ * reset to 0.  In that case return the bitmaps, but don't mark them as
+ * valid.
+ */

I don't see in your patch that you are setting rd_bitmapsvalid to 0.
Also, I think this suffers from the same problem as the patch proposed
by Alvaro which is that different attrs (hot_attrs, key_attrs and
id_attrs) will get computed based on different index lists which can
cause a problem as mentioned in my mail up thread.  However, I think
your general approach and idea that we have to set the things up for
next time is on spot. I am not sure but I think your solution can be
extended to make it somewhat similar to what we do with rd_indexvalid
(basically, we should set rd_bitmapsvalid to 2 (bitmap is temporarily
forced) at rel cache invalidation and then clean it up transaction
end).  I can try something on those lines.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Pavan Deolasee
On Mon, Feb 6, 2017 at 1:44 AM, Tom Lane  wrote:

>
>
> > 2. In the second patch, we tried to recompute attribute lists if a
> relcache
> > flush happens in between and index list is invalidated. We've seen
> problems
> > with that, especially it getting into an infinite loop with
> > CACHE_CLOBBER_ALWAYS. Not clear why it happens, but apparently relcache
> > flushes keep happening.
>
> Well, yeah: the point of CLOBBER_CACHE_ALWAYS is to make a relcache flush
> happen whenever it possibly could.


Ah, ok. That explains why the retry logic as originally proposed was
causing infinite loop.


> The way to deal with that without
> looping is to test whether the index set *actually* changed, not whether
> it just might have changed.
>

I like this approach. I'll run the patch on a build with
CACHE_CLOBBER_ALWAYS, but I'm pretty sure it will be ok. I also like the
fact that retry logic is now limited to only when index set changes, which
fits in the situation we're dealing with.

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Peter Geoghegan
On Sun, Feb 5, 2017 at 4:57 PM, Tomas Vondra
 wrote:
> OTOH I disagree with the notion that bugs that are not driven by user
> reports are somehow less severe. Some data corruption bugs cause quite
> visible breakage - segfaults, immediate crashes, etc. Those are pretty clear
> bugs, and are reported by users.

I meant that I find the fact that there were no user reports in all
these years to be a good reason to not proceed for now in this
instance.

I wrote amcheck to detect the latter variety of bug, so clearly I
think that they are very serious.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Andres Freund
Hi,

On 2017-02-05 16:37:33 -0800, Andres Freund wrote:
> >  RelationGetIndexList(Relation relation)
> > @@ -4746,8 +4747,10 @@ RelationGetIndexPredicate(Relation relat
> >   * we can include system attributes (e.g., OID) in the bitmap 
> > representation.
> >   *
> >   * Caller had better hold at least RowExclusiveLock on the target relation
> > - * to ensure that it has a stable set of indexes.  This also makes it safe
> > - * (deadlock-free) for us to take locks on the relation's indexes.
> > + * to ensure it is safe (deadlock-free) for us to take locks on the 
> > relation's
> > + * indexes.  Note that since the introduction of CREATE INDEX CONCURRENTLY,
> > + * that lock level doesn't guarantee a stable set of indexes, so we have to
> > + * be prepared to retry here in case of a change in the set of indexes.
> 
> I've not yet read the full thread, but I'm a bit confused so far. We
> obviously can get changing information about indexes here, but isn't
> that something we have to deal with anyway?  If we guarantee that we
> don't loose knowledge that there's a pending invalidation, why do we
> have to retry?  Pretty much by definition the knowledge in a relcache
> entry can be outdated as soon as returned unless locking prevents that
> from being possible - which is not the case here.
> 
> ISTM it'd be better not to have retry logic here, but to follow the more
> general pattern of making sure that we know whether the information
> needs to be recomputed at the next access.  We could either do that by
> having an additional bit of information about the validity of the
> bitmaps (so we have invalid, building, valid - and only set valid at the
> end of computing the bitmaps when still building and not invalid again),
> or we simply move the bitmap computation into the normal relcache build.

To show what I mean here's an *unpolished* and *barely tested* patch
implementing the first of my suggestions.

Alvaro, Pavan, I think should address the issue as well?

- Andres
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 8a7c560e46..9e94495e75 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -4745,9 +4745,12 @@ RelationGetIndexPredicate(Relation relation)
  * Attribute numbers are offset by FirstLowInvalidHeapAttributeNumber so that
  * we can include system attributes (e.g., OID) in the bitmap representation.
  *
- * Caller had better hold at least RowExclusiveLock on the target relation
- * to ensure that it has a stable set of indexes.  This also makes it safe
- * (deadlock-free) for us to take locks on the relation's indexes.
+ * Caller had better hold at least RowExclusiveLock on the target relation to
+ * ensure that it has a stable set of indexes.  This also makes it safe
+ * (deadlock-free) for us to take locks on the relation's indexes.  Note that
+ * a concurrent CREATE/DROP INDEX CONCURRENTLY can lead to an outdated list
+ * being returned (will be recomputed at the next access), the CONCURRENTLY
+ * code deals with that.
  *
  * The returned result is palloc'd in the caller's memory context and should
  * be bms_free'd when not needed anymore.
@@ -4766,7 +4769,7 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
 	MemoryContext oldcxt;
 
 	/* Quick exit if we already computed the result. */
-	if (relation->rd_indexattr != NULL)
+	if (relation->rd_bitmapsvalid == 2)
 	{
 		switch (attrKind)
 		{
@@ -4788,6 +4791,14 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
 		return NULL;
 
 	/*
+	 * Signal that the attribute bitmaps are being built. If there's any
+	 * relcache invalidations while building them, rd_bitmapsvalid will be
+	 * reset to 0.  In that case return the bitmaps, but don't mark them as
+	 * valid.
+	 */
+	relation->rd_bitmapsvalid = 1;
+
+	/*
 	 * Get cached list of index OIDs
 	 */
 	indexoidlist = RelationGetIndexList(relation);
@@ -4892,13 +4903,7 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
 	bms_free(relation->rd_idattr);
 	relation->rd_idattr = NULL;
 
-	/*
-	 * Now save copies of the bitmaps in the relcache entry.  We intentionally
-	 * set rd_indexattr last, because that's the one that signals validity of
-	 * the values; if we run out of memory before making that copy, we won't
-	 * leave the relcache entry looking like the other ones are valid but
-	 * empty.
-	 */
+	/* now save copies of the bitmaps in the relcache entry */
 	oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
 	relation->rd_keyattr = bms_copy(uindexattrs);
 	relation->rd_pkattr = bms_copy(pkindexattrs);
@@ -4906,6 +4911,18 @@ RelationGetIndexAttrBitmap(Relation relation, IndexAttrBitmapKind attrKind)
 	relation->rd_indexattr = bms_copy(indexattrs);
 	MemoryContextSwitchTo(oldcxt);
 
+	/*
+	 * If there've been no invalidations while building the entry, mark the
+	 * stored bitmaps as being valid.  Need to do so after the copies above,

Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Tomas Vondra

On 02/06/2017 01:11 AM, Peter Geoghegan wrote:

On Sun, Feb 5, 2017 at 4:09 PM, Robert Haas  wrote:

I don't think this kind of black-and-white thinking is very
helpful. Obviously, data corruption is bad. However, this bug has
(from what one can tell from this thread) been with us for over a
decade; it must necessarily be either low-probability or
low-severity, or somebody would've found it and fixed it before
now. Indeed, the discovery of this bug was driven by new feature
development, not a user report. It seems pretty clear that if we
try to patch this and get it wrong, the effects of our mistake
could easily be a lot more serious than the original bug.


+1. The fact that it wasn't driven by a user report convinces me
that this is the way to go.



+1 to not rushing fixes into releases. While I think we now finally 
understand the mechanics of this bug, the fact that we came up with 
three different fixes in this thread, only to discover issues with each 
of them, warrants some caution.


OTOH I disagree with the notion that bugs that are not driven by user 
reports are somehow less severe. Some data corruption bugs cause quite 
visible breakage - segfaults, immediate crashes, etc. Those are pretty 
clear bugs, and are reported by users.


Other data corruption bugs are much more subtle - for example this bug 
may lead to incorrect results to some queries, failing to detect values 
violating UNIQUE constraints, issues with foreign keys, etc.


It's damn impossible to notice incorrect query results that only affect 
tiny subset of the rows (e.g. rows updated when the CIC was running), 
especially when the issue may go away after a while due to additional 
non-HOT updates.


Regarding the other symptoms - I wonder how many strange 'duplicate 
value' errors were misdiagnosed, wrongly attributed to a recent power 
outage, etc.



regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Andres Freund
On 2017-02-05 15:14:59 -0500, Tom Lane wrote:
> I do not like any of the other patches proposed in this thread, because
> they fail to guarantee delivering an up-to-date attribute bitmap to the
> caller.  I think we need a retry loop, and I think that it needs to look
> like the attached.

That looks like a reasonable approach, although I'm not sure it's the
best one.


> (Note that the tests whether rd_pkindex and rd_replidindex changed might
> be redundant; but I'm not totally sure of that, and they're cheap.)

I don't think they can, but I'm all for re-checking.


>  RelationGetIndexList(Relation relation)
> @@ -4746,8 +4747,10 @@ RelationGetIndexPredicate(Relation relat
>   * we can include system attributes (e.g., OID) in the bitmap representation.
>   *
>   * Caller had better hold at least RowExclusiveLock on the target relation
> - * to ensure that it has a stable set of indexes.  This also makes it safe
> - * (deadlock-free) for us to take locks on the relation's indexes.
> + * to ensure it is safe (deadlock-free) for us to take locks on the 
> relation's
> + * indexes.  Note that since the introduction of CREATE INDEX CONCURRENTLY,
> + * that lock level doesn't guarantee a stable set of indexes, so we have to
> + * be prepared to retry here in case of a change in the set of indexes.

I've not yet read the full thread, but I'm a bit confused so far. We
obviously can get changing information about indexes here, but isn't
that something we have to deal with anyway?  If we guarantee that we
don't loose knowledge that there's a pending invalidation, why do we
have to retry?  Pretty much by definition the knowledge in a relcache
entry can be outdated as soon as returned unless locking prevents that
from being possible - which is not the case here.

ISTM it'd be better not to have retry logic here, but to follow the more
general pattern of making sure that we know whether the information
needs to be recomputed at the next access.  We could either do that by
having an additional bit of information about the validity of the
bitmaps (so we have invalid, building, valid - and only set valid at the
end of computing the bitmaps when still building and not invalid again),
or we simply move the bitmap computation into the normal relcache build.


BTW, the interactions of RelationSetIndexList with
RelationGetIndexAttrBitmap() look more than a bit fragile to me, even if
documented:
 *
 * We deliberately do not change rd_indexattr here: even when operating
 * with a temporary partial index list, HOT-update decisions must be made
 * correctly with respect to the full index set.  It is up to the caller
 * to ensure that a correct rd_indexattr set has been cached before first
 * calling RelationSetIndexList; else a subsequent inquiry might cause a
 * wrong rd_indexattr set to get computed and cached.  Likewise, we do not
 * touch rd_keyattr, rd_pkattr or rd_idattr.


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Andres Freund
On 2017-02-05 12:51:09 -0500, Tom Lane wrote:
> Michael Paquier  writes:
> > On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule  
> > wrote:
> >> I agree with Pavan - a release with known important bug is not good idea.
> 
> > This bug has been around for some time, so I would recommend taking
> > the time necessary to make the best fix possible, even if it means
> > waiting for the next round of minor releases.
> 
> I think the way to think about this sort of thing is, if we had found
> this bug when a release wasn't imminent, would we consider it bad enough
> to justify an unscheduled release cycle?  I have to think the answer for
> this one is "probably not".

+1.  I don't think we serve our users by putting out a nontrivial bugfix
hastily. Nor do I think we serve them in this instance by delaying the
release to cover this fix; there's enough other fixes in the release
imo.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Peter Geoghegan
On Sun, Feb 5, 2017 at 4:09 PM, Robert Haas  wrote:
> I don't think this kind of black-and-white thinking is very helpful.
> Obviously, data corruption is bad.  However, this bug has (from what
> one can tell from this thread) been with us for over a decade; it must
> necessarily be either low-probability or low-severity, or somebody
> would've found it and fixed it before now.  Indeed, the discovery of
> this bug was driven by new feature development, not a user report.  It
> seems pretty clear that if we try to patch this and get it wrong, the
> effects of our mistake could easily be a lot more serious than the
> original bug.

+1. The fact that it wasn't driven by a user report convinces me that
this is the way to go.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Robert Haas
On Sun, Feb 5, 2017 at 1:34 PM, Martín Marqués  wrote:
> El 05/02/17 a las 10:03, Michael Paquier escribió:
>> On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule  
>> wrote:
>>> I agree with Pavan - a release with known important bug is not good idea.
>>
>> This bug has been around for some time, so I would recommend taking
>> the time necessary to make the best fix possible, even if it means
>> waiting for the next round of minor releases.
>
> The fact that the bug has been around for a long time doesn't mean we
> shouldn't take it seriously.
>
> IMO any kind of corruption (heap or index) should be prioritized.
> There's also been comments about maybe this being the cause of old
> reports about index corruption.
>
> I ask myself if it's a good idea to make a point release with a know
> corruption bug in it.

I don't think this kind of black-and-white thinking is very helpful.
Obviously, data corruption is bad.  However, this bug has (from what
one can tell from this thread) been with us for over a decade; it must
necessarily be either low-probability or low-severity, or somebody
would've found it and fixed it before now.  Indeed, the discovery of
this bug was driven by new feature development, not a user report.  It
seems pretty clear that if we try to patch this and get it wrong, the
effects of our mistake could easily be a lot more serious than the
original bug.

Now, I do not object to patching this tomorrow before the wrap if the
various senior hackers who have studied this (Tom, Alvaro, Pavan,
Amit) are all in agreement on the way forward.  But if there is any
significant chance that we don't fully understand this issue and that
our fix might cause bigger problems, it would be more prudent to wait.
I'm all in favor of releasing important bug fixes quickly, but
releasing a bug fix before we're sure we've fixed the bug correctly is
taking a good principle to an irrational extreme.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Tom Lane
[ Having now read the whole thread, I'm prepared to weigh in ... ]

Pavan Deolasee  writes:
> This seems like a real problem to me. I don't know what the consequences
> are, but definitely having various attribute lists to have different view
> of the set of indexes doesn't seem right.

For sure.

> 2. In the second patch, we tried to recompute attribute lists if a relcache
> flush happens in between and index list is invalidated. We've seen problems
> with that, especially it getting into an infinite loop with
> CACHE_CLOBBER_ALWAYS. Not clear why it happens, but apparently relcache
> flushes keep happening.

Well, yeah: the point of CLOBBER_CACHE_ALWAYS is to make a relcache flush
happen whenever it possibly could.  The way to deal with that without
looping is to test whether the index set *actually* changed, not whether
it just might have changed.

I do not like any of the other patches proposed in this thread, because
they fail to guarantee delivering an up-to-date attribute bitmap to the
caller.  I think we need a retry loop, and I think that it needs to look
like the attached.

(Note that the tests whether rd_pkindex and rd_replidindex changed might
be redundant; but I'm not totally sure of that, and they're cheap.)

regards, tom lane

diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 8a7c560..b659994 100644
*** a/src/backend/utils/cache/relcache.c
--- b/src/backend/utils/cache/relcache.c
*** RelationGetFKeyList(Relation relation)
*** 4327,4335 
   * it is the pg_class OID of a unique index on OID when the relation has one,
   * and InvalidOid if there is no such index.
   *
!  * In exactly the same way, we update rd_replidindex, which is the pg_class
!  * OID of an index to be used as the relation's replication identity index,
!  * or InvalidOid if there is no such index.
   */
  List *
  RelationGetIndexList(Relation relation)
--- 4327,4336 
   * it is the pg_class OID of a unique index on OID when the relation has one,
   * and InvalidOid if there is no such index.
   *
!  * In exactly the same way, we update rd_pkindex, which is the OID of the
!  * relation's primary key index if any, else InvalidOid; and rd_replidindex,
!  * which is the pg_class OID of an index to be used as the relation's
!  * replication identity index, or InvalidOid if there is no such index.
   */
  List *
  RelationGetIndexList(Relation relation)
*** RelationGetIndexPredicate(Relation relat
*** 4746,4753 
   * we can include system attributes (e.g., OID) in the bitmap representation.
   *
   * Caller had better hold at least RowExclusiveLock on the target relation
!  * to ensure that it has a stable set of indexes.  This also makes it safe
!  * (deadlock-free) for us to take locks on the relation's indexes.
   *
   * The returned result is palloc'd in the caller's memory context and should
   * be bms_free'd when not needed anymore.
--- 4747,4756 
   * we can include system attributes (e.g., OID) in the bitmap representation.
   *
   * Caller had better hold at least RowExclusiveLock on the target relation
!  * to ensure it is safe (deadlock-free) for us to take locks on the relation's
!  * indexes.  Note that since the introduction of CREATE INDEX CONCURRENTLY,
!  * that lock level doesn't guarantee a stable set of indexes, so we have to
!  * be prepared to retry here in case of a change in the set of indexes.
   *
   * The returned result is palloc'd in the caller's memory context and should
   * be bms_free'd when not needed anymore.
*** RelationGetIndexAttrBitmap(Relation rela
*** 4760,4765 
--- 4763,4769 
  	Bitmapset  *pkindexattrs;	/* columns in the primary index */
  	Bitmapset  *idindexattrs;	/* columns in the replica identity */
  	List	   *indexoidlist;
+ 	List	   *newindexoidlist;
  	Oid			relpkindex;
  	Oid			relreplindex;
  	ListCell   *l;
*** RelationGetIndexAttrBitmap(Relation rela
*** 4788,4795 
  		return NULL;
  
  	/*
! 	 * Get cached list of index OIDs
  	 */
  	indexoidlist = RelationGetIndexList(relation);
  
  	/* Fall out if no indexes (but relhasindex was set) */
--- 4792,4800 
  		return NULL;
  
  	/*
! 	 * Get cached list of index OIDs. If we have to start over, we do so here.
  	 */
+ restart:
  	indexoidlist = RelationGetIndexList(relation);
  
  	/* Fall out if no indexes (but relhasindex was set) */
*** RelationGetIndexAttrBitmap(Relation rela
*** 4800,4808 
  	 * Copy the rd_pkindex and rd_replidindex value computed by
  	 * RelationGetIndexList before proceeding.  This is needed because a
  	 * relcache flush could occur inside index_open below, resetting the
! 	 * fields managed by RelationGetIndexList. (The values we're computing
! 	 * will still be valid, assuming that caller has a sufficient lock on
! 	 * the relation.)
  	 */
  	relpkindex = relation->rd_pkindex;
  	relreplindex = relation->rd_replidindex;
--- 4805,4812 

Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Martín Marqués
El 05/02/17 a las 10:03, Michael Paquier escribió:
> On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule  wrote:
>> I agree with Pavan - a release with known important bug is not good idea.
> 
> This bug has been around for some time, so I would recommend taking
> the time necessary to make the best fix possible, even if it means
> waiting for the next round of minor releases.

The fact that the bug has been around for a long time doesn't mean we
shouldn't take it seriously.

IMO any kind of corruption (heap or index) should be prioritized.
There's also been comments about maybe this being the cause of old
reports about index corruption.

I ask myself if it's a good idea to make a point release with a know
corruption bug in it.

Regards,

-- 
Martín Marquéshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Pavel Stehule
2017-02-05 18:51 GMT+01:00 Tom Lane :

> Michael Paquier  writes:
> > On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule 
> wrote:
> >> I agree with Pavan - a release with known important bug is not good
> idea.
>
> > This bug has been around for some time, so I would recommend taking
> > the time necessary to make the best fix possible, even if it means
> > waiting for the next round of minor releases.
>
> I think the way to think about this sort of thing is, if we had found
> this bug when a release wasn't imminent, would we consider it bad enough
> to justify an unscheduled release cycle?  I have to think the answer for
> this one is "probably not".
>

There are a questions

1. is it critical bug?
2. if is it, will be reason for special minor release?
3. how much time is necessary for fixing of this bug?

These questions can be unimportant if there is some security fixes in next
release, what I cannot to know.

Regards

Pavel


>
> regards, tom lane
>


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Tom Lane
Michael Paquier  writes:
> On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule  wrote:
>> I agree with Pavan - a release with known important bug is not good idea.

> This bug has been around for some time, so I would recommend taking
> the time necessary to make the best fix possible, even if it means
> waiting for the next round of minor releases.

I think the way to think about this sort of thing is, if we had found
this bug when a release wasn't imminent, would we consider it bad enough
to justify an unscheduled release cycle?  I have to think the answer for
this one is "probably not".

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Michael Paquier
On Sun, Feb 5, 2017 at 6:53 PM, Pavel Stehule  wrote:
> I agree with Pavan - a release with known important bug is not good idea.

This bug has been around for some time, so I would recommend taking
the time necessary to make the best fix possible, even if it means
waiting for the next round of minor releases.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-05 Thread Pavel Stehule
2017-02-05 7:54 GMT+01:00 Pavan Deolasee :

>
> On Sat, Feb 4, 2017 at 11:54 PM, Tom Lane  wrote:
>
>>
>> Based on Pavan's comments, I think trying to force this into next week's
>> releases would be extremely unwise.  If the bug went undetected this long,
>> it can wait for a fix for another three months.
>
>
> Yes, I think bug existed ever since and went unnoticed. One reason could
> be that the race happens only when the new index turns HOT updates into
> non-HOT updates. Another reason could be that we don't have checks in place
> to catch these kinds of corruption. Having said that, since we have
> discovered the bug, at least many 2ndQuadrant customers have expressed
> worry and want to know if the fix will be available in 9.6.2 and other
> minor releases.  Since the bug can lead to data corruption, the worry is
> justified. Until we fix the bug, there will be a constant worry about using
> CIC.
>
> If we can have some kind of band-aid fix to plug in the hole, that might
> be enough as well. I tested my first patch (which will need some polishing)
> and that works well AFAIK. I was worried about prepared queries and all,
> but that seems ok too. RelationGetIndexList() always get called within
> ExecInitModifyTable. The fix seems quite unlikely to cause any other side
> effects.
>
> Another possible band-aid is to throw another relcache invalidation in
> CIC. Like adding a dummy index_set_state_flags() within yet another
> transaction. Seems ugly, but should fix the problem for now and not cause
> any impact on relcache mechanism whatsoever.
>
> That seems better than
>> risking new breakage when it's barely 48 hours to the release wrap
>> deadline.  We do not have time to recover from any mistakes.
>>
>
> I'm not sure what the project policies are, but can we consider delaying
> the release by a week for issues like these? Or do you think it will be
> hard to come up with a proper fix for the issue and it will need some
> serious refactoring?
>

I agree with Pavan - a release with known important bug is not good idea.

Pavel


> Thanks,
> Pavan
>
> --
>  Pavan Deolasee   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-04 Thread Pavan Deolasee
On Sat, Feb 4, 2017 at 11:54 PM, Tom Lane  wrote:

>
> Based on Pavan's comments, I think trying to force this into next week's
> releases would be extremely unwise.  If the bug went undetected this long,
> it can wait for a fix for another three months.


Yes, I think bug existed ever since and went unnoticed. One reason could be
that the race happens only when the new index turns HOT updates into
non-HOT updates. Another reason could be that we don't have checks in place
to catch these kinds of corruption. Having said that, since we have
discovered the bug, at least many 2ndQuadrant customers have expressed
worry and want to know if the fix will be available in 9.6.2 and other
minor releases.  Since the bug can lead to data corruption, the worry is
justified. Until we fix the bug, there will be a constant worry about using
CIC.

If we can have some kind of band-aid fix to plug in the hole, that might be
enough as well. I tested my first patch (which will need some polishing)
and that works well AFAIK. I was worried about prepared queries and all,
but that seems ok too. RelationGetIndexList() always get called within
ExecInitModifyTable. The fix seems quite unlikely to cause any other side
effects.

Another possible band-aid is to throw another relcache invalidation in CIC.
Like adding a dummy index_set_state_flags() within yet another transaction.
Seems ugly, but should fix the problem for now and not cause any impact on
relcache mechanism whatsoever.

That seems better than
> risking new breakage when it's barely 48 hours to the release wrap
> deadline.  We do not have time to recover from any mistakes.
>

I'm not sure what the project policies are, but can we consider delaying
the release by a week for issues like these? Or do you think it will be
hard to come up with a proper fix for the issue and it will need some
serious refactoring?

Thanks,
Pavan

-- 
 Pavan Deolasee   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Index corruption with CREATE INDEX CONCURRENTLY

2017-02-04 Thread Tom Lane
Alvaro Herrera  writes:
> I intend to commit this soon to all branches, to ensure it gets into the
> next set of minors.

Based on Pavan's comments, I think trying to force this into next week's
releases would be extremely unwise.  If the bug went undetected this long,
it can wait for a fix for another three months.  That seems better than
risking new breakage when it's barely 48 hours to the release wrap
deadline.  We do not have time to recover from any mistakes.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


  1   2   3   4   5   6   7   8   9   10   >