Re: [HACKERS] index-only count(*) for indexes supporting bitmap scans

2017-11-13 Thread Alexander Kuzmenkov

I've pushed the executor part of this, but mostly not the planner part,
because I didn't think the latter was anywhere near ready for prime
time: the regression test changes it was causing were entirely bogus.


Hi Tom,

Thanks for the commit and the explanation. I'll try to address your 
comments if I continue working on the planner part.


--

Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PoC: full merge join on comparison clause

2017-10-30 Thread Alexander Kuzmenkov

Hi,

I am attaching the updated patch, rebased to 820c03.

On 09.10.2017 13:47, Ashutosh Bapat wrote:

Hi Alexander,
Commit c7a9fa399 has added another test on mergeopfamilies. I think
your patch will need to take care of that test.

On Wed, Oct 4, 2017 at 6:38 PM, Alexander Kuzmenkov
<a.kuzmen...@postgrespro.ru> wrote:

As discussed earlier, I changed the way we work with mergeopfamilies. I use
the "is_equality" flag to indicate whether the clause is an equality one,
and fill mergeopfamilies for both equality and inequality operators.
The updated patch is attached (rebased to 20b6552242).


--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company






--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c
index 925b4cf553..73e6a4ca74 100644
--- a/src/backend/executor/nodeMergejoin.c
+++ b/src/backend/executor/nodeMergejoin.c
@@ -172,31 +172,32 @@ typedef enum
  * to the opfamily and collation, with nulls at the indicated end of the range.
  * This allows us to obtain the needed comparison function from the opfamily.
  */
-static MergeJoinClause
+static void
 MJExamineQuals(List *mergeclauses,
 			   Oid *mergefamilies,
 			   Oid *mergecollations,
 			   int *mergestrategies,
 			   bool *mergenullsfirst,
-			   PlanState *parent)
+			   MergeJoinState *parent)
 {
-	MergeJoinClause clauses;
 	int			nClauses = list_length(mergeclauses);
 	int			iClause;
 	ListCell   *cl;
 
-	clauses = (MergeJoinClause) palloc0(nClauses * sizeof(MergeJoinClauseData));
+	parent->mj_Clauses = (MergeJoinClause) palloc0(nClauses * sizeof(MergeJoinClauseData));
+	parent->mj_UseEqual = (bool *) palloc0(nClauses * sizeof(bool));
+	parent->mj_UseLesser = (bool *) palloc0(nClauses * sizeof(bool));
 
 	iClause = 0;
 	foreach(cl, mergeclauses)
 	{
 		OpExpr	   *qual = (OpExpr *) lfirst(cl);
-		MergeJoinClause clause = [iClause];
+		MergeJoinClause clause = >mj_Clauses[iClause];
 		Oid			opfamily = mergefamilies[iClause];
 		Oid			collation = mergecollations[iClause];
-		StrategyNumber opstrategy = mergestrategies[iClause];
+		StrategyNumber sort_op_strategy = mergestrategies[iClause];
 		bool		nulls_first = mergenullsfirst[iClause];
-		int			op_strategy;
+		int			join_op_strategy;
 		Oid			op_lefttype;
 		Oid			op_righttype;
 		Oid			sortfunc;
@@ -207,28 +208,55 @@ MJExamineQuals(List *mergeclauses,
 		/*
 		 * Prepare the input expressions for execution.
 		 */
-		clause->lexpr = ExecInitExpr((Expr *) linitial(qual->args), parent);
-		clause->rexpr = ExecInitExpr((Expr *) lsecond(qual->args), parent);
+		clause->lexpr = ExecInitExpr((Expr *) linitial(qual->args), (PlanState *) parent);
+		clause->rexpr = ExecInitExpr((Expr *) lsecond(qual->args), (PlanState *) parent);
 
 		/* Set up sort support data */
 		clause->ssup.ssup_cxt = CurrentMemoryContext;
 		clause->ssup.ssup_collation = collation;
-		if (opstrategy == BTLessStrategyNumber)
+		if (sort_op_strategy == BTLessStrategyNumber)
 			clause->ssup.ssup_reverse = false;
-		else if (opstrategy == BTGreaterStrategyNumber)
+		else if (sort_op_strategy == BTGreaterStrategyNumber)
 			clause->ssup.ssup_reverse = true;
 		else	/* planner screwed up */
-			elog(ERROR, "unsupported mergejoin strategy %d", opstrategy);
+			elog(ERROR, "unsupported mergejoin strategy %d", sort_op_strategy);
 		clause->ssup.ssup_nulls_first = nulls_first;
 
 		/* Extract the operator's declared left/right datatypes */
 		get_op_opfamily_properties(qual->opno, opfamily, false,
-   _strategy,
+   _op_strategy,
    _lefttype,
    _righttype);
-		if (op_strategy != BTEqualStrategyNumber)	/* should not happen */
-			elog(ERROR, "cannot merge using non-equality operator %u",
- qual->opno);
+
+		/*
+		 * Determine whether we accept lesser and/or equal tuples of the inner
+		 * relation.
+		 */
+		switch (join_op_strategy)
+		{
+			case BTEqualStrategyNumber:
+parent->mj_UseEqual[iClause] = true;
+break;
+
+			case BTLessEqualStrategyNumber:
+parent->mj_UseEqual[iClause] = true;
+/* fall through */
+
+			case BTLessStrategyNumber:
+parent->mj_UseLesser[iClause] = true;
+break;
+
+			case BTGreaterEqualStrategyNumber:
+parent->mj_UseEqual[iClause] = true;
+/* fall through */
+
+			case BTGreaterStrategyNumber:
+parent->mj_UseLesser[iClause] = true;
+break;
+
+			default:
+elog(ERROR, "unsupported join strategy %d", join_op_strategy);
+		}
 
 		/*
 		 * sortsupport routine must know if abbreviation optimization is
@@ -265,8 +293,6 @@ MJExamineQuals(List *mergeclauses,
 
 		iClause++;
 	}
-
-	return clauses;
 }
 
 /*
@@ -378,6 +404,14 @@ MJEvalInnerValues(MergeJoinState *mergestate, TupleTableSlot *innerslot)
 	

Re: [HACKERS] Proposal: Improve bitmap costing for lossy pages

2017-10-06 Thread Alexander Kuzmenkov



Analysis: The estimated value of the lossy_pages is way higher than
its actual value and reason is that the total_pages calculated by the
"Mackert and Lohman formula" is not correct.


I think the problem might be that the total_pages includes cache effects 
and rescans. For bitmap entries we should use something like relation 
pages * selectivity.


Meanwhile, I ran TPC-H briefly with the v3 patch: scale factor 25, 2 
workers, SSD storage.

It shows significant improvement on 4MB work_mem and no change on 2GB.

Here are the results (execution time in seconds, average of 5 runs):
work_mem4MB2GB
Query masterpatchmasterpatch
4179174168167
5190168155156
628087227229
8197114179172
10269227190192
14110108106    105

--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PoC: full merge join on comparison clause

2017-10-04 Thread Alexander Kuzmenkov
As discussed earlier, I changed the way we work with mergeopfamilies. I 
use the "is_equality" flag to indicate whether the clause is an equality 
one, and fill mergeopfamilies for both equality and inequality operators.

The updated patch is attached (rebased to 20b6552242).

--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c
index 925b4cf553..73e6a4ca74 100644
--- a/src/backend/executor/nodeMergejoin.c
+++ b/src/backend/executor/nodeMergejoin.c
@@ -172,31 +172,32 @@ typedef enum
  * to the opfamily and collation, with nulls at the indicated end of the range.
  * This allows us to obtain the needed comparison function from the opfamily.
  */
-static MergeJoinClause
+static void
 MJExamineQuals(List *mergeclauses,
 			   Oid *mergefamilies,
 			   Oid *mergecollations,
 			   int *mergestrategies,
 			   bool *mergenullsfirst,
-			   PlanState *parent)
+			   MergeJoinState *parent)
 {
-	MergeJoinClause clauses;
 	int			nClauses = list_length(mergeclauses);
 	int			iClause;
 	ListCell   *cl;
 
-	clauses = (MergeJoinClause) palloc0(nClauses * sizeof(MergeJoinClauseData));
+	parent->mj_Clauses = (MergeJoinClause) palloc0(nClauses * sizeof(MergeJoinClauseData));
+	parent->mj_UseEqual = (bool *) palloc0(nClauses * sizeof(bool));
+	parent->mj_UseLesser = (bool *) palloc0(nClauses * sizeof(bool));
 
 	iClause = 0;
 	foreach(cl, mergeclauses)
 	{
 		OpExpr	   *qual = (OpExpr *) lfirst(cl);
-		MergeJoinClause clause = [iClause];
+		MergeJoinClause clause = >mj_Clauses[iClause];
 		Oid			opfamily = mergefamilies[iClause];
 		Oid			collation = mergecollations[iClause];
-		StrategyNumber opstrategy = mergestrategies[iClause];
+		StrategyNumber sort_op_strategy = mergestrategies[iClause];
 		bool		nulls_first = mergenullsfirst[iClause];
-		int			op_strategy;
+		int			join_op_strategy;
 		Oid			op_lefttype;
 		Oid			op_righttype;
 		Oid			sortfunc;
@@ -207,28 +208,55 @@ MJExamineQuals(List *mergeclauses,
 		/*
 		 * Prepare the input expressions for execution.
 		 */
-		clause->lexpr = ExecInitExpr((Expr *) linitial(qual->args), parent);
-		clause->rexpr = ExecInitExpr((Expr *) lsecond(qual->args), parent);
+		clause->lexpr = ExecInitExpr((Expr *) linitial(qual->args), (PlanState *) parent);
+		clause->rexpr = ExecInitExpr((Expr *) lsecond(qual->args), (PlanState *) parent);
 
 		/* Set up sort support data */
 		clause->ssup.ssup_cxt = CurrentMemoryContext;
 		clause->ssup.ssup_collation = collation;
-		if (opstrategy == BTLessStrategyNumber)
+		if (sort_op_strategy == BTLessStrategyNumber)
 			clause->ssup.ssup_reverse = false;
-		else if (opstrategy == BTGreaterStrategyNumber)
+		else if (sort_op_strategy == BTGreaterStrategyNumber)
 			clause->ssup.ssup_reverse = true;
 		else	/* planner screwed up */
-			elog(ERROR, "unsupported mergejoin strategy %d", opstrategy);
+			elog(ERROR, "unsupported mergejoin strategy %d", sort_op_strategy);
 		clause->ssup.ssup_nulls_first = nulls_first;
 
 		/* Extract the operator's declared left/right datatypes */
 		get_op_opfamily_properties(qual->opno, opfamily, false,
-   _strategy,
+   _op_strategy,
    _lefttype,
    _righttype);
-		if (op_strategy != BTEqualStrategyNumber)	/* should not happen */
-			elog(ERROR, "cannot merge using non-equality operator %u",
- qual->opno);
+
+		/*
+		 * Determine whether we accept lesser and/or equal tuples of the inner
+		 * relation.
+		 */
+		switch (join_op_strategy)
+		{
+			case BTEqualStrategyNumber:
+parent->mj_UseEqual[iClause] = true;
+break;
+
+			case BTLessEqualStrategyNumber:
+parent->mj_UseEqual[iClause] = true;
+/* fall through */
+
+			case BTLessStrategyNumber:
+parent->mj_UseLesser[iClause] = true;
+break;
+
+			case BTGreaterEqualStrategyNumber:
+parent->mj_UseEqual[iClause] = true;
+/* fall through */
+
+			case BTGreaterStrategyNumber:
+parent->mj_UseLesser[iClause] = true;
+break;
+
+			default:
+elog(ERROR, "unsupported join strategy %d", join_op_strategy);
+		}
 
 		/*
 		 * sortsupport routine must know if abbreviation optimization is
@@ -265,8 +293,6 @@ MJExamineQuals(List *mergeclauses,
 
 		iClause++;
 	}
-
-	return clauses;
 }
 
 /*
@@ -378,6 +404,14 @@ MJEvalInnerValues(MergeJoinState *mergestate, TupleTableSlot *innerslot)
 	return result;
 }
 
+/* Tuple comparison result */
+typedef enum
+{
+	MJCR_NextInner = 1,
+	MJCR_NextOuter = -1,
+	MJCR_Join = 0
+} MJCompareResult;
+
 /*
  * MJCompare
  *
@@ -388,10 +422,10 @@ MJEvalInnerValues(MergeJoinState *mergestate, TupleTableSlot *innerslot)
  * MJEvalOuterValues and MJEvalInnerValues must already have been called
  * for the current outer and inner tuples, respectively.
  */
-static int
+static MJCompareResult
 MJC

Re: [HACKERS] PoC: full merge join on comparison clause

2017-09-28 Thread Alexander Kuzmenkov

Hi Ashutosh,

Thanks for the review.

*Jeff*, I'm copying you because this is relevant to our discussion about 
what to do with mergeopfamilies when adding new merge join types.



You have renamed RestrictInfo member mergeopfamilies as
equivopfamilies. I don't think that's a good name; it doesn't convey
that these are opfamilies containing merge operators. The changes in
check_mergejoinable() suggest that an operator may act as equality
operator in few operator families and comparison operator in others.
That looks odd. Actually an operator family contains operators other
than equality operators, so you may want to retain this member and add
a new member to specify whether the clause is an equality clause or
not.


For mergeopfamilies, I'm not sure what is the best thing to do. I'll try 
to explain my understanding of the situation, please correct me if I'm 
wrong.


Before the patch, mergeopfamilies was used for two things: creating 
equivalence classes and performing merge joins.


For equivalence classes: we look at the restriction clauses, and if they 
have mergeopfamilies set, it means that these clause are based on an 
equality operator, and the left and right variables must be equal. To 
record this fact, we create an equivalence class. The variables might be 
equal for one equality operator and not equal for another, so we record 
the particular operator families to which our equality operator belongs.


For merge join: we look at the join clauses, and if they have 
mergeopfamilies set, it means that these clauses are based on an 
equality operator, and we can try performing this particular join as 
merge join. These opfamilies are also used beforehand to create the 
equivalence classes for left and right variables. The equivalence 
classes are used to match the join clauses to pathkeys describing the 
ordering of join inputs.


So, if we want to start doing merge joins for operators other than 
equality, we still need to record their opfamilies, but only use them 
for the second case and not the first. I chose to put these opfamilies 
to different variables, and
name the one used for equivalence classes 'equivopfamilies' and the one 
used for merge joins 'mergeopfamilies'. The equality operators are used 
for both cases, so we put their opfamilies into both of these variables.


I agree this might look confusing. Indeed, we could keep a single 
variable for opfamilies, and add separate flags that show how they can 
be used, be that for equivalence classes, merge joins, range joins or 
some combination of them. This is similar to what Jeff did in his range 
merge join patch [1]. I will think more about this and try to produce an 
updated patch.




In mergejoinscansel() you have just removed Assert(op_strategy ==
BTEqualStrategyNumber); Probably this function is written considering
on equality operators. But now that we are using this for all other
operators, we will need more changes to this function. That may be the
reason why INNER join in your earlier example doesn't choose right
costing.


I changed mergejoinscansel() slightly to reflect the fact that the inner 
relation is scanned from the beginning if we have an inequality merge 
clause.




The comment change in final_cost_mergejoin() needs more work. n1, n2,
n3 are number of rows on inner side with values 1, 2, 3 resp. So n1 +
n2 + n3 + ... = size of inner relation is correct. In that context I
am not able to understand your change
+* If the merge clauses contain inequality, (n1 + n2 + ...) ~=
+* (size of inner relation)^2.


I extended the comment in final_cost_mergejoin(). Not sure if that 
approximation makes any sense, but this is the best I could think of.


Style problems are fixed.

Attached please find the new version of the patch that addresses all the 
review comments except mergeopfamilies.


The current commitfest is ending, but I'd like to continue working on 
this patch, so I am moving it to the next one.



[1] 
https://www.postgresql.org/message-id/flat/CAMp0ubfwAFFW3O_NgKqpRPmm56M4weTEXjprb2gP_NrDaEC4Eg%40mail.gmail.com#camp0ubfwaffw3o_ngkqprpmm56m4wetexjprb2gp_nrdaec...@mail.gmail.com 



--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 32dc4e6301..2958a9e53d 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -722,19 +722,19 @@ get_useful_ecs_for_relation(PlannerInfo *root, RelOptInfo *rel)
 	{
 		RestrictInfo *restrictinfo = (RestrictInfo *) lfirst(lc);
 
-		/* Consider only mergejoinable clauses */
-		if (restrictinfo->mergeopfamilies == NIL)
+		/* Consider only mergejoinable equality clauses */
+		if (restrictinfo->equivopfamilies == NIL)
 			continue;
 
 		/* Make sure we've got canonical ECs. */
-		update_mergeclause_eclasses(root, restrictinfo);
+		update_equivclause_eclasses(root, restri

Re: [HACKERS] type cache for concat functions

2017-09-19 Thread Alexander Kuzmenkov
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

Looks good to me.

The new status of this patch is: Ready for Committer

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] Faster processing at Gather node

2017-09-11 Thread Alexander Kuzmenkov
Thanks Rafia, Amit, now I understand the ideas behind the patch better. 
I'll see if I can look at the new one.


--

Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] Faster processing at Gather node

2017-09-08 Thread Alexander Kuzmenkov

Hi Rafia,

I like the idea of reducing locking overhead by sending tuples in bulk. 
The implementation could probably be simpler: you could extend the API 
of shm_mq to decouple notifying the sender from actually putting data 
into the queue (i.e., make shm_mq_notify_receiver public and make a 
variant of shm_mq_sendv that doesn't send the notification). From Amit's 
letter I understand that you have already tried something along these 
lines and the performance wasn't good. What was the bottleneck then? If 
it's the locking around mq_bytes_read/written, it can be rewritten with 
atomics. I think it would be great to try this approach because it 
doesn't add much code, doesn't add any additional copying and improves 
shm_mq performance in general.


--
Alexander Kuzmenkov
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] index-only count(*) for indexes supporting bitmap scans

2017-09-04 Thread Alexander Kuzmenkov

On 04.09.2017 15:17, Alexey Chernyshov wrote:

make check-world fails on contrib/postgres_fdw because of Subquery Scan on ... 
Probably, query plan was changed.


Hi Alexey,

Thanks for testing! This is the same problem as the one in 
'select_distinct' I mentioned before. I changed the test, the updated 
patch is attached.


--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index c19b3318c7..5f5f65d60c 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -2203,22 +2203,23 @@ SELECT t1c1, avg(t1c1 + t2c1) FROM (SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2
 -- join with lateral reference
 EXPLAIN (VERBOSE, COSTS OFF)
 SELECT t1."C 1" FROM "S 1"."T 1" t1, LATERAL (SELECT DISTINCT t2.c1, t3.c1 FROM ft1 t2, ft2 t3 WHERE t2.c1 = t3.c1 AND t2.c2 = t1.c2) q ORDER BY t1."C 1" OFFSET 10 LIMIT 10;
- QUERY PLAN 
-
+QUERY PLAN
+--
  Limit
Output: t1."C 1"
->  Nested Loop
  Output: t1."C 1"
  ->  Index Scan using t1_pkey on "S 1"."T 1" t1
Output: t1."C 1", t1.c2, t1.c3, t1.c4, t1.c5, t1.c6, t1.c7, t1.c8
- ->  HashAggregate
-   Output: t2.c1, t3.c1
-   Group Key: t2.c1, t3.c1
-   ->  Foreign Scan
+ ->  Subquery Scan on q
+   ->  HashAggregate
  Output: t2.c1, t3.c1
- Relations: (public.ft1 t2) INNER JOIN (public.ft2 t3)
- Remote SQL: SELECT r1."C 1", r2."C 1" FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (((r1."C 1" = r2."C 1")) AND ((r1.c2 = $1::integer
-(13 rows)
+ Group Key: t2.c1, t3.c1
+ ->  Foreign Scan
+   Output: t2.c1, t3.c1
+   Relations: (public.ft1 t2) INNER JOIN (public.ft2 t3)
+   Remote SQL: SELECT r1."C 1", r2."C 1" FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (((r1."C 1" = r2."C 1")) AND ((r1.c2 = $1::integer
+(14 rows)
 
 SELECT t1."C 1" FROM "S 1"."T 1" t1, LATERAL (SELECT DISTINCT t2.c1, t3.c1 FROM ft1 t2, ft2 t3 WHERE t2.c1 = t3.c1 AND t2.c2 = t1.c2) q ORDER BY t1."C 1" OFFSET 10 LIMIT 10;
  C 1 
@@ -2672,16 +2673,17 @@ select c2, sum(c1) from ft2 group by c2 having avg(c1) < 500 and sum(c1) < 49800
 -- Unshippable HAVING clause will be evaluated locally, and other qual in HAVING clause is pushed down
 explain (verbose, costs off)
 select count(*) from (select c5, count(c1) from ft1 group by c5, sqrt(c2) having (avg(c1) / avg(c1)) * random() <= 1 and avg(c1) < 500) x;
-   QUERY PLAN
--
+  QUERY PLAN   
+---
  Aggregate
Output: count(*)
-   ->  Foreign Scan
- Output: ft1.c5, NULL::bigint, (sqrt((ft1.c2)::double precision))
- Filter: (avg(ft1.c1)) / (avg(ft1.c1::double precision * random()) <= '1'::double precision)
- Relations: Aggregate on (public.ft1)
- Remote SQL: SELECT c5, NULL::bigint, sqrt(c2), avg("C 1") FROM "S 1"."T 1" GROUP BY c5, (sqrt(c2)) HAVING ((avg("C 1") < 500::numeric))
-(7 rows)
+   ->  Subquery Scan on x
+ ->  Foreign Scan
+   Output: ft1.c5, NULL::bigint, (sqrt((ft1.c2)::double precision))
+   Filter: (avg(ft1.c1)) / (avg(ft1.c1::doub

Re: [HACKERS] Range Merge Join v1

2017-08-25 Thread Alexander Kuzmenkov

Hi Jeff,

Fast range joins are very nice to have, thank you for working on this.

I've been doing a somewhat related thing recently, trying to teach the 
merge join executor to perform full joins on comparison clauses [1]. I 
have some comments after reading your patch:


* At the moment, "mergejoinable clause" and "equality clause" mean the 
same thing to the planner, and those clauses are used both to create 
equivalence classes and to perform merge joins. If we start performing 
merge joins for different kinds of clauses, such as comparison or range 
intersection, it makes sense to separate these two meanings. I tried 
this in my patch and it didn't require many changes. I use 
RestrictInfo.equivopfamilies to build equivalence classes, and 
RestrictInfo.mergeopfamilies for mergejoinable clauses.


* Semantically, MJCompare() is a function that determines whether you 
should emit a join result or advance one of the pointers. This meaning 
is not explicit in the code and is not well encapsulated: the function 
returns and int and 'compareNoMatch' flag, and the calling code combines 
them in various ways to derive the final result. This meaning can be 
made explicit by making MJCompare return enum values {Join, NextInner, 
NextOuter}, and putting inside it all the logic that decides whether we 
join or advance. ExecMergeJoin looks intimidating already, and I think 
this change would help readability. Again, you can find an illustration 
in my patch.


I hope you find these points helpful.

[1] 
https://www.postgresql.org/message-id/flat/1d23ad41-a9d9-1d1e-1d79-83b002d6f776%40postgrespro.ru#1d23ad41-a9d9-1d1e-1d79-83b002d6f...@postgrespro.ru


--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PoC: full merge join on comparison clause

2017-08-25 Thread Alexander Kuzmenkov
Here is a new version of the patch, rebased to 749c7c41 and with some 
cosmetic changes.


--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index d77c2a70e4..19bc90aa32 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -722,19 +722,19 @@ get_useful_ecs_for_relation(PlannerInfo *root, RelOptInfo *rel)
 	{
 		RestrictInfo *restrictinfo = (RestrictInfo *) lfirst(lc);
 
-		/* Consider only mergejoinable clauses */
-		if (restrictinfo->mergeopfamilies == NIL)
+		/* Consider only mergejoinable equality clauses */
+		if (restrictinfo->equivopfamilies == NIL)
 			continue;
 
 		/* Make sure we've got canonical ECs. */
-		update_mergeclause_eclasses(root, restrictinfo);
+		update_equivclause_eclasses(root, restrictinfo);
 
 		/*
-		 * restrictinfo->mergeopfamilies != NIL is sufficient to guarantee
+		 * restrictinfo->equivopfamilies != NIL is sufficient to guarantee
 		 * that left_ec and right_ec will be initialized, per comments in
 		 * distribute_qual_to_rels.
 		 *
-		 * We want to identify which side of this merge-joinable clause
+		 * We want to identify which side of this merge-joinable equality clause
 		 * contains columns from the relation produced by this RelOptInfo. We
 		 * test for overlap, not containment, because there could be extra
 		 * relations on either side.  For example, suppose we've got something
diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c
index 925b4cf553..8eb5c8fd1d 100644
--- a/src/backend/executor/nodeMergejoin.c
+++ b/src/backend/executor/nodeMergejoin.c
@@ -172,31 +172,32 @@ typedef enum
  * to the opfamily and collation, with nulls at the indicated end of the range.
  * This allows us to obtain the needed comparison function from the opfamily.
  */
-static MergeJoinClause
+static void
 MJExamineQuals(List *mergeclauses,
 			   Oid *mergefamilies,
 			   Oid *mergecollations,
 			   int *mergestrategies,
 			   bool *mergenullsfirst,
-			   PlanState *parent)
+			   MergeJoinState *parent)
 {
-	MergeJoinClause clauses;
 	int			nClauses = list_length(mergeclauses);
 	int			iClause;
 	ListCell   *cl;
 
-	clauses = (MergeJoinClause) palloc0(nClauses * sizeof(MergeJoinClauseData));
+	parent->mj_Clauses = (MergeJoinClause) palloc0(nClauses * sizeof(MergeJoinClauseData));
+	parent->mj_UseEqual = (bool *) palloc0(nClauses * sizeof(bool));
+	parent->mj_UseLesser = (bool *) palloc0(nClauses * sizeof(bool));
 
 	iClause = 0;
 	foreach(cl, mergeclauses)
 	{
 		OpExpr	   *qual = (OpExpr *) lfirst(cl);
-		MergeJoinClause clause = [iClause];
+		MergeJoinClause clause = >mj_Clauses[iClause];
 		Oid			opfamily = mergefamilies[iClause];
 		Oid			collation = mergecollations[iClause];
-		StrategyNumber opstrategy = mergestrategies[iClause];
+		StrategyNumber sort_op_strategy = mergestrategies[iClause];
 		bool		nulls_first = mergenullsfirst[iClause];
-		int			op_strategy;
+		int			join_op_strategy;
 		Oid			op_lefttype;
 		Oid			op_righttype;
 		Oid			sortfunc;
@@ -207,28 +208,50 @@ MJExamineQuals(List *mergeclauses,
 		/*
 		 * Prepare the input expressions for execution.
 		 */
-		clause->lexpr = ExecInitExpr((Expr *) linitial(qual->args), parent);
-		clause->rexpr = ExecInitExpr((Expr *) lsecond(qual->args), parent);
+		clause->lexpr = ExecInitExpr((Expr *) linitial(qual->args), (PlanState *) parent);
+		clause->rexpr = ExecInitExpr((Expr *) lsecond(qual->args), (PlanState *) parent);
 
 		/* Set up sort support data */
 		clause->ssup.ssup_cxt = CurrentMemoryContext;
 		clause->ssup.ssup_collation = collation;
-		if (opstrategy == BTLessStrategyNumber)
+		if (sort_op_strategy == BTLessStrategyNumber)
 			clause->ssup.ssup_reverse = false;
-		else if (opstrategy == BTGreaterStrategyNumber)
+		else if (sort_op_strategy == BTGreaterStrategyNumber)
 			clause->ssup.ssup_reverse = true;
 		else	/* planner screwed up */
-			elog(ERROR, "unsupported mergejoin strategy %d", opstrategy);
+			elog(ERROR, "unsupported mergejoin strategy %d", sort_op_strategy);
 		clause->ssup.ssup_nulls_first = nulls_first;
 
 		/* Extract the operator's declared left/right datatypes */
 		get_op_opfamily_properties(qual->opno, opfamily, false,
-   _strategy,
+   _op_strategy,
    _lefttype,
    _righttype);
-		if (op_strategy != BTEqualStrategyNumber)	/* should not happen */
-			elog(ERROR, "cannot merge using non-equality operator %u",
- qual->opno);
+
+		/* 
+		 * Determine whether we accept lesser and/or equal tuples 
+		 * of the inner relation.
+		 */
+		switch (join_op_strategy)
+		{
+			case BTEqualStrategyNumber:
+parent->mj_UseEqual[iClause] = true;
+break;
+			case BTLessEqualStrategyNumber:
+parent->mj_UseEqual[iClause] = true;
+		

Re: [HACKERS] type cache for concat functions

2017-08-24 Thread Alexander Kuzmenkov

Hi Pavel,

I tried applying your patch, it applies and compiles fine, check and 
checkworld pass.


I ran a simple performance test, select 
concat(generate_series(1,10), ... [x5 total]) vs select 
generate_series(1,10)::text || ... .
Operator || runs in 60 ms, while unpatched concat takes 90 and patched 
-- 55 ms.


About the code:
* There seems to be an extra tab here:
FmgrInfo*typcache;
It's not aligned with the previous declaration with tab size 4.

* You could allocate the cache and store it into the context inside 
rebuildConcatCache. It would return return the cache pointer, and the 
calling code would look cleaner -- just one line. This is a matter of 
taste though.


* The nearby functions use snake_case, so it should be 
rebuild_concat_cache too.


Overall the patch looks good to me.

--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for CSN based snapshots

2017-08-02 Thread Alexander Kuzmenkov

What problem exactly you are seeing in the clog, is it the contention
around CLOGControlLock or generally accessing CLOG is slower.  If
former, then we already have a patch [1] to address it.
It's the contention around CLogControlLock. Thank you for the pointer, 
next time I'll try it with the group clog update patch.



--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for CSN based snapshots

2017-07-19 Thread Alexander Kuzmenkov

That's not actually a problem as I am reusing an older v4 from Heikki
now for the future, but I wanted to let you know about that first.

Thank you, I'll look into that.

--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal for CSN based snapshots

2017-06-21 Thread Alexander Kuzmenkov

On 21.06.2017 04:48, Michael Paquier wrote:

There has not been much activity on this thread for some time, and I
mentioned my intentions to some developers at the last PGCon. But I am
planning to study more the work that has been done here, with as
envisaged goal to present a patch for the first CF of PG11. Lots of
fun ahead.


Hi Michael,

Glad to see you working on this! I've been studying this topic too. 
Attached you can find a recently rebased version of Heikki's v4 patch.
I also fixed a bug that appeared on report-receipts isolation test: 
XidIsConcurrent should report that a transaction is concurrent to ours 
when its csn equals our snapshotcsn.


There is another bug where multiple tuples with the same primary key 
appear after a pgbench read-write test. Querying pgbench_branches with 
disabled index scans, I see several tuples with the same 'bid' value. 
Usually they do not touch the index so there are no errors. But 
sometimes HOT update is not possible and it errors out with violation of 
unique constraint, which is how I noticed it. I can't reproduce it on my 
machine and have to use a 72-core one.


For now I can conclude that the oldestActiveXid is not always updated 
correctly. In TransactionIdAsyncCommitTree, just before the transaction 
sets clog status, its xid can be less than oldestActiveXid. Other 
transactions are seeing it as aborted for some time before it writes to 
clog (see TransactionIdGetCommitSeqNo). They can update a tuple it 
deleted, and that leads to duplication. Unfortunately, I didn't have 
time yet to investigate this further.

--

Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



csn-5.patch.gz
Description: application/gzip

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PoC: full merge join on comparison clause

2017-05-16 Thread Alexander Kuzmenkov

On 16.05.2017 18:57, Robert Haas wrote:

Interesting. I suggest adding this to the next CommitFest.

Thank you, added: https://commitfest.postgresql.org/14/1141/

--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] PoC: full merge join on comparison clause

2017-05-12 Thread Alexander Kuzmenkov
 ms
(16 rows)

test1=# explain analyze select * from t4 full join t1 on t4.a < t1.a;
  QUERY PLAN
---
 Merge Full Join  (cost=809.66..34230.49 rows=333 width=8) (actual 
time=8.351..914.590 rows=5007546 loops=1)

   Merge Cond: (t1.a > t4.a)
   ->  Index Only Scan using idx_t1_a on t1  (cost=0.28..35.27 
rows=1000 width=4) (actual time=0.035..0.368 rows=1001 loops=1)

 Heap Fetches: 97
   ->  Sort  (cost=809.39..834.39 rows=1 width=4) (actual 
time=8.309..347.851 rows=5007546 loops=1)

 Sort Key: t4.a
 Sort Method: quicksort  Memory: 931kB
 ->  Seq Scan on t4  (cost=0.00..145.00 rows=1 width=4) 
(actual time=0.020..2.563 rows=1 loops=1)

 Planning time: 1.083 ms
 Execution time: 1044.869 ms
(10 rows)


=== Merge vs nested loop ===

test1=# explain analyze select * from t5 join t1 on t5.a <= t1.a;
   QUERY PLAN
-
 Nested Loop  (cost=0.28..944713.00 rows= width=8) (actual 
time=0.055..8718.840 rows=50014145 loops=1)
   ->  Seq Scan on t5  (cost=0.00..1443.00 rows=10 width=4) (actual 
time=0.019..6.428 rows=10 loops=1)
   ->  Index Only Scan using idx_t1_a on t1  (cost=0.28..6.10 rows=333 
width=4) (actual time=0.003..0.050 rows=500 loops=10)

 Index Cond: (a >= t5.a)
 Heap Fetches: 9147995
 Planning time: 2.209 ms
 Execution time: 9942.176 ms
(7 rows)

test1=# set enable_mergejoin TO on;
SET
test1=# explain analyze select * from t5 join t1 on t5.a <= t1.a;
  QUERY PLAN
---
 Merge Join  (cost=9748.54..343618.88 rows= width=8) (actual 
time=35.491..9281.482 rows=50014145 loops=1)

   Merge Cond: (t1.a >= t5.a)
   ->  Index Only Scan using idx_t1_a on t1  (cost=0.28..35.27 
rows=1000 width=4) (actual time=0.027..0.769 rows=1001 loops=1)

 Heap Fetches: 97
   ->  Sort  (cost=9747.82..9997.82 rows=10 width=4) (actual 
time=35.458..3906.652 rows=50014145 loops=1)

 Sort Key: t5.a
 Sort Method: quicksort  Memory: 8541kB
 ->  Seq Scan on t5  (cost=0.00..1443.00 rows=10 width=4) 
(actual time=0.013..8.570 rows=10 loops=1)

 Planning time: 2.368 ms
 Execution time: 10530.356 ms
(10 rows)

--
Alexander Kuzmenkov
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c
index 62784af..db9de5b 100644
--- a/src/backend/executor/nodeMergejoin.c
+++ b/src/backend/executor/nodeMergejoin.c
@@ -171,31 +171,32 @@ typedef enum
  * to the opfamily and collation, with nulls at the indicated end of the range.
  * This allows us to obtain the needed comparison function from the opfamily.
  */
-static MergeJoinClause
+static void
 MJExamineQuals(List *mergeclauses,
 			   Oid *mergefamilies,
 			   Oid *mergecollations,
 			   int *mergestrategies,
 			   bool *mergenullsfirst,
-			   PlanState *parent)
+			   MergeJoinState *parent)
 {
-	MergeJoinClause clauses;
 	int			nClauses = list_length(mergeclauses);
 	int			iClause;
 	ListCell   *cl;
 
-	clauses = (MergeJoinClause) palloc0(nClauses * sizeof(MergeJoinClauseData));
+	parent->mj_Clauses = (MergeJoinClause) palloc0(nClauses * sizeof(MergeJoinClauseData));
+	parent->mj_UseEqual = (bool *) palloc0(nClauses * sizeof(bool));
+	parent->mj_UseLesser = (bool *) palloc0(nClauses * sizeof(bool));
 
 	iClause = 0;
 	foreach(cl, mergeclauses)
 	{
 		OpExpr	   *qual = (OpExpr *) lfirst(cl);
-		MergeJoinClause clause = [iClause];
+		MergeJoinClause clause = >mj_Clauses[iClause];
 		Oid			opfamily = mergefamilies[iClause];
 		Oid			collation = mergecollations[iClause];
-		StrategyNumber opstrategy = mergestrategies[iClause];
+		StrategyNumber sort_op_strategy = mergestrategies[iClause];
 		bool		nulls_first = mergenullsfirst[iClause];
-		int			op_strategy;
+		int			join_op_strategy;
 		Oid			op_lefttype;
 		Oid			op_righttype;
 		Oid			sortfunc;
@@ -206,28 +207,46 @@ MJExamineQuals(List *mergeclauses,
 		/*
 		 * Prepare the input expressions for execution.
 		 */
-		clause->lexpr = ExecInitExpr((Expr *) linitial(qual->args), parent);
-		clause->rexpr = ExecInitExpr((Expr *) lsecond(qual->args), parent);
+		clause->lexpr = ExecInitExpr((Expr *) linitial(qual->args), (PlanState *) parent);
+		clause->rexpr = ExecInitExpr((Expr *) lsecond(qual->args), (PlanState *) parent);
 
 		/* Set up s

Re: [HACKERS] index-only count(*) for indexes supporting bitmap scans

2017-04-12 Thread Alexander Kuzmenkov
1 lossy=22217
 ->  Bitmap Index Scan on idx_pglist_fts 
(cost=0.00..529.02 rows=54503 width=0) (actual time=78.366..78.366 
rows=222813 loops=1)
   Index Cond: (fts @@ to_tsquery('tom & 
lane'::text))

 Planning time: 0.722 ms
 Execution time: 1256.028 ms
(13 rows)

test1=# set enable_bitmapcount to on;
SET
test1=# explain analyze select count(*) from pglist where fts @@ 
to_tsquery( 'tom & lane' );

 QUERY PLAN

 Bitmap Count on pglist  (cost=542.65..1087.68 rows=54503 width=8) 
(actual time=2745.740..2745.742 rows=1 loops=1)

   Recheck Cond: (fts @@ to_tsquery('tom & lane'::text))
   Rows Removed by Index Recheck: 270853
   Heap Fetches: 66138
   Heap Blocks: exact=39854 lossy=66138
   ->  Bitmap Index Scan on idx_pglist_fts  (cost=0.00..529.02 
rows=54503 width=0) (actual time=85.572..85.572 rows=222813 loops=1)

 Index Cond: (fts @@ to_tsquery('tom & lane'::text))
 Planning time: 0.701 ms
 Execution time: 2745.800 ms
(9 rows)

--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] index-only count(*) for indexes supporting bitmap scans

2017-04-12 Thread Alexander Kuzmenkov

On 12.04.2017 17:24, Tom Lane wrote:

TBH, I'm not sure you need to do any of that work.  Have you got evidence
that the planner will fail to choose the right plan regardless? I'm
particularly unconvinced that choose_bitmap_and is a critical problem,
because once you're into having to AND multiple indexes, you're talking
about an expensive query anyhow.
The most expensive part would probably be accessing the heap in the 
bitmap heap scan. It may be worth trying to choose an index path that 
checks all the restriction and therefore allows us to skip this heap 
access. This path might not be the cheapest one, though. The standard 
AND selection procedure would have discarded it based on cost.
I've seen this happen on the regression database. Somehow I can't seem 
to reproduce it on my earlier full-text search example.


An example:

regression=# explain select count(*) from tenk1 where hundred < 90 and 
thousand < 31;

QUERY PLAN
---
 Bitmap Count on tenk1  (cost=182.69..185.56 rows=1 width=8)
   Recheck Cond: ((thousand < 31) AND (hundred < 90))
   ->  BitmapAnd  (cost=182.69..182.69 rows=287 width=0)
 ->  Bitmap Index Scan on tenk1_thous_tenthous (cost=0.00..6.68 
rows=319 width=0)

   Index Cond: (thousand < 31)
 ->  Bitmap Index Scan on tenk1_hundred (cost=0.00..175.62 
rows=8978 width=0)

   Index Cond: (hundred < 90)
(7 rows)

regression=# set enable_bitmapcount to off;
SET
regression=# explain select count(*) from tenk1 where hundred < 90 and 
thousand < 31;

QUERY PLAN
---
 Aggregate  (cost=375.34..375.35 rows=1 width=8)
   ->  Bitmap Heap Scan on tenk1  (cost=6.75..374.62 rows=287 width=0)
 Recheck Cond: (thousand < 31)
 Filter: (hundred < 90)
 ->  Bitmap Index Scan on tenk1_thous_tenthous (cost=0.00..6.68 
rows=319 width=0)

   Index Cond: (thousand < 31)
(6 rows)

--

Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] index-only count(*) for indexes supporting bitmap scans

2017-04-12 Thread Alexander Kuzmenkov

On 12.04.2017 15:04, Tom Lane wrote:

Andrew Gierth <and...@tao11.riddles.org.uk> writes:

"Alexander" == Alexander Kuzmenkov <a.kuzmen...@postgrespro.ru> writes:
  Alexander> Structurally, the patch consists of two major parts: a
  Alexander> specialized executor node
Why?
It strikes me that the significant fact here is not that we're doing
count(*), but that we don't need any columns from the bitmap heap scan
result.  Rather than creating a whole new node, can't the existing
bitmap heapscan be taught to skip fetching the actual table page in
cases where it's all-visible, not lossy, and no columns are needed?

+1 ... while I hadn't actually looked at the code, it seemed to me that
anything like the optimization-as-described would be impossibly klugy
from the planner's standpoint.  Your formulation sounds lots nicer.

Detecting that no columns are needed in the executor might be a bit tricky
because of the planner's habit of inserting a "physical tlist" to avoid a
projection step.  (See also nearby discussion about custom scan planning.)
But we could fix that.  I think one rule that would make sense is to
just disable the physical-tlist substitution if the relation's targetlist
is empty --- it wouldn't be buying much in such a case anyhow.  Then the
runtime tlist for the scan node would also be empty, and away you go.
When making an early prototype of this optimization, I did what you are 
describing with the bitmap heap scan executor node. In an internal 
review, it was said that the bitmap heap scan node is already 
complicated enough and shouldn't have more logic added to it, so I 
rewrote it a separate node. To me, your approach looks good too, so if 
the community is generally in favor of this, I could rewrite the 
executor as such.


With planner, the changes are more complex. Two things must be done 
there. First, when the tlist is empty, we must use a different cost 
function for bitmap heap scan, because the heap access pattern is 
different. Second, choose_bitmap_and() must use a different algorithm 
for choosing the right combination of paths. A standard algorithm 
chooses the combination based on cost. For count(*) purposes, the 
decisive factor is that the path has to check all the restrictions, or 
else we'll need heap access to recheck some of them, which defeats the 
purpose of having this optimization. The planner code that builds and 
costs the index path is fairly complex, and integrating this additional 
behavior into it didn't look good to me. Instead, I created a 
specialized path node and isolated the logic that handles it. The 
resulting implementation adds several functions, but it is mostly 
self-contained and has a single entry point in grouping_planner(). It 
handles the specific case of bitmap count plans and doesn't complicate 
the existing code any further.


The planner part is to some extent independent of whether we use a 
separate executor node or not. If we choose not to, the same count(*) 
optimization code I proposed could create plain bitmap heap scan paths.


--
Alexander Kuzmenkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] index-only count(*) for indexes supporting bitmap scans

2017-04-11 Thread Alexander Kuzmenkov

Hi,

I would like to propose a patch that speeds up the queries of the form 
'select
count(*) ... where ...',  where the restriction clauses can be satisfied 
by some

indexes. At the moment, such queries use index-only scans followed by
aggregation. Index-only scans are only possible for indexes that are 
capable of
returning indexed tuples, that is, support the 'amgettuple' access 
method. They
are not applicable to indexes such as GIN and RUM. However, it is 
possible to

improve count(*) performance for indexes that support bitmap scans. Having
performed a bitmap index scan or a combination of such, the bits in 
bitmap can

be counted to obtain the final result. Of course, the bitmap pages that are
lossy or not visible to all existing transactions will still require heap
access.

One kind of applications that can benefit from this change is the full-text
search with pagination. To show a search results page, the application 
has to
know the results that go to current page, and the total number of the 
results.
Getting one page is fast, when the desired sorting order can be provided 
by an
index. For example, results can be sorted by date with a separate btree 
index,
or by relevance with RUM index. However, getting the total number of 
results is
more difficult. With text search indexes, it requires a bitmap heap 
scan, which
can be rather slow due to obligatory heap access. A well-known hack for 
this is
using the approximate data from 'explain' results. The proposed change 
allows

the user to obtain the precise number of the results in an efficient and
idiomatic manner.

The performance of this approach was tested on an archive of pgsql-hackers
mailing list. The detailed results for two sample queries can be found 
in the

attached file 'benchmark.txt'. The first test demonstrates full-text search
with RUM index, ordering the results by rank. For a query with low 
selectivity,
getting the top results is much faster than counting them all with a 
bitmap heap
scan. With bitmap count execution plan, the results can be counted much 
faster.
A similar test is done with a GIN index, where the results are 
restricted and
ordered by date using another btree index. Again, it shows a significant 
speedup

for count(*) query for bitmap count scan compared to bitmap heap scan. These
results demonstrate that the bitmap count plan can indeed be useful for 
full-

text search scenarios.

Structurally, the patch consists of two major parts: a specialized executor
node and the generation of corresponding paths and plans. The optimizer 
behavior
here is similar to that of the min/max aggregate optimization. The main 
entry

point is preprocess_count_aggs(); it is called by grouping_planner(). It
searches for count(*) expression in the target list, and, if possible, 
generates

a special path for it (struct BitmapCountPath). This path is then added to
UPPERREL_GROUP_AGG upperrel, and competes with other paths at the 
further stages

of planning.

The executor node (nodeBitmapCount.c) is similar to the bitmap heap scan 
node,
with the main difference being that it does not access heap for tuples 
that are

known to satisfy the restriction and to be visible to all transactions.

This patch has some important limitations:
* It only supports targetlist consisting of a single expression that can be
projected from count(*).
* count(expr) is not supported. We could support it for cases where the
"expr is not null" restriction can be satisfied with an index.
* The current implementation does not support parallel execution. It 
could be

implemented during the PostgreSQL 11 release cycle.
* For some indexes, the bitmap index scan will always require rechecking 
all
the tuples. Bitmap count plans should not be used in such cases. For 
now, this

check is not implemented.

I would be glad to hear your comments on this patch.

Regards,
Alexander Kuzmenkov

===
= The data

test1=# \d pglist
   Table "public.pglist"
   Column   |Type | Collation | Nullable | Default 
+-+---+--+-
 id | integer |   |  | 
 sent   | timestamp without time zone |   |  | 
 subject| text|   |  | 
 author | text|   |  | 
 body_plain | text|   |  | 
 fts| tsvector|   |  | 
Indexes:
"idx_pglist_rum_fts" rum (fts)
"idx_pglist_fts" gin (fts)
"idx_pglist_sent" btree (sent)


test1=# select min(sent), max(sent), count(*) from pglist;
 min | max |  count  
-+-+-
 1997-06-