Re: [HACKERS] WIP: Aggregation push-down

2020-02-07 Thread Antonin Houska
Richard Guo  wrote:

> Hi,
> 
> I've been looking at the 'make_join_rel()' part of the patch, and I'm
> aware that, if we are joining A and B, a 'grouped join rel (AB)' would
> be created besides the 'plain join rel (AB)', and paths are added by 1)
> applying partial aggregate to the paths of the 'plain join rel (AB)', or
> 2) joining grouped A to plain B or joining plain A to grouped B.
> 
> This is a smart idea. One issue I can see is that some logic would have
> to be repeated several times. For example, the validity check for the
> same proposed join would be performed at most three times by
> join_is_legal().
> 
> I'm thinking of another idea that, instead of using a separate
> RelOptInfo for the grouped rel, we add in RelOptInfo a
> 'grouped_pathlist' for the grouped paths, just like how we implement
> parallel query via adding 'partial_pathlist'.
> 
> For base rel, if we decide it can produce grouped paths, we create the
> grouped paths by applying partial aggregation to the paths in 'pathlist'
> and add them into 'grouped_pathlist'.
> 
> For join rel (AB), we can create the grouped paths for it by:
> 1) joining a grouped path from 'A->grouped_pathlist' to a plain path
> from 'B->pathlist', or vice versa;
> 2) applying partial aggregation to the paths in '(AB)->pathlist'.
> 
> This is basically the same idea, just moves the grouped paths from the
> grouped rel to a 'grouped_pathlist'. With it we should not need to make
> any code changes to 'make_join_rel()'. Most code changes would happen in
> 'add_paths_to_joinrel()'.
> 
> Will this idea work? Is it better or worse?

I tried such approach in an earlier version of the patch [1], and I think the
reason also was to avoid repeated calls of functions like
join_is_legal(). However there were objections against such approach,
e.g. [2], and I admit that it was more invasive than what the current patch
version does.

Perhaps we can cache the result of join_is_legal() that we get for the plain
relation, and use it for the group relation. I'll consider that. Thanks.

[1] https://www.postgresql.org/message-id/18007.1513957437%40localhost
[2] 
https://www.postgresql.org/message-id/CA%2BTgmob8og%2B9HzMg1vM%2B3LwDm2f_bHUi9%2Bg1bqLDTjqpw5s%2BnQ%40mail.gmail.com

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com




Re: [HACKERS] WIP: Aggregation push-down

2020-02-06 Thread Richard Guo
Hi,

I've been looking at the 'make_join_rel()' part of the patch, and I'm
aware that, if we are joining A and B, a 'grouped join rel (AB)' would
be created besides the 'plain join rel (AB)', and paths are added by 1)
applying partial aggregate to the paths of the 'plain join rel (AB)', or
2) joining grouped A to plain B or joining plain A to grouped B.

This is a smart idea. One issue I can see is that some logic would have
to be repeated several times. For example, the validity check for the
same proposed join would be performed at most three times by
join_is_legal().

I'm thinking of another idea that, instead of using a separate
RelOptInfo for the grouped rel, we add in RelOptInfo a
'grouped_pathlist' for the grouped paths, just like how we implement
parallel query via adding 'partial_pathlist'.

For base rel, if we decide it can produce grouped paths, we create the
grouped paths by applying partial aggregation to the paths in 'pathlist'
and add them into 'grouped_pathlist'.

For join rel (AB), we can create the grouped paths for it by:
1) joining a grouped path from 'A->grouped_pathlist' to a plain path
from 'B->pathlist', or vice versa;
2) applying partial aggregation to the paths in '(AB)->pathlist'.

This is basically the same idea, just moves the grouped paths from the
grouped rel to a 'grouped_pathlist'. With it we should not need to make
any code changes to 'make_join_rel()'. Most code changes would happen in
'add_paths_to_joinrel()'.

Will this idea work? Is it better or worse?

Thanks
Richard


Re: [HACKERS] WIP: Aggregation push-down

2020-01-16 Thread Antonin Houska
Tomas Vondra  wrote:

> Hi,
> 
> I've been looking at the last version (v14) of this patch series,
> submitted way back in July and unfortunately quiet since then. Antonin
> is probably right one of the reasons for the lack of reviews is that it
> requires interest/experience with planner.
> 
> Initially it was also a bit hard to understand what are the benefits
> (and the patch shifted a bit), which is now mostly addressed by the
> README in the last patch. The trouble is that's hidden in the patch and
> so not immediately obvious to people considering reviewing it :-( Tom
> posted a nice summary in November 2018, but it was perhaps focused on
> the internals.
> 
> So my first suggestion it to write a short summary with example how it
> affects practical queries (plan change, speedup) etc.

I think README plus regression test output should be enough for someone who is
about to review patch as complex as this.

> My second suggestion is to have meaningful commit messages for each part
> of the patch series, explaining why we need that particular part. It
> might have been explained somewhere in the thread, but as a reviewer I
> really don't want to reverse-engineer the whole thread.

ok, done.

> Now, regarding individual parts of the patch:
> 
> 
> 1) v14-0001-Introduce-RelInfoList-structure.patch
> -
> 
> - I'm not entirely sure why we need this change. We had the list+hash
>   before, so I assume we do this because we need the output functions?

I believe that this is what Tom proposed in [1], see "Maybe an appropriate
preliminary patch is to refactor the support code ..." near the end of that
message. The point is that now we need two lists: one for "plain" relations
and one for grouped ones.

> - The RelInfoList naming is pretty confusing, because it's actually not
>   a list but a combination of list+hash. And the embedded list is called
>   items, to make it yet a bit more confusing. I suggest we call this
>   just RelInfo or RelInfoLookup or something else not including List.

I think it can be considered a list by the caller of add_join_rel() etc. The
hash table is hidden from user and is empty until the list becomes long
enough. The word "list" in the structure name may just indicate that new
elements can be added to the end, which shouldn't be expected if the structure
was an array.

I searched a bit in tools/pgindent/typedefs.list and found at least a few
structures that also do have "list" in the name but are not lists internally:
CatCList, FuncCandidateList, MCVList.

Actually I'm not opposed to renaming the structure, but don't have better idea
right now. As for *Lookup: following is the only use of such a structure name
in PG code and it's not something to store data in:

typedef enum
{
IDENTIFIER_LOOKUP_NORMAL,   /* normal processing of var names */
IDENTIFIER_LOOKUP_DECLARE,  /* In DECLARE --- don't look up names */
IDENTIFIER_LOOKUP_EXPR  /* In SQL expression --- special case */
} IdentifierLookup;

> - RelInfoEntry seems severely under-documented, particularly the data
>   part (which is void* making it even harder to understand what's it for
>   without having to read the functions). AFAICS it's just simply a link
>   to the embedded list, but maybe I'm wrong.

This is just JoinHashEntry (which was also undocumented) renamed. I've added a
short comment now.

> - I suggest we move find_join_rel/add_rel_info/add_join_rel in relnode.c
>   right before build_join_rel. This makes diff clearer/smaller and
>   visual diffs nicer.

Hm, it might improve readability of the diff, but this API is exactly where I
still need feedback from Tom. I'm not eager to optimize the diff as long as
there's a risk that these functions will be removed or renamed.

> 2) v14-0002-Introduce-make_join_rel_common-function.patch
> -
> 
> I see no point in keeping this change in a separate patch, as it prety
> much just renames make_join_rel to make_join_rel_common and then adds 
> 
>   make_join_rel(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2)
>   {
>   return make_join_rel_common(root, rel1, rel2);
>   }
> 
> which seems rather trivial and useless on it's own. I'd just merge it
> into 0003 where we use the _common function more extensively.

ok. I thought that this improves readability of the diffs, but it doesn't look
that bad if this is included in 0003.

> 
> 3) v14-0003-Aggregate-push-down-basic-functionality.patch
> -
> 
> I haven't done much review on this yet, but I've done some testing and
> benchmarking so let me share at least that. The queries I used were of
> the type I mentioned earlier in this thread - essentially a star schema,
> i.e. fact table referencing dimensions, with aggregation per columns in
> the dimension. So something like
> 
>   SELECT d.c, sum(f) FROM fact JOIN dimension d ON 

Re: [HACKERS] WIP: Aggregation push-down

2020-01-10 Thread Tomas Vondra

Hi,

I've been looking at the last version (v14) of this patch series,
submitted way back in July and unfortunately quiet since then. Antonin
is probably right one of the reasons for the lack of reviews is that it
requires interest/experience with planner.

Initially it was also a bit hard to understand what are the benefits
(and the patch shifted a bit), which is now mostly addressed by the
README in the last patch. The trouble is that's hidden in the patch and
so not immediately obvious to people considering reviewing it :-( Tom
posted a nice summary in November 2018, but it was perhaps focused on
the internals.

So my first suggestion it to write a short summary with example how it
affects practical queries (plan change, speedup) etc.

My second suggestion is to have meaningful commit messages for each part
of the patch series, explaining why we need that particular part. It
might have been explained somewhere in the thread, but as a reviewer I
really don't want to reverse-engineer the whole thread.


Now, regarding individual parts of the patch:


1) v14-0001-Introduce-RelInfoList-structure.patch
-

- I'm not entirely sure why we need this change. We had the list+hash
  before, so I assume we do this because we need the output functions?

- The RelInfoList naming is pretty confusing, because it's actually not
  a list but a combination of list+hash. And the embedded list is called
  items, to make it yet a bit more confusing. I suggest we call this
  just RelInfo or RelInfoLookup or something else not including List.

- RelInfoEntry seems severely under-documented, particularly the data
  part (which is void* making it even harder to understand what's it for
  without having to read the functions). AFAICS it's just simply a link
  to the embedded list, but maybe I'm wrong.

- I suggest we move find_join_rel/add_rel_info/add_join_rel in relnode.c
  right before build_join_rel. This makes diff clearer/smaller and
  visual diffs nicer.


2) v14-0002-Introduce-make_join_rel_common-function.patch
-

I see no point in keeping this change in a separate patch, as it prety
much just renames make_join_rel to make_join_rel_common and then adds 


  make_join_rel(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2)
  {
  return make_join_rel_common(root, rel1, rel2);
  }

which seems rather trivial and useless on it's own. I'd just merge it
into 0003 where we use the _common function more extensively.


3) v14-0003-Aggregate-push-down-basic-functionality.patch
-

I haven't done much review on this yet, but I've done some testing and
benchmarking so let me share at least that. The queries I used were of
the type I mentioned earlier in this thread - essentially a star schema,
i.e. fact table referencing dimensions, with aggregation per columns in
the dimension. So something like

  SELECT d.c, sum(f) FROM fact JOIN dimension d ON (d.id = f.d_id)
  GROUP BY d.c;

where "fact" table is much much larger than the dimension.

Attached is a script I used for testing with a bunch of queries and a
number of parameters (parallelism, size of dimension, size of fact, ...)
and a spreadsheed summarizing the results.

Overall, the results seem pretty good - in many cases the queries get
about 2x faster and I haven't seen any regressions. That's pretty nice.

One annoying thing is that this only works for non-parallel queries.
That is, I saw no improvements with parallel queries. It was still
faster than the non-parallel query with aggregate pushdown, but the
parallel query did not improve.

An example of timing looks like this:

  non-parallel (pushdown = off): 918 ms
  non-parallel (pushdown = on):  561 ms
  parallel (pushdown = off): 313 ms
  parallel (pushdown = on):  313 ms

The non-parallel query gets faster because after enabling push-down the
plan changes (and we get partial aggregate below the join). With
parallel query that does not happen, the plans stay the same.

I'm not claiming this is a bug - we end up picking the fastest execution
plan (i.e. we don't make a mistake by e.g. switching to the non-parallel
one with pushdown).

My understanding is that the aggregate pushdown can't (currently?) be
used with queries that use partial aggregate because of parallelism.
That is we can either end up with a plan like this:

  Finalize Aggregate
-> Join
  -> Partial Aggregate
  -> ...

or a parallel plan like this:

  Finalize Aggregate
-> Gather
-> Partial Aggregate
-> Join
-> ...
-> ...

but we currently don't support a combination of both

  Finalize Aggregate
-> Gather
-> Join
-> Partial Aggregate
-> ...

I wonder if that's actually even possible and what would it take to make
it work?


regards

--
Tomas Vondra  

Re: [HACKERS] WIP: Aggregation push-down

2019-09-05 Thread Antonin Houska
Alvaro Herrera  wrote:

> This stuff seems very useful.  How come it sits unreviewed for so long?

I think the review is hard for people who are not interested in the planner
very much. And as for further development, there are a few design decisions
that can hardly be resolved without Tom Lane's comments. Right now I recall
two problems: 1) is the way I currently store RelOptInfo for the grouped
relations correct?, 2) how should we handle types for which logical equality
does not imply physical (byte-wise) equality?

Fortunately it seems now that I'm not the only one who cares about 2), so this
problem might be resolved soon:

https://www.postgresql.org/message-id/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ%40mail.gmail.com

But 1) still remains.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com




Re: [HACKERS] WIP: Aggregation push-down

2019-09-04 Thread Alvaro Herrera
This stuff seems very useful.  How come it sits unreviewed for so long?

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: [HACKERS] WIP: Aggregation push-down

2019-07-17 Thread Antonin Houska
Richard Guo  wrote:

> Another core dump for query below:
> 
> select sum(t1.s1) from t1, t2, t3, t4 where t1.j1 = t2.j2 group by t1.g1, 
> t2.g2;
> 
> This is due to a small mistake:
> 
> diff --git a/src/backend/optimizer/util/relnode.c 
> b/src/backend/optimizer/util/relnode.c
> index 10becc0..9c2deed 100644
> --- a/src/backend/optimizer/util/relnode.c
> +++ b/src/backend/optimizer/util/relnode.c
> @@ -648,7 +648,7 @@ void
>  add_grouped_rel(PlannerInfo *root, RelOptInfo *rel, RelAggInfo *agg_info)
>  {
> add_rel_info(root->grouped_rel_list, rel);
> -   add_rel_info(root->agg_info_list, rel);
> +   add_rel_info(root->agg_info_list, agg_info);
>  }
> 

Thanks! Yes, this is what I had to fix.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

>From b2672f2b06f521fdc11f7e7293ab669ca582061b Mon Sep 17 00:00:00 2001
From: Antonin Houska 
Date: Wed, 17 Jul 2019 16:31:32 +0200
Subject: [PATCH 1/3] Introduce RelInfoList structure.

---
 contrib/postgres_fdw/postgres_fdw.c|   3 +-
 src/backend/nodes/outfuncs.c   |  11 +++
 src/backend/optimizer/geqo/geqo_eval.c |  12 +--
 src/backend/optimizer/plan/planmain.c  |   3 +-
 src/backend/optimizer/util/relnode.c   | 157 -
 src/include/nodes/nodes.h  |   1 +
 src/include/nodes/pathnodes.h  |  28 --
 7 files changed, 136 insertions(+), 79 deletions(-)

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 033aeb2556..90414f1168 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -5205,7 +5205,8 @@ foreign_join_ok(PlannerInfo *root, RelOptInfo *joinrel, JoinType jointype,
 	 */
 	Assert(fpinfo->relation_index == 0);	/* shouldn't be set yet */
 	fpinfo->relation_index =
-		list_length(root->parse->rtable) + list_length(root->join_rel_list);
+		list_length(root->parse->rtable) +
+		list_length(root->join_rel_list->items);
 
 	return true;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 8e31fae47f..01745ff879 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -2279,6 +2279,14 @@ _outRelOptInfo(StringInfo str, const RelOptInfo *node)
 }
 
 static void
+_outRelInfoList(StringInfo str, const RelInfoList *node)
+{
+	WRITE_NODE_TYPE("RELOPTINFOLIST");
+
+	WRITE_NODE_FIELD(items);
+}
+
+static void
 _outIndexOptInfo(StringInfo str, const IndexOptInfo *node)
 {
 	WRITE_NODE_TYPE("INDEXOPTINFO");
@@ -4052,6 +4060,9 @@ outNode(StringInfo str, const void *obj)
 			case T_RelOptInfo:
 _outRelOptInfo(str, obj);
 break;
+			case T_RelInfoList:
+_outRelInfoList(str, obj);
+break;
 			case T_IndexOptInfo:
 _outIndexOptInfo(str, obj);
 break;
diff --git a/src/backend/optimizer/geqo/geqo_eval.c b/src/backend/optimizer/geqo/geqo_eval.c
index 7b67a29c88..5fca9814a2 100644
--- a/src/backend/optimizer/geqo/geqo_eval.c
+++ b/src/backend/optimizer/geqo/geqo_eval.c
@@ -92,11 +92,11 @@ geqo_eval(PlannerInfo *root, Gene *tour, int num_gene)
 	 *
 	 * join_rel_level[] shouldn't be in use, so just Assert it isn't.
 	 */
-	savelength = list_length(root->join_rel_list);
-	savehash = root->join_rel_hash;
+	savelength = list_length(root->join_rel_list->items);
+	savehash = root->join_rel_list->hash;
 	Assert(root->join_rel_level == NULL);
 
-	root->join_rel_hash = NULL;
+	root->join_rel_list->hash = NULL;
 
 	/* construct the best path for the given combination of relations */
 	joinrel = gimme_tree(root, tour, num_gene);
@@ -121,9 +121,9 @@ geqo_eval(PlannerInfo *root, Gene *tour, int num_gene)
 	 * Restore join_rel_list to its former state, and put back original
 	 * hashtable if any.
 	 */
-	root->join_rel_list = list_truncate(root->join_rel_list,
-		savelength);
-	root->join_rel_hash = savehash;
+	root->join_rel_list->items = list_truncate(root->join_rel_list->items,
+			   savelength);
+	root->join_rel_list->hash = savehash;
 
 	/* release all the memory acquired within gimme_tree */
 	MemoryContextSwitchTo(oldcxt);
diff --git a/src/backend/optimizer/plan/planmain.c b/src/backend/optimizer/plan/planmain.c
index 2dbf1db844..0bc8a6 100644
--- a/src/backend/optimizer/plan/planmain.c
+++ b/src/backend/optimizer/plan/planmain.c
@@ -65,8 +65,7 @@ query_planner(PlannerInfo *root,
 	 * NOTE: append_rel_list was set up by subquery_planner, so do not touch
 	 * here.
 	 */
-	root->join_rel_list = NIL;
-	root->join_rel_hash = NULL;
+	root->join_rel_list = makeNode(RelInfoList);
 	root->join_rel_level = NULL;
 	root->join_cur_level = 0;
 	root->canon_pathkeys = NIL;
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index 6054bd2b53..c238dd6538 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -31,11 +31,11 @@
 #include "utils/hsearch.h"
 
 
-typedef struct JoinHashEntry
+typedef struct RelInfoEntry
 {
-	Relids		join_relids;	/* hash key --- MUST BE 

Re: [HACKERS] WIP: Aggregation push-down

2019-07-15 Thread Richard Guo
On Fri, Jul 12, 2019 at 4:42 PM Antonin Houska  wrote:

> Richard Guo  wrote:
>
> > I didn't fully follow the whole thread and mainly looked into the
> > latest
> > patch set. So what are the considerations for abandoning the
> > aggmultifn
> > concept?
>
> Originally the function was there to support join where both relations are
> grouped. My doubts about usefulness of such join started at [1]. (See the
> thread referenced from [2].)
>
> > In my opinion, aggmultifn would enable us to do a lot more
> > types of transformation. For example, consider the query below:
> >
> > select sum(foo.c) from foo join bar on foo.b = bar.b group by foo.a,
> > bar.a;
> >
> > With the latest patch, the plan looks like:
> >
> > Finalize HashAggregate<-- sum(psum)
> >Group Key: foo.a, bar.a
> >->  Hash Join
> >  Hash Cond: (bar.b = foo.b)
> >  ->  Seq Scan on bar
> >  ->  Hash
> >->  Partial HashAggregate<-- sum(foo.c) as
> > psum
> >  Group Key: foo.a, foo.b
> >  ->  Seq Scan on foo
> >
> >
> > If we have aggmultifn, we can perform the query this way:
> >
> > Finalize HashAggregate<-- sum(foo.c)*cnt
> >Group Key: foo.a, bar.a
> >->  Hash Join
> >  Hash Cond: (foo.b = bar.b)
> >  ->  Seq Scan on foo
> >  ->  Hash
> >->  Partial HashAggregate<-- count(*) as cnt
> >  Group Key: bar.a, bar.b
> >  ->  Seq Scan on bar
>
> Perhaps, but it that would require changes to nodeAgg.c, which I want to
> avoid
> for now.
>
> > And this way:
> >
> > Finalize HashAggregate<-- sum(psum)*cnt
> >Group Key: foo.a, bar.a
> >->  Hash Join
> >  Hash Cond: (foo.b = bar.b)
> >->  Partial HashAggregate<-- sum(foo.c) as
> > psum
> >  Group Key: foo.a, foo.b
> >  ->  Seq Scan on foo
> >  ->  Hash
> >->  Partial HashAggregate<-- count(*) as cnt
> >  Group Key: bar.a, bar.b
> >  ->  Seq Scan on bar
>
> This looks like my idea presented in [1], for which there seems to be no
> common use case.
>
> > My another question is in function add_grouped_path(), when creating
> > sorted aggregation path on top of subpath. If the subpath is not
> > sorted,
> > then the sorted aggregation path would not be generated. Why not in
> > this
> > case we create a sort path on top of subpath first and then create
> > group
> > aggregation path on top of the sort path?
>
> I see no reason not to do it. (If you want to try, just go ahead.) However
> the
> current patch version brings only the basic functionality and I'm not
> going to
> add new functionality (such as parallel aggregation, partitioned tables or
> postgres_fdw) until the open design questions are resolved. Otherwise
> there's
> a risk that the additional work will be wasted due to major rework of the
> core
> functionality.
>
> > Core dump when running one query in agg_pushdown.sql
>
> Thanks for the report! Fixed in the new version.
>

Another core dump for query below:

select sum(t1.s1) from t1, t2, t3, t4 where t1.j1 = t2.j2 group by t1.g1,
t2.g2;

This is due to a small mistake:

diff --git a/src/backend/optimizer/util/relnode.c
b/src/backend/optimizer/util/relnode.c
index 10becc0..9c2deed 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -648,7 +648,7 @@ void
 add_grouped_rel(PlannerInfo *root, RelOptInfo *rel, RelAggInfo *agg_info)
 {
add_rel_info(root->grouped_rel_list, rel);
-   add_rel_info(root->agg_info_list, rel);
+   add_rel_info(root->agg_info_list, agg_info);
 }

Thanks
Richard


Re: [HACKERS] WIP: Aggregation push-down

2019-07-11 Thread Richard Guo
On Tue, Jul 9, 2019 at 9:47 PM Antonin Houska  wrote:

> Richard Guo  wrote:
>
> > Another rebase is needed for the patches.
>
> Done.
>


I didn't fully follow the whole thread and mainly looked into the latest
patch set. So what are the considerations for abandoning the aggmultifn
concept? In my opinion, aggmultifn would enable us to do a lot more
types of transformation. For example, consider the query below:

select sum(foo.c) from foo join bar on foo.b = bar.b group by foo.a, bar.a;

With the latest patch, the plan looks like:

Finalize HashAggregate<-- sum(psum)
   Group Key: foo.a, bar.a
   ->  Hash Join
 Hash Cond: (bar.b = foo.b)
 ->  Seq Scan on bar
 ->  Hash
   ->  Partial HashAggregate<-- sum(foo.c) as psum
 Group Key: foo.a, foo.b
 ->  Seq Scan on foo


If we have aggmultifn, we can perform the query this way:

Finalize HashAggregate<-- sum(foo.c)*cnt
   Group Key: foo.a, bar.a
   ->  Hash Join
 Hash Cond: (foo.b = bar.b)
 ->  Seq Scan on foo
 ->  Hash
   ->  Partial HashAggregate<-- count(*) as cnt
 Group Key: bar.a, bar.b
 ->  Seq Scan on bar


And this way:

Finalize HashAggregate<-- sum(psum)*cnt
   Group Key: foo.a, bar.a
   ->  Hash Join
 Hash Cond: (foo.b = bar.b)
   ->  Partial HashAggregate<-- sum(foo.c) as psum
 Group Key: foo.a, foo.b
 ->  Seq Scan on foo
 ->  Hash
   ->  Partial HashAggregate<-- count(*) as cnt
 Group Key: bar.a, bar.b
 ->  Seq Scan on bar


My another question is in function add_grouped_path(), when creating
sorted aggregation path on top of subpath. If the subpath is not sorted,
then the sorted aggregation path would not be generated. Why not in this
case we create a sort path on top of subpath first and then create group
aggregation path on top of the sort path?


Core dump when running one query in agg_pushdown.sql

EXPLAIN ANALYZE
SELECT p.x, avg(c1.v) FROM agg_pushdown_parent AS p JOIN agg_pushdown_child1
AS c1 ON c1.parent = p.i GROUP BY p.i;


#0  0x006def98 in CheckVarSlotCompatibility (slot=0x0, attnum=1,
vartype=23) at execExprInterp.c:1850
#1  0x006def5d in CheckExprStillValid (state=0x2b63a28,
econtext=0x2ba4958) at execExprInterp.c:1814
#2  0x006dee38 in ExecInterpExprStillValid (state=0x2b63a28,
econtext=0x2ba4958, isNull=0x7fff7cd16a37) at execExprInterp.c:1763
#3  0x007144dd in ExecEvalExpr (state=0x2b63a28,
econtext=0x2ba4958, isNull=0x7fff7cd16a37)
at ../../../src/include/executor/executor.h:288
#4  0x00715475 in ExecIndexEvalRuntimeKeys (econtext=0x2ba4958,
runtimeKeys=0x2b63910, numRuntimeKeys=1) at nodeIndexscan.c:630
#5  0x0071533b in ExecReScanIndexScan (node=0x2b62bf8) at
nodeIndexscan.c:568
#6  0x006d4ce6 in ExecReScan (node=0x2b62bf8) at execAmi.c:182
#7  0x007152a0 in ExecIndexScan (pstate=0x2b62bf8) at
nodeIndexscan.c:530


This is really a cool feature. Thank you for working on this.

Thanks
Richard


Re: [HACKERS] WIP: Aggregation push-down

2019-07-09 Thread Antonin Houska
Richard Guo  wrote:

> Another rebase is needed for the patches.

Done.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

>From f656bd8d46afb9cb0a331cf3d34b9eed39f5e360 Mon Sep 17 00:00:00 2001
From: Antonin Houska 
Date: Tue, 9 Jul 2019 15:30:13 +0200
Subject: [PATCH 1/3] Introduce RelInfoList structure.

---
 contrib/postgres_fdw/postgres_fdw.c|   3 +-
 src/backend/nodes/outfuncs.c   |  11 +++
 src/backend/optimizer/geqo/geqo_eval.c |  12 +--
 src/backend/optimizer/plan/planmain.c  |   3 +-
 src/backend/optimizer/util/relnode.c   | 157 -
 src/include/nodes/nodes.h  |   1 +
 src/include/nodes/pathnodes.h  |  28 --
 7 files changed, 136 insertions(+), 79 deletions(-)

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 033aeb2556..90414f1168 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -5205,7 +5205,8 @@ foreign_join_ok(PlannerInfo *root, RelOptInfo *joinrel, JoinType jointype,
 	 */
 	Assert(fpinfo->relation_index == 0);	/* shouldn't be set yet */
 	fpinfo->relation_index =
-		list_length(root->parse->rtable) + list_length(root->join_rel_list);
+		list_length(root->parse->rtable) +
+		list_length(root->join_rel_list->items);
 
 	return true;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 8400dd319e..4529b5c63b 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -2278,6 +2278,14 @@ _outRelOptInfo(StringInfo str, const RelOptInfo *node)
 	WRITE_NODE_FIELD(partitioned_child_rels);
 }
 
+static void
+_outRelInfoList(StringInfo str, const RelInfoList *node)
+{
+	WRITE_NODE_TYPE("RELOPTINFOLIST");
+
+	WRITE_NODE_FIELD(items);
+}
+
 static void
 _outIndexOptInfo(StringInfo str, const IndexOptInfo *node)
 {
@@ -4052,6 +4060,9 @@ outNode(StringInfo str, const void *obj)
 			case T_RelOptInfo:
 _outRelOptInfo(str, obj);
 break;
+			case T_RelInfoList:
+_outRelInfoList(str, obj);
+break;
 			case T_IndexOptInfo:
 _outIndexOptInfo(str, obj);
 break;
diff --git a/src/backend/optimizer/geqo/geqo_eval.c b/src/backend/optimizer/geqo/geqo_eval.c
index 6c69c1c147..c69f3469ba 100644
--- a/src/backend/optimizer/geqo/geqo_eval.c
+++ b/src/backend/optimizer/geqo/geqo_eval.c
@@ -92,11 +92,11 @@ geqo_eval(PlannerInfo *root, Gene *tour, int num_gene)
 	 *
 	 * join_rel_level[] shouldn't be in use, so just Assert it isn't.
 	 */
-	savelength = list_length(root->join_rel_list);
-	savehash = root->join_rel_hash;
+	savelength = list_length(root->join_rel_list->items);
+	savehash = root->join_rel_list->hash;
 	Assert(root->join_rel_level == NULL);
 
-	root->join_rel_hash = NULL;
+	root->join_rel_list->hash = NULL;
 
 	/* construct the best path for the given combination of relations */
 	joinrel = gimme_tree(root, tour, num_gene);
@@ -121,9 +121,9 @@ geqo_eval(PlannerInfo *root, Gene *tour, int num_gene)
 	 * Restore join_rel_list to its former state, and put back original
 	 * hashtable if any.
 	 */
-	root->join_rel_list = list_truncate(root->join_rel_list,
-		savelength);
-	root->join_rel_hash = savehash;
+	root->join_rel_list->items = list_truncate(root->join_rel_list->items,
+			   savelength);
+	root->join_rel_list->hash = savehash;
 
 	/* release all the memory acquired within gimme_tree */
 	MemoryContextSwitchTo(oldcxt);
diff --git a/src/backend/optimizer/plan/planmain.c b/src/backend/optimizer/plan/planmain.c
index 2dbf1db844..0bc8a6 100644
--- a/src/backend/optimizer/plan/planmain.c
+++ b/src/backend/optimizer/plan/planmain.c
@@ -65,8 +65,7 @@ query_planner(PlannerInfo *root,
 	 * NOTE: append_rel_list was set up by subquery_planner, so do not touch
 	 * here.
 	 */
-	root->join_rel_list = NIL;
-	root->join_rel_hash = NULL;
+	root->join_rel_list = makeNode(RelInfoList);
 	root->join_rel_level = NULL;
 	root->join_cur_level = 0;
 	root->canon_pathkeys = NIL;
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index 6054bd2b53..c238dd6538 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -31,11 +31,11 @@
 #include "utils/hsearch.h"
 
 
-typedef struct JoinHashEntry
+typedef struct RelInfoEntry
 {
-	Relids		join_relids;	/* hash key --- MUST BE FIRST */
-	RelOptInfo *join_rel;
-} JoinHashEntry;
+	Relids		relids;			/* hash key --- MUST BE FIRST */
+	void	   *data;
+} RelInfoEntry;
 
 static void build_joinrel_tlist(PlannerInfo *root, RelOptInfo *joinrel,
 RelOptInfo *input_rel);
@@ -375,11 +375,11 @@ find_base_rel(PlannerInfo *root, int relid)
 }
 
 /*
- * build_join_rel_hash
- *	  Construct the auxiliary hash table for join relations.
+ * build_rel_hash
+ *	  Construct the auxiliary hash table for relation specific data.
  */
 static void
-build_join_rel_hash(PlannerInfo *root)
+build_rel_hash(RelInfoList *list)
 {
 	HTAB	   *hashtab;
 	HASHCTL		hash_ctl;
@@ -388,47 

Re: [HACKERS] WIP: Aggregation push-down

2019-07-03 Thread Richard Guo
On Fri, Mar 1, 2019 at 12:01 AM Antonin Houska  wrote:

> Tom Lane  wrote:
>
> > Antonin Houska  writes:
> > > Michael Paquier  wrote:
> > >> Latest patch set fails to apply, so moved to next CF, waiting on
> > >> author.
> >
> > > Rebased.
> >
> > This is in need of rebasing again :-(.  I went ahead and pushed the 001
> > part, since that seemed fairly uncontroversial.
>
> ok, thanks.
>


Another rebase is needed for the patches.

Thanks
Richard


Re: [HACKERS] WIP: Aggregation push-down

2019-02-28 Thread Antonin Houska
Tom Lane  wrote:

> Antonin Houska  writes:
> > Michael Paquier  wrote:
> >> Latest patch set fails to apply, so moved to next CF, waiting on
> >> author.
> 
> > Rebased.
> 
> This is in need of rebasing again :-(.  I went ahead and pushed the 001
> part, since that seemed fairly uncontroversial.

ok, thanks.

> I did not spend a whole lot of time looking at the patch today, but
> I'm still pretty distressed at the data structures you've chosen.
> I remain of the opinion that a grouped relation and a base relation
> are, er, unrelated, even if they happen to share the same relid set.
> So I do not see the value of the RelOptInfoSet struct you propose here,

ok. As you suggested upthread, I try now to reuse the join_rel_list /
join_rel_hash structures, see v11-001-Introduce_RelInfoList.patch.

> and I definitely don't think there's any value in having, eg,
> create_seqscan_path or create_index_path dealing with this stuff.

Originally I tried to aggregate any path that ever gets passed to agg_path(),
but that's probably not worth the code complexity. Now the partial aggregation
is only applied to paths that survived agg_path() on the plain relation.

> I also don't like changing create_nestloop_path et al to take a PathTarget
> rather than using the RelOptInfo's pathtarget; IMO, it's flat out wrong
> for a path to generate a tlist different from what its parent RelOptInfo
> says that the relation produces.

Likewise, the current patch version is less invasive, so create_nestloop_path
et al are not touched at all.

> I think basically the way this ought to work is that we generate baserel
> paths pretty much the same as today, and then we run through the baserels
> and see which ones are potentially worth doing partial aggregation on,
> and for each one that is, we create a separate "upper relation" RelOptInfo
> describing that.  The paths for this RelOptInfo would be
> partial-aggregation paths using paths from the corresponding baserel as
> input.  Then we'd run a join search that only considers joining grouped
> rels with plain rels (I concur that joining two grouped rels is not worth
> coping with, at least for now).

make_join_rel() is the core of my implementation: besides joining two plain
relations it tries to join plain relation to grouped one, and also to
aggregate the join of the two plain relations. I consider this approach less
invasive and more efficient than running the whole standard_join_search again
for the grouped rels.

The problem of numeric-like data types (i.e. types for wich equality of two
values of the grouping key does not justify putting them into the same group
because information like scale would be discarded this way) remains open. My
last idea was to add a boolean flag to operator class which tells that
equality implies "bitwise equality", and to disallow aggregate push-down if
SortGroupClause.eqop is in an opclass which has this field FALSE. I'd like to
hear your opinion before I do any coding.

-- 
Antonin Houska
https://www.cybertec-postgresql.com

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 6b96e7de0a..4484fb4fbc 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -4906,7 +4906,8 @@ foreign_join_ok(PlannerInfo *root, RelOptInfo *joinrel, JoinType jointype,
 	 */
 	Assert(fpinfo->relation_index == 0);	/* shouldn't be set yet */
 	fpinfo->relation_index =
-		list_length(root->parse->rtable) + list_length(root->join_rel_list);
+		list_length(root->parse->rtable) +
+		list_length(root->join_rel_list->items);
 
 	return true;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 65302fe65b..8b333f069a 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -2272,6 +2272,14 @@ _outRelOptInfo(StringInfo str, const RelOptInfo *node)
 }
 
 static void
+_outRelInfoList(StringInfo str, const RelInfoList *node)
+{
+	WRITE_NODE_TYPE("RELOPTINFOLIST");
+
+	WRITE_NODE_FIELD(items);
+}
+
+static void
 _outIndexOptInfo(StringInfo str, const IndexOptInfo *node)
 {
 	WRITE_NODE_TYPE("INDEXOPTINFO");
@@ -4031,6 +4039,9 @@ outNode(StringInfo str, const void *obj)
 			case T_RelOptInfo:
 _outRelOptInfo(str, obj);
 break;
+			case T_RelInfoList:
+_outRelInfoList(str, obj);
+break;
 			case T_IndexOptInfo:
 _outIndexOptInfo(str, obj);
 break;
diff --git a/src/backend/optimizer/geqo/geqo_eval.c b/src/backend/optimizer/geqo/geqo_eval.c
index e07bab831e..82b4f5c56a 100644
--- a/src/backend/optimizer/geqo/geqo_eval.c
+++ b/src/backend/optimizer/geqo/geqo_eval.c
@@ -92,11 +92,11 @@ geqo_eval(PlannerInfo *root, Gene *tour, int num_gene)
 	 *
 	 * join_rel_level[] shouldn't be in use, so just Assert it isn't.
 	 */
-	savelength = list_length(root->join_rel_list);
-	savehash = root->join_rel_hash;
+	savelength = list_length(root->join_rel_list->items);
+	savehash = root->join_rel_list->hash;
 	Assert(root->join_rel_level == NULL);
 
-	

Re: [HACKERS] WIP: Aggregation push-down

2019-02-21 Thread Tom Lane
Antonin Houska  writes:
> Michael Paquier  wrote:
>> Latest patch set fails to apply, so moved to next CF, waiting on
>> author.

> Rebased.

This is in need of rebasing again :-(.  I went ahead and pushed the 001
part, since that seemed fairly uncontroversial.  (Note that I changed
estimate_hashagg_tablesize's result type to double on the way; you
probably want to make corresponding adjustments in your patch.)

I did not spend a whole lot of time looking at the patch today, but
I'm still pretty distressed at the data structures you've chosen.
I remain of the opinion that a grouped relation and a base relation
are, er, unrelated, even if they happen to share the same relid set.
So I do not see the value of the RelOptInfoSet struct you propose here,
and I definitely don't think there's any value in having, eg,
create_seqscan_path or create_index_path dealing with this stuff.

I also don't like changing create_nestloop_path et al to take a PathTarget
rather than using the RelOptInfo's pathtarget; IMO, it's flat out wrong
for a path to generate a tlist different from what its parent RelOptInfo
says that the relation produces.

I think basically the way this ought to work is that we generate baserel
paths pretty much the same as today, and then we run through the baserels
and see which ones are potentially worth doing partial aggregation on,
and for each one that is, we create a separate "upper relation" RelOptInfo
describing that.  The paths for this RelOptInfo would be
partial-aggregation paths using paths from the corresponding baserel as
input.  Then we'd run a join search that only considers joining grouped
rels with plain rels (I concur that joining two grouped rels is not worth
coping with, at least for now).

regards, tom lane



Re: [HACKERS] WIP: Aggregation push-down

2019-02-03 Thread Michael Paquier
On Tue, Jan 15, 2019 at 03:06:18PM +0100, Antonin Houska wrote:
> This is the next version. A few more comments below.

Latest patch set fails to apply, so moved to next CF, waiting on
author.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] WIP: Aggregation push-down

2018-12-17 Thread Antonin Houska
Tom Lane  wrote:

> Antonin Houska  writes:
> > [ agg_pushdown_v8.tgz ]
> 
> I spent a few hours looking at this today.

Thanks!

> It seems to me that at no point has there been a clear explanation of what
> the patch is trying to accomplish, so let me see whether I've got it
> straight or not.  (I suggest that something like this ought to be included
> in optimizer/README; the patch's lack of internal documentation is a serious
> deficiency.)

Earlier version of the patch did update optimizer/README but I forgot to merge
it into the current one. I'll fix that in the next version. (Also some more
will be added, especially header comments of some functions.)

> Conceptually, we'd like to be able to push aggregation down below joins
> when that yields a win, which it could do by reducing the number of rows
> that have to be processed at the join stage.  Thus consider
> 
> SELECT agg(a.x) FROM a, b WHERE a.y = b.z;
> 
> We can't simply aggregate during the scan of "a", because some values of
> "x" might not appear at all in the input of the naively-computed aggregate
> (if their rows have no join partner in "b"), or might appear multiple
> times (if their rows have multiple join partners).  So at first glance,
> aggregating before the join is impossible.  The key insight of the patch
> is that we can make some progress by considering only grouped aggregation:

Yes, the patch does not handle queries with no GROUP BY clause. Primarily
because I don't know how the grouped "a" table in this example could emit the
"a.y" column w/o using it as a grouping key.

> if we can group "a" into sets of rows that must all have the same join
> partners, then we can do preliminary aggregation within each such group,
> and take care of the number-of-repetitions problem when we join.  If the
> groups are large then this reduces the number of rows processed by the
> join, at the cost that we might spend time computing the aggregate for
> some row sets that will just be discarded by the join.  So now we consider
> 
> SELECT agg(a.x) FROM a, b WHERE a.y = b.z GROUP BY ?;
> 
> and ask what properties the grouping column(s) "?" must have to make this
> work.  I believe we can say that it's OK to push down through a join if
> its join clauses are all of the form "a.y = b.z", where either a.y or b.z
> is listed as a GROUP BY column *and* the join operator is equality for
> the btree opfamily specified by the SortGroupClause.  (Note: actually,
> SortGroupClause per se mentions an equality operator, although I think the
> planner quickly reduces that to an opfamily specification.)

I suppose you mean make_pathkey_from_sortop().

> The concerns Robert had about equality semantics are certainly vital, but I
> believe that the logic goes through correctly as long as the grouping
> equality operator and the join equality operator belong to the same
> opfamily.

ok, generic approach like this is always better. I just could not devise it,
nor can I prove that your theory is wrong.

> Robert postulated cases like "citext = text", but that doesn't apply
> here because no cross-type citext = text operator could be part of a
> well-behaved opfamily.

The reason I can see now is that the GROUP BY operator (opfamily) essentially
has the grouping column on both sides, so it does not match the cross-type
operator.

> What we'd actually be looking at is either "citextvar::text texteq textvar"
> or "citextvar citexteq textvar::citext", and the casts prevent these
> expressions from matching GROUP BY entries that have the wrong semantics.

Can you please explain this? If the expression is "citextvar::text texteq
textvar", then both operands of the operator should be of "text" type (because
the operator receives the output of the cast), so I'd expect a match to
SortGroupClause.eqop of clause like "GROUP BY ".

> However: what we have proven here is only that we can aggregate across
> a set of rows that must share the same join partners.  We still have
> to be able to handle the case that the rowset has more than one join
> partner, which AFAICS means that we must use partial aggregation and
> then apply an "aggmultifn" (or else apply the aggcombinefn N times).
> We can avoid that and use plain aggregation when we can prove the "b"
> side of the join is unique, so that no sets of rows will have to be merged
> post-join; but ISTM that that reduces the set of use cases to be too small
> to be worth such a complex patch.  So I'm really doubtful that we should
> proceed forward with only that case available.

I tried to follow Heikki's proposal in [1] and separated the functionality
that does not need the partial aggregation. I was not able to foresee that the
patch does not get much smaller this way. Also because I had to introduce
functions that check whether particular join does duplicate aggregated values
or not. (Even if we use partial aggregation, we can use this functionality to
detect that we only need to call aggfinalfn() in some cases, as opposed 

Re: [HACKERS] WIP: Aggregation push-down

2018-11-18 Thread Tom Lane
Antonin Houska  writes:
> [ agg_pushdown_v8.tgz ]

I spent a few hours looking at this today.  It seems to me that at no
point has there been a clear explanation of what the patch is trying to
accomplish, so let me see whether I've got it straight or not.  (I suggest
that something like this ought to be included in optimizer/README; the
patch's lack of internal documentation is a serious deficiency.)

--

Conceptually, we'd like to be able to push aggregation down below joins
when that yields a win, which it could do by reducing the number of rows
that have to be processed at the join stage.  Thus consider

SELECT agg(a.x) FROM a, b WHERE a.y = b.z;

We can't simply aggregate during the scan of "a", because some values of
"x" might not appear at all in the input of the naively-computed aggregate
(if their rows have no join partner in "b"), or might appear multiple
times (if their rows have multiple join partners).  So at first glance,
aggregating before the join is impossible.  The key insight of the patch
is that we can make some progress by considering only grouped aggregation:
if we can group "a" into sets of rows that must all have the same join
partners, then we can do preliminary aggregation within each such group,
and take care of the number-of-repetitions problem when we join.  If the
groups are large then this reduces the number of rows processed by the
join, at the cost that we might spend time computing the aggregate for
some row sets that will just be discarded by the join.  So now we consider

SELECT agg(a.x) FROM a, b WHERE a.y = b.z GROUP BY ?;

and ask what properties the grouping column(s) "?" must have to make this
work.  I believe we can say that it's OK to push down through a join if
its join clauses are all of the form "a.y = b.z", where either a.y or b.z
is listed as a GROUP BY column *and* the join operator is equality for
the btree opfamily specified by the SortGroupClause.  (Note: actually,
SortGroupClause per se mentions an equality operator, although I think the
planner quickly reduces that to an opfamily specification.)  The concerns
Robert had about equality semantics are certainly vital, but I believe
that the logic goes through correctly as long as the grouping equality
operator and the join equality operator belong to the same opfamily.

Case 1: the GROUP BY column is a.y.  Two rows of "a" whose y values are
equal per the grouping operator must join to exactly the same set of "b"
rows, else transitivity is failing.

Case 2: the GROUP BY column is b.z.  It still works, because the set of
"a" rows that are equal to a given z value is well-defined, and those
rows must join to exactly the "b" rows whose z entries are equal to
the given value, else transitivity is failing.

Robert postulated cases like "citext = text", but that doesn't apply
here because no cross-type citext = text operator could be part of a
well-behaved opfamily.  What we'd actually be looking at is either
"citextvar::text texteq textvar" or "citextvar citexteq textvar::citext",
and the casts prevent these expressions from matching GROUP BY entries
that have the wrong semantics.

However: what we have proven here is only that we can aggregate across
a set of rows that must share the same join partners.  We still have
to be able to handle the case that the rowset has more than one join
partner, which AFAICS means that we must use partial aggregation and
then apply an "aggmultifn" (or else apply the aggcombinefn N times).
We can avoid that and use plain aggregation when we can prove the "b"
side of the join is unique, so that no sets of rows will have to be merged
post-join; but ISTM that that reduces the set of use cases to be too small
to be worth such a complex patch.  So I'm really doubtful that we should
proceed forward with only that case available.

Also, Tomas complained in the earlier thread that he didn't think
grouping on the join column was a very common use-case in the first
place.  I share that concern, but I think we could extend the logic
to the case that Tomas posited as being useful:

SELECT agg(a.x) FROM a, b WHERE a.y = b.id GROUP BY b.z;

where the join column b.id is unique.  If we group on a.y (using semantics
compatible with the join operator and the uniqueness constraint), then all
"a" rows in a given group will join to exactly one "b" row that
necessarily has exactly one grouping value, so this group can safely be
aggregated together.  We might need to combine it post-join with other "b"
rows that have equal "z" values, but we can do that as long as we're okay
with partial aggregation.  I think this example shows why the idea is far
more powerful with partial aggregation than without.

--

In short, then, I don't have much use for the patch as presented in this
thread, without "aggmultifn".  That might be OK as the first phase in a
multi-step patch, but not as a production feature.  I also have no use
for the stuff depending on bitwise equality rather than the user-visible
operators; 

Re: [HACKERS] WIP: Aggregation push-down

2018-08-31 Thread Antonin Houska
Antonin Houska  wrote:

> I've reworked the patch so that separate RelOptInfo is used for grouped
> relation. The attached patch is only the core part. Neither postgres_fdw
> changes nor the part that tries to avoid the final aggregation is included
> here. I'll add these when the patch does not seem to need another major 
> rework.

This is the next version. After having posted a few notes to the [1] thread,
I'm returning to the original one so it can be referenced from my entry in the
CF application. As proposed in the other thread, it uses only "simple
aggregation". The 2-stage aggregation, which gives more power to the feature
and which also enables paralle processing for it, will be coded in a separate
patch.

I eventually decided abandon the Var substitution proposed by Heikki in
[2]. It's not necessary if we accept the restriction that the feature accepts
only simple column reference as grouping expression so far.

[1] 
https://www.postgresql.org/message-id/c165b72e-8dbb-2a24-291f-113aeb67b76a%40iki.fi

[2] 
https://www.postgresql.org/message-id/113e9594-7c08-0f1f-ad15-41773b56a86b%40iki.fi

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com



agg_pushdown_v8.tgz
Description: GNU Zip compressed data


Re: [HACKERS] WIP: Aggregation push-down

2018-07-06 Thread Antonin Houska
Robert Haas  wrote:

> On Fri, Feb 23, 2018 at 11:08 AM, Antonin Houska  wrote:
> > I spent some more time thinking about this. What about adding a new strategy
> > number for hash index operator classes, e.g. HTBinaryEqualStrategyNumber? 
> > For
> > most types both HTEqualStrategyNumber and HTBinaryEqualStrategyNumber 
> > strategy
> > would point to the same operator, but types like numeric would naturally 
> > have
> > them different.
> >
> > Thus the pushed-down partial aggregation can only use the
> > HTBinaryEqualStrategyNumber's operator to compare grouping expressions. In 
> > the
> > initial version (until we have useful statistics for the binary values) we 
> > can
> > avoid the aggregation push-down if the grouping expression output type has 
> > the
> > two strategies implemented using different functions because, as you noted
> > upthread, grouping based on binary equality can result in excessive number 
> > of
> > groups.
> >
> > One open question is whether the binary equality operator needs a separate
> > operator class or not. If an opclass cares only about the binary equality, 
> > its
> > hash function(s) can be a lot simpler.
> 
> Hmm.  How about instead adding another regproc field to pg_type which
> stores the OID of a function that tests binary equality for that
> datatype?  If that happens to be equal to the OID you got from the
> opclass, then you're all set.

I suppose you mean pg_operator, not pg_type. What I don't like about this is
that the new field would only be useful for very little fraction of
operators.

On the other hand, the drawback of an additional operator classes is that we'd
have to modify almost all the existing operator classes for the hash AM. (The
absence of the new strategy number in an operator class cannot mean that the
existing equality operator can be used to compare binary values too, because
thus we can't guarantee correct behavior of the already existing user-defined
operator classes.)


-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com



Re: [HACKERS] WIP: Aggregation push-down

2018-02-24 Thread Robert Haas
On Fri, Feb 23, 2018 at 11:08 AM, Antonin Houska  wrote:
> I spent some more time thinking about this. What about adding a new strategy
> number for hash index operator classes, e.g. HTBinaryEqualStrategyNumber? For
> most types both HTEqualStrategyNumber and HTBinaryEqualStrategyNumber strategy
> would point to the same operator, but types like numeric would naturally have
> them different.
>
> Thus the pushed-down partial aggregation can only use the
> HTBinaryEqualStrategyNumber's operator to compare grouping expressions. In the
> initial version (until we have useful statistics for the binary values) we can
> avoid the aggregation push-down if the grouping expression output type has the
> two strategies implemented using different functions because, as you noted
> upthread, grouping based on binary equality can result in excessive number of
> groups.
>
> One open question is whether the binary equality operator needs a separate
> operator class or not. If an opclass cares only about the binary equality, its
> hash function(s) can be a lot simpler.

Hmm.  How about instead adding another regproc field to pg_type which
stores the OID of a function that tests binary equality for that
datatype?  If that happens to be equal to the OID you got from the
opclass, then you're all set.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] WIP: Aggregation push-down

2018-02-23 Thread Antonin Houska
Robert Haas  wrote:

> On Mon, Jan 29, 2018 at 3:32 AM, Antonin Houska  wrote:
> > I think of a variant of this: implement an universal function that tests the
> > binary values for equality (besides the actual arguments, caller would have 
> > to
> > pass info like typlen, typalign, typbyval for each argument, and cache these
> > for repeated calls) and set pg_proc(oprcode) to 0 wherever this function is
> > sufficient. Thus the problematic cases like numeric, citext, etc. would be
> > detected by their non-zero oprcode.
> 
> I don't think that's a good option, because we don't want int4eq to
> have to go through a code path that has branches to support varlenas
> and cstrings.  Andres is busy trying to speed up expression evaluation
> by removing unnecessary code branches; adding new ones would be
> undoing that valuable work.

I spent some more time thinking about this. What about adding a new strategy
number for hash index operator classes, e.g. HTBinaryEqualStrategyNumber? For
most types both HTEqualStrategyNumber and HTBinaryEqualStrategyNumber strategy
would point to the same operator, but types like numeric would naturally have
them different.

Thus the pushed-down partial aggregation can only use the
HTBinaryEqualStrategyNumber's operator to compare grouping expressions. In the
initial version (until we have useful statistics for the binary values) we can
avoid the aggregation push-down if the grouping expression output type has the
two strategies implemented using different functions because, as you noted
upthread, grouping based on binary equality can result in excessive number of
groups.

One open question is whether the binary equality operator needs a separate
operator class or not. If an opclass cares only about the binary equality, its
hash function(s) can be a lot simpler.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com



Re: [HACKERS] WIP: Aggregation push-down

2018-01-29 Thread Robert Haas
On Mon, Jan 29, 2018 at 3:32 AM, Antonin Houska  wrote:
> I think of a variant of this: implement an universal function that tests the
> binary values for equality (besides the actual arguments, caller would have to
> pass info like typlen, typalign, typbyval for each argument, and cache these
> for repeated calls) and set pg_proc(oprcode) to 0 wherever this function is
> sufficient. Thus the problematic cases like numeric, citext, etc. would be
> detected by their non-zero oprcode.

I don't think that's a good option, because we don't want int4eq to
have to go through a code path that has branches to support varlenas
and cstrings.  Andres is busy trying to speed up expression evaluation
by removing unnecessary code branches; adding new ones would be
undoing that valuable work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] WIP: Aggregation push-down

2018-01-29 Thread Chapman Flack
On 01/29/18 03:32, Antonin Houska wrote:
> Robert Haas  wrote:

>>> only take place if we had a special equality operator which distinguishes 
>>> the
>>> *binary* values (I don't know yet how to store this operator the catalog ---
...
>> We don't have an operator that tests for binary equality, but it's
>> certainly testable from C; see datumIsEqual.

Disclaimer: I haven't been following the whole thread closely. But could
there be some way to exploit the composite-type *= operator?

https://www.postgresql.org/docs/current/static/functions-comparisons.html#COMPOSITE-TYPE-COMPARISON

-Chap



Re: [HACKERS] WIP: Aggregation push-down

2018-01-29 Thread Antonin Houska
Robert Haas  wrote:

> On Fri, Jan 26, 2018 at 8:04 AM, Antonin Houska  wrote:
> > So one problem is that the grouping expression can be inappropriate for
> > partial aggregation even if there's no type change during the
> > translation. What I consider typical for this case is that the equality
> > operator used to identify groups (SortGroupClause.eqop) can return true even
> > if binary (stored) values of the inputs differ. The partial aggregation 
> > could
> > only take place if we had a special equality operator which distinguishes 
> > the
> > *binary* values (I don't know yet how to store this operator the catalog ---
> > in pg_opclass recors for the hash opclasses?)..
> 
> We don't have an operator that tests for binary equality, but it's
> certainly testable from C; see datumIsEqual.  I'm not sure how much
> this really helps, though.  I think it would be safe to build an
> initial set of groups based on datumIsEqual comparisons and then
> arrange to later merge some of the groups.  But that's not something
> we ever do in the executor today, so it might require quite a lot of
> hacking.  Also, it seems like it might really suck in some cases.  For
> instance, consider something like SELECT scale(one.a), sum(two.a) FROM
> one, two WHERE one.a = two.a GROUP BY 1.  Doing a Partial Aggregate on
> two.a using datumIsEqual semantics, joining to a, and then doing a
> Finalize Aggregate looks legal, but the Partial Aggregate may produce
> a tremendous number of groups compared to the Finalize Aggregate.  In
> other words, this technique wouldn't merge any groups that shouldn't
> be merged, but it might fail to merge groups that really need to be
> merged to get good performance.

I don't insist on doing Partial Aggregate in any case. If we wanted to group
by the binary value, we'd probably have to enhance statistics accordingly. The
important thing is to recognize the special case like this. Rejection of the
Partial Aggregate would be the default response.

> > Another idea is to allow only such changes that the
> > destination type is in the same operator class as the source, and explicitly
> > enumerate the "safe opclasses". But that would mean that user cannot define
> > new opclasses within which the translation is possible --- not sure this is 
> > a
> > serious issue.
> 
> Enumerating specific opclasses in the source code is a non-starter --
> Tom Lane would roll over in his grave if he weren't still alive.  What
> we could perhaps consider doing is adding some mechanism for an
> opclass or opfamily to say whether its equality semantics happen to be
> exactly datumIsEqual() semantics.  That's a little grotty because it
> leaves data types for which that isn't true out in the cold, but on
> the other hand it seems like it would be useful as a way of optimizing
> a bunch of things other than this.  Maybe it could also include a way
> to specify that the comparator happens to have the semantics as C's
> built-in < and > operators, which seems like a case that might also
> lend itself to some optimizations.

I think of a variant of this: implement an universal function that tests the
binary values for equality (besides the actual arguments, caller would have to
pass info like typlen, typalign, typbyval for each argument, and cache these
for repeated calls) and set pg_proc(oprcode) to 0 wherever this function is
sufficient. Thus the problematic cases like numeric, citext, etc. would be
detected by their non-zero oprcode.

-- 
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Web: https://www.cybertec-postgresql.com



Re: [HACKERS] WIP: Aggregation push-down

2018-01-26 Thread Robert Haas
On Fri, Jan 26, 2018 at 8:04 AM, Antonin Houska  wrote:
> So one problem is that the grouping expression can be inappropriate for
> partial aggregation even if there's no type change during the
> translation. What I consider typical for this case is that the equality
> operator used to identify groups (SortGroupClause.eqop) can return true even
> if binary (stored) values of the inputs differ. The partial aggregation could
> only take place if we had a special equality operator which distinguishes the
> *binary* values (I don't know yet how to store this operator the catalog ---
> in pg_opclass recors for the hash opclasses?)..

We don't have an operator that tests for binary equality, but it's
certainly testable from C; see datumIsEqual.  I'm not sure how much
this really helps, though.  I think it would be safe to build an
initial set of groups based on datumIsEqual comparisons and then
arrange to later merge some of the groups.  But that's not something
we ever do in the executor today, so it might require quite a lot of
hacking.  Also, it seems like it might really suck in some cases.  For
instance, consider something like SELECT scale(one.a), sum(two.a) FROM
one, two WHERE one.a = two.a GROUP BY 1.  Doing a Partial Aggregate on
two.a using datumIsEqual semantics, joining to a, and then doing a
Finalize Aggregate looks legal, but the Partial Aggregate may produce
a tremendous number of groups compared to the Finalize Aggregate.  In
other words, this technique wouldn't merge any groups that shouldn't
be merged, but it might fail to merge groups that really need to be
merged to get good performance.

> One of my ideas is to check whether the source and destination types are
> binary coercible (i.e. pg_cast(castfunc) = 0) but this might be a misuse of
> the binary coercibility.

Yeah, binary coercibility has nothing to do with this; that tells you
whether the two types are the same on disk, not whether they have the
same equality semantics.  For instance, I think text and citext are
binary coercible, but their equality semantics are not the same.

> Another idea is to allow only such changes that the
> destination type is in the same operator class as the source, and explicitly
> enumerate the "safe opclasses". But that would mean that user cannot define
> new opclasses within which the translation is possible --- not sure this is a
> serious issue.

Enumerating specific opclasses in the source code is a non-starter --
Tom Lane would roll over in his grave if he weren't still alive.  What
we could perhaps consider doing is adding some mechanism for an
opclass or opfamily to say whether its equality semantics happen to be
exactly datumIsEqual() semantics.  That's a little grotty because it
leaves data types for which that isn't true out in the cold, but on
the other hand it seems like it would be useful as a way of optimizing
a bunch of things other than this.  Maybe it could also include a way
to specify that the comparator happens to have the semantics as C's
built-in < and > operators, which seems like a case that might also
lend itself to some optimizations.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] WIP: Aggregation push-down

2018-01-26 Thread Antonin Houska
Robert Haas  wrote:

> On Fri, Dec 22, 2017 at 10:43 AM, Antonin Houska  wrote:
> > Michael Paquier  wrote:
> >> On Sat, Nov 4, 2017 at 12:33 AM, Antonin Houska  wrote:
> >> > I'm not about to add any other features now. Implementation of the 
> >> > missing
> >> > parts (see the TODO comments in the code) is the next step. But what I'd
> >> > appreciate most is a feedback on the design. Thanks.
> >>
> >> I am getting a conflict after applying patch 5 but this did not get
> >> any reviews so moved to next CF with waiting on author as status.
> >
> > Attached is the next version.
> 
> I've been a bit confused for a while about what this patch is trying
> to do, so I spent some time today looking at it to try to figure it
> out.

Thanks.

> There's a lot I don't understand yet, but it seems like the general idea is
> to build, potentially for each relation in the join tree, not only the
> regular list of paths but also a list of "grouped" paths.  If the
> pre-grouped path wins, then we can get a final path that looks like Finalize
> Aggregate -> Some Joins -> Partial Aggregate -> Maybe Some More Joins ->
> Base Table Scan.

Yes. The regression test output shows some more plans.

> In some cases the patch seems to replace that uppermost Finalize Aggregate
> with a Result node.

This is a feature implemented in 09_avoid_agg_finalization.diff. You can
ignore it for now, it's of lower priority than the preceding parts and needs
more work to be complete.

> translate_expression_to_rels() looks unsafe.  Equivalence members are
> known to be equal under the semantics of the relevant operator class,
> but that doesn't mean that one can be freely substituted for another.
> For example:
> 
> rhaas=# create table one (a numeric);
> CREATE TABLE
> rhaas=# create table two (a numeric);
> CREATE TABLE
> rhaas=# insert into one values ('0');
> INSERT 0 1
> rhaas=# insert into two values ('0.00');
> INSERT 0 1
> rhaas=# select one.a, count(*) from one, two where one.a = two.a group by 1;
>  a | count
> ---+---
>  0 | 1
> (1 row)
> 
> rhaas=# select two.a, count(*) from one, two where one.a = two.a group by 1;
>   a   | count
> --+---
>  0.00 | 1
> (1 row)
> 
> There are, admittedly, a large number of data types for which such a
> substitution would work just fine.  numeric will not, citext will not,
> but many others are fine. Regrettably, we have no framework in the
> system for identifying equality operators which actually test identity
> versus some looser notion of equality.  Cross-type operators are a
> problem, too; if we have foo.x = bar.y in the query, one might be int4
> and the other int8, for example, but they can still belong to the same
> equivalence class.
> 
> Concretely, in your test query "SELECT p.i, avg(c1.v) FROM
> agg_pushdown_parent AS p JOIN agg_pushdown_child1 AS c1 ON c1.parent =
> p.i GROUP BY p.i" you assume that it's OK to do a Partial
> HashAggregate over c1.parent rather than p.i.  This will be false if,
> say, c1.parent is of type citext and p.i is of type text; this will
> get grouped together that shouldn't.

I've really missed this problem, thanks for bringing it up! Your
considerations made me think of the specifics of the types like numeric or
citext. Although your example does not demonstrate what I consider the core
issue, I believe we eventually think of the same.

Consider this example query (using the tables you defined upthread):

SELECT  one.a
FROMone, two
WHERE   one.a = two.a AND scale(two.a) > 1
GROUP BY1

I we first try to aggregate "two", then evaluate WHERE clause and then
finalize the aggregation, the partial aggregation of "two" can put various
binary values of the "a" column (0, 0.0, etc.) into the same group so the
scale of the output (of the partial agregation) will be undefined. Thus the
aggregation push-down will change the input of the WHERE clause.

So one problem is that the grouping expression can be inappropriate for
partial aggregation even if there's no type change during the
translation. What I consider typical for this case is that the equality
operator used to identify groups (SortGroupClause.eqop) can return true even
if binary (stored) values of the inputs differ. The partial aggregation could
only take place if we had a special equality operator which distinguishes the
*binary* values (I don't know yet how to store this operator the catalog ---
in pg_opclass recors for the hash opclasses?)..

On the other hand, if the grouping expression is not a plain Var and there's
no type change during the translation and the expression output type is not of
the "identity-unsafe type" (numeric, citext, etc.), input vars of the
"identity-unsafe type"should not prevent us from using the expression for
partial aggregation: in such a case the grouping keys are computed in the same
way they would be w/o the partial aggregation.

The other problem is 

Re: [HACKERS] WIP: Aggregation push-down

2018-01-15 Thread Robert Haas
On Fri, Dec 22, 2017 at 10:43 AM, Antonin Houska  wrote:
> Michael Paquier  wrote:
>> On Sat, Nov 4, 2017 at 12:33 AM, Antonin Houska  wrote:
>> > I'm not about to add any other features now. Implementation of the missing
>> > parts (see the TODO comments in the code) is the next step. But what I'd
>> > appreciate most is a feedback on the design. Thanks.
>>
>> I am getting a conflict after applying patch 5 but this did not get
>> any reviews so moved to next CF with waiting on author as status.
>
> Attached is the next version.

I've been a bit confused for a while about what this patch is trying
to do, so I spent some time today looking at it to try to figure it
out.  There's a lot I don't understand yet, but it seems like the
general idea is to build, potentially for each relation in the join
tree, not only the regular list of paths but also a list of "grouped"
paths.  If the pre-grouped path wins, then we can get a final path
that looks like Finalize Aggregate -> Some Joins -> Partial Aggregate
-> Maybe Some More Joins -> Base Table Scan.  In some cases the patch
seems to replace that uppermost Finalize Aggregate with a Result node.

translate_expression_to_rels() looks unsafe.  Equivalence members are
known to be equal under the semantics of the relevant operator class,
but that doesn't mean that one can be freely substituted for another.
For example:

rhaas=# create table one (a numeric);
CREATE TABLE
rhaas=# create table two (a numeric);
CREATE TABLE
rhaas=# insert into one values ('0');
INSERT 0 1
rhaas=# insert into two values ('0.00');
INSERT 0 1
rhaas=# select one.a, count(*) from one, two where one.a = two.a group by 1;
 a | count
---+---
 0 | 1
(1 row)

rhaas=# select two.a, count(*) from one, two where one.a = two.a group by 1;
  a   | count
--+---
 0.00 | 1
(1 row)

There are, admittedly, a large number of data types for which such a
substitution would work just fine.  numeric will not, citext will not,
but many others are fine. Regrettably, we have no framework in the
system for identifying equality operators which actually test identity
versus some looser notion of equality.  Cross-type operators are a
problem, too; if we have foo.x = bar.y in the query, one might be int4
and the other int8, for example, but they can still belong to the same
equivalence class.

Concretely, in your test query "SELECT p.i, avg(c1.v) FROM
agg_pushdown_parent AS p JOIN agg_pushdown_child1 AS c1 ON c1.parent =
p.i GROUP BY p.i" you assume that it's OK to do a Partial
HashAggregate over c1.parent rather than p.i.  This will be false if,
say, c1.parent is of type citext and p.i is of type text; this will
get grouped together that shouldn't.  It will also be false if the
grouping expression is something like GROUP BY length(p.i::text),
because one value could be '0'::numeric and the other '0.00'::numeric.
I can't think of a reason why it would be false if the grouping
expressions are both simple Vars of the same underlying data type, but
I'm a little nervous that I might be wrong even about that case.
Maybe you've handled all of this somehow, but it's not obvious to me
that it has been considered.

Another thing I noticed is that the GroupedPathInfo looks a bit like a
stripped-down RelOptInfo, in that for example it has both a pathlist
and a partial_pathlist. I'm inclined to think that we should build new
RelOptInfos instead.  As you have it, there are an awful lot of things
that seem to know about the grouped/ungrouped distinction, many of
which are quite low-level functions like add_path(),
add_paths_to_append_rel(), try_nestloop_path(), etc.  I think that
some of this would go away if you had separate RelOptInfos instead of
GroupedPathInfo.

Some compiler noise:

allpaths.c:2794:11: error: variable 'partial_target' is used
uninitialized whenever 'if' condition is false
  [-Werror,-Wsometimes-uninitialized]
else if (rel->gpi != NULL && rel->gpi->target != NULL)
 ^~~~
allpaths.c:2798:56: note: uninitialized use occurs here
create_gather_path(root, rel, cheapest_partial_path,
partial_target,

^~
allpaths.c:2794:7: note: remove the 'if' if its condition is always true
else if (rel->gpi != NULL && rel->gpi->target != NULL)
 ^
allpaths.c:2794:11: error: variable 'partial_target' is used
uninitialized whenever '&&' condition is false
  [-Werror,-Wsometimes-uninitialized]
else if (rel->gpi != NULL && rel->gpi->target != NULL)
 ^~~~
allpaths.c:2798:56: note: uninitialized use occurs here
create_gather_path(root, rel, cheapest_partial_path,
partial_target,

^~
allpaths.c:2794:11: note: remove the '&&' if its condition is always true
else if (rel->gpi != NULL && rel->gpi->target != NULL)