Re: [HACKERS] WIP: Aggregation push-down
Richard Guo wrote: > Hi, > > I've been looking at the 'make_join_rel()' part of the patch, and I'm > aware that, if we are joining A and B, a 'grouped join rel (AB)' would > be created besides the 'plain join rel (AB)', and paths are added by 1) > applying partial aggregate to the paths of the 'plain join rel (AB)', or > 2) joining grouped A to plain B or joining plain A to grouped B. > > This is a smart idea. One issue I can see is that some logic would have > to be repeated several times. For example, the validity check for the > same proposed join would be performed at most three times by > join_is_legal(). > > I'm thinking of another idea that, instead of using a separate > RelOptInfo for the grouped rel, we add in RelOptInfo a > 'grouped_pathlist' for the grouped paths, just like how we implement > parallel query via adding 'partial_pathlist'. > > For base rel, if we decide it can produce grouped paths, we create the > grouped paths by applying partial aggregation to the paths in 'pathlist' > and add them into 'grouped_pathlist'. > > For join rel (AB), we can create the grouped paths for it by: > 1) joining a grouped path from 'A->grouped_pathlist' to a plain path > from 'B->pathlist', or vice versa; > 2) applying partial aggregation to the paths in '(AB)->pathlist'. > > This is basically the same idea, just moves the grouped paths from the > grouped rel to a 'grouped_pathlist'. With it we should not need to make > any code changes to 'make_join_rel()'. Most code changes would happen in > 'add_paths_to_joinrel()'. > > Will this idea work? Is it better or worse? I tried such approach in an earlier version of the patch [1], and I think the reason also was to avoid repeated calls of functions like join_is_legal(). However there were objections against such approach, e.g. [2], and I admit that it was more invasive than what the current patch version does. Perhaps we can cache the result of join_is_legal() that we get for the plain relation, and use it for the group relation. I'll consider that. Thanks. [1] https://www.postgresql.org/message-id/18007.1513957437%40localhost [2] https://www.postgresql.org/message-id/CA%2BTgmob8og%2B9HzMg1vM%2B3LwDm2f_bHUi9%2Bg1bqLDTjqpw5s%2BnQ%40mail.gmail.com -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: [HACKERS] WIP: Aggregation push-down
Hi, I've been looking at the 'make_join_rel()' part of the patch, and I'm aware that, if we are joining A and B, a 'grouped join rel (AB)' would be created besides the 'plain join rel (AB)', and paths are added by 1) applying partial aggregate to the paths of the 'plain join rel (AB)', or 2) joining grouped A to plain B or joining plain A to grouped B. This is a smart idea. One issue I can see is that some logic would have to be repeated several times. For example, the validity check for the same proposed join would be performed at most three times by join_is_legal(). I'm thinking of another idea that, instead of using a separate RelOptInfo for the grouped rel, we add in RelOptInfo a 'grouped_pathlist' for the grouped paths, just like how we implement parallel query via adding 'partial_pathlist'. For base rel, if we decide it can produce grouped paths, we create the grouped paths by applying partial aggregation to the paths in 'pathlist' and add them into 'grouped_pathlist'. For join rel (AB), we can create the grouped paths for it by: 1) joining a grouped path from 'A->grouped_pathlist' to a plain path from 'B->pathlist', or vice versa; 2) applying partial aggregation to the paths in '(AB)->pathlist'. This is basically the same idea, just moves the grouped paths from the grouped rel to a 'grouped_pathlist'. With it we should not need to make any code changes to 'make_join_rel()'. Most code changes would happen in 'add_paths_to_joinrel()'. Will this idea work? Is it better or worse? Thanks Richard
Re: [HACKERS] WIP: Aggregation push-down
Tomas Vondra wrote: > Hi, > > I've been looking at the last version (v14) of this patch series, > submitted way back in July and unfortunately quiet since then. Antonin > is probably right one of the reasons for the lack of reviews is that it > requires interest/experience with planner. > > Initially it was also a bit hard to understand what are the benefits > (and the patch shifted a bit), which is now mostly addressed by the > README in the last patch. The trouble is that's hidden in the patch and > so not immediately obvious to people considering reviewing it :-( Tom > posted a nice summary in November 2018, but it was perhaps focused on > the internals. > > So my first suggestion it to write a short summary with example how it > affects practical queries (plan change, speedup) etc. I think README plus regression test output should be enough for someone who is about to review patch as complex as this. > My second suggestion is to have meaningful commit messages for each part > of the patch series, explaining why we need that particular part. It > might have been explained somewhere in the thread, but as a reviewer I > really don't want to reverse-engineer the whole thread. ok, done. > Now, regarding individual parts of the patch: > > > 1) v14-0001-Introduce-RelInfoList-structure.patch > - > > - I'm not entirely sure why we need this change. We had the list+hash > before, so I assume we do this because we need the output functions? I believe that this is what Tom proposed in [1], see "Maybe an appropriate preliminary patch is to refactor the support code ..." near the end of that message. The point is that now we need two lists: one for "plain" relations and one for grouped ones. > - The RelInfoList naming is pretty confusing, because it's actually not > a list but a combination of list+hash. And the embedded list is called > items, to make it yet a bit more confusing. I suggest we call this > just RelInfo or RelInfoLookup or something else not including List. I think it can be considered a list by the caller of add_join_rel() etc. The hash table is hidden from user and is empty until the list becomes long enough. The word "list" in the structure name may just indicate that new elements can be added to the end, which shouldn't be expected if the structure was an array. I searched a bit in tools/pgindent/typedefs.list and found at least a few structures that also do have "list" in the name but are not lists internally: CatCList, FuncCandidateList, MCVList. Actually I'm not opposed to renaming the structure, but don't have better idea right now. As for *Lookup: following is the only use of such a structure name in PG code and it's not something to store data in: typedef enum { IDENTIFIER_LOOKUP_NORMAL, /* normal processing of var names */ IDENTIFIER_LOOKUP_DECLARE, /* In DECLARE --- don't look up names */ IDENTIFIER_LOOKUP_EXPR /* In SQL expression --- special case */ } IdentifierLookup; > - RelInfoEntry seems severely under-documented, particularly the data > part (which is void* making it even harder to understand what's it for > without having to read the functions). AFAICS it's just simply a link > to the embedded list, but maybe I'm wrong. This is just JoinHashEntry (which was also undocumented) renamed. I've added a short comment now. > - I suggest we move find_join_rel/add_rel_info/add_join_rel in relnode.c > right before build_join_rel. This makes diff clearer/smaller and > visual diffs nicer. Hm, it might improve readability of the diff, but this API is exactly where I still need feedback from Tom. I'm not eager to optimize the diff as long as there's a risk that these functions will be removed or renamed. > 2) v14-0002-Introduce-make_join_rel_common-function.patch > - > > I see no point in keeping this change in a separate patch, as it prety > much just renames make_join_rel to make_join_rel_common and then adds > > make_join_rel(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2) > { > return make_join_rel_common(root, rel1, rel2); > } > > which seems rather trivial and useless on it's own. I'd just merge it > into 0003 where we use the _common function more extensively. ok. I thought that this improves readability of the diffs, but it doesn't look that bad if this is included in 0003. > > 3) v14-0003-Aggregate-push-down-basic-functionality.patch > - > > I haven't done much review on this yet, but I've done some testing and > benchmarking so let me share at least that. The queries I used were of > the type I mentioned earlier in this thread - essentially a star schema, > i.e. fact table referencing dimensions, with aggregation per columns in > the dimension. So something like > > SELECT d.c, sum(f) FROM fact JOIN dimension d ON
Re: [HACKERS] WIP: Aggregation push-down
Hi, I've been looking at the last version (v14) of this patch series, submitted way back in July and unfortunately quiet since then. Antonin is probably right one of the reasons for the lack of reviews is that it requires interest/experience with planner. Initially it was also a bit hard to understand what are the benefits (and the patch shifted a bit), which is now mostly addressed by the README in the last patch. The trouble is that's hidden in the patch and so not immediately obvious to people considering reviewing it :-( Tom posted a nice summary in November 2018, but it was perhaps focused on the internals. So my first suggestion it to write a short summary with example how it affects practical queries (plan change, speedup) etc. My second suggestion is to have meaningful commit messages for each part of the patch series, explaining why we need that particular part. It might have been explained somewhere in the thread, but as a reviewer I really don't want to reverse-engineer the whole thread. Now, regarding individual parts of the patch: 1) v14-0001-Introduce-RelInfoList-structure.patch - - I'm not entirely sure why we need this change. We had the list+hash before, so I assume we do this because we need the output functions? - The RelInfoList naming is pretty confusing, because it's actually not a list but a combination of list+hash. And the embedded list is called items, to make it yet a bit more confusing. I suggest we call this just RelInfo or RelInfoLookup or something else not including List. - RelInfoEntry seems severely under-documented, particularly the data part (which is void* making it even harder to understand what's it for without having to read the functions). AFAICS it's just simply a link to the embedded list, but maybe I'm wrong. - I suggest we move find_join_rel/add_rel_info/add_join_rel in relnode.c right before build_join_rel. This makes diff clearer/smaller and visual diffs nicer. 2) v14-0002-Introduce-make_join_rel_common-function.patch - I see no point in keeping this change in a separate patch, as it prety much just renames make_join_rel to make_join_rel_common and then adds make_join_rel(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2) { return make_join_rel_common(root, rel1, rel2); } which seems rather trivial and useless on it's own. I'd just merge it into 0003 where we use the _common function more extensively. 3) v14-0003-Aggregate-push-down-basic-functionality.patch - I haven't done much review on this yet, but I've done some testing and benchmarking so let me share at least that. The queries I used were of the type I mentioned earlier in this thread - essentially a star schema, i.e. fact table referencing dimensions, with aggregation per columns in the dimension. So something like SELECT d.c, sum(f) FROM fact JOIN dimension d ON (d.id = f.d_id) GROUP BY d.c; where "fact" table is much much larger than the dimension. Attached is a script I used for testing with a bunch of queries and a number of parameters (parallelism, size of dimension, size of fact, ...) and a spreadsheed summarizing the results. Overall, the results seem pretty good - in many cases the queries get about 2x faster and I haven't seen any regressions. That's pretty nice. One annoying thing is that this only works for non-parallel queries. That is, I saw no improvements with parallel queries. It was still faster than the non-parallel query with aggregate pushdown, but the parallel query did not improve. An example of timing looks like this: non-parallel (pushdown = off): 918 ms non-parallel (pushdown = on): 561 ms parallel (pushdown = off): 313 ms parallel (pushdown = on): 313 ms The non-parallel query gets faster because after enabling push-down the plan changes (and we get partial aggregate below the join). With parallel query that does not happen, the plans stay the same. I'm not claiming this is a bug - we end up picking the fastest execution plan (i.e. we don't make a mistake by e.g. switching to the non-parallel one with pushdown). My understanding is that the aggregate pushdown can't (currently?) be used with queries that use partial aggregate because of parallelism. That is we can either end up with a plan like this: Finalize Aggregate -> Join -> Partial Aggregate -> ... or a parallel plan like this: Finalize Aggregate -> Gather -> Partial Aggregate -> Join -> ... -> ... but we currently don't support a combination of both Finalize Aggregate -> Gather -> Join -> Partial Aggregate -> ... I wonder if that's actually even possible and what would it take to make it work? regards -- Tomas Vondra
Re: [HACKERS] WIP: Aggregation push-down
Alvaro Herrera wrote: > This stuff seems very useful. How come it sits unreviewed for so long? I think the review is hard for people who are not interested in the planner very much. And as for further development, there are a few design decisions that can hardly be resolved without Tom Lane's comments. Right now I recall two problems: 1) is the way I currently store RelOptInfo for the grouped relations correct?, 2) how should we handle types for which logical equality does not imply physical (byte-wise) equality? Fortunately it seems now that I'm not the only one who cares about 2), so this problem might be resolved soon: https://www.postgresql.org/message-id/CAH2-Wzn3Ee49Gmxb7V1VJ3-AC8fWn-Fr8pfWQebHe8rYRxt5OQ%40mail.gmail.com But 1) still remains. -- Antonin Houska Web: https://www.cybertec-postgresql.com
Re: [HACKERS] WIP: Aggregation push-down
This stuff seems very useful. How come it sits unreviewed for so long? -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] WIP: Aggregation push-down
Richard Guo wrote: > Another core dump for query below: > > select sum(t1.s1) from t1, t2, t3, t4 where t1.j1 = t2.j2 group by t1.g1, > t2.g2; > > This is due to a small mistake: > > diff --git a/src/backend/optimizer/util/relnode.c > b/src/backend/optimizer/util/relnode.c > index 10becc0..9c2deed 100644 > --- a/src/backend/optimizer/util/relnode.c > +++ b/src/backend/optimizer/util/relnode.c > @@ -648,7 +648,7 @@ void > add_grouped_rel(PlannerInfo *root, RelOptInfo *rel, RelAggInfo *agg_info) > { > add_rel_info(root->grouped_rel_list, rel); > - add_rel_info(root->agg_info_list, rel); > + add_rel_info(root->agg_info_list, agg_info); > } > Thanks! Yes, this is what I had to fix. -- Antonin Houska Web: https://www.cybertec-postgresql.com >From b2672f2b06f521fdc11f7e7293ab669ca582061b Mon Sep 17 00:00:00 2001 From: Antonin Houska Date: Wed, 17 Jul 2019 16:31:32 +0200 Subject: [PATCH 1/3] Introduce RelInfoList structure. --- contrib/postgres_fdw/postgres_fdw.c| 3 +- src/backend/nodes/outfuncs.c | 11 +++ src/backend/optimizer/geqo/geqo_eval.c | 12 +-- src/backend/optimizer/plan/planmain.c | 3 +- src/backend/optimizer/util/relnode.c | 157 - src/include/nodes/nodes.h | 1 + src/include/nodes/pathnodes.h | 28 -- 7 files changed, 136 insertions(+), 79 deletions(-) diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c index 033aeb2556..90414f1168 100644 --- a/contrib/postgres_fdw/postgres_fdw.c +++ b/contrib/postgres_fdw/postgres_fdw.c @@ -5205,7 +5205,8 @@ foreign_join_ok(PlannerInfo *root, RelOptInfo *joinrel, JoinType jointype, */ Assert(fpinfo->relation_index == 0); /* shouldn't be set yet */ fpinfo->relation_index = - list_length(root->parse->rtable) + list_length(root->join_rel_list); + list_length(root->parse->rtable) + + list_length(root->join_rel_list->items); return true; } diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c index 8e31fae47f..01745ff879 100644 --- a/src/backend/nodes/outfuncs.c +++ b/src/backend/nodes/outfuncs.c @@ -2279,6 +2279,14 @@ _outRelOptInfo(StringInfo str, const RelOptInfo *node) } static void +_outRelInfoList(StringInfo str, const RelInfoList *node) +{ + WRITE_NODE_TYPE("RELOPTINFOLIST"); + + WRITE_NODE_FIELD(items); +} + +static void _outIndexOptInfo(StringInfo str, const IndexOptInfo *node) { WRITE_NODE_TYPE("INDEXOPTINFO"); @@ -4052,6 +4060,9 @@ outNode(StringInfo str, const void *obj) case T_RelOptInfo: _outRelOptInfo(str, obj); break; + case T_RelInfoList: +_outRelInfoList(str, obj); +break; case T_IndexOptInfo: _outIndexOptInfo(str, obj); break; diff --git a/src/backend/optimizer/geqo/geqo_eval.c b/src/backend/optimizer/geqo/geqo_eval.c index 7b67a29c88..5fca9814a2 100644 --- a/src/backend/optimizer/geqo/geqo_eval.c +++ b/src/backend/optimizer/geqo/geqo_eval.c @@ -92,11 +92,11 @@ geqo_eval(PlannerInfo *root, Gene *tour, int num_gene) * * join_rel_level[] shouldn't be in use, so just Assert it isn't. */ - savelength = list_length(root->join_rel_list); - savehash = root->join_rel_hash; + savelength = list_length(root->join_rel_list->items); + savehash = root->join_rel_list->hash; Assert(root->join_rel_level == NULL); - root->join_rel_hash = NULL; + root->join_rel_list->hash = NULL; /* construct the best path for the given combination of relations */ joinrel = gimme_tree(root, tour, num_gene); @@ -121,9 +121,9 @@ geqo_eval(PlannerInfo *root, Gene *tour, int num_gene) * Restore join_rel_list to its former state, and put back original * hashtable if any. */ - root->join_rel_list = list_truncate(root->join_rel_list, - savelength); - root->join_rel_hash = savehash; + root->join_rel_list->items = list_truncate(root->join_rel_list->items, + savelength); + root->join_rel_list->hash = savehash; /* release all the memory acquired within gimme_tree */ MemoryContextSwitchTo(oldcxt); diff --git a/src/backend/optimizer/plan/planmain.c b/src/backend/optimizer/plan/planmain.c index 2dbf1db844..0bc8a6 100644 --- a/src/backend/optimizer/plan/planmain.c +++ b/src/backend/optimizer/plan/planmain.c @@ -65,8 +65,7 @@ query_planner(PlannerInfo *root, * NOTE: append_rel_list was set up by subquery_planner, so do not touch * here. */ - root->join_rel_list = NIL; - root->join_rel_hash = NULL; + root->join_rel_list = makeNode(RelInfoList); root->join_rel_level = NULL; root->join_cur_level = 0; root->canon_pathkeys = NIL; diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c index 6054bd2b53..c238dd6538 100644 --- a/src/backend/optimizer/util/relnode.c +++ b/src/backend/optimizer/util/relnode.c @@ -31,11 +31,11 @@ #include "utils/hsearch.h" -typedef struct JoinHashEntry +typedef struct RelInfoEntry { - Relids join_relids; /* hash key --- MUST BE
Re: [HACKERS] WIP: Aggregation push-down
On Fri, Jul 12, 2019 at 4:42 PM Antonin Houska wrote: > Richard Guo wrote: > > > I didn't fully follow the whole thread and mainly looked into the > > latest > > patch set. So what are the considerations for abandoning the > > aggmultifn > > concept? > > Originally the function was there to support join where both relations are > grouped. My doubts about usefulness of such join started at [1]. (See the > thread referenced from [2].) > > > In my opinion, aggmultifn would enable us to do a lot more > > types of transformation. For example, consider the query below: > > > > select sum(foo.c) from foo join bar on foo.b = bar.b group by foo.a, > > bar.a; > > > > With the latest patch, the plan looks like: > > > > Finalize HashAggregate<-- sum(psum) > >Group Key: foo.a, bar.a > >-> Hash Join > > Hash Cond: (bar.b = foo.b) > > -> Seq Scan on bar > > -> Hash > >-> Partial HashAggregate<-- sum(foo.c) as > > psum > > Group Key: foo.a, foo.b > > -> Seq Scan on foo > > > > > > If we have aggmultifn, we can perform the query this way: > > > > Finalize HashAggregate<-- sum(foo.c)*cnt > >Group Key: foo.a, bar.a > >-> Hash Join > > Hash Cond: (foo.b = bar.b) > > -> Seq Scan on foo > > -> Hash > >-> Partial HashAggregate<-- count(*) as cnt > > Group Key: bar.a, bar.b > > -> Seq Scan on bar > > Perhaps, but it that would require changes to nodeAgg.c, which I want to > avoid > for now. > > > And this way: > > > > Finalize HashAggregate<-- sum(psum)*cnt > >Group Key: foo.a, bar.a > >-> Hash Join > > Hash Cond: (foo.b = bar.b) > >-> Partial HashAggregate<-- sum(foo.c) as > > psum > > Group Key: foo.a, foo.b > > -> Seq Scan on foo > > -> Hash > >-> Partial HashAggregate<-- count(*) as cnt > > Group Key: bar.a, bar.b > > -> Seq Scan on bar > > This looks like my idea presented in [1], for which there seems to be no > common use case. > > > My another question is in function add_grouped_path(), when creating > > sorted aggregation path on top of subpath. If the subpath is not > > sorted, > > then the sorted aggregation path would not be generated. Why not in > > this > > case we create a sort path on top of subpath first and then create > > group > > aggregation path on top of the sort path? > > I see no reason not to do it. (If you want to try, just go ahead.) However > the > current patch version brings only the basic functionality and I'm not > going to > add new functionality (such as parallel aggregation, partitioned tables or > postgres_fdw) until the open design questions are resolved. Otherwise > there's > a risk that the additional work will be wasted due to major rework of the > core > functionality. > > > Core dump when running one query in agg_pushdown.sql > > Thanks for the report! Fixed in the new version. > Another core dump for query below: select sum(t1.s1) from t1, t2, t3, t4 where t1.j1 = t2.j2 group by t1.g1, t2.g2; This is due to a small mistake: diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c index 10becc0..9c2deed 100644 --- a/src/backend/optimizer/util/relnode.c +++ b/src/backend/optimizer/util/relnode.c @@ -648,7 +648,7 @@ void add_grouped_rel(PlannerInfo *root, RelOptInfo *rel, RelAggInfo *agg_info) { add_rel_info(root->grouped_rel_list, rel); - add_rel_info(root->agg_info_list, rel); + add_rel_info(root->agg_info_list, agg_info); } Thanks Richard
Re: [HACKERS] WIP: Aggregation push-down
On Tue, Jul 9, 2019 at 9:47 PM Antonin Houska wrote: > Richard Guo wrote: > > > Another rebase is needed for the patches. > > Done. > I didn't fully follow the whole thread and mainly looked into the latest patch set. So what are the considerations for abandoning the aggmultifn concept? In my opinion, aggmultifn would enable us to do a lot more types of transformation. For example, consider the query below: select sum(foo.c) from foo join bar on foo.b = bar.b group by foo.a, bar.a; With the latest patch, the plan looks like: Finalize HashAggregate<-- sum(psum) Group Key: foo.a, bar.a -> Hash Join Hash Cond: (bar.b = foo.b) -> Seq Scan on bar -> Hash -> Partial HashAggregate<-- sum(foo.c) as psum Group Key: foo.a, foo.b -> Seq Scan on foo If we have aggmultifn, we can perform the query this way: Finalize HashAggregate<-- sum(foo.c)*cnt Group Key: foo.a, bar.a -> Hash Join Hash Cond: (foo.b = bar.b) -> Seq Scan on foo -> Hash -> Partial HashAggregate<-- count(*) as cnt Group Key: bar.a, bar.b -> Seq Scan on bar And this way: Finalize HashAggregate<-- sum(psum)*cnt Group Key: foo.a, bar.a -> Hash Join Hash Cond: (foo.b = bar.b) -> Partial HashAggregate<-- sum(foo.c) as psum Group Key: foo.a, foo.b -> Seq Scan on foo -> Hash -> Partial HashAggregate<-- count(*) as cnt Group Key: bar.a, bar.b -> Seq Scan on bar My another question is in function add_grouped_path(), when creating sorted aggregation path on top of subpath. If the subpath is not sorted, then the sorted aggregation path would not be generated. Why not in this case we create a sort path on top of subpath first and then create group aggregation path on top of the sort path? Core dump when running one query in agg_pushdown.sql EXPLAIN ANALYZE SELECT p.x, avg(c1.v) FROM agg_pushdown_parent AS p JOIN agg_pushdown_child1 AS c1 ON c1.parent = p.i GROUP BY p.i; #0 0x006def98 in CheckVarSlotCompatibility (slot=0x0, attnum=1, vartype=23) at execExprInterp.c:1850 #1 0x006def5d in CheckExprStillValid (state=0x2b63a28, econtext=0x2ba4958) at execExprInterp.c:1814 #2 0x006dee38 in ExecInterpExprStillValid (state=0x2b63a28, econtext=0x2ba4958, isNull=0x7fff7cd16a37) at execExprInterp.c:1763 #3 0x007144dd in ExecEvalExpr (state=0x2b63a28, econtext=0x2ba4958, isNull=0x7fff7cd16a37) at ../../../src/include/executor/executor.h:288 #4 0x00715475 in ExecIndexEvalRuntimeKeys (econtext=0x2ba4958, runtimeKeys=0x2b63910, numRuntimeKeys=1) at nodeIndexscan.c:630 #5 0x0071533b in ExecReScanIndexScan (node=0x2b62bf8) at nodeIndexscan.c:568 #6 0x006d4ce6 in ExecReScan (node=0x2b62bf8) at execAmi.c:182 #7 0x007152a0 in ExecIndexScan (pstate=0x2b62bf8) at nodeIndexscan.c:530 This is really a cool feature. Thank you for working on this. Thanks Richard
Re: [HACKERS] WIP: Aggregation push-down
Richard Guo wrote: > Another rebase is needed for the patches. Done. -- Antonin Houska Web: https://www.cybertec-postgresql.com >From f656bd8d46afb9cb0a331cf3d34b9eed39f5e360 Mon Sep 17 00:00:00 2001 From: Antonin Houska Date: Tue, 9 Jul 2019 15:30:13 +0200 Subject: [PATCH 1/3] Introduce RelInfoList structure. --- contrib/postgres_fdw/postgres_fdw.c| 3 +- src/backend/nodes/outfuncs.c | 11 +++ src/backend/optimizer/geqo/geqo_eval.c | 12 +-- src/backend/optimizer/plan/planmain.c | 3 +- src/backend/optimizer/util/relnode.c | 157 - src/include/nodes/nodes.h | 1 + src/include/nodes/pathnodes.h | 28 -- 7 files changed, 136 insertions(+), 79 deletions(-) diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c index 033aeb2556..90414f1168 100644 --- a/contrib/postgres_fdw/postgres_fdw.c +++ b/contrib/postgres_fdw/postgres_fdw.c @@ -5205,7 +5205,8 @@ foreign_join_ok(PlannerInfo *root, RelOptInfo *joinrel, JoinType jointype, */ Assert(fpinfo->relation_index == 0); /* shouldn't be set yet */ fpinfo->relation_index = - list_length(root->parse->rtable) + list_length(root->join_rel_list); + list_length(root->parse->rtable) + + list_length(root->join_rel_list->items); return true; } diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c index 8400dd319e..4529b5c63b 100644 --- a/src/backend/nodes/outfuncs.c +++ b/src/backend/nodes/outfuncs.c @@ -2278,6 +2278,14 @@ _outRelOptInfo(StringInfo str, const RelOptInfo *node) WRITE_NODE_FIELD(partitioned_child_rels); } +static void +_outRelInfoList(StringInfo str, const RelInfoList *node) +{ + WRITE_NODE_TYPE("RELOPTINFOLIST"); + + WRITE_NODE_FIELD(items); +} + static void _outIndexOptInfo(StringInfo str, const IndexOptInfo *node) { @@ -4052,6 +4060,9 @@ outNode(StringInfo str, const void *obj) case T_RelOptInfo: _outRelOptInfo(str, obj); break; + case T_RelInfoList: +_outRelInfoList(str, obj); +break; case T_IndexOptInfo: _outIndexOptInfo(str, obj); break; diff --git a/src/backend/optimizer/geqo/geqo_eval.c b/src/backend/optimizer/geqo/geqo_eval.c index 6c69c1c147..c69f3469ba 100644 --- a/src/backend/optimizer/geqo/geqo_eval.c +++ b/src/backend/optimizer/geqo/geqo_eval.c @@ -92,11 +92,11 @@ geqo_eval(PlannerInfo *root, Gene *tour, int num_gene) * * join_rel_level[] shouldn't be in use, so just Assert it isn't. */ - savelength = list_length(root->join_rel_list); - savehash = root->join_rel_hash; + savelength = list_length(root->join_rel_list->items); + savehash = root->join_rel_list->hash; Assert(root->join_rel_level == NULL); - root->join_rel_hash = NULL; + root->join_rel_list->hash = NULL; /* construct the best path for the given combination of relations */ joinrel = gimme_tree(root, tour, num_gene); @@ -121,9 +121,9 @@ geqo_eval(PlannerInfo *root, Gene *tour, int num_gene) * Restore join_rel_list to its former state, and put back original * hashtable if any. */ - root->join_rel_list = list_truncate(root->join_rel_list, - savelength); - root->join_rel_hash = savehash; + root->join_rel_list->items = list_truncate(root->join_rel_list->items, + savelength); + root->join_rel_list->hash = savehash; /* release all the memory acquired within gimme_tree */ MemoryContextSwitchTo(oldcxt); diff --git a/src/backend/optimizer/plan/planmain.c b/src/backend/optimizer/plan/planmain.c index 2dbf1db844..0bc8a6 100644 --- a/src/backend/optimizer/plan/planmain.c +++ b/src/backend/optimizer/plan/planmain.c @@ -65,8 +65,7 @@ query_planner(PlannerInfo *root, * NOTE: append_rel_list was set up by subquery_planner, so do not touch * here. */ - root->join_rel_list = NIL; - root->join_rel_hash = NULL; + root->join_rel_list = makeNode(RelInfoList); root->join_rel_level = NULL; root->join_cur_level = 0; root->canon_pathkeys = NIL; diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c index 6054bd2b53..c238dd6538 100644 --- a/src/backend/optimizer/util/relnode.c +++ b/src/backend/optimizer/util/relnode.c @@ -31,11 +31,11 @@ #include "utils/hsearch.h" -typedef struct JoinHashEntry +typedef struct RelInfoEntry { - Relids join_relids; /* hash key --- MUST BE FIRST */ - RelOptInfo *join_rel; -} JoinHashEntry; + Relids relids; /* hash key --- MUST BE FIRST */ + void *data; +} RelInfoEntry; static void build_joinrel_tlist(PlannerInfo *root, RelOptInfo *joinrel, RelOptInfo *input_rel); @@ -375,11 +375,11 @@ find_base_rel(PlannerInfo *root, int relid) } /* - * build_join_rel_hash - * Construct the auxiliary hash table for join relations. + * build_rel_hash + * Construct the auxiliary hash table for relation specific data. */ static void -build_join_rel_hash(PlannerInfo *root) +build_rel_hash(RelInfoList *list) { HTAB *hashtab; HASHCTL hash_ctl; @@ -388,47
Re: [HACKERS] WIP: Aggregation push-down
On Fri, Mar 1, 2019 at 12:01 AM Antonin Houska wrote: > Tom Lane wrote: > > > Antonin Houska writes: > > > Michael Paquier wrote: > > >> Latest patch set fails to apply, so moved to next CF, waiting on > > >> author. > > > > > Rebased. > > > > This is in need of rebasing again :-(. I went ahead and pushed the 001 > > part, since that seemed fairly uncontroversial. > > ok, thanks. > Another rebase is needed for the patches. Thanks Richard
Re: [HACKERS] WIP: Aggregation push-down
Tom Lane wrote: > Antonin Houska writes: > > Michael Paquier wrote: > >> Latest patch set fails to apply, so moved to next CF, waiting on > >> author. > > > Rebased. > > This is in need of rebasing again :-(. I went ahead and pushed the 001 > part, since that seemed fairly uncontroversial. ok, thanks. > I did not spend a whole lot of time looking at the patch today, but > I'm still pretty distressed at the data structures you've chosen. > I remain of the opinion that a grouped relation and a base relation > are, er, unrelated, even if they happen to share the same relid set. > So I do not see the value of the RelOptInfoSet struct you propose here, ok. As you suggested upthread, I try now to reuse the join_rel_list / join_rel_hash structures, see v11-001-Introduce_RelInfoList.patch. > and I definitely don't think there's any value in having, eg, > create_seqscan_path or create_index_path dealing with this stuff. Originally I tried to aggregate any path that ever gets passed to agg_path(), but that's probably not worth the code complexity. Now the partial aggregation is only applied to paths that survived agg_path() on the plain relation. > I also don't like changing create_nestloop_path et al to take a PathTarget > rather than using the RelOptInfo's pathtarget; IMO, it's flat out wrong > for a path to generate a tlist different from what its parent RelOptInfo > says that the relation produces. Likewise, the current patch version is less invasive, so create_nestloop_path et al are not touched at all. > I think basically the way this ought to work is that we generate baserel > paths pretty much the same as today, and then we run through the baserels > and see which ones are potentially worth doing partial aggregation on, > and for each one that is, we create a separate "upper relation" RelOptInfo > describing that. The paths for this RelOptInfo would be > partial-aggregation paths using paths from the corresponding baserel as > input. Then we'd run a join search that only considers joining grouped > rels with plain rels (I concur that joining two grouped rels is not worth > coping with, at least for now). make_join_rel() is the core of my implementation: besides joining two plain relations it tries to join plain relation to grouped one, and also to aggregate the join of the two plain relations. I consider this approach less invasive and more efficient than running the whole standard_join_search again for the grouped rels. The problem of numeric-like data types (i.e. types for wich equality of two values of the grouping key does not justify putting them into the same group because information like scale would be discarded this way) remains open. My last idea was to add a boolean flag to operator class which tells that equality implies "bitwise equality", and to disallow aggregate push-down if SortGroupClause.eqop is in an opclass which has this field FALSE. I'd like to hear your opinion before I do any coding. -- Antonin Houska https://www.cybertec-postgresql.com diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c index 6b96e7de0a..4484fb4fbc 100644 --- a/contrib/postgres_fdw/postgres_fdw.c +++ b/contrib/postgres_fdw/postgres_fdw.c @@ -4906,7 +4906,8 @@ foreign_join_ok(PlannerInfo *root, RelOptInfo *joinrel, JoinType jointype, */ Assert(fpinfo->relation_index == 0); /* shouldn't be set yet */ fpinfo->relation_index = - list_length(root->parse->rtable) + list_length(root->join_rel_list); + list_length(root->parse->rtable) + + list_length(root->join_rel_list->items); return true; } diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c index 65302fe65b..8b333f069a 100644 --- a/src/backend/nodes/outfuncs.c +++ b/src/backend/nodes/outfuncs.c @@ -2272,6 +2272,14 @@ _outRelOptInfo(StringInfo str, const RelOptInfo *node) } static void +_outRelInfoList(StringInfo str, const RelInfoList *node) +{ + WRITE_NODE_TYPE("RELOPTINFOLIST"); + + WRITE_NODE_FIELD(items); +} + +static void _outIndexOptInfo(StringInfo str, const IndexOptInfo *node) { WRITE_NODE_TYPE("INDEXOPTINFO"); @@ -4031,6 +4039,9 @@ outNode(StringInfo str, const void *obj) case T_RelOptInfo: _outRelOptInfo(str, obj); break; + case T_RelInfoList: +_outRelInfoList(str, obj); +break; case T_IndexOptInfo: _outIndexOptInfo(str, obj); break; diff --git a/src/backend/optimizer/geqo/geqo_eval.c b/src/backend/optimizer/geqo/geqo_eval.c index e07bab831e..82b4f5c56a 100644 --- a/src/backend/optimizer/geqo/geqo_eval.c +++ b/src/backend/optimizer/geqo/geqo_eval.c @@ -92,11 +92,11 @@ geqo_eval(PlannerInfo *root, Gene *tour, int num_gene) * * join_rel_level[] shouldn't be in use, so just Assert it isn't. */ - savelength = list_length(root->join_rel_list); - savehash = root->join_rel_hash; + savelength = list_length(root->join_rel_list->items); + savehash = root->join_rel_list->hash; Assert(root->join_rel_level == NULL); -
Re: [HACKERS] WIP: Aggregation push-down
Antonin Houska writes: > Michael Paquier wrote: >> Latest patch set fails to apply, so moved to next CF, waiting on >> author. > Rebased. This is in need of rebasing again :-(. I went ahead and pushed the 001 part, since that seemed fairly uncontroversial. (Note that I changed estimate_hashagg_tablesize's result type to double on the way; you probably want to make corresponding adjustments in your patch.) I did not spend a whole lot of time looking at the patch today, but I'm still pretty distressed at the data structures you've chosen. I remain of the opinion that a grouped relation and a base relation are, er, unrelated, even if they happen to share the same relid set. So I do not see the value of the RelOptInfoSet struct you propose here, and I definitely don't think there's any value in having, eg, create_seqscan_path or create_index_path dealing with this stuff. I also don't like changing create_nestloop_path et al to take a PathTarget rather than using the RelOptInfo's pathtarget; IMO, it's flat out wrong for a path to generate a tlist different from what its parent RelOptInfo says that the relation produces. I think basically the way this ought to work is that we generate baserel paths pretty much the same as today, and then we run through the baserels and see which ones are potentially worth doing partial aggregation on, and for each one that is, we create a separate "upper relation" RelOptInfo describing that. The paths for this RelOptInfo would be partial-aggregation paths using paths from the corresponding baserel as input. Then we'd run a join search that only considers joining grouped rels with plain rels (I concur that joining two grouped rels is not worth coping with, at least for now). regards, tom lane
Re: [HACKERS] WIP: Aggregation push-down
On Tue, Jan 15, 2019 at 03:06:18PM +0100, Antonin Houska wrote: > This is the next version. A few more comments below. Latest patch set fails to apply, so moved to next CF, waiting on author. -- Michael signature.asc Description: PGP signature
Re: [HACKERS] WIP: Aggregation push-down
Tom Lane wrote: > Antonin Houska writes: > > [ agg_pushdown_v8.tgz ] > > I spent a few hours looking at this today. Thanks! > It seems to me that at no point has there been a clear explanation of what > the patch is trying to accomplish, so let me see whether I've got it > straight or not. (I suggest that something like this ought to be included > in optimizer/README; the patch's lack of internal documentation is a serious > deficiency.) Earlier version of the patch did update optimizer/README but I forgot to merge it into the current one. I'll fix that in the next version. (Also some more will be added, especially header comments of some functions.) > Conceptually, we'd like to be able to push aggregation down below joins > when that yields a win, which it could do by reducing the number of rows > that have to be processed at the join stage. Thus consider > > SELECT agg(a.x) FROM a, b WHERE a.y = b.z; > > We can't simply aggregate during the scan of "a", because some values of > "x" might not appear at all in the input of the naively-computed aggregate > (if their rows have no join partner in "b"), or might appear multiple > times (if their rows have multiple join partners). So at first glance, > aggregating before the join is impossible. The key insight of the patch > is that we can make some progress by considering only grouped aggregation: Yes, the patch does not handle queries with no GROUP BY clause. Primarily because I don't know how the grouped "a" table in this example could emit the "a.y" column w/o using it as a grouping key. > if we can group "a" into sets of rows that must all have the same join > partners, then we can do preliminary aggregation within each such group, > and take care of the number-of-repetitions problem when we join. If the > groups are large then this reduces the number of rows processed by the > join, at the cost that we might spend time computing the aggregate for > some row sets that will just be discarded by the join. So now we consider > > SELECT agg(a.x) FROM a, b WHERE a.y = b.z GROUP BY ?; > > and ask what properties the grouping column(s) "?" must have to make this > work. I believe we can say that it's OK to push down through a join if > its join clauses are all of the form "a.y = b.z", where either a.y or b.z > is listed as a GROUP BY column *and* the join operator is equality for > the btree opfamily specified by the SortGroupClause. (Note: actually, > SortGroupClause per se mentions an equality operator, although I think the > planner quickly reduces that to an opfamily specification.) I suppose you mean make_pathkey_from_sortop(). > The concerns Robert had about equality semantics are certainly vital, but I > believe that the logic goes through correctly as long as the grouping > equality operator and the join equality operator belong to the same > opfamily. ok, generic approach like this is always better. I just could not devise it, nor can I prove that your theory is wrong. > Robert postulated cases like "citext = text", but that doesn't apply > here because no cross-type citext = text operator could be part of a > well-behaved opfamily. The reason I can see now is that the GROUP BY operator (opfamily) essentially has the grouping column on both sides, so it does not match the cross-type operator. > What we'd actually be looking at is either "citextvar::text texteq textvar" > or "citextvar citexteq textvar::citext", and the casts prevent these > expressions from matching GROUP BY entries that have the wrong semantics. Can you please explain this? If the expression is "citextvar::text texteq textvar", then both operands of the operator should be of "text" type (because the operator receives the output of the cast), so I'd expect a match to SortGroupClause.eqop of clause like "GROUP BY ". > However: what we have proven here is only that we can aggregate across > a set of rows that must share the same join partners. We still have > to be able to handle the case that the rowset has more than one join > partner, which AFAICS means that we must use partial aggregation and > then apply an "aggmultifn" (or else apply the aggcombinefn N times). > We can avoid that and use plain aggregation when we can prove the "b" > side of the join is unique, so that no sets of rows will have to be merged > post-join; but ISTM that that reduces the set of use cases to be too small > to be worth such a complex patch. So I'm really doubtful that we should > proceed forward with only that case available. I tried to follow Heikki's proposal in [1] and separated the functionality that does not need the partial aggregation. I was not able to foresee that the patch does not get much smaller this way. Also because I had to introduce functions that check whether particular join does duplicate aggregated values or not. (Even if we use partial aggregation, we can use this functionality to detect that we only need to call aggfinalfn() in some cases, as opposed
Re: [HACKERS] WIP: Aggregation push-down
Antonin Houska writes: > [ agg_pushdown_v8.tgz ] I spent a few hours looking at this today. It seems to me that at no point has there been a clear explanation of what the patch is trying to accomplish, so let me see whether I've got it straight or not. (I suggest that something like this ought to be included in optimizer/README; the patch's lack of internal documentation is a serious deficiency.) -- Conceptually, we'd like to be able to push aggregation down below joins when that yields a win, which it could do by reducing the number of rows that have to be processed at the join stage. Thus consider SELECT agg(a.x) FROM a, b WHERE a.y = b.z; We can't simply aggregate during the scan of "a", because some values of "x" might not appear at all in the input of the naively-computed aggregate (if their rows have no join partner in "b"), or might appear multiple times (if their rows have multiple join partners). So at first glance, aggregating before the join is impossible. The key insight of the patch is that we can make some progress by considering only grouped aggregation: if we can group "a" into sets of rows that must all have the same join partners, then we can do preliminary aggregation within each such group, and take care of the number-of-repetitions problem when we join. If the groups are large then this reduces the number of rows processed by the join, at the cost that we might spend time computing the aggregate for some row sets that will just be discarded by the join. So now we consider SELECT agg(a.x) FROM a, b WHERE a.y = b.z GROUP BY ?; and ask what properties the grouping column(s) "?" must have to make this work. I believe we can say that it's OK to push down through a join if its join clauses are all of the form "a.y = b.z", where either a.y or b.z is listed as a GROUP BY column *and* the join operator is equality for the btree opfamily specified by the SortGroupClause. (Note: actually, SortGroupClause per se mentions an equality operator, although I think the planner quickly reduces that to an opfamily specification.) The concerns Robert had about equality semantics are certainly vital, but I believe that the logic goes through correctly as long as the grouping equality operator and the join equality operator belong to the same opfamily. Case 1: the GROUP BY column is a.y. Two rows of "a" whose y values are equal per the grouping operator must join to exactly the same set of "b" rows, else transitivity is failing. Case 2: the GROUP BY column is b.z. It still works, because the set of "a" rows that are equal to a given z value is well-defined, and those rows must join to exactly the "b" rows whose z entries are equal to the given value, else transitivity is failing. Robert postulated cases like "citext = text", but that doesn't apply here because no cross-type citext = text operator could be part of a well-behaved opfamily. What we'd actually be looking at is either "citextvar::text texteq textvar" or "citextvar citexteq textvar::citext", and the casts prevent these expressions from matching GROUP BY entries that have the wrong semantics. However: what we have proven here is only that we can aggregate across a set of rows that must share the same join partners. We still have to be able to handle the case that the rowset has more than one join partner, which AFAICS means that we must use partial aggregation and then apply an "aggmultifn" (or else apply the aggcombinefn N times). We can avoid that and use plain aggregation when we can prove the "b" side of the join is unique, so that no sets of rows will have to be merged post-join; but ISTM that that reduces the set of use cases to be too small to be worth such a complex patch. So I'm really doubtful that we should proceed forward with only that case available. Also, Tomas complained in the earlier thread that he didn't think grouping on the join column was a very common use-case in the first place. I share that concern, but I think we could extend the logic to the case that Tomas posited as being useful: SELECT agg(a.x) FROM a, b WHERE a.y = b.id GROUP BY b.z; where the join column b.id is unique. If we group on a.y (using semantics compatible with the join operator and the uniqueness constraint), then all "a" rows in a given group will join to exactly one "b" row that necessarily has exactly one grouping value, so this group can safely be aggregated together. We might need to combine it post-join with other "b" rows that have equal "z" values, but we can do that as long as we're okay with partial aggregation. I think this example shows why the idea is far more powerful with partial aggregation than without. -- In short, then, I don't have much use for the patch as presented in this thread, without "aggmultifn". That might be OK as the first phase in a multi-step patch, but not as a production feature. I also have no use for the stuff depending on bitwise equality rather than the user-visible operators;
Re: [HACKERS] WIP: Aggregation push-down
Antonin Houska wrote: > I've reworked the patch so that separate RelOptInfo is used for grouped > relation. The attached patch is only the core part. Neither postgres_fdw > changes nor the part that tries to avoid the final aggregation is included > here. I'll add these when the patch does not seem to need another major > rework. This is the next version. After having posted a few notes to the [1] thread, I'm returning to the original one so it can be referenced from my entry in the CF application. As proposed in the other thread, it uses only "simple aggregation". The 2-stage aggregation, which gives more power to the feature and which also enables paralle processing for it, will be coded in a separate patch. I eventually decided abandon the Var substitution proposed by Heikki in [2]. It's not necessary if we accept the restriction that the feature accepts only simple column reference as grouping expression so far. [1] https://www.postgresql.org/message-id/c165b72e-8dbb-2a24-291f-113aeb67b76a%40iki.fi [2] https://www.postgresql.org/message-id/113e9594-7c08-0f1f-ad15-41773b56a86b%40iki.fi -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26, A-2700 Wiener Neustadt Web: https://www.cybertec-postgresql.com agg_pushdown_v8.tgz Description: GNU Zip compressed data
Re: [HACKERS] WIP: Aggregation push-down
Robert Haas wrote: > On Fri, Feb 23, 2018 at 11:08 AM, Antonin Houska wrote: > > I spent some more time thinking about this. What about adding a new strategy > > number for hash index operator classes, e.g. HTBinaryEqualStrategyNumber? > > For > > most types both HTEqualStrategyNumber and HTBinaryEqualStrategyNumber > > strategy > > would point to the same operator, but types like numeric would naturally > > have > > them different. > > > > Thus the pushed-down partial aggregation can only use the > > HTBinaryEqualStrategyNumber's operator to compare grouping expressions. In > > the > > initial version (until we have useful statistics for the binary values) we > > can > > avoid the aggregation push-down if the grouping expression output type has > > the > > two strategies implemented using different functions because, as you noted > > upthread, grouping based on binary equality can result in excessive number > > of > > groups. > > > > One open question is whether the binary equality operator needs a separate > > operator class or not. If an opclass cares only about the binary equality, > > its > > hash function(s) can be a lot simpler. > > Hmm. How about instead adding another regproc field to pg_type which > stores the OID of a function that tests binary equality for that > datatype? If that happens to be equal to the OID you got from the > opclass, then you're all set. I suppose you mean pg_operator, not pg_type. What I don't like about this is that the new field would only be useful for very little fraction of operators. On the other hand, the drawback of an additional operator classes is that we'd have to modify almost all the existing operator classes for the hash AM. (The absence of the new strategy number in an operator class cannot mean that the existing equality operator can be used to compare binary values too, because thus we can't guarantee correct behavior of the already existing user-defined operator classes.) -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26, A-2700 Wiener Neustadt Web: https://www.cybertec-postgresql.com
Re: [HACKERS] WIP: Aggregation push-down
On Fri, Feb 23, 2018 at 11:08 AM, Antonin Houskawrote: > I spent some more time thinking about this. What about adding a new strategy > number for hash index operator classes, e.g. HTBinaryEqualStrategyNumber? For > most types both HTEqualStrategyNumber and HTBinaryEqualStrategyNumber strategy > would point to the same operator, but types like numeric would naturally have > them different. > > Thus the pushed-down partial aggregation can only use the > HTBinaryEqualStrategyNumber's operator to compare grouping expressions. In the > initial version (until we have useful statistics for the binary values) we can > avoid the aggregation push-down if the grouping expression output type has the > two strategies implemented using different functions because, as you noted > upthread, grouping based on binary equality can result in excessive number of > groups. > > One open question is whether the binary equality operator needs a separate > operator class or not. If an opclass cares only about the binary equality, its > hash function(s) can be a lot simpler. Hmm. How about instead adding another regproc field to pg_type which stores the OID of a function that tests binary equality for that datatype? If that happens to be equal to the OID you got from the opclass, then you're all set. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] WIP: Aggregation push-down
Robert Haaswrote: > On Mon, Jan 29, 2018 at 3:32 AM, Antonin Houska wrote: > > I think of a variant of this: implement an universal function that tests the > > binary values for equality (besides the actual arguments, caller would have > > to > > pass info like typlen, typalign, typbyval for each argument, and cache these > > for repeated calls) and set pg_proc(oprcode) to 0 wherever this function is > > sufficient. Thus the problematic cases like numeric, citext, etc. would be > > detected by their non-zero oprcode. > > I don't think that's a good option, because we don't want int4eq to > have to go through a code path that has branches to support varlenas > and cstrings. Andres is busy trying to speed up expression evaluation > by removing unnecessary code branches; adding new ones would be > undoing that valuable work. I spent some more time thinking about this. What about adding a new strategy number for hash index operator classes, e.g. HTBinaryEqualStrategyNumber? For most types both HTEqualStrategyNumber and HTBinaryEqualStrategyNumber strategy would point to the same operator, but types like numeric would naturally have them different. Thus the pushed-down partial aggregation can only use the HTBinaryEqualStrategyNumber's operator to compare grouping expressions. In the initial version (until we have useful statistics for the binary values) we can avoid the aggregation push-down if the grouping expression output type has the two strategies implemented using different functions because, as you noted upthread, grouping based on binary equality can result in excessive number of groups. One open question is whether the binary equality operator needs a separate operator class or not. If an opclass cares only about the binary equality, its hash function(s) can be a lot simpler. -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26, A-2700 Wiener Neustadt Web: https://www.cybertec-postgresql.com
Re: [HACKERS] WIP: Aggregation push-down
On Mon, Jan 29, 2018 at 3:32 AM, Antonin Houskawrote: > I think of a variant of this: implement an universal function that tests the > binary values for equality (besides the actual arguments, caller would have to > pass info like typlen, typalign, typbyval for each argument, and cache these > for repeated calls) and set pg_proc(oprcode) to 0 wherever this function is > sufficient. Thus the problematic cases like numeric, citext, etc. would be > detected by their non-zero oprcode. I don't think that's a good option, because we don't want int4eq to have to go through a code path that has branches to support varlenas and cstrings. Andres is busy trying to speed up expression evaluation by removing unnecessary code branches; adding new ones would be undoing that valuable work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] WIP: Aggregation push-down
On 01/29/18 03:32, Antonin Houska wrote: > Robert Haaswrote: >>> only take place if we had a special equality operator which distinguishes >>> the >>> *binary* values (I don't know yet how to store this operator the catalog --- ... >> We don't have an operator that tests for binary equality, but it's >> certainly testable from C; see datumIsEqual. Disclaimer: I haven't been following the whole thread closely. But could there be some way to exploit the composite-type *= operator? https://www.postgresql.org/docs/current/static/functions-comparisons.html#COMPOSITE-TYPE-COMPARISON -Chap
Re: [HACKERS] WIP: Aggregation push-down
Robert Haaswrote: > On Fri, Jan 26, 2018 at 8:04 AM, Antonin Houska wrote: > > So one problem is that the grouping expression can be inappropriate for > > partial aggregation even if there's no type change during the > > translation. What I consider typical for this case is that the equality > > operator used to identify groups (SortGroupClause.eqop) can return true even > > if binary (stored) values of the inputs differ. The partial aggregation > > could > > only take place if we had a special equality operator which distinguishes > > the > > *binary* values (I don't know yet how to store this operator the catalog --- > > in pg_opclass recors for the hash opclasses?).. > > We don't have an operator that tests for binary equality, but it's > certainly testable from C; see datumIsEqual. I'm not sure how much > this really helps, though. I think it would be safe to build an > initial set of groups based on datumIsEqual comparisons and then > arrange to later merge some of the groups. But that's not something > we ever do in the executor today, so it might require quite a lot of > hacking. Also, it seems like it might really suck in some cases. For > instance, consider something like SELECT scale(one.a), sum(two.a) FROM > one, two WHERE one.a = two.a GROUP BY 1. Doing a Partial Aggregate on > two.a using datumIsEqual semantics, joining to a, and then doing a > Finalize Aggregate looks legal, but the Partial Aggregate may produce > a tremendous number of groups compared to the Finalize Aggregate. In > other words, this technique wouldn't merge any groups that shouldn't > be merged, but it might fail to merge groups that really need to be > merged to get good performance. I don't insist on doing Partial Aggregate in any case. If we wanted to group by the binary value, we'd probably have to enhance statistics accordingly. The important thing is to recognize the special case like this. Rejection of the Partial Aggregate would be the default response. > > Another idea is to allow only such changes that the > > destination type is in the same operator class as the source, and explicitly > > enumerate the "safe opclasses". But that would mean that user cannot define > > new opclasses within which the translation is possible --- not sure this is > > a > > serious issue. > > Enumerating specific opclasses in the source code is a non-starter -- > Tom Lane would roll over in his grave if he weren't still alive. What > we could perhaps consider doing is adding some mechanism for an > opclass or opfamily to say whether its equality semantics happen to be > exactly datumIsEqual() semantics. That's a little grotty because it > leaves data types for which that isn't true out in the cold, but on > the other hand it seems like it would be useful as a way of optimizing > a bunch of things other than this. Maybe it could also include a way > to specify that the comparator happens to have the semantics as C's > built-in < and > operators, which seems like a case that might also > lend itself to some optimizations. I think of a variant of this: implement an universal function that tests the binary values for equality (besides the actual arguments, caller would have to pass info like typlen, typalign, typbyval for each argument, and cache these for repeated calls) and set pg_proc(oprcode) to 0 wherever this function is sufficient. Thus the problematic cases like numeric, citext, etc. would be detected by their non-zero oprcode. -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26, A-2700 Wiener Neustadt Web: https://www.cybertec-postgresql.com
Re: [HACKERS] WIP: Aggregation push-down
On Fri, Jan 26, 2018 at 8:04 AM, Antonin Houskawrote: > So one problem is that the grouping expression can be inappropriate for > partial aggregation even if there's no type change during the > translation. What I consider typical for this case is that the equality > operator used to identify groups (SortGroupClause.eqop) can return true even > if binary (stored) values of the inputs differ. The partial aggregation could > only take place if we had a special equality operator which distinguishes the > *binary* values (I don't know yet how to store this operator the catalog --- > in pg_opclass recors for the hash opclasses?).. We don't have an operator that tests for binary equality, but it's certainly testable from C; see datumIsEqual. I'm not sure how much this really helps, though. I think it would be safe to build an initial set of groups based on datumIsEqual comparisons and then arrange to later merge some of the groups. But that's not something we ever do in the executor today, so it might require quite a lot of hacking. Also, it seems like it might really suck in some cases. For instance, consider something like SELECT scale(one.a), sum(two.a) FROM one, two WHERE one.a = two.a GROUP BY 1. Doing a Partial Aggregate on two.a using datumIsEqual semantics, joining to a, and then doing a Finalize Aggregate looks legal, but the Partial Aggregate may produce a tremendous number of groups compared to the Finalize Aggregate. In other words, this technique wouldn't merge any groups that shouldn't be merged, but it might fail to merge groups that really need to be merged to get good performance. > One of my ideas is to check whether the source and destination types are > binary coercible (i.e. pg_cast(castfunc) = 0) but this might be a misuse of > the binary coercibility. Yeah, binary coercibility has nothing to do with this; that tells you whether the two types are the same on disk, not whether they have the same equality semantics. For instance, I think text and citext are binary coercible, but their equality semantics are not the same. > Another idea is to allow only such changes that the > destination type is in the same operator class as the source, and explicitly > enumerate the "safe opclasses". But that would mean that user cannot define > new opclasses within which the translation is possible --- not sure this is a > serious issue. Enumerating specific opclasses in the source code is a non-starter -- Tom Lane would roll over in his grave if he weren't still alive. What we could perhaps consider doing is adding some mechanism for an opclass or opfamily to say whether its equality semantics happen to be exactly datumIsEqual() semantics. That's a little grotty because it leaves data types for which that isn't true out in the cold, but on the other hand it seems like it would be useful as a way of optimizing a bunch of things other than this. Maybe it could also include a way to specify that the comparator happens to have the semantics as C's built-in < and > operators, which seems like a case that might also lend itself to some optimizations. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] WIP: Aggregation push-down
Robert Haaswrote: > On Fri, Dec 22, 2017 at 10:43 AM, Antonin Houska wrote: > > Michael Paquier wrote: > >> On Sat, Nov 4, 2017 at 12:33 AM, Antonin Houska wrote: > >> > I'm not about to add any other features now. Implementation of the > >> > missing > >> > parts (see the TODO comments in the code) is the next step. But what I'd > >> > appreciate most is a feedback on the design. Thanks. > >> > >> I am getting a conflict after applying patch 5 but this did not get > >> any reviews so moved to next CF with waiting on author as status. > > > > Attached is the next version. > > I've been a bit confused for a while about what this patch is trying > to do, so I spent some time today looking at it to try to figure it > out. Thanks. > There's a lot I don't understand yet, but it seems like the general idea is > to build, potentially for each relation in the join tree, not only the > regular list of paths but also a list of "grouped" paths. If the > pre-grouped path wins, then we can get a final path that looks like Finalize > Aggregate -> Some Joins -> Partial Aggregate -> Maybe Some More Joins -> > Base Table Scan. Yes. The regression test output shows some more plans. > In some cases the patch seems to replace that uppermost Finalize Aggregate > with a Result node. This is a feature implemented in 09_avoid_agg_finalization.diff. You can ignore it for now, it's of lower priority than the preceding parts and needs more work to be complete. > translate_expression_to_rels() looks unsafe. Equivalence members are > known to be equal under the semantics of the relevant operator class, > but that doesn't mean that one can be freely substituted for another. > For example: > > rhaas=# create table one (a numeric); > CREATE TABLE > rhaas=# create table two (a numeric); > CREATE TABLE > rhaas=# insert into one values ('0'); > INSERT 0 1 > rhaas=# insert into two values ('0.00'); > INSERT 0 1 > rhaas=# select one.a, count(*) from one, two where one.a = two.a group by 1; > a | count > ---+--- > 0 | 1 > (1 row) > > rhaas=# select two.a, count(*) from one, two where one.a = two.a group by 1; > a | count > --+--- > 0.00 | 1 > (1 row) > > There are, admittedly, a large number of data types for which such a > substitution would work just fine. numeric will not, citext will not, > but many others are fine. Regrettably, we have no framework in the > system for identifying equality operators which actually test identity > versus some looser notion of equality. Cross-type operators are a > problem, too; if we have foo.x = bar.y in the query, one might be int4 > and the other int8, for example, but they can still belong to the same > equivalence class. > > Concretely, in your test query "SELECT p.i, avg(c1.v) FROM > agg_pushdown_parent AS p JOIN agg_pushdown_child1 AS c1 ON c1.parent = > p.i GROUP BY p.i" you assume that it's OK to do a Partial > HashAggregate over c1.parent rather than p.i. This will be false if, > say, c1.parent is of type citext and p.i is of type text; this will > get grouped together that shouldn't. I've really missed this problem, thanks for bringing it up! Your considerations made me think of the specifics of the types like numeric or citext. Although your example does not demonstrate what I consider the core issue, I believe we eventually think of the same. Consider this example query (using the tables you defined upthread): SELECT one.a FROMone, two WHERE one.a = two.a AND scale(two.a) > 1 GROUP BY1 I we first try to aggregate "two", then evaluate WHERE clause and then finalize the aggregation, the partial aggregation of "two" can put various binary values of the "a" column (0, 0.0, etc.) into the same group so the scale of the output (of the partial agregation) will be undefined. Thus the aggregation push-down will change the input of the WHERE clause. So one problem is that the grouping expression can be inappropriate for partial aggregation even if there's no type change during the translation. What I consider typical for this case is that the equality operator used to identify groups (SortGroupClause.eqop) can return true even if binary (stored) values of the inputs differ. The partial aggregation could only take place if we had a special equality operator which distinguishes the *binary* values (I don't know yet how to store this operator the catalog --- in pg_opclass recors for the hash opclasses?).. On the other hand, if the grouping expression is not a plain Var and there's no type change during the translation and the expression output type is not of the "identity-unsafe type" (numeric, citext, etc.), input vars of the "identity-unsafe type"should not prevent us from using the expression for partial aggregation: in such a case the grouping keys are computed in the same way they would be w/o the partial aggregation. The other problem is
Re: [HACKERS] WIP: Aggregation push-down
On Fri, Dec 22, 2017 at 10:43 AM, Antonin Houskawrote: > Michael Paquier wrote: >> On Sat, Nov 4, 2017 at 12:33 AM, Antonin Houska wrote: >> > I'm not about to add any other features now. Implementation of the missing >> > parts (see the TODO comments in the code) is the next step. But what I'd >> > appreciate most is a feedback on the design. Thanks. >> >> I am getting a conflict after applying patch 5 but this did not get >> any reviews so moved to next CF with waiting on author as status. > > Attached is the next version. I've been a bit confused for a while about what this patch is trying to do, so I spent some time today looking at it to try to figure it out. There's a lot I don't understand yet, but it seems like the general idea is to build, potentially for each relation in the join tree, not only the regular list of paths but also a list of "grouped" paths. If the pre-grouped path wins, then we can get a final path that looks like Finalize Aggregate -> Some Joins -> Partial Aggregate -> Maybe Some More Joins -> Base Table Scan. In some cases the patch seems to replace that uppermost Finalize Aggregate with a Result node. translate_expression_to_rels() looks unsafe. Equivalence members are known to be equal under the semantics of the relevant operator class, but that doesn't mean that one can be freely substituted for another. For example: rhaas=# create table one (a numeric); CREATE TABLE rhaas=# create table two (a numeric); CREATE TABLE rhaas=# insert into one values ('0'); INSERT 0 1 rhaas=# insert into two values ('0.00'); INSERT 0 1 rhaas=# select one.a, count(*) from one, two where one.a = two.a group by 1; a | count ---+--- 0 | 1 (1 row) rhaas=# select two.a, count(*) from one, two where one.a = two.a group by 1; a | count --+--- 0.00 | 1 (1 row) There are, admittedly, a large number of data types for which such a substitution would work just fine. numeric will not, citext will not, but many others are fine. Regrettably, we have no framework in the system for identifying equality operators which actually test identity versus some looser notion of equality. Cross-type operators are a problem, too; if we have foo.x = bar.y in the query, one might be int4 and the other int8, for example, but they can still belong to the same equivalence class. Concretely, in your test query "SELECT p.i, avg(c1.v) FROM agg_pushdown_parent AS p JOIN agg_pushdown_child1 AS c1 ON c1.parent = p.i GROUP BY p.i" you assume that it's OK to do a Partial HashAggregate over c1.parent rather than p.i. This will be false if, say, c1.parent is of type citext and p.i is of type text; this will get grouped together that shouldn't. It will also be false if the grouping expression is something like GROUP BY length(p.i::text), because one value could be '0'::numeric and the other '0.00'::numeric. I can't think of a reason why it would be false if the grouping expressions are both simple Vars of the same underlying data type, but I'm a little nervous that I might be wrong even about that case. Maybe you've handled all of this somehow, but it's not obvious to me that it has been considered. Another thing I noticed is that the GroupedPathInfo looks a bit like a stripped-down RelOptInfo, in that for example it has both a pathlist and a partial_pathlist. I'm inclined to think that we should build new RelOptInfos instead. As you have it, there are an awful lot of things that seem to know about the grouped/ungrouped distinction, many of which are quite low-level functions like add_path(), add_paths_to_append_rel(), try_nestloop_path(), etc. I think that some of this would go away if you had separate RelOptInfos instead of GroupedPathInfo. Some compiler noise: allpaths.c:2794:11: error: variable 'partial_target' is used uninitialized whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized] else if (rel->gpi != NULL && rel->gpi->target != NULL) ^~~~ allpaths.c:2798:56: note: uninitialized use occurs here create_gather_path(root, rel, cheapest_partial_path, partial_target, ^~ allpaths.c:2794:7: note: remove the 'if' if its condition is always true else if (rel->gpi != NULL && rel->gpi->target != NULL) ^ allpaths.c:2794:11: error: variable 'partial_target' is used uninitialized whenever '&&' condition is false [-Werror,-Wsometimes-uninitialized] else if (rel->gpi != NULL && rel->gpi->target != NULL) ^~~~ allpaths.c:2798:56: note: uninitialized use occurs here create_gather_path(root, rel, cheapest_partial_path, partial_target, ^~ allpaths.c:2794:11: note: remove the '&&' if its condition is always true else if (rel->gpi != NULL && rel->gpi->target != NULL)