Re: [HACKERS] Partition-wise aggregation/grouping
On 11.11.2017 23:29, Konstantin Knizhnik wrote: On 10/27/2017 02:01 PM, Jeevan Chalke wrote: Hi, Attached new patch-set here. Changes include: 1. Added separate patch for costing Append node as discussed up-front in the patch-set. 2. Since we now cost Append node, we don't need partition_wise_agg_cost_factor GUC. So removed that. The remaining patch hence merged into main implementation patch. 3. Updated rows in test-cases so that we will get partition-wise plans. Thanks I applied partition-wise-agg-v6.tar.gz patch to the master and use shard.sh example from https://www.postgresql.org/message-id/14577.1509723225%40localhost Plan for count(*) is the following: shard=# explain select count(*) from orders; QUERY PLAN --- Finalize Aggregate (cost=100415.29..100415.30 rows=1 width=8) -> Append (cost=50207.63..100415.29 rows=2 width=8) -> Partial Aggregate (cost=50207.63..50207.64 rows=1 width=8) -> Foreign Scan on orders_0 (cost=101.00..50195.13 rows=5000 width=0) -> Partial Aggregate (cost=50207.63..50207.64 rows=1 width=8) -> Foreign Scan on orders_1 (cost=101.00..50195.13 rows=5000 width=0) We really calculate partial aggregate for each partition, but to do we still have to fetch all data from remote host. So for foreign partitions such plans is absolutely inefficient. Amy be it should be combined with some other patch? For example, with agg_pushdown_v4.tgz patch https://www.postgresql.org/message-id/14577.1509723225%40localhost ? But it is not applied after partition-wise-agg-v6.tar.gz patch. Also postgres_fdw in 11dev is able to push down aggregates without agg_pushdown_v4.tgz patch. In 0009-Teach-postgres_fdw-to-push-aggregates-for-child-rela.patch there is the following check: /* Partial aggregates are not supported. */ + if (extra->isPartial) + return; If we just comment this line then produced plan will be the following: shard=# explain select sum(product_id) from orders; QUERY PLAN Finalize Aggregate (cost=308.41..308.42 rows=1 width=8) -> Append (cost=144.18..308.41 rows=2 width=8) -> Foreign Scan (cost=144.18..154.20 rows=1 width=8) Relations: Aggregate on (public.orders_0 orders) -> Foreign Scan (cost=144.18..154.20 rows=1 width=8) Relations: Aggregate on (public.orders_1 orders) (6 rows) And it is actually desired plan! Obviously such approach will not always work. FDW really doesn't support partial aggregates now. But for most frequently used aggregates: sum, min, max, count aggtype==aggtranstype and there is no difference between partial and normal aggregate calculation. So instead of (extra->isPartial) condition we can add more complex check which will traverse pathtarget expressions and check if it can be evaluated in this way. Or... extend FDW API to support partial aggregation. But even the last plan is not ideal: it will calculate predicates at each remote node sequentially. There is parallel append patch: https://www.postgresql.org/message-id/CAJ3gD9ctEcrVUmpY6fq_JUB6WDKGXAGd70EY68jVFA4kxMbKeQ%40mail.gmail.com but ... FDW doesn't support parallel scan, so parallel append can not be applied in this case. And we actually do not need parallel append with all its dynamic workers here. We just need to start commands at all remote servers and only after it fetch results (which can be done sequentially). I am investigating problem of efficient execution of OLAP queries on sharded tables (tables with remote partitions). After reading all this threads and corresponding patches, it seems to me that we already have most of parts of the puzzle, what we need is to put them on right places and may be add missed ones. I wonder if somebody is busy with it and can I somehow help here? Also I am not quite sure about the best approach with parallel execution of distributed query at all nodes. Should we make postgres_fdw parallel safe and use parallel append? How difficult it will be? Or in addition to parallel append we should also have "asynchronous append" which will be able to initiate execution at all nodes? It seems to be close to merge append, because it should simultaneously traverse all cursors. Looks like second approach is easier for implementation. But in case of sharded table, distributed query may need to traverse both remote and local shards and this approach doesn't allow to processed several local shards in parallel. I attach small patch for postgres_fdw.c which allows concurrent execution of aggregates by all remote servers (when them are accessed through postgres_fdw). I have added "postgres_fdw.use_prefetch" GUC to enable/disable prefetching data in postgres_fdw.
Re: [HACKERS] Partition-wise aggregation/grouping
On 10/27/2017 02:01 PM, Jeevan Chalke wrote: Hi, Attached new patch-set here. Changes include: 1. Added separate patch for costing Append node as discussed up-front in the patch-set. 2. Since we now cost Append node, we don't need partition_wise_agg_cost_factor GUC. So removed that. The remaining patch hence merged into main implementation patch. 3. Updated rows in test-cases so that we will get partition-wise plans. Thanks I applied partition-wise-agg-v6.tar.gz patch to the master and use shard.sh example from https://www.postgresql.org/message-id/14577.1509723225%40localhost Plan for count(*) is the following: shard=# explain select count(*) from orders; QUERY PLAN --- Finalize Aggregate (cost=100415.29..100415.30 rows=1 width=8) -> Append (cost=50207.63..100415.29 rows=2 width=8) -> Partial Aggregate (cost=50207.63..50207.64 rows=1 width=8) -> Foreign Scan on orders_0 (cost=101.00..50195.13 rows=5000 width=0) -> Partial Aggregate (cost=50207.63..50207.64 rows=1 width=8) -> Foreign Scan on orders_1 (cost=101.00..50195.13 rows=5000 width=0) We really calculate partial aggregate for each partition, but to do we still have to fetch all data from remote host. So for foreign partitions such plans is absolutely inefficient. Amy be it should be combined with some other patch? For example, with agg_pushdown_v4.tgz patch https://www.postgresql.org/message-id/14577.1509723225%40localhost ? But it is not applied after partition-wise-agg-v6.tar.gz patch. Also postgres_fdw in 11dev is able to push down aggregates without agg_pushdown_v4.tgz patch. In 0009-Teach-postgres_fdw-to-push-aggregates-for-child-rela.patch there is the following check: /* Partial aggregates are not supported. */ + if (extra->isPartial) + return; If we just comment this line then produced plan will be the following: shard=# explain select sum(product_id) from orders; QUERY PLAN Finalize Aggregate (cost=308.41..308.42 rows=1 width=8) -> Append (cost=144.18..308.41 rows=2 width=8) -> Foreign Scan (cost=144.18..154.20 rows=1 width=8) Relations: Aggregate on (public.orders_0 orders) -> Foreign Scan (cost=144.18..154.20 rows=1 width=8) Relations: Aggregate on (public.orders_1 orders) (6 rows) And it is actually desired plan! Obviously such approach will not always work. FDW really doesn't support partial aggregates now. But for most frequently used aggregates: sum, min, max, count aggtype==aggtranstype and there is no difference between partial and normal aggregate calculation. So instead of (extra->isPartial) condition we can add more complex check which will traverse pathtarget expressions and check if it can be evaluated in this way. Or... extend FDW API to support partial aggregation. But even the last plan is not ideal: it will calculate predicates at each remote node sequentially. There is parallel append patch: https://www.postgresql.org/message-id/CAJ3gD9ctEcrVUmpY6fq_JUB6WDKGXAGd70EY68jVFA4kxMbKeQ%40mail.gmail.com but ... FDW doesn't support parallel scan, so parallel append can not be applied in this case. And we actually do not need parallel append with all its dynamic workers here. We just need to start commands at all remote servers and only after it fetch results (which can be done sequentially). I am investigating problem of efficient execution of OLAP queries on sharded tables (tables with remote partitions). After reading all this threads and corresponding patches, it seems to me that we already have most of parts of the puzzle, what we need is to put them on right places and may be add missed ones. I wonder if somebody is busy with it and can I somehow help here? Also I am not quite sure about the best approach with parallel execution of distributed query at all nodes. Should we make postgres_fdw parallel safe and use parallel append? How difficult it will be? Or in addition to parallel append we should also have "asynchronous append" which will be able to initiate execution at all nodes? It seems to be close to merge append, because it should simultaneously traverse all cursors. Looks like second approach is easier for implementation. But in case of sharded table, distributed query may need to traverse both remote and local shards and this approach doesn't allow to processed several local shards in parallel. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise aggregation/grouping
On Wed, Nov 1, 2017 at 6:20 PM, Jeevan Chalkewrote: > Yep. > But as David reported earlier, if we remove the first part i.e. adding > cpu_operator_cost per tuple, Merge Append will be preferred over an Append > node unlike before. And thus, I thought of better having both, but no so > sure. Should we remove that part altogether, or add both in a single > statement with updated comments? I was only suggesting that you update the comments. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise aggregation/grouping
On Sat, Oct 28, 2017 at 3:07 PM, Robert Haaswrote: > On Fri, Oct 27, 2017 at 1:01 PM, Jeevan Chalke > wrote: > > 1. Added separate patch for costing Append node as discussed up-front in > the > > patch-set. > > 2. Since we now cost Append node, we don't need > > partition_wise_agg_cost_factor > > GUC. So removed that. The remaining patch hence merged into main > > implementation > > patch. > > 3. Updated rows in test-cases so that we will get partition-wise plans. > > With 0006 applied, cost_merge_append() is now a little bit confused: > > /* > * Also charge a small amount (arbitrarily set equal to operator cost) > per > * extracted tuple. We don't charge cpu_tuple_cost because a > MergeAppend > * node doesn't do qual-checking or projection, so it has less overhead > * than most plan nodes. > */ > run_cost += cpu_operator_cost * tuples; > > /* Add MergeAppend node overhead like we do it for the Append node */ > run_cost += cpu_tuple_cost * DEFAULT_APPEND_COST_FACTOR * tuples; > > The first comment says that we don't add cpu_tuple_cost, and the > second one then adds half of it anyway. > Yep. But as David reported earlier, if we remove the first part i.e. adding cpu_operator_cost per tuple, Merge Append will be preferred over an Append node unlike before. And thus, I thought of better having both, but no so sure. Should we remove that part altogether, or add both in a single statement with updated comments? > I think it's fine to have a #define for DEFAULT_APPEND_COST_FACTOR, > because as you say it's used twice, but I don't think that should be > exposed in cost.h; I'd make it private to costsize.c and rename it to > something like APPEND_CPU_COST_MULTIPLIER. The word DEFAULT, in > particular, seems useless to me, since there's no provision for it to > be overridden by a different value. > Agree. Will make that change. > > What testing, if any, can we think about doing with this plan to make > sure it doesn't regress things? For example, if we do a TPC-H run > with partitioned tables and partition-wise join enabled, will any > plans change with this patch? I have tried doing this on my local developer machine. For 1GB database size (tpc-h scale factor 1), I see no plan change with and without this patch. I have tried with scale factor 10, but query is not executing well due to space and memory constraints. Can someone try out that? > Do they get faster or not? Anyone have > other ideas for what to test? > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > -- Jeevan Chalke Technical Architect, Product Development EnterpriseDB Corporation The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise aggregation/grouping
On Fri, Oct 27, 2017 at 1:01 PM, Jeevan Chalkewrote: > 1. Added separate patch for costing Append node as discussed up-front in the > patch-set. > 2. Since we now cost Append node, we don't need > partition_wise_agg_cost_factor > GUC. So removed that. The remaining patch hence merged into main > implementation > patch. > 3. Updated rows in test-cases so that we will get partition-wise plans. With 0006 applied, cost_merge_append() is now a little bit confused: /* * Also charge a small amount (arbitrarily set equal to operator cost) per * extracted tuple. We don't charge cpu_tuple_cost because a MergeAppend * node doesn't do qual-checking or projection, so it has less overhead * than most plan nodes. */ run_cost += cpu_operator_cost * tuples; /* Add MergeAppend node overhead like we do it for the Append node */ run_cost += cpu_tuple_cost * DEFAULT_APPEND_COST_FACTOR * tuples; The first comment says that we don't add cpu_tuple_cost, and the second one then adds half of it anyway. I think it's fine to have a #define for DEFAULT_APPEND_COST_FACTOR, because as you say it's used twice, but I don't think that should be exposed in cost.h; I'd make it private to costsize.c and rename it to something like APPEND_CPU_COST_MULTIPLIER. The word DEFAULT, in particular, seems useless to me, since there's no provision for it to be overridden by a different value. What testing, if any, can we think about doing with this plan to make sure it doesn't regress things? For example, if we do a TPC-H run with partitioned tables and partition-wise join enabled, will any plans change with this patch? Do they get faster or not? Anyone have other ideas for what to test? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise aggregation/grouping
Hi, Attached new patch-set here. Changes include: 1. Added separate patch for costing Append node as discussed up-front in the patch-set. 2. Since we now cost Append node, we don't need partition_wise_agg_cost_factor GUC. So removed that. The remaining patch hence merged into main implementation patch. 3. Updated rows in test-cases so that we will get partition-wise plans. Thanks On Wed, Oct 18, 2017 at 9:53 AM, Dilip Kumarwrote: > On Tue, Oct 17, 2017 at 10:44 PM, Jeevan Chalke > wrote: > > > > > I didn't get what you mean by regression here. Can you please explain? > > > > I see that PWA plan is selected over regular plan when enabled on the > basis > > of costing. > > Regular planning need a Result node due to which costing increases where > as > > PWA don't need that and thus wins. > > Sorry for not clearly explaining, I meant that with normal plan > execution time is 263.678 ms whereas with PWA its 339.929 ms. > > I only set enable_partition_wise_agg=on and it switched to PWA and > execution time increased by 30%. > I understand that the this is the worst case for PWA where > FinalizeAggregate is getting all the tuple. > > -- > Regards, > Dilip Kumar > EnterpriseDB: http://www.enterprisedb.com > -- Jeevan Chalke Technical Architect, Product Development EnterpriseDB Corporation The Enterprise PostgreSQL Company partition-wise-agg-v6.tar.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise aggregation/grouping
On Tue, Oct 17, 2017 at 10:44 PM, Jeevan Chalkewrote: > > I didn't get what you mean by regression here. Can you please explain? > > I see that PWA plan is selected over regular plan when enabled on the basis > of costing. > Regular planning need a Result node due to which costing increases where as > PWA don't need that and thus wins. Sorry for not clearly explaining, I meant that with normal plan execution time is 263.678 ms whereas with PWA its 339.929 ms. I only set enable_partition_wise_agg=on and it switched to PWA and execution time increased by 30%. I understand that the this is the worst case for PWA where FinalizeAggregate is getting all the tuple. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise aggregation/grouping
On Tue, Oct 17, 2017 at 7:13 PM, Dilip Kumarwrote: > On Fri, Oct 13, 2017 at 12:06 PM, Jeevan Chalke > wrote: > > > While playing around with the patch I have noticed one regression with > the partial partition-wise aggregate. > > I am consistently able to reproduce this on my local machine. > > Scenario: Group by on non-key column and only one tuple per group. > I didn't get what you mean by regression here. Can you please explain? I see that PWA plan is selected over regular plan when enabled on the basis of costing. Regular planning need a Result node due to which costing increases where as PWA don't need that and thus wins. -- Jeevan Chalke Technical Architect, Product Development EnterpriseDB Corporation The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise aggregation/grouping
On Fri, Oct 13, 2017 at 12:06 PM, Jeevan Chalkewrote: > While playing around with the patch I have noticed one regression with the partial partition-wise aggregate. I am consistently able to reproduce this on my local machine. Scenario: Group by on non-key column and only one tuple per group. Complete Test: create table t(a int,b int) partition by range(a); create table t1 partition of t for values from (1) to (10); create table t2 partition of t for values from (10) to (20); insert into t values (generate_series(1,19),generate_series(1, 19)); postgres=# explain analyze select sum(a) from t group by b; QUERY PLAN -- Finalize GroupAggregate (cost=20379.55..28379.51 rows=19 width=12) (actual time=102.311..322.969 rows=19 loops=1) Group Key: t1.b -> Merge Append (cost=20379.55..25379.53 rows=19 width=12) (actual time=102.303..232.310 rows=19 loops=1) Sort Key: t1.b -> Partial GroupAggregate (cost=10189.72..11939.70 rows=9 width=12) (actual time=52.164..108.967 rows=9 loops=1) Group Key: t1.b -> Sort (cost=10189.72..10439.72 rows=9 width=8) (actual time=52.158..66.236 rows=9 loops=1) Sort Key: t1.b Sort Method: external merge Disk: 1768kB -> Seq Scan on t1 (cost=0.00..1884.99 rows=9 width=8) (actual time=0.860..20.388 rows=9 loops=1) -> Partial GroupAggregate (cost=10189.82..11939.82 rows=10 width=12) (actual time=50.134..102.976 rows=10 loops=1) Group Key: t2.b -> Sort (cost=10189.82..10439.82 rows=10 width=8) (actual time=50.128..63.362 rows=10 loops=1) Sort Key: t2.b Sort Method: external merge Disk: 1768kB -> Seq Scan on t2 (cost=0.00..1885.00 rows=10 width=8) (actual time=0.498..20.977 rows=10 loops=1) Planning time: 0.190 ms Execution time: 339.929 ms (18 rows) postgres=# set enable_partition_wise_agg=off; SET postgres=# explain analyze select sum(a) from t group by b; QUERY PLAN -- GroupAggregate (cost=26116.53..29616.51 rows=19 width=12) (actual time=139.413..250.751 rows=19 loops=1) Group Key: t1.b -> Sort (cost=26116.53..26616.52 rows=19 width=8) (actual time=139.406..168.775 rows=19 loops=1) Sort Key: t1.b Sort Method: external merge Disk: 3544kB -> Result (cost=0.00..5769.98 rows=19 width=8) (actual time=0.674..76.392 rows=19 loops=1) -> Append (cost=0.00..3769.99 rows=19 width=8) (actual time=0.672..40.291 rows=19 loops=1) -> Seq Scan on t1 (cost=0.00..1884.99 rows=9 width=8) (actual time=0.672..12.408 rows=9 loops=1) -> Seq Scan on t2 (cost=0.00..1885.00 rows=10 width=8) (actual time=1.407..11.689 rows=10 loops=1) Planning time: 0.146 ms Execution time: 263.678 ms (11 rows) -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise aggregation/grouping
On Fri, Oct 13, 2017 at 1:13 PM, David Rowleywrote: > > I looked over the patch and saw this: > > @@ -1800,6 +1827,9 @@ cost_merge_append(Path *path, PlannerInfo *root, > */ > run_cost += cpu_operator_cost * tuples; > > + /* Add MergeAppend node overhead like we do it for the Append node */ > + run_cost += cpu_tuple_cost * DEFAULT_APPEND_COST_FACTOR * tuples; > + > path->startup_cost = startup_cost + input_startup_cost; > path->total_cost = startup_cost + run_cost + input_total_cost; > } > > You're doing that right after a comment that says we don't do that. It > also does look like the "run_cost += cpu_operator_cost * tuples" is > trying to do the same thing, so perhaps it's worth just replacing > that, which by default will double that additional cost, although > doing so would have the planner slightly prefer a MergeAppend to an > Append than previously. > I think we can remove that code block entirely. I have added relevant comments around DEFAULT_APPEND_COST_FACTOR already. However, I am not sure of doing this as you correctly said it may prefer MergeAppend to an Append. Will it be fine we remove that code block? > +#define DEFAULT_APPEND_COST_FACTOR 0.5 > > I don't really think the DEFAULT_APPEND_COST_FACTOR adds much. it > means very little by itself. It also seems that most of the other cost > functions just use the magic number. > Agree, but those magic numbers used only once at that place. But here there are two places. So if someone wants to update it, (s)he needs to make sure to update that at two places. To minimize that risk, having a #define seems better. > > -- > David Rowley http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services > -- Jeevan Chalke Technical Architect, Product Development EnterpriseDB Corporation The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise aggregation/grouping
On 13 October 2017 at 19:36, Jeevan Chalkewrote: > I have tried exactly same tests to get to this factor on my local developer > machine. And with parallelism enabled I got this number as 7.9. However, if > I disable the parallelism (and I believe David too disabled that), I get > this number as 1.8. Whereas for 1 rows, I get this number to 1.7 > > -- With Gather > # select current_Setting('cpu_tuple_cost')::float8 / ((10633.56 * (81.035 / > 72.450) - 10633.56) / 100); > 7.9 > > -- Without Gather > # select current_Setting('cpu_tuple_cost')::float8 / ((16925.01 * (172.838 / > 131.400) - 16925.01) / 100); > 1.8 > > -- With 1 rows (so no Gather too) > # select current_Setting('cpu_tuple_cost')::float8 / ((170.01 * (1.919 / > 1.424) - 170.01) / 1); > 1.7 > > So it is not so straight forward to come up the correct heuristic here. Thus > using 50% of cpu_tuple_cost look good to me here. > > As suggested by Ashutosh and Robert, attached separate small WIP patch for > it. Good to see it stays fairly consistent at different tuple counts, and is not too far away from what I got on this machine. I looked over the patch and saw this: @@ -1800,6 +1827,9 @@ cost_merge_append(Path *path, PlannerInfo *root, */ run_cost += cpu_operator_cost * tuples; + /* Add MergeAppend node overhead like we do it for the Append node */ + run_cost += cpu_tuple_cost * DEFAULT_APPEND_COST_FACTOR * tuples; + path->startup_cost = startup_cost + input_startup_cost; path->total_cost = startup_cost + run_cost + input_total_cost; } You're doing that right after a comment that says we don't do that. It also does look like the "run_cost += cpu_operator_cost * tuples" is trying to do the same thing, so perhaps it's worth just replacing that, which by default will double that additional cost, although doing so would have the planner slightly prefer a MergeAppend to an Append than previously. +#define DEFAULT_APPEND_COST_FACTOR 0.5 I don't really think the DEFAULT_APPEND_COST_FACTOR adds much. it means very little by itself. It also seems that most of the other cost functions just use the magic number. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise aggregation/grouping
On Tue, Oct 10, 2017 at 1:31 PM, David Rowleywrote: > On 10 October 2017 at 17:57, Ashutosh Bapat > wrote: > > Append node just returns the result of ExecProcNode(). Charging > > cpu_tuple_cost may make it too expensive. In other places where we > > charge cpu_tuple_cost there's some processing done to the tuple like > > ExecStoreTuple() in SeqNext(). May be we need some other measure for > > Append's processing of the tuple. > > I don't think there's any need to invent any new GUC. You could just > divide cpu_tuple_cost by something. > > I did a quick benchmark on my laptop to see how much Append really > costs, and with the standard costs the actual cost seems to be about > cpu_tuple_cost / 2.4. So probably cpu_tuple_cost / 2 might be > realistic. create_set_projection_path() does something similar and > brincostestimate() does some similar magic and applies 0.1 * > cpu_operator_cost to the total cost. > > # create table p (a int, b int); > # create table p1 () inherits (p); > # insert into p1 select generate_series(1,100); > # vacuum analyze p1; > # \q > $ echo "select count(*) from p1;" > p1.sql > $ echo "select count(*) from p;" > p.sql > $ pgbench -T 60 -f p1.sql -n > > latency average = 58.567 ms > > $ pgbench -T 60 -f p.sql -n > latency average = 72.984 ms > > $ psql > psql (11devel) > Type "help" for help. > > # -- check the cost of the plan. > # explain select count(*) from p1; > QUERY PLAN > -- > Aggregate (cost=16925.00..16925.01 rows=1 width=8) >-> Seq Scan on p1 (cost=0.00..14425.00 rows=100 width=0) > (2 rows) > > # -- selecting from the parent is the same due to zero Append cost. > # explain select count(*) from p; >QUERY PLAN > > Aggregate (cost=16925.00..16925.01 rows=1 width=8) >-> Append (cost=0.00..14425.00 rows=101 width=0) > -> Seq Scan on p (cost=0.00..0.00 rows=1 width=0) > -> Seq Scan on p1 (cost=0.00..14425.00 rows=100 width=0) > (4 rows) > > # -- extrapolate the additional time taken for the Append scan and > work out what the planner > # -- should add to the plan's cost, then divide by the number of rows > in p1 to work out the > # -- tuple cost of pulling a row through the append. > # select (16925.01 * (72.984 / 58.567) - 16925.01) / 100; > ?column? > > 0.00416630302337493743 > (1 row) > > # show cpu_tuple_cost; > cpu_tuple_cost > > 0.01 > (1 row) > > # -- How does that compare to the cpu_tuple_cost? > # select current_Setting('cpu_tuple_cost')::float8 / > 0.00416630302337493743; > ?column? > > 2.400209476818 > (1 row) > > Maybe it's worth trying with different row counts to see if the > additional cost is consistent, but it's probably not worth being too > critical here. > I have tried exactly same tests to get to this factor on my local developer machine. And with parallelism enabled I got this number as 7.9. However, if I disable the parallelism (and I believe David too disabled that), I get this number as 1.8. Whereas for 1 rows, I get this number to 1.7 -- With Gather # select current_Setting('cpu_tuple_cost')::float8 / ((10633.56 * (81.035 / 72.450) - 10633.56) / 100); 7.9 -- Without Gather # select current_Setting('cpu_tuple_cost')::float8 / ((16925.01 * (172.838 / 131.400) - 16925.01) / 100); 1.8 -- With 1 rows (so no Gather too) # select current_Setting('cpu_tuple_cost')::float8 / ((170.01 * (1.919 / 1.424) - 170.01) / 1); 1.7 So it is not so straight forward to come up the correct heuristic here. Thus using 50% of cpu_tuple_cost look good to me here. As suggested by Ashutosh and Robert, attached separate small WIP patch for it. I think it will be better if we take this topic on another mail-thread. Do you agree? > > -- > David Rowley http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services > -- Jeevan Chalke Technical Architect, Product Development EnterpriseDB Corporation The Enterprise PostgreSQL Company diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c index ce32b8a4..c0bf602 100644 --- a/src/backend/optimizer/path/costsize.c +++ b/src/backend/optimizer/path/costsize.c @@ -1742,6 +1742,33 @@ cost_sort(Path *path, PlannerInfo *root, } /* + * cost_append + * Determines and returns the cost of an Append node. + * + * Though Append doesn't do any selection or projection, it's not free. So we + * try to add cost per input tuple which is arbitrarily calculated as + * DEFAULT_APPEND_COST_FACTOR * cpu_tuple_cost. + * + * 'input_startup_cost' is the sum of the input streams' startup costs + * 'input_total_cost' is the sum of the input streams'
Re: [HACKERS] Partition-wise aggregation/grouping
On Tue, Oct 10, 2017 at 6:00 AM, Ashutosh Bapatwrote: > This looks good to me. I think it should be a separate, yet very small patch. +1. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise aggregation/grouping
On Tue, Oct 10, 2017 at 1:31 PM, David Rowleywrote: > > I don't think there's any need to invent any new GUC. You could just > divide cpu_tuple_cost by something. > > I did a quick benchmark on my laptop to see how much Append really > costs, and with the standard costs the actual cost seems to be about > cpu_tuple_cost / 2.4. So probably cpu_tuple_cost / 2 might be > realistic. create_set_projection_path() does something similar and > brincostestimate() does some similar magic and applies 0.1 * > cpu_operator_cost to the total cost. > > > # -- How does that compare to the cpu_tuple_cost? > # select current_Setting('cpu_tuple_cost')::float8 / 0.00416630302337493743; > ?column? > > 2.400209476818 > (1 row) > > Maybe it's worth trying with different row counts to see if the > additional cost is consistent, but it's probably not worth being too > critical here. > This looks good to me. I think it should be a separate, yet very small patch. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise aggregation/grouping
On 10 October 2017 at 17:57, Ashutosh Bapatwrote: > Append node just returns the result of ExecProcNode(). Charging > cpu_tuple_cost may make it too expensive. In other places where we > charge cpu_tuple_cost there's some processing done to the tuple like > ExecStoreTuple() in SeqNext(). May be we need some other measure for > Append's processing of the tuple. I don't think there's any need to invent any new GUC. You could just divide cpu_tuple_cost by something. I did a quick benchmark on my laptop to see how much Append really costs, and with the standard costs the actual cost seems to be about cpu_tuple_cost / 2.4. So probably cpu_tuple_cost / 2 might be realistic. create_set_projection_path() does something similar and brincostestimate() does some similar magic and applies 0.1 * cpu_operator_cost to the total cost. # create table p (a int, b int); # create table p1 () inherits (p); # insert into p1 select generate_series(1,100); # vacuum analyze p1; # \q $ echo "select count(*) from p1;" > p1.sql $ echo "select count(*) from p;" > p.sql $ pgbench -T 60 -f p1.sql -n latency average = 58.567 ms $ pgbench -T 60 -f p.sql -n latency average = 72.984 ms $ psql psql (11devel) Type "help" for help. # -- check the cost of the plan. # explain select count(*) from p1; QUERY PLAN -- Aggregate (cost=16925.00..16925.01 rows=1 width=8) -> Seq Scan on p1 (cost=0.00..14425.00 rows=100 width=0) (2 rows) # -- selecting from the parent is the same due to zero Append cost. # explain select count(*) from p; QUERY PLAN Aggregate (cost=16925.00..16925.01 rows=1 width=8) -> Append (cost=0.00..14425.00 rows=101 width=0) -> Seq Scan on p (cost=0.00..0.00 rows=1 width=0) -> Seq Scan on p1 (cost=0.00..14425.00 rows=100 width=0) (4 rows) # -- extrapolate the additional time taken for the Append scan and work out what the planner # -- should add to the plan's cost, then divide by the number of rows in p1 to work out the # -- tuple cost of pulling a row through the append. # select (16925.01 * (72.984 / 58.567) - 16925.01) / 100; ?column? 0.00416630302337493743 (1 row) # show cpu_tuple_cost; cpu_tuple_cost 0.01 (1 row) # -- How does that compare to the cpu_tuple_cost? # select current_Setting('cpu_tuple_cost')::float8 / 0.00416630302337493743; ?column? 2.400209476818 (1 row) Maybe it's worth trying with different row counts to see if the additional cost is consistent, but it's probably not worth being too critical here. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise aggregation/grouping
On Tue, Oct 10, 2017 at 10:27 AM, Ashutosh Bapat < ashutosh.ba...@enterprisedb.com> wrote: > On Tue, Oct 10, 2017 at 3:15 AM, David Rowley >wrote: > > On 10 October 2017 at 01:10, Jeevan Chalke > > wrote: > >> Attached new patch set having HEAD at 84ad4b0 with all these review > points > >> fixed. Let me know if I missed any thanks. > > > > I've only really skimmed over this thread and only opened the code > > enough to extract the following: > > > > + /* Multiply the costs by partition_wise_agg_cost_factor. */ > > + apath->startup_cost *= partition_wise_agg_cost_factor; > > + apath->total_cost *= partition_wise_agg_cost_factor; > > > > I've not studied how all the path plumbing is done, but I think > > instead of doing this costing magic we should really stop pretending > > that Append/MergeAppend nodes are cost-free. I think something like > > charging cpu_tuple_cost per row expected through Append/MergeAppend > > would be a better approach to this. > > > > If you perform grouping or partial grouping before the Append, then in > > most cases the Append will receive less rows, so come out cheaper than > > if you perform the grouping after it. I've not learned the > > partition-wise join code enough to know if this is going to affect > > that too, but for everything else, there should be no plan change, > > since there's normally no alternative paths. I see there's even a > > comment in create_append_path() which claims the zero cost is a bit > > optimistic. > > > > +1. Yes. Me and Ashutosh had a thought on this offlist that we will need to cost Append node too as having an extra GUC to control this is not a good idea per se. Thanks for your vote too. I will try doing this and will see how plan goes with it. Partition-wise join will also benefit from costing Append > processing. Number of rows * width of join result compared with the > sum of that measure for joining relations decides whether Append node > processes more data in Append->Join case than Join->Append case. > > Append node just returns the result of ExecProcNode(). Charging > cpu_tuple_cost may make it too expensive. In other places where we > charge cpu_tuple_cost there's some processing done to the tuple like > ExecStoreTuple() in SeqNext(). May be we need some other measure for > Append's processing of the tuple. > > May be we should try to measure the actual time spent in Append node > as a fraction of say time spent in child seq scans. That might give us > a clue as to how Append processing can be charged in terms of costing. > > -- > Best Wishes, > Ashutosh Bapat > EnterpriseDB Corporation > The Postgres Database Company > -- Jeevan Chalke Technical Architect, Product Development EnterpriseDB Corporation The Enterprise PostgreSQL Company Phone: +91 20 66449694 Website: www.enterprisedb.com EnterpriseDB Blog: http://blogs.enterprisedb.com/ Follow us on Twitter: http://www.twitter.com/enterprisedb This e-mail message (and any attachment) is intended for the use of the individual or entity to whom it is addressed. This message contains information from EnterpriseDB Corporation that may be privileged, confidential, or exempt from disclosure under applicable law. If you are not the intended recipient or authorized to receive this for the intended recipient, any use, dissemination, distribution, retention, archiving, or copying of this communication is strictly prohibited. If you have received this e-mail in error, please notify the sender immediately by reply e-mail and delete this message.
Re: [HACKERS] Partition-wise aggregation/grouping
On Tue, Oct 10, 2017 at 3:15 AM, David Rowleywrote: > On 10 October 2017 at 01:10, Jeevan Chalke > wrote: >> Attached new patch set having HEAD at 84ad4b0 with all these review points >> fixed. Let me know if I missed any thanks. > > I've only really skimmed over this thread and only opened the code > enough to extract the following: > > + /* Multiply the costs by partition_wise_agg_cost_factor. */ > + apath->startup_cost *= partition_wise_agg_cost_factor; > + apath->total_cost *= partition_wise_agg_cost_factor; > > I've not studied how all the path plumbing is done, but I think > instead of doing this costing magic we should really stop pretending > that Append/MergeAppend nodes are cost-free. I think something like > charging cpu_tuple_cost per row expected through Append/MergeAppend > would be a better approach to this. > > If you perform grouping or partial grouping before the Append, then in > most cases the Append will receive less rows, so come out cheaper than > if you perform the grouping after it. I've not learned the > partition-wise join code enough to know if this is going to affect > that too, but for everything else, there should be no plan change, > since there's normally no alternative paths. I see there's even a > comment in create_append_path() which claims the zero cost is a bit > optimistic. > +1. Partition-wise join will also benefit from costing Append processing. Number of rows * width of join result compared with the sum of that measure for joining relations decides whether Append node processes more data in Append->Join case than Join->Append case. Append node just returns the result of ExecProcNode(). Charging cpu_tuple_cost may make it too expensive. In other places where we charge cpu_tuple_cost there's some processing done to the tuple like ExecStoreTuple() in SeqNext(). May be we need some other measure for Append's processing of the tuple. May be we should try to measure the actual time spent in Append node as a fraction of say time spent in child seq scans. That might give us a clue as to how Append processing can be charged in terms of costing. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise aggregation/grouping
On 10 October 2017 at 01:10, Jeevan Chalkewrote: > Attached new patch set having HEAD at 84ad4b0 with all these review points > fixed. Let me know if I missed any thanks. I've only really skimmed over this thread and only opened the code enough to extract the following: + /* Multiply the costs by partition_wise_agg_cost_factor. */ + apath->startup_cost *= partition_wise_agg_cost_factor; + apath->total_cost *= partition_wise_agg_cost_factor; I've not studied how all the path plumbing is done, but I think instead of doing this costing magic we should really stop pretending that Append/MergeAppend nodes are cost-free. I think something like charging cpu_tuple_cost per row expected through Append/MergeAppend would be a better approach to this. If you perform grouping or partial grouping before the Append, then in most cases the Append will receive less rows, so come out cheaper than if you perform the grouping after it. I've not learned the partition-wise join code enough to know if this is going to affect that too, but for everything else, there should be no plan change, since there's normally no alternative paths. I see there's even a comment in create_append_path() which claims the zero cost is a bit optimistic. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise aggregation/grouping
On Thu, Sep 28, 2017 at 3:12 PM, Ashutosh Bapat < ashutosh.ba...@enterprisedb.com> wrote: > > Here are comments on 0004 from last patch set. But most the comments > still apply. > Thank you, Ashutosh for reviewing. > > Patch 0001 adds functions create_hash_agg_path() and > create_sort_agg_path(). > Patch 0004 adds a new argument to those functions for conditions in HAVING > clause. We should move those changes to 0001 and pass parse->havingQual to > these functions in 0001 itself. That will keep all changes to those > functions > together and also make 0003 small. > Done. > The prologue of try_partition_wise_grouping() mentions a restriction of > partition keys being leading group by clauses. This restriction is not > specified in the prologue of have_grouping_by_partkey(), which actually > checks > for this restriction. The requirement per prologue of that function is > just to > have partition keys in group clause. I think have_grouping_by_partkey() is > correct, and we should change prologue of try_partition_wise_grouping() to > be > in sync with have_grouping_by_partkey(). Done. The prologue explains why > partition-wise aggregation/grouping would be efficient with this > restriction, > but it doesn't explain why partial aggregation/grouping per partition > would be > efficient. May be it should do that as well. Similar is the case with > partial > aggregation/grouping discussion in README. > I have tried updating it. Please check. > +/* Do not handle grouping sets for now. */ > Is this a real restriction or just restriction for first cut of this > feature? > Can you please add more explanation? May be update README as well? > Grouping sets plan does not work with an inheritance subtree (see notes in create_groupingsets_plan). Thus grouping sets are not handled here. > > +grouped_rel->part_scheme = input_rel->part_scheme; > Is this true even for partial aggregates? I think not. Since group by > clause > does not contain partition keys, the rows from multiple partitions > participate > in one group and thus the partition keys of input relation do not apply to > the > grouped relation. In this case, it seems that the grouped rel will have > part_rels but will not be partitioned. > I have removed this as your analysis is correct. grouped_rel is not partitioned. > +/* > + * If there is no path for the child relation, then we cannot add > + * aggregate path too. > + */ > +if (input_child_rel->pathlist == NIL) > +return; > When can this happen? I think, similar to partition-wise join it happens > when > there is a dummy parent relation. See [1]. If that's the case, you may > want to > do things similar to what partition-wise join is doing. If there's some > other > reason for doing this, returing from here half-way is actually waste of > memory > and planning time. Instead, we may want to loop over the part_rels to find > if > any of those have empty pathlist and return from there before doing any > actual > work. > This is kind of can't happen scenario, so I have converted it to an Assert(). And yes, I am marking a grouped_rel as dummy rel when input rel is dummy. > +extra.pathTarget = child_target; > +extra.inputRows = input_child_rel->cheapest_startup_path->rows; > +extra.havingQual = (Node *) adjust_appendrel_attrs(root, > + (Node *) > query->havingQual, > + nappinfos, > + appinfos); > These lines are updating some fields of "extra" structure in every loop. > The > structure is passed to create_child_grouping_paths() in the loop and to > add_paths_to_append_rel() outside the loop. Thus add_paths_to_append_rel() > only > gets some member values for the last child. Is that right? No. Patch do update those fields before calling add_paths_to_append_rel(). Should we split > extra into two structures one to be used within the loop and one outside? > Or > may be send the members being updated within the loop separately? > I don't see any point in splitting. We need almost all fields at child path creation as well as at finalization step. The patch basically just re-using the struct variable. > +/* > + * Forcibly apply scan/join target to all the Paths for the > scan/join > + * rel. > + * > [ lines clipped ] > +if (subpath == input_child_rel->cheapest_total_path) > +input_child_rel->cheapest_total_path = path; > +} > +} > This code seems to be copied from grouping_planner() almost verbatim. Is > there > a way we can refactor it into a function and use it in both the places. > Done. Moved this in 0003-Refactor-code-applying-scanjoin-target-to-paths-into.patch > have_grouping_by_partkey() may use match_expr_to_partition_keys() to find >
Re: [HACKERS] Partition-wise aggregation/grouping
On Wed, Sep 27, 2017 at 3:42 PM, Jeevan Chalkewrote: > Thanks Ashutosh for reviewing. > > Attached new patch-set with following changes: > > 1. Removed earlier 0007 and 0008 patches which were PoC for supporting > partial aggregation over fdw. I removed them as it will be a different > issue altogether and hence I will tackle them separately once this is > done. > > This patch-set now includes support for parallel plans within partitions. > > Notes: > HEAD: 59597e6 > Partition-wise Join Version: 34 > > (First six patches 0001 - 0006, remains the same functionality-wise) > 0007 - Refactors partial grouping paths creation into the separate function. > 0008 - Enables parallelism within the partition-wise aggregation. > > This patch also includes a fix for the crash reported by Rajkumar. > While forcibly applying scan/join target to all the Paths for the scan/join > rel, earlier I was using apply_projection_to_path() which modifies the path > in-place which causing this crash as the path finally chosen has been > updated by partition-wise agg path creation. Now I have used > create_projection_path() like we do in partial aggregation paths. > > Also, fixed issues reported by Ashutosh. Thanks. Here are comments on 0004 from last patch set. But most the comments still apply. Patch 0001 adds functions create_hash_agg_path() and create_sort_agg_path(). Patch 0004 adds a new argument to those functions for conditions in HAVING clause. We should move those changes to 0001 and pass parse->havingQual to these functions in 0001 itself. That will keep all changes to those functions together and also make 0003 small. The prologue of try_partition_wise_grouping() mentions a restriction of partition keys being leading group by clauses. This restriction is not specified in the prologue of have_grouping_by_partkey(), which actually checks for this restriction. The requirement per prologue of that function is just to have partition keys in group clause. I think have_grouping_by_partkey() is correct, and we should change prologue of try_partition_wise_grouping() to be in sync with have_grouping_by_partkey(). The prologue explains why partition-wise aggregation/grouping would be efficient with this restriction, but it doesn't explain why partial aggregation/grouping per partition would be efficient. May be it should do that as well. Similar is the case with partial aggregation/grouping discussion in README. +/* Do not handle grouping sets for now. */ Is this a real restriction or just restriction for first cut of this feature? Can you please add more explanation? May be update README as well? +grouped_rel->part_scheme = input_rel->part_scheme; Is this true even for partial aggregates? I think not. Since group by clause does not contain partition keys, the rows from multiple partitions participate in one group and thus the partition keys of input relation do not apply to the grouped relation. In this case, it seems that the grouped rel will have part_rels but will not be partitioned. +/* + * If there is no path for the child relation, then we cannot add + * aggregate path too. + */ +if (input_child_rel->pathlist == NIL) +return; When can this happen? I think, similar to partition-wise join it happens when there is a dummy parent relation. See [1]. If that's the case, you may want to do things similar to what partition-wise join is doing. If there's some other reason for doing this, returing from here half-way is actually waste of memory and planning time. Instead, we may want to loop over the part_rels to find if any of those have empty pathlist and return from there before doing any actual work. +extra.pathTarget = child_target; +extra.inputRows = input_child_rel->cheapest_startup_path->rows; +extra.havingQual = (Node *) adjust_appendrel_attrs(root, + (Node *) query->havingQual, + nappinfos, + appinfos); These lines are updating some fields of "extra" structure in every loop. The structure is passed to create_child_grouping_paths() in the loop and to add_paths_to_append_rel() outside the loop. Thus add_paths_to_append_rel() only gets some member values for the last child. Is that right? Should we split extra into two structures one to be used within the loop and one outside? Or may be send the members being updated within the loop separately? +/* + * Forcibly apply scan/join target to all the Paths for the scan/join + * rel. + * [ lines clipped ] +if (subpath == input_child_rel->cheapest_total_path) +input_child_rel->cheapest_total_path = path; +} +} This code seems to be copied from grouping_planner() almost verbatim. Is there a way we can
Re: [HACKERS] Partition-wise aggregation/grouping
Thanks Ashutosh for reviewing. Attached new patch-set with following changes: 1. Removed earlier 0007 and 0008 patches which were PoC for supporting partial aggregation over fdw. I removed them as it will be a different issue altogether and hence I will tackle them separately once this is done. This patch-set now includes support for parallel plans within partitions. Notes: HEAD: 59597e6 Partition-wise Join Version: 34 (First six patches 0001 - 0006, remains the same functionality-wise) 0007 - Refactors partial grouping paths creation into the separate function. 0008 - Enables parallelism within the partition-wise aggregation. This patch also includes a fix for the crash reported by Rajkumar. While forcibly applying scan/join target to all the Paths for the scan/join rel, earlier I was using apply_projection_to_path() which modifies the path in-place which causing this crash as the path finally chosen has been updated by partition-wise agg path creation. Now I have used create_projection_path() like we do in partial aggregation paths. Also, fixed issues reported by Ashutosh. On Tue, Sep 26, 2017 at 6:16 PM, Ashutosh Bapat < ashutosh.ba...@enterprisedb.com> wrote: > Hi Jeevan, > I have started reviewing these patches. > > 0001 looks fine. There might be some changes that will be needed, but > those will be clear when I review the patch that uses this > refactoring. > > 0002 > + * > + * If targetlist is provided, we use it else use targetlist from the root. > */ > static double > get_number_of_groups(PlannerInfo *root, > double path_rows, > -grouping_sets_data *gd) > +grouping_sets_data *gd, > +List *tlist) > { > Query *parse = root->parse; > double dNumGroups; > + List *targetList = (tlist == NIL) ? parse->targetList : tlist; > > May be we should just pass targetlist always. Instead of passing NIL, > pass parse->targetList directly. That would save us one conditional > assignment. May be passing NIL is required for the patches that use > this refactoring, but that's not clear as is in this patch. > Done in attached patch-set. > > 0003 > In the documenation of enable_partition_wise_aggregate, we should > probably explain why the default is off or like partition_wise_join > GUC, explain the consequences of turning it off. I have updated this. Please have a look. > I doubt if we could > accept something like partition_wise_agg_cost_factor looks. But we can > discuss this at a later stage. Mean time it may be worthwhile to fix > the reason why we would require this GUC. If the regular aggregation > has cost lesser than partition-wise aggregation in most of the cases, > then probably we need to fix the cost model. > Yep. I will have a look mean-while. > > I will continue reviewing rest of the patches. > > Thanks -- Jeevan Chalke Technical Architect, Product Development EnterpriseDB Corporation The Enterprise PostgreSQL Company partition-wise-agg-v4.tar.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise aggregation/grouping
Hi Jeevan, I have started reviewing these patches. 0001 looks fine. There might be some changes that will be needed, but those will be clear when I review the patch that uses this refactoring. 0002 + * + * If targetlist is provided, we use it else use targetlist from the root. */ static double get_number_of_groups(PlannerInfo *root, double path_rows, -grouping_sets_data *gd) +grouping_sets_data *gd, +List *tlist) { Query *parse = root->parse; double dNumGroups; + List *targetList = (tlist == NIL) ? parse->targetList : tlist; May be we should just pass targetlist always. Instead of passing NIL, pass parse->targetList directly. That would save us one conditional assignment. May be passing NIL is required for the patches that use this refactoring, but that's not clear as is in this patch. 0003 In the documenation of enable_partition_wise_aggregate, we should probably explain why the default is off or like partition_wise_join GUC, explain the consequences of turning it off. I doubt if we could accept something like partition_wise_agg_cost_factor looks. But we can discuss this at a later stage. Mean time it may be worthwhile to fix the reason why we would require this GUC. If the regular aggregation has cost lesser than partition-wise aggregation in most of the cases, then probably we need to fix the cost model. I will continue reviewing rest of the patches. On Mon, Sep 18, 2017 at 12:37 PM, Jeevan Chalkewrote: > > > On Tue, Sep 12, 2017 at 6:21 PM, Jeevan Chalke > wrote: >> >> >> >> On Tue, Sep 12, 2017 at 3:24 PM, Rajkumar Raghuwanshi >> wrote: >>> >>> >>> Hi Jeevan, >>> >>> I have started testing partition-wise-aggregate and got one observation, >>> please take a look. >>> with the v2 patch, here if I change target list order, query is not >>> picking full partition-wise-aggregate. >> >> >> Thanks Rajkumar for reporting this. >> >> I am looking into this issue and will post updated patch with the fix. > > > Logic for checking whether partition keys lead group by keys needs to be > updated here. The group by expressions can appear in any order without > affecting the final result. And thus, the need for partition keys should > be leading the group by keys to have full aggregation is not mandatory. > Instead we must ensure that the partition keys are part of the group by > keys to compute full aggregation on a partition. > > Attached, revised patch-set with above fix. > > Also, in test-cases, I have removed DROP/ANALYZE commands on child > relations and also removed VERBOSE from the EXPLAIN. > > Notes: > HEAD: 8edacab209957520423770851351ab4013cb0167 > Partition-wise Join patch-set version: v32 > > Thanks > > -- > Jeevan Chalke > Principal Software Engineer, Product Development > EnterpriseDB Corporation > The Enterprise PostgreSQL Company > > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise aggregation/grouping
On Thu, Sep 21, 2017 at 5:00 PM, Rajkumar Raghuwanshi < rajkumar.raghuwan...@enterprisedb.com> wrote: > Hi Jeevan, > > while testing latest v3 patches, I am getting a server crash if I reset > partition_wise_agg_cost_factor, please take a look. > > CREATE TABLE lp (a TEXT, b FLOAT, c INT) PARTITION BY LIST(c); > CREATE TABLE lp1 PARTITION OF lp FOR VALUES IN (10,20); > CREATE TABLE lp2 PARTITION OF lp FOR VALUES IN (30,40); > > INSERT INTO lp VALUES ('a1',800, 20); > INSERT INTO lp VALUES ('a2',1250,30); > INSERT INTO lp VALUES ('a3',2975,20); > INSERT INTO lp VALUES ('a3',2850,30); > > postgres=# SET enable_partition_wise_agg TO true; > SET > postgres=# SET partition_wise_agg_cost_factor TO 0.5; > SET > postgres=# > postgres=# SELECT MAX(b), AVG(b) FROM lp GROUP BY a HAVING a = 'a3' ORDER > BY 1,2; > max | avg > --+ > 2975 | 2912.5 > (1 row) > > postgres=# RESET partition_wise_agg_cost_factor; > RESET > postgres=# > postgres=# SELECT MAX(b), AVG(b) FROM lp GROUP BY a HAVING a = 'a3' ORDER > BY 1,2; > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Failed. > !> > > Oops. Will fix this. I have added these tests in testcase, but testcase is working as expected. However running those steps on psql reproduces the crash (not consistent though). Looking into it. Thanks for reporting. > Thanks & Regards, > Rajkumar Raghuwanshi > QMG, EnterpriseDB Corporation > -- Jeevan Chalke Principal Software Engineer, Product Development EnterpriseDB Corporation The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise aggregation/grouping
On Mon, Sep 18, 2017 at 12:37 PM, Jeevan Chalke < jeevan.cha...@enterprisedb.com> wrote: > > On Tue, Sep 12, 2017 at 6:21 PM, Jeevan Chalke < > jeevan.cha...@enterprisedb.com> wrote: > >> >> >> On Tue, Sep 12, 2017 at 3:24 PM, Rajkumar Raghuwanshi < >> rajkumar.raghuwan...@enterprisedb.com> wrote: >> >>> >>> Hi Jeevan, >>> >>> I have started testing partition-wise-aggregate and got one observation, >>> please take a look. >>> with the v2 patch, here if I change target list order, query is not >>> picking full partition-wise-aggregate. >>> >> >> Thanks Rajkumar for reporting this. >> >> I am looking into this issue and will post updated patch with the fix. >> > > Logic for checking whether partition keys lead group by keys needs to be > updated here. The group by expressions can appear in any order without > affecting the final result. And thus, the need for partition keys should > be leading the group by keys to have full aggregation is not mandatory. > Instead we must ensure that the partition keys are part of the group by > keys to compute full aggregation on a partition. > > Attached, revised patch-set with above fix. > > Also, in test-cases, I have removed DROP/ANALYZE commands on child > relations and also removed VERBOSE from the EXPLAIN. > > Notes: > HEAD: 8edacab209957520423770851351ab4013cb0167 > Partition-wise Join patch-set version: v32 > Hi Jeevan, while testing latest v3 patches, I am getting a server crash if I reset partition_wise_agg_cost_factor, please take a look. CREATE TABLE lp (a TEXT, b FLOAT, c INT) PARTITION BY LIST(c); CREATE TABLE lp1 PARTITION OF lp FOR VALUES IN (10,20); CREATE TABLE lp2 PARTITION OF lp FOR VALUES IN (30,40); INSERT INTO lp VALUES ('a1',800, 20); INSERT INTO lp VALUES ('a2',1250,30); INSERT INTO lp VALUES ('a3',2975,20); INSERT INTO lp VALUES ('a3',2850,30); postgres=# SET enable_partition_wise_agg TO true; SET postgres=# SET partition_wise_agg_cost_factor TO 0.5; SET postgres=# postgres=# SELECT MAX(b), AVG(b) FROM lp GROUP BY a HAVING a = 'a3' ORDER BY 1,2; max | avg --+ 2975 | 2912.5 (1 row) postgres=# RESET partition_wise_agg_cost_factor; RESET postgres=# postgres=# SELECT MAX(b), AVG(b) FROM lp GROUP BY a HAVING a = 'a3' ORDER BY 1,2; server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. !> Thanks & Regards, Rajkumar Raghuwanshi QMG, EnterpriseDB Corporation
Re: [HACKERS] Partition-wise aggregation/grouping
On Mon, Sep 18, 2017 at 12:37 PM, Jeevan Chalke < jeevan.cha...@enterprisedb.com> wrote: > > > On Tue, Sep 12, 2017 at 6:21 PM, Jeevan Chalke < > jeevan.cha...@enterprisedb.com> wrote: > >> >> >> On Tue, Sep 12, 2017 at 3:24 PM, Rajkumar Raghuwanshi < >> rajkumar.raghuwan...@enterprisedb.com> wrote: >> >>> >>> Hi Jeevan, >>> >>> I have started testing partition-wise-aggregate and got one observation, >>> please take a look. >>> with the v2 patch, here if I change target list order, query is not >>> picking full partition-wise-aggregate. >>> >> >> Thanks Rajkumar for reporting this. >> >> I am looking into this issue and will post updated patch with the fix. >> > > Logic for checking whether partition keys lead group by keys needs to be > updated here. The group by expressions can appear in any order without > affecting the final result. And thus, the need for partition keys should > be leading the group by keys to have full aggregation is not mandatory. > Instead we must ensure that the partition keys are part of the group by > keys to compute full aggregation on a partition. > > Attached, revised patch-set with above fix. > > Also, in test-cases, I have removed DROP/ANALYZE commands on child > relations and also removed VERBOSE from the EXPLAIN. > > Notes: > HEAD: 8edacab209957520423770851351ab4013cb0167 > Partition-wise Join patch-set version: v32 > Thanks for the patch. I have tested it and issue is fixed now.
Re: [HACKERS] Partition-wise aggregation/grouping
On Tue, Sep 12, 2017 at 6:21 PM, Jeevan Chalke < jeevan.cha...@enterprisedb.com> wrote: > > > On Tue, Sep 12, 2017 at 3:24 PM, Rajkumar Raghuwanshi < > rajkumar.raghuwan...@enterprisedb.com> wrote: > >> >> Hi Jeevan, >> >> I have started testing partition-wise-aggregate and got one observation, >> please take a look. >> with the v2 patch, here if I change target list order, query is not >> picking full partition-wise-aggregate. >> > > Thanks Rajkumar for reporting this. > > I am looking into this issue and will post updated patch with the fix. > Logic for checking whether partition keys lead group by keys needs to be updated here. The group by expressions can appear in any order without affecting the final result. And thus, the need for partition keys should be leading the group by keys to have full aggregation is not mandatory. Instead we must ensure that the partition keys are part of the group by keys to compute full aggregation on a partition. Attached, revised patch-set with above fix. Also, in test-cases, I have removed DROP/ANALYZE commands on child relations and also removed VERBOSE from the EXPLAIN. Notes: HEAD: 8edacab209957520423770851351ab4013cb0167 Partition-wise Join patch-set version: v32 Thanks -- Jeevan Chalke Principal Software Engineer, Product Development EnterpriseDB Corporation The Enterprise PostgreSQL Company partition-wise-agg-v3.tar.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise aggregation/grouping
On Tue, Sep 12, 2017 at 3:24 PM, Rajkumar Raghuwanshi < rajkumar.raghuwan...@enterprisedb.com> wrote: > > On Fri, Sep 8, 2017 at 5:47 PM, Jeevan Chalkecom> wrote: > >> Here are the new patch-set re-based on HEAD (f0a0c17) and >> latest partition-wise join (v29) patches. >> > > Hi Jeevan, > > I have started testing partition-wise-aggregate and got one observation, > please take a look. > with the v2 patch, here if I change target list order, query is not > picking full partition-wise-aggregate. > Thanks Rajkumar for reporting this. I am looking into this issue and will post updated patch with the fix. Thanks & Regards, > Rajkumar Raghuwanshi > QMG, EnterpriseDB Corporation > > > > -- Jeevan Chalke Principal Software Engineer, Product Development EnterpriseDB Corporation The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise aggregation/grouping
On Fri, Sep 8, 2017 at 5:47 PM, Jeevan Chalke < jeevan.cha...@enterprisedb.com> wrote: > Here are the new patch-set re-based on HEAD (f0a0c17) and > latest partition-wise join (v29) patches. > Hi Jeevan, I have started testing partition-wise-aggregate and got one observation, please take a look. with the v2 patch, here if I change target list order, query is not picking full partition-wise-aggregate. SET enable_partition_wise_agg TO true; SET partition_wise_agg_cost_factor TO 0.5; SET enable_partition_wise_join TO true; SET max_parallel_workers_per_gather TO 0; CREATE TABLE pagg_tab (a int, b int, c int) PARTITION BY RANGE(a); CREATE TABLE pagg_tab_p1 PARTITION OF pagg_tab FOR VALUES FROM (0) TO (10); CREATE TABLE pagg_tab_p2 PARTITION OF pagg_tab FOR VALUES FROM (10) TO (20); CREATE TABLE pagg_tab_p3 PARTITION OF pagg_tab FOR VALUES FROM (20) TO (30); INSERT INTO pagg_tab SELECT i % 30, i % 30, i % 50 FROM generate_series(0, 299) i; ANALYZE pagg_tab; postgres=# explain (verbose, costs off) select a,b,count(*) from pagg_tab group by a,b order by 1,2; QUERY PLAN -- Sort Output: pagg_tab_p1.a, pagg_tab_p1.b, (count(*)) Sort Key: pagg_tab_p1.a, pagg_tab_p1.b -> Append -> HashAggregate Output: pagg_tab_p1.a, pagg_tab_p1.b, count(*) Group Key: pagg_tab_p1.a, pagg_tab_p1.b -> Seq Scan on public.pagg_tab_p1 Output: pagg_tab_p1.a, pagg_tab_p1.b -> HashAggregate Output: pagg_tab_p2.a, pagg_tab_p2.b, count(*) Group Key: pagg_tab_p2.a, pagg_tab_p2.b -> Seq Scan on public.pagg_tab_p2 Output: pagg_tab_p2.a, pagg_tab_p2.b -> HashAggregate Output: pagg_tab_p3.a, pagg_tab_p3.b, count(*) Group Key: pagg_tab_p3.a, pagg_tab_p3.b -> Seq Scan on public.pagg_tab_p3 Output: pagg_tab_p3.a, pagg_tab_p3.b (19 rows) -- changing target list order -- picking partial partition-wise aggregation path postgres=# explain (verbose, costs off) select b,a,count(*) from pagg_tab group by a,b order by 1,2; QUERY PLAN Finalize GroupAggregate Output: pagg_tab_p1.b, pagg_tab_p1.a, count(*) Group Key: pagg_tab_p1.b, pagg_tab_p1.a -> Sort Output: pagg_tab_p1.b, pagg_tab_p1.a, (PARTIAL count(*)) Sort Key: pagg_tab_p1.b, pagg_tab_p1.a -> Append -> Partial HashAggregate Output: pagg_tab_p1.b, pagg_tab_p1.a, PARTIAL count(*) Group Key: pagg_tab_p1.b, pagg_tab_p1.a -> Seq Scan on public.pagg_tab_p1 Output: pagg_tab_p1.b, pagg_tab_p1.a -> Partial HashAggregate Output: pagg_tab_p2.b, pagg_tab_p2.a, PARTIAL count(*) Group Key: pagg_tab_p2.b, pagg_tab_p2.a -> Seq Scan on public.pagg_tab_p2 Output: pagg_tab_p2.b, pagg_tab_p2.a -> Partial HashAggregate Output: pagg_tab_p3.b, pagg_tab_p3.a, PARTIAL count(*) Group Key: pagg_tab_p3.b, pagg_tab_p3.a -> Seq Scan on public.pagg_tab_p3 Output: pagg_tab_p3.b, pagg_tab_p3.a (22 rows) Thanks & Regards, Rajkumar Raghuwanshi QMG, EnterpriseDB Corporation
Re: [HACKERS] Partition-wise aggregation/grouping
On Wed, Aug 23, 2017 at 4:43 PM, Jeevan Chalke < jeevan.cha...@enterprisedb.com> wrote: > Hi, > > Attached is the patch to implement partition-wise aggregation/grouping. > > As explained earlier, we produce a full aggregation for each partition when > partition keys are leading group by clauses and then append is performed. > Else we do a partial aggregation on each partition, append them and then > add > finalization step over it. > > I have observed that cost estimated for partition-wise aggregation and cost > for the plans without partition-wise aggregation is almost same. However, > execution time shows significant improvement (as explained my in the very > first email) with partition-wise aggregates. Planner chooses a plan > according > to the costs, and thus most of the time plan without partition-wise > aggregation is chosen. Hence, to force partition-wise plans and for the > regression runs, I have added a GUC named partition_wise_agg_cost_factor to > adjust the costings. > > This feature is only used when enable_partition_wise_agg GUC is set to on. > > Here are the details of the patches in the patch-set: > Here are the new patch-set re-based on HEAD (f0a0c17) and latest partition-wise join (v29) patches. > > 0001 - Refactors sort and hash final grouping paths into separate > functions. > Since partition-wise aggregation too builds paths same as that of > create_grouping_paths(), separated path creation for sort and hash agg into > separate functions. These functions later used by main partition-wise > aggregation/grouping patch. > > 0002 - Passes targetlist to get_number_of_groups(). > We need to estimate groups for individual child relations and thus need to > pass targetlist corresponding to the child rel. > > 0003 - Adds enable_partition_wise_agg and partition_wise_agg_cost_factor > GUCs. > > 0004 - Implements partition-wise aggregation. > > 0005 - Adds test-cases. > > 0006 - postgres_fdw changes which enable pushing aggregation for other > upper > relations. > 0007 - Provides infrastructure to allow partial aggregation This will allow us to push the partial aggregation over fdw. With this one can write SUM(PARTIAL x) to get a partial sum result. Since PARTIAL is used in syntax, I need to move that to a reserved keywords category. This is kind of PoC patch and needs input over approach and the way it is implemented. 0008 - Teaches postgres_fdw to push partial aggregation With this we can push aggregate on remote server when GROUP BY key does not match with the PARTITION key too. > > > Since this patch is highly dependent on partition-wise join [1], one needs > to > apply all those patches on HEAD (my repository head was at: > 66ed3829df959adb47f71d7c903ac59f0670f3e1) before applying these patches in > order. > > Suggestions / feedback / inputs ? > > [1] https://www.postgresql.org/message-id/CAFjFpRd9Vqh_=-Ldv- > XqWY006d07TJ+VXuhXCbdj=p1juky...@mail.gmail.com > > > -- Jeevan Chalke Principal Software Engineer, Product Development EnterpriseDB Corporation The Enterprise PostgreSQL Company partition-wise-agg-v2.tar.gz Description: GNU Zip compressed data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise aggregation/grouping
Hi, Attached is the patch to implement partition-wise aggregation/grouping. As explained earlier, we produce a full aggregation for each partition when partition keys are leading group by clauses and then append is performed. Else we do a partial aggregation on each partition, append them and then add finalization step over it. I have observed that cost estimated for partition-wise aggregation and cost for the plans without partition-wise aggregation is almost same. However, execution time shows significant improvement (as explained my in the very first email) with partition-wise aggregates. Planner chooses a plan according to the costs, and thus most of the time plan without partition-wise aggregation is chosen. Hence, to force partition-wise plans and for the regression runs, I have added a GUC named partition_wise_agg_cost_factor to adjust the costings. This feature is only used when enable_partition_wise_agg GUC is set to on. Here are the details of the patches in the patch-set: 0001 - Refactors sort and hash final grouping paths into separate functions. Since partition-wise aggregation too builds paths same as that of create_grouping_paths(), separated path creation for sort and hash agg into separate functions. These functions later used by main partition-wise aggregation/grouping patch. 0002 - Passes targetlist to get_number_of_groups(). We need to estimate groups for individual child relations and thus need to pass targetlist corresponding to the child rel. 0003 - Adds enable_partition_wise_agg and partition_wise_agg_cost_factor GUCs. 0004 - Implements partition-wise aggregation. 0005 - Adds test-cases. 0006 - postgres_fdw changes which enable pushing aggregation for other upper relations. Since this patch is highly dependent on partition-wise join [1], one needs to apply all those patches on HEAD (my repository head was at: 66ed3829df959adb47f71d7c903ac59f0670f3e1) before applying these patches in order. Suggestions / feedback / inputs ? [1] https://www.postgresql.org/message-id/CAFjFpRd9Vqh_=-Ldv-XqWY006d07TJ+VXuhXCbdj=p1juky...@mail.gmail.com On Tue, Mar 21, 2017 at 12:47 PM, Jeevan Chalke < jeevan.cha...@enterprisedb.com> wrote: > Hi all, > > Declarative partitioning is supported in PostgreSQL 10 and work is already > in > progress to support partition-wise joins. Here is a proposal for > partition-wise > aggregation/grouping. Our initial performance measurement has shown 7 > times > performance when partitions are on foreign servers and approximately 15% > when > partitions are local. > > Partition-wise aggregation/grouping computes aggregates for each partition > separately. If the group clause contains the partition key, all the rows > belonging to a given group come from one partition, thus allowing > aggregates > to be computed completely for each partition. Otherwise, partial > aggregates > computed for each partition are combined across the partitions to produce > the > final aggregates. This technique improves performance because: > i. When partitions are located on foreign server, we can push down the > aggregate to the foreign server. > ii. If hash table for each partition fits in memory, but that for the whole > relation does not, each partition-wise aggregate can use an in-memory hash > table. > iii. Aggregation at the level of partitions can exploit properties of > partitions like indexes, their storage etc. > > Attached an experimental patch for the same based on the partition-wise > join > patches posted in [1]. > > This patch currently implements partition-wise aggregation when group > clause > contains the partitioning key. A query below, involving a partitioned > table > with 3 partitions containing 1M rows each, producing total 30 groups showed > 15% improvement over non-partition-wise aggregation. Same query showed 7 > times > improvement when the partitions were located on the foreign servers. > > Here is the sample plan: > > postgres=# set enable_partition_wise_agg to true; > SET > postgres=# EXPLAIN ANALYZE SELECT a, count(*) FROM plt1 GROUP BY a; > QUERY > PLAN > > -- > Append (cost=5100.00..61518.90 rows=30 width=12) (actual > time=324.837..944.804 rows=30 loops=1) >-> Foreign Scan (cost=5100.00..20506.30 rows=10 width=12) (actual > time=324.837..324.838 rows=10 loops=1) > Relations: Aggregate on (public.fplt1_p1 plt1) >-> Foreign Scan (cost=5100.00..20506.30 rows=10 width=12) (actual > time=309.954..309.956 rows=10 loops=1) > Relations: Aggregate on (public.fplt1_p2 plt1) >-> Foreign Scan (cost=5100.00..20506.30 rows=10 width=12) (actual > time=310.002..310.004 rows=10 loops=1) > Relations: Aggregate on (public.fplt1_p3 plt1) > Planning time: 0.370 ms > Execution time: 945.384 ms > (9 rows) > > postgres=# set enable_partition_wise_agg to
Re: [HACKERS] Partition-wise aggregation/grouping
On Fri, Apr 28, 2017 at 3:03 AM, Antonin Houskawrote: > I think this is not generic enough because the result of the Append plan can > be joined to another relation. As such a join can duplicate the > already-aggregated values, the aggregates should not be finalized below the > top-level plan. If the grouping key matches the partition key, then it's correct to push the entire aggregate down, and there's probably a large performance advantage from avoiding aggregating twice. If the two don't match, then pushing the aggregate down necessarily involves a "partial" and a "finalize" stage, which may or may not be cheaper than doing the aggregation all at once. If you have lots of 2-row groups with 1 row in the first branch of the append and 1 row in the second branch of the append, breaking the aggregate into two steps is probably going to be a loser. If the overall number of groups is small, it's probably going to win. But when the grouping key matches the partition key, so that two-stage aggregation isn't required, I suspect the pushdown should almost always win. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise aggregation/grouping
Jeevan Chalkewrote: > On Thu, Apr 27, 2017 at 4:53 PM, Antonin Houska wrote: > > > Robert Haas wrote: > > > Well, I'm confused. I see that there's a relationship between what > > > Antonin is trying to do and what Jeevan is trying to do, but I can't > > > figure out whether one is a subset of the other, whether they're both > > > orthogonal, or something else. This plan looks similar to what I > > > would expect Jeevan's patch to produce, > > The point is that the patch Jeevan wanted to work on is actually a subset > > of > > [1] combined with [2]. > Seems like, as you are targeting every relation whether or not it is > partitioned. Yes. > With my patch, I am getting following plan where we push entire > aggregation below append. > > QUERY PLAN > -- > Append > -> HashAggregate > Group Key: b_1.j > -> Hash Join > Hash Cond: (b_1.j = c_1.k) > -> Seq Scan on b_1 > -> Hash > -> Seq Scan on c_1 > -> HashAggregate > Group Key: b_2.j > -> Hash Join > Hash Cond: (b_2.j = c_2.k) > -> Seq Scan on b_2 > -> Hash > -> Seq Scan on c_2 > (15 rows) I think this is not generic enough because the result of the Append plan can be joined to another relation. As such a join can duplicate the already-aggregated values, the aggregates should not be finalized below the top-level plan. > Antonin, I have tried applying your patch on master but it doesn't get > apply. Can you please provide the HEAD and any other changes required > to be applied first? I've lost that information. I'll post a new version to the [1] thread asap. > How the plan look like when GROUP BY key does not match with the > partitioning key i.e. GROUP BY b.v ? EXPLAIN (COSTS false) SELECT b.v, avg(b.v + c.v) FROMb JOIN c ON b.j = c.k GROUP BYb.v; QUERY PLAN Finalize HashAggregate Group Key: b_1.v -> Append -> Partial HashAggregate Group Key: b_1.v -> Hash Join Hash Cond: (b_1.j = c_1.k) -> Seq Scan on b_1 -> Hash -> Seq Scan on c_1 -> Partial HashAggregate Group Key: b_2.v -> Hash Join Hash Cond: (b_2.j = c_2.k) -> Seq Scan on b_2 -> Hash -> Seq Scan on c_2 > > [1] https://www.postgresql.org/message-id/9666.1491295317%40localhost > > > > [2] https://commitfest.postgresql.org/14/994/ -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de, http://www.cybertec.at -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise aggregation/grouping
On Thu, Apr 27, 2017 at 4:53 PM, Antonin Houskawrote: > Robert Haas wrote: > > > On Wed, Apr 26, 2017 at 6:28 AM, Antonin Houska wrote: > > > Attached is a diff that contains both patches merged. This is just to > prove my > > > assumption, details to be elaborated later. The scripts attached > produce the > > > following plan in my environment: > > > > > >QUERY PLAN > > > > > > Parallel Finalize HashAggregate > > >Group Key: b_1.j > > >-> Append > > > -> Parallel Partial HashAggregate > > >Group Key: b_1.j > > >-> Hash Join > > > Hash Cond: (b_1.j = c_1.k) > > > -> Seq Scan on b_1 > > > -> Hash > > >-> Seq Scan on c_1 > > > -> Parallel Partial HashAggregate > > >Group Key: b_2.j > > >-> Hash Join > > > Hash Cond: (b_2.j = c_2.k) > > > -> Seq Scan on b_2 > > > -> Hash > > >-> Seq Scan on c_2 > > > > Well, I'm confused. I see that there's a relationship between what > > Antonin is trying to do and what Jeevan is trying to do, but I can't > > figure out whether one is a subset of the other, whether they're both > > orthogonal, or something else. This plan looks similar to what I > > would expect Jeevan's patch to produce, > > The point is that the patch Jeevan wanted to work on is actually a subset > of > [1] combined with [2]. > Seems like, as you are targeting every relation whether or not it is partitioned. Where as I am targeting only partitioned relations in my patch. > > > except i have no idea what "Parallel" would mean in a plan that contains > no > > Gather node. > > parallel_aware field was set mistakenly on the AggPath. Fixed patch is > attached below, producing this plan: > >QUERY PLAN > > Finalize HashAggregate >Group Key: b_1.j >-> Append > -> Partial HashAggregate >Group Key: b_1.j >-> Hash Join > Hash Cond: (b_1.j = c_1.k) > -> Seq Scan on b_1 > -> Hash >-> Seq Scan on c_1 > -> Partial HashAggregate >Group Key: b_2.j >-> Hash Join > Hash Cond: (b_2.j = c_2.k) > -> Seq Scan on b_2 > -> Hash >-> Seq Scan on c_2 > With my patch, I am getting following plan where we push entire aggregation below append. QUERY PLAN -- Append -> HashAggregate Group Key: b_1.j -> Hash Join Hash Cond: (b_1.j = c_1.k) -> Seq Scan on b_1 -> Hash -> Seq Scan on c_1 -> HashAggregate Group Key: b_2.j -> Hash Join Hash Cond: (b_2.j = c_2.k) -> Seq Scan on b_2 -> Hash -> Seq Scan on c_2 (15 rows) Antonin, I have tried applying your patch on master but it doesn't get apply. Can you please provide the HEAD and any other changes required to be applied first? How the plan look like when GROUP BY key does not match with the partitioning key i.e. GROUP BY b.v ? > > [1] https://www.postgresql.org/message-id/9666.1491295317%40localhost > > [2] https://commitfest.postgresql.org/14/994/ > > -- > Antonin Houska > Cybertec Schönig & Schönig GmbH > Gröhrmühlgasse 26 > A-2700 Wiener Neustadt > Web: http://www.postgresql-support.de, http://www.cybertec.at > > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > > -- Jeevan Chalke Principal Software Engineer, Product Development EnterpriseDB Corporation The Enterprise PostgreSQL Company Phone: +91 20 66449694 Website: www.enterprisedb.com EnterpriseDB Blog: http://blogs.enterprisedb.com/ Follow us on Twitter: http://www.twitter.com/enterprisedb This e-mail message (and any attachment) is intended for the use of the individual or entity to whom it is addressed. This message contains information from EnterpriseDB Corporation that may be privileged, confidential, or exempt from disclosure under applicable law. If you are not the intended recipient or authorized to receive this for the intended recipient, any use, dissemination, distribution, retention, archiving, or copying of this communication is strictly prohibited. If you have received this e-mail in error, please notify the sender immediately by reply e-mail and delete this message.
Re: [HACKERS] Partition-wise aggregation/grouping
On Wed, Apr 26, 2017 at 6:28 AM, Antonin Houskawrote: > Attached is a diff that contains both patches merged. This is just to prove my > assumption, details to be elaborated later. The scripts attached produce the > following plan in my environment: > >QUERY PLAN > > Parallel Finalize HashAggregate >Group Key: b_1.j >-> Append > -> Parallel Partial HashAggregate >Group Key: b_1.j >-> Hash Join > Hash Cond: (b_1.j = c_1.k) > -> Seq Scan on b_1 > -> Hash >-> Seq Scan on c_1 > -> Parallel Partial HashAggregate >Group Key: b_2.j >-> Hash Join > Hash Cond: (b_2.j = c_2.k) > -> Seq Scan on b_2 > -> Hash >-> Seq Scan on c_2 Well, I'm confused. I see that there's a relationship between what Antonin is trying to do and what Jeevan is trying to do, but I can't figure out whether one is a subset of the other, whether they're both orthogonal, or something else. This plan looks similar to what I would expect Jeevan's patch to produce, except i have no idea what "Parallel" would mean in a plan that contains no Gather node. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise aggregation/grouping
Antonin Houskawrote: > > Jeevan Chalke wrote: > > > Our work will overlap when we are pushing down the aggregate on partitioned > > base relation to its children/partitions. > > > > I think you should continue working on pushing down aggregate onto the > > joins/scans where as I will continue my work on pushing down aggregates to > > partitions (joins as well as single table). Once we are done with these > > task, > > then we may need to find a way to integrate them. > > > > [1] > > https://www.postgresql.org/message-id/flat/CAFjFpRfQ8GrQvzp3jA2wnLqrHmaXna-urjm_UY9BqXj=ead...@mail.gmail.com#CAFjFpRfQ8GrQvzp3jA2wnLqrHmaXna-urjm_UY9BqXj=ead...@mail.gmail.com > > My patch does also create (partial) aggregation paths below the Append node, > but only expects SeqScan as input. Please check if you patch can be based on > this or if there's any conflict. Well, I haven't imposed any explicit restriction on the kind of path to be aggregated below the Append path. Maybe the only thing to do is to merge my patch with the "partition-wise join" patch (which I haven't checked yet). -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de, http://www.cybertec.at -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise aggregation/grouping
The promised new version of my patch is here: https://www.postgresql.org/message-id/9666.1491295317%40localhost Jeevan Chalkewrote: > On Tue, Mar 21, 2017 at 1:47 PM, Antonin Houska wrote: > > Jeevan Chalke wrote: > > IIUC, it seems that you are trying to push down the aggregation into the > joining relations. So basically you are converting > Agg -> Join -> {scan1, scan2} into > FinalAgg -> Join -> {PartialAgg -> scan1, PartialAgg -> scan2}. > In addition to that your patch pushes aggregates on base rel to its children, > if any. > > Where as what I propose here is pushing down aggregation below the append > node keeping join/scan as is. So basically I am converting > Agg -> Append-> Join -> {scan1, scan2} into > Append -> Agg -> Join -> {scan1, scan2}. > This will require partition-wise join as posted in [1]. > But I am planning to make this work for partitioned relations and not for > generic inheritance. > > I treat these two as separate strategies/paths to be consider while planning. > > Our work will overlap when we are pushing down the aggregate on partitioned > base relation to its children/partitions. > > I think you should continue working on pushing down aggregate onto the > joins/scans where as I will continue my work on pushing down aggregates to > partitions (joins as well as single table). Once we are done with these task, > then we may need to find a way to integrate them. > > [1] > https://www.postgresql.org/message-id/flat/CAFjFpRfQ8GrQvzp3jA2wnLqrHmaXna-urjm_UY9BqXj=ead...@mail.gmail.com#CAFjFpRfQ8GrQvzp3jA2wnLqrHmaXna-urjm_UY9BqXj=ead...@mail.gmail.com My patch does also create (partial) aggregation paths below the Append node, but only expects SeqScan as input. Please check if you patch can be based on this or if there's any conflict. (I'll probably be unable to respond before Monday 04/17.) -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de, http://www.cybertec.at -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Partition-wise aggregation/grouping
On Tue, Mar 21, 2017 at 1:47 PM, Antonin Houskawrote: > Jeevan Chalke wrote: > > > Declarative partitioning is supported in PostgreSQL 10 and work is > already in > > progress to support partition-wise joins. Here is a proposal for > partition-wise > > aggregation/grouping. Our initial performance measurement has shown 7 > times > > performance when partitions are on foreign servers and approximately 15% > when > > partitions are local. > > > > Partition-wise aggregation/grouping computes aggregates for each > partition > > separately. If the group clause contains the partition key, all the rows > > belonging to a given group come from one partition, thus allowing > aggregates > > to be computed completely for each partition. Otherwise, partial > aggregates > > computed for each partition are combined across the partitions to > produce the > > final aggregates. This technique improves performance because: > > > i. When partitions are located on foreign server, we can push down the > > aggregate to the foreign server. > > > ii. If hash table for each partition fits in memory, but that for the > whole > > relation does not, each partition-wise aggregate can use an in-memory > hash > > table. > > > iii. Aggregation at the level of partitions can exploit properties of > > partitions like indexes, their storage etc. > > I suspect this overlaps with > > https://www.postgresql.org/message-id/29111.1483984605%40localhost > > I'm working on the next version of the patch, which will be able to > aggregate > the result of both base relation scans and joins. I'm trying hard to make > the > next version available before an urgent vacation that I'll have to take at > random date between today and early April. I suggest that we coordinate the > effort, it's lot of work in any case. > IIUC, it seems that you are trying to push down the aggregation into the joining relations. So basically you are converting Agg -> Join -> {scan1, scan2} into FinalAgg -> Join -> {PartialAgg -> scan1, PartialAgg -> scan2}. In addition to that your patch pushes aggregates on base rel to its children, if any. Where as what I propose here is pushing down aggregation below the append node keeping join/scan as is. So basically I am converting Agg -> Append-> Join -> {scan1, scan2} into Append -> Agg -> Join -> {scan1, scan2}. This will require partition-wise join as posted in [1]. But I am planning to make this work for partitioned relations and not for generic inheritance. I treat these two as separate strategies/paths to be consider while planning. Our work will overlap when we are pushing down the aggregate on partitioned base relation to its children/partitions. I think you should continue working on pushing down aggregate onto the joins/scans where as I will continue my work on pushing down aggregates to partitions (joins as well as single table). Once we are done with these task, then we may need to find a way to integrate them. [1] https://www.postgresql.org/message-id/flat/CAFjFpRfQ8GrQvzp3jA2wnLqrHmaXna-urjm_UY9BqXj=ead...@mail.gmail.com#CAFjFpRfQ8GrQvzp3jA2wnLqrHmaXna-urjm_UY9BqXj=ead...@mail.gmail.com > > -- > Antonin Houska > Cybertec Schönig & Schönig GmbH > Gröhrmühlgasse 26 > A-2700 Wiener Neustadt > Web: http://www.postgresql-support.de, http://www.cybertec.at > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Jeevan Chalke Principal Software Engineer, Product Development EnterpriseDB Corporation The Enterprise PostgreSQL Company
Re: [HACKERS] Partition-wise aggregation/grouping
Jeevan Chalkewrote: > Declarative partitioning is supported in PostgreSQL 10 and work is already in > progress to support partition-wise joins. Here is a proposal for > partition-wise > aggregation/grouping. Our initial performance measurement has shown 7 times > performance when partitions are on foreign servers and approximately 15% when > partitions are local. > > Partition-wise aggregation/grouping computes aggregates for each partition > separately. If the group clause contains the partition key, all the rows > belonging to a given group come from one partition, thus allowing aggregates > to be computed completely for each partition. Otherwise, partial aggregates > computed for each partition are combined across the partitions to produce the > final aggregates. This technique improves performance because: > i. When partitions are located on foreign server, we can push down the > aggregate to the foreign server. > ii. If hash table for each partition fits in memory, but that for the whole > relation does not, each partition-wise aggregate can use an in-memory hash > table. > iii. Aggregation at the level of partitions can exploit properties of > partitions like indexes, their storage etc. I suspect this overlaps with https://www.postgresql.org/message-id/29111.1483984605%40localhost I'm working on the next version of the patch, which will be able to aggregate the result of both base relation scans and joins. I'm trying hard to make the next version available before an urgent vacation that I'll have to take at random date between today and early April. I suggest that we coordinate the effort, it's lot of work in any case. -- Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de, http://www.cybertec.at -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers