Re: Support run-time partition pruning for hash join
On Tue, Jan 30, 2024 at 10:33 AM Richard Guo wrote: > Attached is an updated patch. Nothing else has changed. > Here is another rebase over master so it applies again. Nothing else has changed. Thanks Richard v7-0001-Support-run-time-partition-pruning-for-hash-join.patch Description: Binary data
Re: Support run-time partition pruning for hash join
On Sat, Jan 27, 2024 at 11:29 AM vignesh C wrote: > CFBot shows that the patch does not apply anymore as in [1]: > > Please post an updated version for the same. Attached is an updated patch. Nothing else has changed. Thanks Richard v6-0001-Support-run-time-partition-pruning-for-hash-join.patch Description: Binary data
Re: Support run-time partition pruning for hash join
On Tue, 7 Nov 2023 at 13:25, Richard Guo wrote: > > > On Mon, Nov 6, 2023 at 11:00 PM Alexander Lakhin wrote: >> >> Please look at a warning and an assertion failure triggered by the >> following script: >> set parallel_setup_cost = 0; >> set parallel_tuple_cost = 0; >> set min_parallel_table_scan_size = '1kB'; >> >> create table t1 (i int) partition by range (i); >> create table t1_1 partition of t1 for values from (1) to (2); >> create table t1_2 partition of t1 for values from (2) to (3); >> insert into t1 values (1), (2); >> >> create table t2(i int); >> insert into t2 values (1), (2); >> analyze t1, t2; >> >> select * from t1 right join t2 on t1.i = t2.i; >> >> 2023-11-06 14:11:37.398 UTC|law|regression|6548f419.392cf5|WARNING: Join >> partition pruning $0 has not been performed yet. >> TRAP: failed Assert("node->as_prune_state"), File: "nodeAppend.c", Line: >> 846, PID: 3747061 > > > Thanks for the report! I failed to take care of the parallel-hashjoin > case, and I have to admit that it's not clear to me yet how we should do > join partition pruning in that case. > > For now I think it's better to just avoid performing join partition > pruning for parallel hashjoin, so that the patch doesn't become too > complex for review. We can always extend it in the future. > > I have done that in v5. Thanks for testing! CFBot shows that the patch does not apply anymore as in [1]: === Applying patches on top of PostgreSQL commit ID 924d046dcf55887c98a1628675a30f4b0eebe556 === === applying patch ./v5-0001-Support-run-time-partition-pruning-for-hash-join.patch ... patching file src/include/nodes/plannodes.h ... patching file src/include/optimizer/cost.h Hunk #1 FAILED at 211. 1 out of 1 hunk FAILED -- saving rejects to file src/include/optimizer/cost.h.rej Please post an updated version for the same. [1] - http://cfbot.cputube.org/patch_46_4512.log Regards, Vignesh
Re: Support run-time partition pruning for hash join
On Mon, Nov 6, 2023 at 11:00 PM Alexander Lakhin wrote: > Please look at a warning and an assertion failure triggered by the > following script: > set parallel_setup_cost = 0; > set parallel_tuple_cost = 0; > set min_parallel_table_scan_size = '1kB'; > > create table t1 (i int) partition by range (i); > create table t1_1 partition of t1 for values from (1) to (2); > create table t1_2 partition of t1 for values from (2) to (3); > insert into t1 values (1), (2); > > create table t2(i int); > insert into t2 values (1), (2); > analyze t1, t2; > > select * from t1 right join t2 on t1.i = t2.i; > > 2023-11-06 14:11:37.398 UTC|law|regression|6548f419.392cf5|WARNING: Join > partition pruning $0 has not been performed yet. > TRAP: failed Assert("node->as_prune_state"), File: "nodeAppend.c", Line: > 846, PID: 3747061 > Thanks for the report! I failed to take care of the parallel-hashjoin case, and I have to admit that it's not clear to me yet how we should do join partition pruning in that case. For now I think it's better to just avoid performing join partition pruning for parallel hashjoin, so that the patch doesn't become too complex for review. We can always extend it in the future. I have done that in v5. Thanks for testing! Thanks Richard v5-0001-Support-run-time-partition-pruning-for-hash-join.patch Description: Binary data
Re: Support run-time partition pruning for hash join
Hello Richard, 06.11.2023 06:05, Richard Guo wrote: Fixed this issue in v4. Please look at a warning and an assertion failure triggered by the following script: set parallel_setup_cost = 0; set parallel_tuple_cost = 0; set min_parallel_table_scan_size = '1kB'; create table t1 (i int) partition by range (i); create table t1_1 partition of t1 for values from (1) to (2); create table t1_2 partition of t1 for values from (2) to (3); insert into t1 values (1), (2); create table t2(i int); insert into t2 values (1), (2); analyze t1, t2; select * from t1 right join t2 on t1.i = t2.i; 2023-11-06 14:11:37.398 UTC|law|regression|6548f419.392cf5|WARNING: Join partition pruning $0 has not been performed yet. TRAP: failed Assert("node->as_prune_state"), File: "nodeAppend.c", Line: 846, PID: 3747061 Best regards, Alexander
Re: Support run-time partition pruning for hash join
On Sat, Nov 4, 2023 at 6:00 PM Alexander Lakhin wrote: > 02.11.2023 14:19, Richard Guo wrote: > > However, the cfbot indicates that there are test cases that fail on > FreeBSD [1] (no failure on other platforms). So I set up a FreeBSD-13 > locally but just cannot reproduce the failure. I must be doing > something wrong. Can anyone give me some hints or suggestions? > > I've managed to reproduce that failure on my Ubuntu with: > CPPFLAGS="-Og -DWRITE_READ_PARSE_PLAN_TREES -DCOPY_PARSE_PLAN_TREES" > ./configure ... make check > Wow, thank you so much. You saved me a lot of time. It turns out that it was caused by me not making JoinPartitionPruneInfo a node. The same issue can also exist for JoinPartitionPruneCandidateInfo - if you pprint(root) at some point you'll see 'could not dump unrecognized node type' warning. Fixed this issue in v4. Thanks Richard v4-0001-Support-run-time-partition-pruning-for-hash-join.patch Description: Binary data
Re: Support run-time partition pruning for hash join
Hello Richard, 02.11.2023 14:19, Richard Guo wrote: However, the cfbot indicates that there are test cases that fail on FreeBSD [1] (no failure on other platforms). So I set up a FreeBSD-13 locally but just cannot reproduce the failure. I must be doing something wrong. Can anyone give me some hints or suggestions? FYI. The failure looks like: explain (costs off) select p2.a, p1.c from permtest_parent p1 inner join permtest_parent p2 on p1.a = p2.a and left(p1.c, 3) ~ 'a1$'; - QUERY PLAN - - Hash Join - Hash Cond: (p2.a = p1.a) - -> Seq Scan on permtest_grandchild p2 - -> Hash - -> Seq Scan on permtest_grandchild p1 - Filter: ("left"(c, 3) ~ 'a1$'::text) -(6 rows) - +ERROR: unrecognized node type: 1130127496 I've managed to reproduce that failure on my Ubuntu with: CPPFLAGS="-Og -DWRITE_READ_PARSE_PLAN_TREES -DCOPY_PARSE_PLAN_TREES" ./configure ... make check ... SELECT t1, t2 FROM prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b WHERE t1.b = 0 ORDER BY t1.a, t2.b; - QUERY PLAN --- - Sort - Sort Key: t1.a, t2.b - -> Hash Right Join - Hash Cond: (t2.b = t1.a) - -> Append - -> Seq Scan on prt2_p1 t2_1 - -> Seq Scan on prt2_p2 t2_2 - -> Seq Scan on prt2_p3 t2_3 - -> Hash - -> Append - -> Seq Scan on prt1_p1 t1_1 - Filter: (b = 0) - -> Seq Scan on prt1_p2 t1_2 - Filter: (b = 0) - -> Seq Scan on prt1_p3 t1_3 - Filter: (b = 0) -(16 rows) - +ERROR: unrecognized node type: -1465804424 ... As far as I can see from https://cirrus-ci.com/task/6642692659085312, the FreeBSD host has the following CPPFLAGS specified: -DRELCACHE_FORCE_RELEASE -DCOPY_PARSE_PLAN_TREES -DWRITE_READ_PARSE_PLAN_TREES -DRAW_EXPRESSION_COVERAGE_TEST -DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS Best regards, Alexander
Re: Support run-time partition pruning for hash join
On Tue, Aug 29, 2023 at 6:41 PM Richard Guo wrote: > So it seems that the new costing logic is quite crude and tends to be > very conservative, but it can help avoid the large overhead in the worst > cases. I think this might be a good start to push this patch forward. > > Any thoughts or comments? > I rebased this patch over the latest master. Nothing changed except that I revised the new added test case to make it more stable. However, the cfbot indicates that there are test cases that fail on FreeBSD [1] (no failure on other platforms). So I set up a FreeBSD-13 locally but just cannot reproduce the failure. I must be doing something wrong. Can anyone give me some hints or suggestions? FYI. The failure looks like: explain (costs off) select p2.a, p1.c from permtest_parent p1 inner join permtest_parent p2 on p1.a = p2.a and left(p1.c, 3) ~ 'a1$'; - QUERY PLAN - - Hash Join - Hash Cond: (p2.a = p1.a) - -> Seq Scan on permtest_grandchild p2 - -> Hash - -> Seq Scan on permtest_grandchild p1 - Filter: ("left"(c, 3) ~ 'a1$'::text) -(6 rows) - +ERROR: unrecognized node type: 1130127496 [1] https://api.cirrus-ci.com/v1/artifact/task/5334808075698176/testrun/build/testrun/regress/regress/regression.diffs Thanks Richard v3-0001-Support-run-time-partition-pruning-for-hash-join.patch Description: Binary data
Re: Support run-time partition pruning for hash join
On Fri, Aug 25, 2023 at 11:03 AM David Rowley wrote: > I'd suggest writing some cost which costs an execution of run-time > pruning. With LIST and RANGE you probably want something like > cpu_operator_cost * LOG2(nparts) once for each hashed tuple to account > for the binary search over the sorted datum array. For HASH > partitions, something like cpu_operator_cost * npartcols once for each > hashed tuple. > > You'll need to then come up with some counter costs to subtract from > the Append/MergeAppend. This is tricky, as discussed. Just come up > with something crude for now. > > To start with, it could just be as crude as: > > total_costs *= (Min(expected_outer_rows, n_append_subnodes) / > n_append_subnodes); > > i.e assume that every outer joined row will require exactly 1 new > partition up to the total number of partitions. That's pretty much > worst-case, but it'll at least allow the optimisation to work for > cases like where the hash table is expected to contain just a tiny > number of rows (fewer than the number of partitions) > > To make it better, you might want to look at join selectivity > estimation and see if you can find something there to influence > something better. I have a go at writing some costing codes according to your suggestion. That's compute_partprune_cost() in the v2 patch. For the hash side, this function computes the pruning cost as cpu_operator_cost * LOG2(nparts) * inner_rows for LIST and RANGE, and cpu_operator_cost * nparts * inner_rows for HASH. For the Append/MergeAppend side, this function first estimates the size of outer side that matches, using the same idea as we estimate the joinrel size for JOIN_SEMI. Then it assumes that each outer joined row occupies one new partition (the worst case) and computes how much cost can be saved from partition pruning. If the cost saved from the Append/MergeAppend side is larger than the pruning cost from the Hash side, then we say that partition pruning is a win. Note that this costing logic runs for each Append-Hash pair, so it copes with the case where we have multiple join levels. With this costing logic added, I performed the same performance comparisons of the worst case as in [1], and here is what I got. tuples unpatched patched 1 44.66 44.37 -0.006493506 2 52.41 52.29 -0.002289639 3 61.11 61.12 +0.000163639 4 67.87 68.24 +0.005451599 5 74.51 74.75 +0.003221044 6 82.381.55 -0.009113001 7 87.16 86.98 -0.002065168 8 93.49 93.89 +0.004278532 9 101.52 100.83 -0.00679669 10 108.34 108.56 +0.002030644 So the costing logic successfully avoids performing the partition pruning in the worst case. I also tested the cases where partition pruning is possible with different sizes of the hash side. tuples unpatched patched 100 36.86 2.4 -0.934888768 200 35.87 2.37-0.933928074 300 35.95 2.55-0.92906815 400 36.42.63-0.927747253 500 36.39 2.85-0.921681781 600 36.32 2.97-0.918226872 700 36.63.23-0.911748634 800 36.88 3.44-0.906724512 900 37.02 3.46-0.906537007 100037.25 37.21 -0.001073826 The first 9 rows show that the costing logic allows the partition pruning to be performed and the pruning turns out to be a big win. The last row shows that the partition pruning is disallowed by the costing logic because it thinks no partition can be pruned (we have 1000 partitions in total). So it seems that the new costing logic is quite crude and tends to be very conservative, but it can help avoid the large overhead in the worst cases. I think this might be a good start to push this patch forward. Any thoughts or comments? [1] https://www.postgresql.org/message-id/CAMbWs49%2Bp6hBxXJHFiSwOtPCSkAHwhJj3hTpCR_pmMiUUVLZ1Q%40mail.gmail.com Thanks Richard v2-0001-Support-run-time-partition-pruning-for-hash-join.patch Description: Binary data
Re: Support run-time partition pruning for hash join
On Fri, Aug 25, 2023 at 11:03 AM David Rowley wrote: > On Thu, 24 Aug 2023 at 21:27, Richard Guo wrote: > > I think we need to solve this problem first before we can > > make this new partition pruning mechanism some useful in practice, but > > how? Some thoughts currently in my mind include > > > > 1) we try our best to estimate the cost of this partition pruning when > > creating hash join paths, and decide based on the cost whether to use it > > or not. But this does not seem to be an easy task. > > I think we need to consider another Hash Join path when we detect that > the outer side of the Hash Join involves scanning a partitioned table. > > I'd suggest writing some cost which costs an execution of run-time > pruning. With LIST and RANGE you probably want something like > cpu_operator_cost * LOG2(nparts) once for each hashed tuple to account > for the binary search over the sorted datum array. For HASH > partitions, something like cpu_operator_cost * npartcols once for each > hashed tuple. > > You'll need to then come up with some counter costs to subtract from > the Append/MergeAppend. This is tricky, as discussed. Just come up > with something crude for now. > > To start with, it could just be as crude as: > > total_costs *= (Min(expected_outer_rows, n_append_subnodes) / > n_append_subnodes); > > i.e assume that every outer joined row will require exactly 1 new > partition up to the total number of partitions. That's pretty much > worst-case, but it'll at least allow the optimisation to work for > cases like where the hash table is expected to contain just a tiny > number of rows (fewer than the number of partitions) > > To make it better, you might want to look at join selectivity > estimation and see if you can find something there to influence > something better. Thank you for the suggestion. I will take some time considering it. When we have multiple join levels, it seems the situation becomes even more complex. One Append/MergeAppend node might be pruned by more than one Hash node, and one Hash node might provide pruning for more than one Append/MergeAppend node. For instance, below is the plan from the test case added in the v1 patch: explain (analyze, costs off, summary off, timing off) select * from tprt p1 inner join tprt p2 on p1.col1 = p2.col1 right join tbl1 t on p1.col1 = t.col1 and p2.col1 = t.col1; QUERY PLAN - Hash Right Join (actual rows=2 loops=1) Hash Cond: ((p1.col1 = t.col1) AND (p2.col1 = t.col1)) -> Hash Join (actual rows=3 loops=1) Hash Cond: (p1.col1 = p2.col1) -> Append (actual rows=3 loops=1) -> Seq Scan on tprt_1 p1_1 (never executed) -> Seq Scan on tprt_2 p1_2 (actual rows=3 loops=1) -> Seq Scan on tprt_3 p1_3 (never executed) -> Seq Scan on tprt_4 p1_4 (never executed) -> Seq Scan on tprt_5 p1_5 (never executed) -> Seq Scan on tprt_6 p1_6 (never executed) -> Hash (actual rows=3 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Append (actual rows=3 loops=1) -> Seq Scan on tprt_1 p2_1 (never executed) -> Seq Scan on tprt_2 p2_2 (actual rows=3 loops=1) -> Seq Scan on tprt_3 p2_3 (never executed) -> Seq Scan on tprt_4 p2_4 (never executed) -> Seq Scan on tprt_5 p2_5 (never executed) -> Seq Scan on tprt_6 p2_6 (never executed) -> Hash (actual rows=2 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Seq Scan on tbl1 t (actual rows=2 loops=1) (23 rows) In this plan, the Append node of 'p1' is pruned by two Hash nodes: Hash node of 't' and Hash node of 'p2'. Meanwhile, the Hash node of 't' provides pruning for two Append nodes: Append node of 'p1' and Append node of 'p2'. In this case, meaningfully costing for the partition pruning seems even more difficult. Do you have any suggestion on that? > > 2) we use some heuristics when executing hash join, such as when we > > notice that a $threshold percentage of the partitions must be visited > > we just abort the pruning and assume that no partitions can be pruned. > > You could likely code in something that checks > bms_num_members(jpstate->part_prune_result) to see if it still remains > below the total Append/MergeAppend subplans whenever, say whenever the > lower 8 bits of hashtable->totalTuples are all off. You can just give > up doing any further pruning when all partitions are already required. Yeah, we can do that. While this may not help in the tests I performed for the worst case because the table in the hash side is designed that tuples belong to the same partition are placed together so that we need to scan almost all its tuples before we could know that all
Re: Support run-time partition pruning for hash join
On Thu, 24 Aug 2023 at 21:27, Richard Guo wrote: > I performed some performance comparisons of the worst case with two > tables as below: > > 1. The partitioned table has 1000 children, and 100,000 tuples in total. > > 2. The other table is designed that > a) its tuples occupy every partition of the partitioned table so >that no partitions can be pruned during execution, > b) tuples belong to the same partition are placed together so that >we need to scan all its tuples before we could know that no >pruning would happen and we could stop trying to prune, > c) the tuples are unique on the hash key so as to minimize the cost >of hash probe, so that we can highlight the impact of the pruning >codes. > > Here is the execution time (ms) I get with different sizes of the other > table. > > tuples unpatched patched > 1 45.74 53.46 (+0.17) > 2 54.48 70.18 (+0.29) > 3 62.57 85.18 (+0.36) > 4 69.14 99.19 (+0.43) > 5 76.46 111.09 (+0.45) > 6 82.68 126.37 (+0.53) > 7 92.69 137.89 (+0.49) > 8 94.49 151.46 (+0.60) > 9 101.53 164.93 (+0.62) > 10 107.22 178.44 (+0.66) > > So the overhead the pruning code adds to Hash Join is too large to be > accepted :(. Agreed. Run-time pruning is pretty fast to execute, but so is inserting a row into a hash table. > I think we need to solve this problem first before we can > make this new partition pruning mechanism some useful in practice, but > how? Some thoughts currently in my mind include > > 1) we try our best to estimate the cost of this partition pruning when > creating hash join paths, and decide based on the cost whether to use it > or not. But this does not seem to be an easy task. I think we need to consider another Hash Join path when we detect that the outer side of the Hash Join involves scanning a partitioned table. I'd suggest writing some cost which costs an execution of run-time pruning. With LIST and RANGE you probably want something like cpu_operator_cost * LOG2(nparts) once for each hashed tuple to account for the binary search over the sorted datum array. For HASH partitions, something like cpu_operator_cost * npartcols once for each hashed tuple. You'll need to then come up with some counter costs to subtract from the Append/MergeAppend. This is tricky, as discussed. Just come up with something crude for now. To start with, it could just be as crude as: total_costs *= (Min(expected_outer_rows, n_append_subnodes) / n_append_subnodes); i.e assume that every outer joined row will require exactly 1 new partition up to the total number of partitions. That's pretty much worst-case, but it'll at least allow the optimisation to work for cases like where the hash table is expected to contain just a tiny number of rows (fewer than the number of partitions) To make it better, you might want to look at join selectivity estimation and see if you can find something there to influence something better. > 2) we use some heuristics when executing hash join, such as when we > notice that a $threshold percentage of the partitions must be visited > we just abort the pruning and assume that no partitions can be pruned. You could likely code in something that checks bms_num_members(jpstate->part_prune_result) to see if it still remains below the total Append/MergeAppend subplans whenever, say whenever the lower 8 bits of hashtable->totalTuples are all off. You can just give up doing any further pruning when all partitions are already required. David
Re: Support run-time partition pruning for hash join
On Tue, Aug 22, 2023 at 2:38 PM David Rowley wrote: > It would be good to see some performance comparisons of the worst case > to see how much overhead the pruning code adds to Hash Join. It may > well be that we need to consider two Hash Join paths, one with and one > without run-time pruning. It's pretty difficult to meaningfully cost, > as I already mentioned, however. I performed some performance comparisons of the worst case with two tables as below: 1. The partitioned table has 1000 children, and 100,000 tuples in total. 2. The other table is designed that a) its tuples occupy every partition of the partitioned table so that no partitions can be pruned during execution, b) tuples belong to the same partition are placed together so that we need to scan all its tuples before we could know that no pruning would happen and we could stop trying to prune, c) the tuples are unique on the hash key so as to minimize the cost of hash probe, so that we can highlight the impact of the pruning codes. Here is the execution time (ms) I get with different sizes of the other table. tuples unpatched patched 1 45.74 53.46 (+0.17) 2 54.48 70.18 (+0.29) 3 62.57 85.18 (+0.36) 4 69.14 99.19 (+0.43) 5 76.46 111.09 (+0.45) 6 82.68 126.37 (+0.53) 7 92.69 137.89 (+0.49) 8 94.49 151.46 (+0.60) 9 101.53 164.93 (+0.62) 10 107.22 178.44 (+0.66) So the overhead the pruning code adds to Hash Join is too large to be accepted :(. I think we need to solve this problem first before we can make this new partition pruning mechanism some useful in practice, but how? Some thoughts currently in my mind include 1) we try our best to estimate the cost of this partition pruning when creating hash join paths, and decide based on the cost whether to use it or not. But this does not seem to be an easy task. 2) we use some heuristics when executing hash join, such as when we notice that a $threshold percentage of the partitions must be visited we just abort the pruning and assume that no partitions can be pruned. Any thoughts or comments? Thanks Richard
Re: Support run-time partition pruning for hash join
> > > fwiw, the current master totally ignores the cost reduction for run-time > > partition prune, even for init partition prune. So in some real cases, > > pg chooses a hash join just because the cost of nest loop join is > > highly over estimated. > > This is true about the existing code. It's a very tricky thing to cost > given that the parameter values are always unknown to the planner. > The best we have for these today is the various hardcoded constants in > selfuncs.h. While I do agree that it's not great that the costing code > knows nothing about run-time pruning, I also think that run-time > pruning during execution with parameterised nested loops is much more > likely to be able to prune partitions and save actual work than the > equivalent with Hash Joins. It's more common for the planner to > choose to Nested Loop when there are fewer outer rows, so the pruning > code is likely to be called fewer times with Nested Loop than with > Hash Join. > Yes, I agree with this. In my 4 years of PostgresSQL, I just run into 2 cases of this issue and 1 of them is joining 12+ tables with run-time partition prune for every join. But this situation causes more issues than generating a wrong plan, like for a simple SELECT * FROM p WHERE partkey = $1; generic plan will never win so we have to pay the expensive planning cost for partitioned table. If we don't require very accurate costing for every case, like we only care about '=' operator which is the most common case, it should be easier than the case here since we just need to know if only 1 partition will survive after pruning, but don't care about which one it is. I'd like to discuss in another thread, and leave this thread for Richard's patch only. -- Best Regards Andy Fan
Re: Support run-time partition pruning for hash join
On Tue, Aug 22, 2023 at 5:43 PM Richard Guo wrote: > > On Mon, Aug 21, 2023 at 8:34 PM Andy Fan wrote: > >> This feature looks good, but is it possible to know if we can prune >> any subnodes before we pay the extra effort (building the Hash >> table, for each row... stuff)? >> > > It might be possible if we take the partition prunning into > consideration when estimating costs. But it seems not easy to calculate > the costs accurately. > This is a real place I am worried about the future of this patch. Personally, I do like this patch, but not sure what if this issue can't be fixed to make everyone happy, and fixing this perfectly looks hopeless for me. However, let's see what will happen. > > >> Maybe at least, if we have found no subnodes can be skipped >> during the hashing, we can stop doing such work anymore. >> > > Yeah, this is what we can do. > cool. > > >> In my current knowledge, we have to build the inner table first for this >> optimization? so hash join and sort merge should be OK, but nestloop >> should >> be impossible unless I missed something. >> > > For nestloop and mergejoin, we'd always execute the outer side first. > So the Append/MergeAppend nodes need to be on the inner side for the > join partition prunning to take effect. For a mergejoin that will > explicitly sort the outer side, the sort node would process all the > outer rows before scanning the inner side, so we can do the join > partition prunning with that. For a nestloop, if we have a Material > node on the outer side, we can do that too, but I wonder if we'd have > such a plan in real world, because we only add Material to the inner > side of nestloop. > This is more interesting than I expected,thanks for the explaination. -- Best Regards Andy Fan
Re: Support run-time partition pruning for hash join
On Tue, Aug 22, 2023 at 2:38 PM David Rowley wrote: > With Hash Join, it seems to me that the pruning must take place for > every row that makes it into the hash table. There will be maybe > cases where the unioned set of partitions simply yields every > partition and all the work results in no savings. Pruning on a scalar > value seems much more likely to be able to prune away unneeded > Append/MergeAppend subnodes. Yeah, you're right. If we have 'pt HashJoin t', for a subnode of 'pt' to be pruned, it needs every row of 't' to be able to prune that subnode. The situation may improve if we have more than 2-way hash joins, because the final surviving subnodes would be the intersection of matching subnodes in each Hash. With parameterized nestloop I agree that it's more likely to be able to prune subnodes at rescan of Append/MergeAppend nodes based on scalar values. Sometimes we may just not generate parameterized nestloop as final plan, such as when there are no indexes and no lateral references in the Append/MergeAppend node. In this case I think it would be great if we can still do some partition prunning. So I think this new 'join partition prunning mechanism' (maybe this is not a proper name) should be treated as a supplement to, not a substitute for, the current run-time partition prunning based on parameterized nestloop, and it is so implemented in the patch. > Perhaps there can be something adaptive in Hash Join which stops > trying to prune when all partitions must be visited. On a quick > glance of the patch, I don't see any code in ExecJoinPartitionPrune() > which gives up trying to prune when the number of members in > part_prune_result is equal to the prunable Append/MergeAppend > subnodes. Yeah, we can do that. > But run-time pruning already works for Nested Loops... I must be > missing something here. Here I mean nestloop with non-parameterized inner path. As I explained upthread, we need to have a Material node on the outer side for that to work, which seems not possible in real world. Thanks Richard
Re: Support run-time partition pruning for hash join
On Mon, Aug 21, 2023 at 8:34 PM Andy Fan wrote: > This feature looks good, but is it possible to know if we can prune > any subnodes before we pay the extra effort (building the Hash > table, for each row... stuff)? > It might be possible if we take the partition prunning into consideration when estimating costs. But it seems not easy to calculate the costs accurately. > Maybe at least, if we have found no subnodes can be skipped > during the hashing, we can stop doing such work anymore. > Yeah, this is what we can do. > In my current knowledge, we have to build the inner table first for this > optimization? so hash join and sort merge should be OK, but nestloop > should > be impossible unless I missed something. > For nestloop and mergejoin, we'd always execute the outer side first. So the Append/MergeAppend nodes need to be on the inner side for the join partition prunning to take effect. For a mergejoin that will explicitly sort the outer side, the sort node would process all the outer rows before scanning the inner side, so we can do the join partition prunning with that. For a nestloop, if we have a Material node on the outer side, we can do that too, but I wonder if we'd have such a plan in real world, because we only add Material to the inner side of nestloop. Thanks Richard
Re: Support run-time partition pruning for hash join
On Tue, 22 Aug 2023 at 00:34, Andy Fan wrote: > > On Mon, Aug 21, 2023 at 11:48 AM Richard Guo wrote: >> 1. All the join partition prunning decisions are made in createplan.c >>where the best path tree has been decided. This is not great. Maybe >>it's better to make it happen when we build up the path tree, so that >>we can take the partition prunning into consideration when estimating >>the costs. > > > fwiw, the current master totally ignores the cost reduction for run-time > partition prune, even for init partition prune. So in some real cases, > pg chooses a hash join just because the cost of nest loop join is > highly over estimated. This is true about the existing code. It's a very tricky thing to cost given that the parameter values are always unknown to the planner. The best we have for these today is the various hardcoded constants in selfuncs.h. While I do agree that it's not great that the costing code knows nothing about run-time pruning, I also think that run-time pruning during execution with parameterised nested loops is much more likely to be able to prune partitions and save actual work than the equivalent with Hash Joins. It's more common for the planner to choose to Nested Loop when there are fewer outer rows, so the pruning code is likely to be called fewer times with Nested Loop than with Hash Join. With Hash Join, it seems to me that the pruning must take place for every row that makes it into the hash table. There will be maybe cases where the unioned set of partitions simply yields every partition and all the work results in no savings. Pruning on a scalar value seems much more likely to be able to prune away unneeded Append/MergeAppend subnodes. Perhaps there can be something adaptive in Hash Join which stops trying to prune when all partitions must be visited. On a quick glance of the patch, I don't see any code in ExecJoinPartitionPrune() which gives up trying to prune when the number of members in part_prune_result is equal to the prunable Append/MergeAppend subnodes. It would be good to see some performance comparisons of the worst case to see how much overhead the pruning code adds to Hash Join. It may well be that we need to consider two Hash Join paths, one with and one without run-time pruning. It's pretty difficult to meaningfully cost, as I already mentioned, however. >> 4. Is it possible and worthwhile to extend the join partition prunning >>mechanism to support nestloop and mergejoin also? > > > In my current knowledge, we have to build the inner table first for this > optimization? so hash join and sort merge should be OK, but nestloop should > be impossible unless I missed something. But run-time pruning already works for Nested Loops... I must be missing something here. I imagine for Merge Joins a more generic approach would be better by implementing parameterised Merge Joins (a.k.a zigzag merge joins). The Append/MergeAppend node could then select the correct partition(s) based on the current parameter value at rescan. I don't think any code changes would be needed in node[Merge]Append.c for that to work. This could also speed up Merge Joins to non-partitioned tables when an index is providing presorted input to the join. David
Re: Support run-time partition pruning for hash join
On Mon, Aug 21, 2023 at 11:48 AM Richard Guo wrote: > If we have a hash join with an Append node on the outer side, something > like > > Hash Join >Hash Cond: (pt.a = t.a) >-> Append > -> Seq Scan on pt_p1 pt_1 > -> Seq Scan on pt_p2 pt_2 > -> Seq Scan on pt_p3 pt_3 >-> Hash > -> Seq Scan on t > > We can actually prune those subnodes of the Append that cannot possibly > contain any matching tuples from the other side of the join. To do > that, when building the Hash table, for each row from the inner side we > can compute the minimum set of subnodes that can possibly match the join > condition. When we have built the Hash table and start to execute the > Append node, we should have known which subnodes are survived and thus > can skip other subnodes. > This feature looks good, but is it possible to know if we can prune any subnodes before we pay the extra effort (building the Hash table, for each row... stuff)? IIUC, looks no. If so, I think this area needs more attention. I can't provide any good suggestions yet. Maybe at least, if we have found no subnodes can be skipped during the hashing, we can stop doing such work anymore. There are several points that need more consideration. > > 1. All the join partition prunning decisions are made in createplan.c >where the best path tree has been decided. This is not great. Maybe >it's better to make it happen when we build up the path tree, so that >we can take the partition prunning into consideration when estimating >the costs. > fwiw, the current master totally ignores the cost reduction for run-time partition prune, even for init partition prune. So in some real cases, pg chooses a hash join just because the cost of nest loop join is highly over estimated. 4. Is it possible and worthwhile to extend the join partition prunning >mechanism to support nestloop and mergejoin also? > In my current knowledge, we have to build the inner table first for this optimization? so hash join and sort merge should be OK, but nestloop should be impossible unless I missed something. -- Best Regards Andy Fan
Support run-time partition pruning for hash join
If we have a hash join with an Append node on the outer side, something like Hash Join Hash Cond: (pt.a = t.a) -> Append -> Seq Scan on pt_p1 pt_1 -> Seq Scan on pt_p2 pt_2 -> Seq Scan on pt_p3 pt_3 -> Hash -> Seq Scan on t We can actually prune those subnodes of the Append that cannot possibly contain any matching tuples from the other side of the join. To do that, when building the Hash table, for each row from the inner side we can compute the minimum set of subnodes that can possibly match the join condition. When we have built the Hash table and start to execute the Append node, we should have known which subnodes are survived and thus can skip other subnodes. This kind of partition pruning can be extended to happen across multiple join levels. For instance, Hash Join Hash Cond: (pt.a = t2.a) -> Hash Join Hash Cond: (pt.a = t1.a) -> Append -> Seq Scan on pt_p1 pt_1 -> Seq Scan on pt_p2 pt_2 -> Seq Scan on pt_p3 pt_3 -> Hash -> Seq Scan on t1 -> Hash -> Seq Scan on t2 We can compute the matching subnodes of the Append when building Hash table for 't1' according to the join condition 'pt.a = t1.a', and when building Hash table for 't2' according to join condition 'pt.a = t2.a', and the final surviving subnodes would be their intersection. Greenplum [1] has implemented this kind of partition pruning as 'Partition Selector'. Attached is a patch that refactores Greenplum's implementation to make it work on PostgreSQL master. Here are some details about the patch. During planning: 1. When creating a hash join plan in create_hashjoin_plan() we first collect information required to build PartitionPruneInfos at this join, which includes the join's RestrictInfos and the join's inner relids, and put this information in a stack. 2. When we call create_append_plan() for an appendrel, for each of the joins we check if join partition pruning is possible to take place for this appendrel, based on the information collected at that join, and if so build a PartitionPruneInfo and add it to the stack entry. 3. After finishing the outer side of the hash join, we should have built all the PartitionPruneInfos that can be used to perform join partition pruning at this join. So we pop out the stack entry to get the PartitionPruneInfos and add them to Hash node. During executing: When building the hash table for a hash join, we perform the partition prunning for each row according to each of the JoinPartitionPruneStates at this join, and store each result in a special executor parameter to make it available to Append nodes. When executing an Append node, we can directly use the pre-computed pruning results to skip subnodes that cannot contain any matching rows. Here is a query that shows the effect of the join partition prunning. CREATE TABLE pt (a int, b int, c varchar) PARTITION BY RANGE(a); CREATE TABLE pt_p1 PARTITION OF pt FOR VALUES FROM (0) TO (250); CREATE TABLE pt_p2 PARTITION OF pt FOR VALUES FROM (250) TO (500); CREATE TABLE pt_p3 PARTITION OF pt FOR VALUES FROM (500) TO (600); INSERT INTO pt SELECT i, i % 25, to_char(i, 'FM') FROM generate_series(0, 599) i WHERE i % 2 = 0; CREATE TABLE t1 (a int, b int); INSERT INTO t1 values (10, 10); CREATE TABLE t2 (a int, b int); INSERT INTO t2 values (300, 300); ANALYZE pt, t1, t2; SET enable_nestloop TO off; explain (analyze, costs off, summary off, timing off) select * from pt join t1 on pt.a = t1.a right join t2 on pt.a = t2.a; QUERY PLAN Hash Right Join (actual rows=1 loops=1) Hash Cond: (pt.a = t2.a) -> Hash Join (actual rows=0 loops=1) Hash Cond: (pt.a = t1.a) -> Append (actual rows=0 loops=1) -> Seq Scan on pt_p1 pt_1 (never executed) -> Seq Scan on pt_p2 pt_2 (never executed) -> Seq Scan on pt_p3 pt_3 (never executed) -> Hash (actual rows=1 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Seq Scan on t1 (actual rows=1 loops=1) -> Hash (actual rows=1 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Seq Scan on t2 (actual rows=1 loops=1) (14 rows) There are several points that need more consideration. 1. All the join partition prunning decisions are made in createplan.c where the best path tree has been decided. This is not great. Maybe it's better to make it happen when we build up the path tree, so that we can take the partition prunning into consideration when estimating the costs. 2. In order to make the join partition prunning take effect, the patch hacks the empty-outer optimization in ExecHashJoinImpl(). Not sure if this is a good practice. 3. This patch does not support parallel