Re: Support run-time partition pruning for hash join

2024-03-19 Thread Richard Guo
On Tue, Jan 30, 2024 at 10:33 AM Richard Guo  wrote:

> Attached is an updated patch.  Nothing else has changed.
>

Here is another rebase over master so it applies again.  Nothing else
has changed.

Thanks
Richard


v7-0001-Support-run-time-partition-pruning-for-hash-join.patch
Description: Binary data


Re: Support run-time partition pruning for hash join

2024-01-29 Thread Richard Guo
On Sat, Jan 27, 2024 at 11:29 AM vignesh C  wrote:

> CFBot shows that the patch does not apply anymore as in [1]:
>
> Please post an updated version for the same.


Attached is an updated patch.  Nothing else has changed.

Thanks
Richard


v6-0001-Support-run-time-partition-pruning-for-hash-join.patch
Description: Binary data


Re: Support run-time partition pruning for hash join

2024-01-26 Thread vignesh C
On Tue, 7 Nov 2023 at 13:25, Richard Guo  wrote:
>
>
> On Mon, Nov 6, 2023 at 11:00 PM Alexander Lakhin  wrote:
>>
>> Please look at a warning and an assertion failure triggered by the
>> following script:
>> set parallel_setup_cost = 0;
>> set parallel_tuple_cost = 0;
>> set min_parallel_table_scan_size = '1kB';
>>
>> create table t1 (i int) partition by range (i);
>> create table t1_1 partition of t1 for values from (1) to (2);
>> create table t1_2 partition of t1 for values from (2) to (3);
>> insert into t1 values (1), (2);
>>
>> create table t2(i int);
>> insert into t2 values (1), (2);
>> analyze t1, t2;
>>
>> select * from t1 right join t2 on t1.i = t2.i;
>>
>> 2023-11-06 14:11:37.398 UTC|law|regression|6548f419.392cf5|WARNING:  Join 
>> partition pruning $0 has not been performed yet.
>> TRAP: failed Assert("node->as_prune_state"), File: "nodeAppend.c", Line: 
>> 846, PID: 3747061
>
>
> Thanks for the report!  I failed to take care of the parallel-hashjoin
> case, and I have to admit that it's not clear to me yet how we should do
> join partition pruning in that case.
>
> For now I think it's better to just avoid performing join partition
> pruning for parallel hashjoin, so that the patch doesn't become too
> complex for review.  We can always extend it in the future.
>
> I have done that in v5.  Thanks for testing!

CFBot shows that the patch does not apply anymore as in [1]:
=== Applying patches on top of PostgreSQL commit ID
924d046dcf55887c98a1628675a30f4b0eebe556 ===
=== applying patch
./v5-0001-Support-run-time-partition-pruning-for-hash-join.patch
...
patching file src/include/nodes/plannodes.h
...
patching file src/include/optimizer/cost.h
Hunk #1 FAILED at 211.
1 out of 1 hunk FAILED -- saving rejects to file
src/include/optimizer/cost.h.rej

Please post an updated version for the same.

[1] - http://cfbot.cputube.org/patch_46_4512.log

Regards,
Vignesh




Re: Support run-time partition pruning for hash join

2023-11-06 Thread Richard Guo
On Mon, Nov 6, 2023 at 11:00 PM Alexander Lakhin 
wrote:

> Please look at a warning and an assertion failure triggered by the
> following script:
> set parallel_setup_cost = 0;
> set parallel_tuple_cost = 0;
> set min_parallel_table_scan_size = '1kB';
>
> create table t1 (i int) partition by range (i);
> create table t1_1 partition of t1 for values from (1) to (2);
> create table t1_2 partition of t1 for values from (2) to (3);
> insert into t1 values (1), (2);
>
> create table t2(i int);
> insert into t2 values (1), (2);
> analyze t1, t2;
>
> select * from t1 right join t2 on t1.i = t2.i;
>
> 2023-11-06 14:11:37.398 UTC|law|regression|6548f419.392cf5|WARNING:  Join
> partition pruning $0 has not been performed yet.
> TRAP: failed Assert("node->as_prune_state"), File: "nodeAppend.c", Line:
> 846, PID: 3747061
>

Thanks for the report!  I failed to take care of the parallel-hashjoin
case, and I have to admit that it's not clear to me yet how we should do
join partition pruning in that case.

For now I think it's better to just avoid performing join partition
pruning for parallel hashjoin, so that the patch doesn't become too
complex for review.  We can always extend it in the future.

I have done that in v5.  Thanks for testing!

Thanks
Richard


v5-0001-Support-run-time-partition-pruning-for-hash-join.patch
Description: Binary data


Re: Support run-time partition pruning for hash join

2023-11-06 Thread Alexander Lakhin

Hello Richard,

06.11.2023 06:05, Richard Guo wrote:


Fixed this issue in v4.



Please look at a warning and an assertion failure triggered by the
following script:
set parallel_setup_cost = 0;
set parallel_tuple_cost = 0;
set min_parallel_table_scan_size = '1kB';

create table t1 (i int) partition by range (i);
create table t1_1 partition of t1 for values from (1) to (2);
create table t1_2 partition of t1 for values from (2) to (3);
insert into t1 values (1), (2);

create table t2(i int);
insert into t2 values (1), (2);
analyze t1, t2;

select * from t1 right join t2 on t1.i = t2.i;

2023-11-06 14:11:37.398 UTC|law|regression|6548f419.392cf5|WARNING: Join 
partition pruning $0 has not been performed yet.
TRAP: failed Assert("node->as_prune_state"), File: "nodeAppend.c", Line: 846, 
PID: 3747061

Best regards,
Alexander

Re: Support run-time partition pruning for hash join

2023-11-05 Thread Richard Guo
On Sat, Nov 4, 2023 at 6:00 PM Alexander Lakhin  wrote:

> 02.11.2023 14:19, Richard Guo wrote:
>
> However, the cfbot indicates that there are test cases that fail on
> FreeBSD [1] (no failure on other platforms).  So I set up a FreeBSD-13
> locally but just cannot reproduce the failure.  I must be doing
> something wrong.  Can anyone give me some hints or suggestions?
>
> I've managed to reproduce that failure on my Ubuntu with:
> CPPFLAGS="-Og -DWRITE_READ_PARSE_PLAN_TREES -DCOPY_PARSE_PLAN_TREES"
> ./configure ... make check
>

Wow, thank you so much.  You saved me a lot of time.  It turns out that
it was caused by me not making JoinPartitionPruneInfo a node.  The same
issue can also exist for JoinPartitionPruneCandidateInfo - if you
pprint(root) at some point you'll see 'could not dump unrecognized node
type' warning.

Fixed this issue in v4.

Thanks
Richard


v4-0001-Support-run-time-partition-pruning-for-hash-join.patch
Description: Binary data


Re: Support run-time partition pruning for hash join

2023-11-04 Thread Alexander Lakhin

Hello Richard,

02.11.2023 14:19, Richard Guo wrote:


However, the cfbot indicates that there are test cases that fail on
FreeBSD [1] (no failure on other platforms).  So I set up a FreeBSD-13
locally but just cannot reproduce the failure.  I must be doing
something wrong.  Can anyone give me some hints or suggestions?

FYI. The failure looks like:

 explain (costs off)
   select p2.a, p1.c from permtest_parent p1 inner join permtest_parent p2
   on p1.a = p2.a and left(p1.c, 3) ~ 'a1$';
-                     QUERY PLAN
-
- Hash Join
-   Hash Cond: (p2.a = p1.a)
-   ->  Seq Scan on permtest_grandchild p2
-   ->  Hash
-         ->  Seq Scan on permtest_grandchild p1
-               Filter: ("left"(c, 3) ~ 'a1$'::text)
-(6 rows)
-
+ERROR:  unrecognized node type: 1130127496


I've managed to reproduce that failure on my Ubuntu with:
CPPFLAGS="-Og -DWRITE_READ_PARSE_PLAN_TREES -DCOPY_PARSE_PLAN_TREES" 
./configure ... make check
...
 SELECT t1, t2 FROM prt1 t1 LEFT JOIN prt2 t2 ON t1.a = t2.b WHERE t1.b = 0 
ORDER BY t1.a, t2.b;
-    QUERY PLAN
---
- Sort
-   Sort Key: t1.a, t2.b
-   ->  Hash Right Join
- Hash Cond: (t2.b = t1.a)
- ->  Append
-   ->  Seq Scan on prt2_p1 t2_1
-   ->  Seq Scan on prt2_p2 t2_2
-   ->  Seq Scan on prt2_p3 t2_3
- ->  Hash
-   ->  Append
- ->  Seq Scan on prt1_p1 t1_1
-   Filter: (b = 0)
- ->  Seq Scan on prt1_p2 t1_2
-   Filter: (b = 0)
- ->  Seq Scan on prt1_p3 t1_3
-   Filter: (b = 0)
-(16 rows)
-
+ERROR:  unrecognized node type: -1465804424
...

As far as I can see from https://cirrus-ci.com/task/6642692659085312,
the FreeBSD host has the following CPPFLAGS specified:
-DRELCACHE_FORCE_RELEASE
-DCOPY_PARSE_PLAN_TREES
-DWRITE_READ_PARSE_PLAN_TREES
-DRAW_EXPRESSION_COVERAGE_TEST
-DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS

Best regards,
Alexander

Re: Support run-time partition pruning for hash join

2023-11-02 Thread Richard Guo
On Tue, Aug 29, 2023 at 6:41 PM Richard Guo  wrote:

> So it seems that the new costing logic is quite crude and tends to be
> very conservative, but it can help avoid the large overhead in the worst
> cases.  I think this might be a good start to push this patch forward.
>
> Any thoughts or comments?
>

I rebased this patch over the latest master.  Nothing changed except
that I revised the new added test case to make it more stable.

However, the cfbot indicates that there are test cases that fail on
FreeBSD [1] (no failure on other platforms).  So I set up a FreeBSD-13
locally but just cannot reproduce the failure.  I must be doing
something wrong.  Can anyone give me some hints or suggestions?

FYI. The failure looks like:

 explain (costs off)
   select p2.a, p1.c from permtest_parent p1 inner join permtest_parent p2
   on p1.a = p2.a and left(p1.c, 3) ~ 'a1$';
- QUERY PLAN
-
- Hash Join
-   Hash Cond: (p2.a = p1.a)
-   ->  Seq Scan on permtest_grandchild p2
-   ->  Hash
- ->  Seq Scan on permtest_grandchild p1
-   Filter: ("left"(c, 3) ~ 'a1$'::text)
-(6 rows)
-
+ERROR:  unrecognized node type: 1130127496

[1]
https://api.cirrus-ci.com/v1/artifact/task/5334808075698176/testrun/build/testrun/regress/regress/regression.diffs

Thanks
Richard


v3-0001-Support-run-time-partition-pruning-for-hash-join.patch
Description: Binary data


Re: Support run-time partition pruning for hash join

2023-08-29 Thread Richard Guo
On Fri, Aug 25, 2023 at 11:03 AM David Rowley  wrote:

> I'd suggest writing some cost which costs an execution of run-time
> pruning.  With LIST and RANGE you probably want something like
> cpu_operator_cost * LOG2(nparts) once for each hashed tuple to account
> for the binary search over the sorted datum array. For HASH
> partitions, something like cpu_operator_cost * npartcols once for each
> hashed tuple.
>
> You'll need to then come up with some counter costs to subtract from
> the Append/MergeAppend.  This is tricky, as discussed.  Just come up
> with something crude for now.
>
> To start with, it could just be as crude as:
>
> total_costs *= (Min(expected_outer_rows, n_append_subnodes)  /
> n_append_subnodes);
>
> i.e assume that every outer joined row will require exactly 1 new
> partition up to the total number of partitions.  That's pretty much
> worst-case, but it'll at least allow the optimisation to work for
> cases like where the hash table is expected to contain just a tiny
> number of rows (fewer than the number of partitions)
>
> To make it better, you might want to look at join selectivity
> estimation and see if you can find something there to influence
> something better.


I have a go at writing some costing codes according to your suggestion.
That's compute_partprune_cost() in the v2 patch.

For the hash side, this function computes the pruning cost as
cpu_operator_cost * LOG2(nparts) * inner_rows for LIST and RANGE, and
cpu_operator_cost * nparts * inner_rows for HASH.

For the Append/MergeAppend side, this function first estimates the size
of outer side that matches, using the same idea as we estimate the
joinrel size for JOIN_SEMI.  Then it assumes that each outer joined row
occupies one new partition (the worst case) and computes how much cost
can be saved from partition pruning.

If the cost saved from the Append/MergeAppend side is larger than the
pruning cost from the Hash side, then we say that partition pruning is a
win.

Note that this costing logic runs for each Append-Hash pair, so it copes
with the case where we have multiple join levels.

With this costing logic added, I performed the same performance
comparisons of the worst case as in [1], and here is what I got.

tuples  unpatched   patched
1   44.66   44.37   -0.006493506
2   52.41   52.29   -0.002289639
3   61.11   61.12   +0.000163639
4   67.87   68.24   +0.005451599
5   74.51   74.75   +0.003221044
6   82.381.55   -0.009113001
7   87.16   86.98   -0.002065168
8   93.49   93.89   +0.004278532
9   101.52  100.83  -0.00679669
10  108.34  108.56  +0.002030644

So the costing logic successfully avoids performing the partition
pruning in the worst case.

I also tested the cases where partition pruning is possible with
different sizes of the hash side.

tuples  unpatched   patched
100 36.86   2.4 -0.934888768
200 35.87   2.37-0.933928074
300 35.95   2.55-0.92906815
400 36.42.63-0.927747253
500 36.39   2.85-0.921681781
600 36.32   2.97-0.918226872
700 36.63.23-0.911748634
800 36.88   3.44-0.906724512
900 37.02   3.46-0.906537007
100037.25   37.21   -0.001073826

The first 9 rows show that the costing logic allows the partition
pruning to be performed and the pruning turns out to be a big win.  The
last row shows that the partition pruning is disallowed by the costing
logic because it thinks no partition can be pruned (we have 1000
partitions in total).

So it seems that the new costing logic is quite crude and tends to be
very conservative, but it can help avoid the large overhead in the worst
cases.  I think this might be a good start to push this patch forward.

Any thoughts or comments?

[1]
https://www.postgresql.org/message-id/CAMbWs49%2Bp6hBxXJHFiSwOtPCSkAHwhJj3hTpCR_pmMiUUVLZ1Q%40mail.gmail.com

Thanks
Richard


v2-0001-Support-run-time-partition-pruning-for-hash-join.patch
Description: Binary data


Re: Support run-time partition pruning for hash join

2023-08-25 Thread Richard Guo
On Fri, Aug 25, 2023 at 11:03 AM David Rowley  wrote:

> On Thu, 24 Aug 2023 at 21:27, Richard Guo  wrote:
> > I think we need to solve this problem first before we can
> > make this new partition pruning mechanism some useful in practice, but
> > how?  Some thoughts currently in my mind include
> >
> > 1) we try our best to estimate the cost of this partition pruning when
> > creating hash join paths, and decide based on the cost whether to use it
> > or not.  But this does not seem to be an easy task.
>
> I think we need to consider another Hash Join path when we detect that
> the outer side of the Hash Join involves scanning a partitioned table.
>
> I'd suggest writing some cost which costs an execution of run-time
> pruning.  With LIST and RANGE you probably want something like
> cpu_operator_cost * LOG2(nparts) once for each hashed tuple to account
> for the binary search over the sorted datum array. For HASH
> partitions, something like cpu_operator_cost * npartcols once for each
> hashed tuple.
>
> You'll need to then come up with some counter costs to subtract from
> the Append/MergeAppend.  This is tricky, as discussed.  Just come up
> with something crude for now.
>
> To start with, it could just be as crude as:
>
> total_costs *= (Min(expected_outer_rows, n_append_subnodes)  /
> n_append_subnodes);
>
> i.e assume that every outer joined row will require exactly 1 new
> partition up to the total number of partitions.  That's pretty much
> worst-case, but it'll at least allow the optimisation to work for
> cases like where the hash table is expected to contain just a tiny
> number of rows (fewer than the number of partitions)
>
> To make it better, you might want to look at join selectivity
> estimation and see if you can find something there to influence
> something better.


Thank you for the suggestion.  I will take some time considering it.

When we have multiple join levels, it seems the situation becomes even
more complex.  One Append/MergeAppend node might be pruned by more than
one Hash node, and one Hash node might provide pruning for more than one
Append/MergeAppend node.  For instance, below is the plan from the test
case added in the v1 patch:

explain (analyze, costs off, summary off, timing off)
select * from tprt p1
   inner join tprt p2 on p1.col1 = p2.col1
   right join tbl1 t on p1.col1 = t.col1 and p2.col1 = t.col1;
   QUERY PLAN
-
 Hash Right Join (actual rows=2 loops=1)
   Hash Cond: ((p1.col1 = t.col1) AND (p2.col1 = t.col1))
   ->  Hash Join (actual rows=3 loops=1)
 Hash Cond: (p1.col1 = p2.col1)
 ->  Append (actual rows=3 loops=1)
   ->  Seq Scan on tprt_1 p1_1 (never executed)
   ->  Seq Scan on tprt_2 p1_2 (actual rows=3 loops=1)
   ->  Seq Scan on tprt_3 p1_3 (never executed)
   ->  Seq Scan on tprt_4 p1_4 (never executed)
   ->  Seq Scan on tprt_5 p1_5 (never executed)
   ->  Seq Scan on tprt_6 p1_6 (never executed)
 ->  Hash (actual rows=3 loops=1)
   Buckets: 1024  Batches: 1  Memory Usage: 9kB
   ->  Append (actual rows=3 loops=1)
 ->  Seq Scan on tprt_1 p2_1 (never executed)
 ->  Seq Scan on tprt_2 p2_2 (actual rows=3 loops=1)
 ->  Seq Scan on tprt_3 p2_3 (never executed)
 ->  Seq Scan on tprt_4 p2_4 (never executed)
 ->  Seq Scan on tprt_5 p2_5 (never executed)
 ->  Seq Scan on tprt_6 p2_6 (never executed)
   ->  Hash (actual rows=2 loops=1)
 Buckets: 1024  Batches: 1  Memory Usage: 9kB
 ->  Seq Scan on tbl1 t (actual rows=2 loops=1)
(23 rows)

In this plan, the Append node of 'p1' is pruned by two Hash nodes: Hash
node of 't' and Hash node of 'p2'.  Meanwhile, the Hash node of 't'
provides pruning for two Append nodes: Append node of 'p1' and Append
node of 'p2'.

In this case, meaningfully costing for the partition pruning seems even
more difficult.  Do you have any suggestion on that?


> > 2) we use some heuristics when executing hash join, such as when we
> > notice that a $threshold percentage of the partitions must be visited
> > we just abort the pruning and assume that no partitions can be pruned.
>
> You could likely code in something that checks
> bms_num_members(jpstate->part_prune_result) to see if it still remains
> below the total Append/MergeAppend subplans whenever, say whenever the
> lower 8 bits of hashtable->totalTuples are all off.  You can just give
> up doing any further pruning when all partitions are already required.


Yeah, we can do that.  While this may not help in the tests I performed
for the worst case because the table in the hash side is designed that
tuples belong to the same partition are placed together so that we need
to scan almost all its tuples before we could know that all 

Re: Support run-time partition pruning for hash join

2023-08-24 Thread David Rowley
On Thu, 24 Aug 2023 at 21:27, Richard Guo  wrote:
> I performed some performance comparisons of the worst case with two
> tables as below:
>
> 1. The partitioned table has 1000 children, and 100,000 tuples in total.
>
> 2. The other table is designed that
> a) its tuples occupy every partition of the partitioned table so
>that no partitions can be pruned during execution,
> b) tuples belong to the same partition are placed together so that
>we need to scan all its tuples before we could know that no
>pruning would happen and we could stop trying to prune,
> c) the tuples are unique on the hash key so as to minimize the cost
>of hash probe, so that we can highlight the impact of the pruning
>codes.
>
> Here is the execution time (ms) I get with different sizes of the other
> table.
>
> tuples  unpatched   patched
> 1   45.74   53.46   (+0.17)
> 2   54.48   70.18   (+0.29)
> 3   62.57   85.18   (+0.36)
> 4   69.14   99.19   (+0.43)
> 5   76.46   111.09  (+0.45)
> 6   82.68   126.37  (+0.53)
> 7   92.69   137.89  (+0.49)
> 8   94.49   151.46  (+0.60)
> 9   101.53  164.93  (+0.62)
> 10  107.22  178.44  (+0.66)
>
> So the overhead the pruning code adds to Hash Join is too large to be
> accepted :(.

Agreed.  Run-time pruning is pretty fast to execute, but so is
inserting a row into a hash table.

> I think we need to solve this problem first before we can
> make this new partition pruning mechanism some useful in practice, but
> how?  Some thoughts currently in my mind include
>
> 1) we try our best to estimate the cost of this partition pruning when
> creating hash join paths, and decide based on the cost whether to use it
> or not.  But this does not seem to be an easy task.

I think we need to consider another Hash Join path when we detect that
the outer side of the Hash Join involves scanning a partitioned table.

I'd suggest writing some cost which costs an execution of run-time
pruning.  With LIST and RANGE you probably want something like
cpu_operator_cost * LOG2(nparts) once for each hashed tuple to account
for the binary search over the sorted datum array. For HASH
partitions, something like cpu_operator_cost * npartcols once for each
hashed tuple.

You'll need to then come up with some counter costs to subtract from
the Append/MergeAppend.  This is tricky, as discussed.  Just come up
with something crude for now.

To start with, it could just be as crude as:

total_costs *= (Min(expected_outer_rows, n_append_subnodes)  /
n_append_subnodes);

i.e assume that every outer joined row will require exactly 1 new
partition up to the total number of partitions.  That's pretty much
worst-case, but it'll at least allow the optimisation to work for
cases like where the hash table is expected to contain just a tiny
number of rows (fewer than the number of partitions)

To make it better, you might want to look at join selectivity
estimation and see if you can find something there to influence
something better.

> 2) we use some heuristics when executing hash join, such as when we
> notice that a $threshold percentage of the partitions must be visited
> we just abort the pruning and assume that no partitions can be pruned.

You could likely code in something that checks
bms_num_members(jpstate->part_prune_result) to see if it still remains
below the total Append/MergeAppend subplans whenever, say whenever the
lower 8 bits of hashtable->totalTuples are all off.  You can just give
up doing any further pruning when all partitions are already required.

David




Re: Support run-time partition pruning for hash join

2023-08-24 Thread Richard Guo
On Tue, Aug 22, 2023 at 2:38 PM David Rowley  wrote:

> It would be good to see some performance comparisons of the worst case
> to see how much overhead the pruning code adds to Hash Join.  It may
> well be that we need to consider two Hash Join paths, one with and one
> without run-time pruning. It's pretty difficult to meaningfully cost,
> as I already mentioned, however.


I performed some performance comparisons of the worst case with two
tables as below:

1. The partitioned table has 1000 children, and 100,000 tuples in total.

2. The other table is designed that
a) its tuples occupy every partition of the partitioned table so
   that no partitions can be pruned during execution,
b) tuples belong to the same partition are placed together so that
   we need to scan all its tuples before we could know that no
   pruning would happen and we could stop trying to prune,
c) the tuples are unique on the hash key so as to minimize the cost
   of hash probe, so that we can highlight the impact of the pruning
   codes.

Here is the execution time (ms) I get with different sizes of the other
table.

tuples  unpatched   patched
1   45.74   53.46   (+0.17)
2   54.48   70.18   (+0.29)
3   62.57   85.18   (+0.36)
4   69.14   99.19   (+0.43)
5   76.46   111.09  (+0.45)
6   82.68   126.37  (+0.53)
7   92.69   137.89  (+0.49)
8   94.49   151.46  (+0.60)
9   101.53  164.93  (+0.62)
10  107.22  178.44  (+0.66)

So the overhead the pruning code adds to Hash Join is too large to be
accepted :(.  I think we need to solve this problem first before we can
make this new partition pruning mechanism some useful in practice, but
how?  Some thoughts currently in my mind include

1) we try our best to estimate the cost of this partition pruning when
creating hash join paths, and decide based on the cost whether to use it
or not.  But this does not seem to be an easy task.

2) we use some heuristics when executing hash join, such as when we
notice that a $threshold percentage of the partitions must be visited
we just abort the pruning and assume that no partitions can be pruned.

Any thoughts or comments?

Thanks
Richard


Re: Support run-time partition pruning for hash join

2023-08-22 Thread Andy Fan
>
> > fwiw, the current master totally ignores the cost reduction for run-time
> > partition prune, even for init partition prune.  So in some real cases,
> > pg chooses a hash join just because the cost of nest loop join is
> > highly over estimated.
>
> This is true about the existing code. It's a very tricky thing to cost
> given that the parameter values are always unknown to the planner.
> The best we have for these today is the various hardcoded constants in
> selfuncs.h. While I do agree that it's not great that the costing code
> knows nothing about run-time pruning, I also think that run-time
> pruning during execution with parameterised nested loops is much more
> likely to be able to prune partitions and save actual work than the
> equivalent with Hash Joins.  It's more common for the planner to
> choose to Nested Loop when there are fewer outer rows, so the pruning
> code is likely to be called fewer times with Nested Loop than with
> Hash Join.
>

Yes, I agree with this.  In my 4 years of PostgresSQL,  I just run into
2 cases of this issue and 1 of them is joining 12+ tables with run-time
partition prune for every join.  But this situation causes more issues than
generating a wrong plan, like for a simple SELECT * FROM p WHERE
partkey = $1;  generic plan will never win so we have to pay the expensive
planning cost for partitioned table.

If we don't require very accurate costing for every case,  like we only
care about '=' operator which is the most common case,  it should be
easier than the case here since we just need to know if only 1 partition
will survive after pruning, but don't care about which one it is.  I'd like
to discuss in another thread, and leave this thread for Richard's patch
only.

-- 
Best Regards
Andy Fan


Re: Support run-time partition pruning for hash join

2023-08-22 Thread Andy Fan
On Tue, Aug 22, 2023 at 5:43 PM Richard Guo  wrote:

>
> On Mon, Aug 21, 2023 at 8:34 PM Andy Fan  wrote:
>
>> This feature looks good, but is it possible to know if we can prune
>> any subnodes before we pay the extra effort (building the Hash
>> table, for each row... stuff)?
>>
>
> It might be possible if we take the partition prunning into
> consideration when estimating costs.  But it seems not easy to calculate
> the costs accurately.
>

This is a real place I am worried about the future of this patch.
Personally, I do like this patch,  but not sure what if this issue can't be
fixed to make everyone happy, and fixing this perfectly looks hopeless
for me.  However, let's see what will happen.


>
>
>> Maybe at least,  if we have found no subnodes can be skipped
>> during the hashing, we can stop doing such work anymore.
>>
>
> Yeah, this is what we can do.
>

cool.


>
>
>> In my current knowledge, we have to build the inner table first for this
>> optimization?  so hash join and sort merge should be OK, but nestloop
>> should
>> be impossible unless I missed something.
>>
>
> For nestloop and mergejoin, we'd always execute the outer side first.
> So the Append/MergeAppend nodes need to be on the inner side for the
> join partition prunning to take effect.  For a mergejoin that will
> explicitly sort the outer side, the sort node would process all the
> outer rows before scanning the inner side, so we can do the join
> partition prunning with that.  For a nestloop, if we have a Material
> node on the outer side, we can do that too, but I wonder if we'd have
> such a plan in real world, because we only add Material to the inner
> side of nestloop.
>

This is more interesting than I expected,thanks for the explaination.

-- 
Best Regards
Andy Fan


Re: Support run-time partition pruning for hash join

2023-08-22 Thread Richard Guo
On Tue, Aug 22, 2023 at 2:38 PM David Rowley  wrote:

> With Hash Join, it seems to me that the pruning must take place for
> every row that makes it into the hash table.  There will be maybe
> cases where the unioned set of partitions simply yields every
> partition and all the work results in no savings. Pruning on a scalar
> value seems much more likely to be able to prune away unneeded
> Append/MergeAppend subnodes.


Yeah, you're right.  If we have 'pt HashJoin t', for a subnode of 'pt'
to be pruned, it needs every row of 't' to be able to prune that
subnode.  The situation may improve if we have more than 2-way hash
joins, because the final surviving subnodes would be the intersection of
matching subnodes in each Hash.

With parameterized nestloop I agree that it's more likely to be able to
prune subnodes at rescan of Append/MergeAppend nodes based on scalar
values.

Sometimes we may just not generate parameterized nestloop as final plan,
such as when there are no indexes and no lateral references in the
Append/MergeAppend node.  In this case I think it would be great if we
can still do some partition prunning.  So I think this new 'join
partition prunning mechanism' (maybe this is not a proper name) should
be treated as a supplement to, not a substitute for, the current
run-time partition prunning based on parameterized nestloop, and it is
so implemented in the patch.


> Perhaps there can be something adaptive in Hash Join which stops
> trying to prune when all partitions must be visited.  On a quick
> glance of the patch, I don't see any code in ExecJoinPartitionPrune()
> which gives up trying to prune when the number of members in
> part_prune_result is equal to the prunable Append/MergeAppend
> subnodes.


Yeah, we can do that.


> But run-time pruning already works for Nested Loops... I must be
> missing something here.


Here I mean nestloop with non-parameterized inner path.  As I explained
upthread, we need to have a Material node on the outer side for that to
work, which seems not possible in real world.

Thanks
Richard


Re: Support run-time partition pruning for hash join

2023-08-22 Thread Richard Guo
On Mon, Aug 21, 2023 at 8:34 PM Andy Fan  wrote:

> This feature looks good, but is it possible to know if we can prune
> any subnodes before we pay the extra effort (building the Hash
> table, for each row... stuff)?
>

It might be possible if we take the partition prunning into
consideration when estimating costs.  But it seems not easy to calculate
the costs accurately.


> Maybe at least,  if we have found no subnodes can be skipped
> during the hashing, we can stop doing such work anymore.
>

Yeah, this is what we can do.


> In my current knowledge, we have to build the inner table first for this
> optimization?  so hash join and sort merge should be OK, but nestloop
> should
> be impossible unless I missed something.
>

For nestloop and mergejoin, we'd always execute the outer side first.
So the Append/MergeAppend nodes need to be on the inner side for the
join partition prunning to take effect.  For a mergejoin that will
explicitly sort the outer side, the sort node would process all the
outer rows before scanning the inner side, so we can do the join
partition prunning with that.  For a nestloop, if we have a Material
node on the outer side, we can do that too, but I wonder if we'd have
such a plan in real world, because we only add Material to the inner
side of nestloop.

Thanks
Richard


Re: Support run-time partition pruning for hash join

2023-08-22 Thread David Rowley
On Tue, 22 Aug 2023 at 00:34, Andy Fan  wrote:
>
> On Mon, Aug 21, 2023 at 11:48 AM Richard Guo  wrote:
>> 1. All the join partition prunning decisions are made in createplan.c
>>where the best path tree has been decided.  This is not great.  Maybe
>>it's better to make it happen when we build up the path tree, so that
>>we can take the partition prunning into consideration when estimating
>>the costs.
>
>
> fwiw, the current master totally ignores the cost reduction for run-time
> partition prune, even for init partition prune.  So in some real cases,
> pg chooses a hash join just because the cost of nest loop join is
> highly over estimated.

This is true about the existing code. It's a very tricky thing to cost
given that the parameter values are always unknown to the planner.
The best we have for these today is the various hardcoded constants in
selfuncs.h. While I do agree that it's not great that the costing code
knows nothing about run-time pruning, I also think that run-time
pruning during execution with parameterised nested loops is much more
likely to be able to prune partitions and save actual work than the
equivalent with Hash Joins.  It's more common for the planner to
choose to Nested Loop when there are fewer outer rows, so the pruning
code is likely to be called fewer times with Nested Loop than with
Hash Join.

With Hash Join, it seems to me that the pruning must take place for
every row that makes it into the hash table.  There will be maybe
cases where the unioned set of partitions simply yields every
partition and all the work results in no savings. Pruning on a scalar
value seems much more likely to be able to prune away unneeded
Append/MergeAppend subnodes.

Perhaps there can be something adaptive in Hash Join which stops
trying to prune when all partitions must be visited.  On a quick
glance of the patch, I don't see any code in ExecJoinPartitionPrune()
which gives up trying to prune when the number of members in
part_prune_result is equal to the prunable Append/MergeAppend
subnodes.

It would be good to see some performance comparisons of the worst case
to see how much overhead the pruning code adds to Hash Join.  It may
well be that we need to consider two Hash Join paths, one with and one
without run-time pruning. It's pretty difficult to meaningfully cost,
as I already mentioned, however.

>> 4. Is it possible and worthwhile to extend the join partition prunning
>>mechanism to support nestloop and mergejoin also?
>
>
> In my current knowledge, we have to build the inner table first for this
> optimization?  so hash join and sort merge should be OK, but nestloop should
> be impossible unless I missed something.

But run-time pruning already works for Nested Loops... I must be
missing something here.

I imagine for Merge Joins a more generic approach would be better by
implementing parameterised Merge Joins (a.k.a zigzag merge joins).
The Append/MergeAppend node could then select the correct partition(s)
based on the current parameter value at rescan. I don't think any code
changes would be needed in node[Merge]Append.c for that to work.  This
could also speed up Merge Joins to non-partitioned tables when an
index is providing presorted input to the join.

David




Re: Support run-time partition pruning for hash join

2023-08-21 Thread Andy Fan
On Mon, Aug 21, 2023 at 11:48 AM Richard Guo  wrote:

> If we have a hash join with an Append node on the outer side, something
> like
>
>  Hash Join
>Hash Cond: (pt.a = t.a)
>->  Append
>  ->  Seq Scan on pt_p1 pt_1
>  ->  Seq Scan on pt_p2 pt_2
>  ->  Seq Scan on pt_p3 pt_3
>->  Hash
>  ->  Seq Scan on t
>
> We can actually prune those subnodes of the Append that cannot possibly
> contain any matching tuples from the other side of the join.  To do
> that, when building the Hash table, for each row from the inner side we
> can compute the minimum set of subnodes that can possibly match the join
> condition.  When we have built the Hash table and start to execute the
> Append node, we should have known which subnodes are survived and thus
> can skip other subnodes.
>

This feature looks good, but is it possible to know if we can prune
any subnodes before we pay the extra effort (building the Hash
table, for each row... stuff)?  IIUC, looks no.  If so, I think this area
needs more attention. I can't provide any good suggestions yet.

Maybe at least,  if we have found no subnodes can be skipped
during the hashing, we can stop doing such work anymore.

There are several points that need more consideration.
>
> 1. All the join partition prunning decisions are made in createplan.c
>where the best path tree has been decided.  This is not great.  Maybe
>it's better to make it happen when we build up the path tree, so that
>we can take the partition prunning into consideration when estimating
>the costs.
>

fwiw, the current master totally ignores the cost reduction for run-time
partition prune, even for init partition prune.  So in some real cases,
pg chooses a hash join just because the cost of nest loop join is
highly over estimated.

4. Is it possible and worthwhile to extend the join partition prunning
>mechanism to support nestloop and mergejoin also?
>

In my current knowledge, we have to build the inner table first for this
optimization?  so hash join and sort merge should be OK, but nestloop should
be impossible unless I missed something.

-- 
Best Regards
Andy Fan


Support run-time partition pruning for hash join

2023-08-20 Thread Richard Guo
If we have a hash join with an Append node on the outer side, something
like

 Hash Join
   Hash Cond: (pt.a = t.a)
   ->  Append
 ->  Seq Scan on pt_p1 pt_1
 ->  Seq Scan on pt_p2 pt_2
 ->  Seq Scan on pt_p3 pt_3
   ->  Hash
 ->  Seq Scan on t

We can actually prune those subnodes of the Append that cannot possibly
contain any matching tuples from the other side of the join.  To do
that, when building the Hash table, for each row from the inner side we
can compute the minimum set of subnodes that can possibly match the join
condition.  When we have built the Hash table and start to execute the
Append node, we should have known which subnodes are survived and thus
can skip other subnodes.

This kind of partition pruning can be extended to happen across multiple
join levels.  For instance,

 Hash Join
   Hash Cond: (pt.a = t2.a)
   ->  Hash Join
 Hash Cond: (pt.a = t1.a)
 ->  Append
   ->  Seq Scan on pt_p1 pt_1
   ->  Seq Scan on pt_p2 pt_2
   ->  Seq Scan on pt_p3 pt_3
 ->  Hash
   ->  Seq Scan on t1
   ->  Hash
 ->  Seq Scan on t2

We can compute the matching subnodes of the Append when building Hash
table for 't1' according to the join condition 'pt.a = t1.a', and when
building Hash table for 't2' according to join condition 'pt.a = t2.a',
and the final surviving subnodes would be their intersection.

Greenplum [1] has implemented this kind of partition pruning as
'Partition Selector'.  Attached is a patch that refactores Greenplum's
implementation to make it work on PostgreSQL master.  Here are some
details about the patch.

During planning:

1. When creating a hash join plan in create_hashjoin_plan() we first
   collect information required to build PartitionPruneInfos at this
   join, which includes the join's RestrictInfos and the join's inner
   relids, and put this information in a stack.

2. When we call create_append_plan() for an appendrel, for each of the
   joins we check if join partition pruning is possible to take place
   for this appendrel, based on the information collected at that join,
   and if so build a PartitionPruneInfo and add it to the stack entry.

3. After finishing the outer side of the hash join, we should have built
   all the PartitionPruneInfos that can be used to perform join
   partition pruning at this join.  So we pop out the stack entry to get
   the PartitionPruneInfos and add them to Hash node.

During executing:

When building the hash table for a hash join, we perform the partition
prunning for each row according to each of the JoinPartitionPruneStates
at this join, and store each result in a special executor parameter to
make it available to Append nodes.  When executing an Append node, we
can directly use the pre-computed pruning results to skip subnodes that
cannot contain any matching rows.

Here is a query that shows the effect of the join partition prunning.

CREATE TABLE pt (a int, b int, c varchar) PARTITION BY RANGE(a);
CREATE TABLE pt_p1 PARTITION OF pt FOR VALUES FROM (0) TO (250);
CREATE TABLE pt_p2 PARTITION OF pt FOR VALUES FROM (250) TO (500);
CREATE TABLE pt_p3 PARTITION OF pt FOR VALUES FROM (500) TO (600);
INSERT INTO pt SELECT i, i % 25, to_char(i, 'FM') FROM
generate_series(0, 599) i WHERE i % 2 = 0;

CREATE TABLE t1 (a int, b int);
INSERT INTO t1 values (10, 10);

CREATE TABLE t2 (a int, b int);
INSERT INTO t2 values (300, 300);

ANALYZE pt, t1, t2;

SET enable_nestloop TO off;

explain (analyze, costs off, summary off, timing off)
select * from pt join t1 on pt.a = t1.a right join t2 on pt.a = t2.a;
 QUERY PLAN

 Hash Right Join (actual rows=1 loops=1)
   Hash Cond: (pt.a = t2.a)
   ->  Hash Join (actual rows=0 loops=1)
 Hash Cond: (pt.a = t1.a)
 ->  Append (actual rows=0 loops=1)
   ->  Seq Scan on pt_p1 pt_1 (never executed)
   ->  Seq Scan on pt_p2 pt_2 (never executed)
   ->  Seq Scan on pt_p3 pt_3 (never executed)
 ->  Hash (actual rows=1 loops=1)
   Buckets: 1024  Batches: 1  Memory Usage: 9kB
   ->  Seq Scan on t1 (actual rows=1 loops=1)
   ->  Hash (actual rows=1 loops=1)
 Buckets: 1024  Batches: 1  Memory Usage: 9kB
 ->  Seq Scan on t2 (actual rows=1 loops=1)
(14 rows)


There are several points that need more consideration.

1. All the join partition prunning decisions are made in createplan.c
   where the best path tree has been decided.  This is not great.  Maybe
   it's better to make it happen when we build up the path tree, so that
   we can take the partition prunning into consideration when estimating
   the costs.

2. In order to make the join partition prunning take effect, the patch
   hacks the empty-outer optimization in ExecHashJoinImpl().  Not sure
   if this is a good practice.

3. This patch does not support parallel