Re: [PR] perfect hash join [datafusion]

2026-01-09 Thread via GitHub


Dandandan commented on PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#issuecomment-3728238191

   Thank you @UBarney !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perfect hash join [datafusion]

2026-01-09 Thread via GitHub


Dandandan merged PR #19411:
URL: https://github.com/apache/datafusion/pull/19411


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perfect hash join [datafusion]

2026-01-09 Thread via GitHub


alamb-ghbot commented on PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#issuecomment-3728200782

   πŸ€–: Benchmark completed
   
   Details
   
   
   
   ```
   Comparing HEAD and perfect_hj
   
   Benchmark tpch_sf1.json
   
   ┏━━┳━━━┳┳━━━┓
   ┃ Query┃  HEAD ┃ perfect_hj ┃Change ┃
   ┑━━╇━━━╇╇━━━┩
   β”‚ QQuery 1 β”‚ 197.41 ms β”‚  193.73 ms β”‚ no change β”‚
   β”‚ QQuery 2 β”‚  92.23 ms β”‚   78.21 ms β”‚ +1.18x faster β”‚
   β”‚ QQuery 3 β”‚ 118.66 ms β”‚  119.31 ms β”‚ no change β”‚
   β”‚ QQuery 4 β”‚  77.39 ms β”‚   75.23 ms β”‚ no change β”‚
   β”‚ QQuery 5 β”‚ 172.35 ms β”‚  171.62 ms β”‚ no change β”‚
   β”‚ QQuery 6 β”‚  63.31 ms β”‚   63.36 ms β”‚ no change β”‚
   β”‚ QQuery 7 β”‚ 211.99 ms β”‚  201.61 ms β”‚ no change β”‚
   β”‚ QQuery 8 β”‚ 162.42 ms β”‚  162.11 ms β”‚ no change β”‚
   β”‚ QQuery 9 β”‚ 224.05 ms β”‚  217.71 ms β”‚ no change β”‚
   β”‚ QQuery 10β”‚ 194.29 ms β”‚  182.13 ms β”‚ +1.07x faster β”‚
   β”‚ QQuery 11β”‚  73.85 ms β”‚   53.93 ms β”‚ +1.37x faster β”‚
   β”‚ QQuery 12β”‚ 116.60 ms β”‚  115.72 ms β”‚ no change β”‚
   β”‚ QQuery 13β”‚ 219.95 ms β”‚  210.38 ms β”‚ no change β”‚
   β”‚ QQuery 14β”‚  92.07 ms β”‚   86.02 ms β”‚ +1.07x faster β”‚
   β”‚ QQuery 15β”‚ 118.29 ms β”‚  119.84 ms β”‚ no change β”‚
   β”‚ QQuery 16β”‚  55.78 ms β”‚   57.05 ms β”‚ no change β”‚
   β”‚ QQuery 17β”‚ 272.93 ms β”‚  271.63 ms β”‚ no change β”‚
   β”‚ QQuery 18β”‚ 324.23 ms β”‚  317.09 ms β”‚ no change β”‚
   β”‚ QQuery 19β”‚ 131.86 ms β”‚  128.52 ms β”‚ no change β”‚
   β”‚ QQuery 20β”‚ 125.68 ms β”‚  122.17 ms β”‚ no change β”‚
   β”‚ QQuery 21β”‚ 257.67 ms β”‚  248.54 ms β”‚ no change β”‚
   β”‚ QQuery 22β”‚  43.63 ms β”‚   36.74 ms β”‚ +1.19x faster β”‚
   β””β”€β”€β”΄β”€β”€β”€β”΄β”΄β”€β”€β”€β”˜
   ┏━━━┳━━━┓
   ┃ Benchmark Summary ┃   ┃
   ┑━━━╇━━━┩
   β”‚ Total Time (HEAD) β”‚ 3346.65ms β”‚
   β”‚ Total Time (perfect_hj)   β”‚ 3232.65ms β”‚
   β”‚ Average Time (HEAD)   β”‚  152.12ms β”‚
   β”‚ Average Time (perfect_hj) β”‚  146.94ms β”‚
   β”‚ Queries Fasterβ”‚ 5 β”‚
   β”‚ Queries Slowerβ”‚ 0 β”‚
   β”‚ Queries with No Changeβ”‚17 β”‚
   β”‚ Queries with Failure  β”‚ 0 β”‚
   β””β”€β”€β”€β”΄β”€β”€β”€β”˜
   ```
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perfect hash join [datafusion]

2026-01-09 Thread via GitHub


alamb-ghbot commented on PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#issuecomment-3728198176

   πŸ€– `./gh_compare_branch.sh` 
[gh_compare_branch.sh](https://github.com/alamb/datafusion-benchmarking/blob/main/scripts/gh_compare_branch.sh)
 Running
   Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 
2025 x86_64 x86_64 x86_64 GNU/Linux
   Comparing perfect_hj (c772070b6690093fc094644b618e52bd414c70a9) to 
1037f0aa205e0d04ade7aa9973b4b4c485b696ab 
[diff](https://github.com/apache/datafusion/compare/1037f0aa205e0d04ade7aa9973b4b4c485b696ab..c772070b6690093fc094644b618e52bd414c70a9)
 using:  tpch
   Results will be posted here when complete
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perfect hash join [datafusion]

2026-01-09 Thread via GitHub


alamb-ghbot commented on PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#issuecomment-3728197862

   πŸ€–: Benchmark completed
   
   Details
   
   
   
   ```
   Comparing HEAD and perfect_hj
   
   Benchmark clickbench_extended.json
   
   ┏━━┳━┳━┳━━━┓
   ┃ Query┃HEAD ┃  perfect_hj ┃Change ┃
   ┑━━╇━╇━╇━━━┩
   β”‚ QQuery 0 β”‚  2454.70 ms β”‚  2388.05 ms β”‚ no change β”‚
   β”‚ QQuery 1 β”‚   972.85 ms β”‚   921.28 ms β”‚ +1.06x faster β”‚
   β”‚ QQuery 2 β”‚  1927.43 ms β”‚  1840.69 ms β”‚ no change β”‚
   β”‚ QQuery 3 β”‚  1207.26 ms β”‚  1175.84 ms β”‚ no change β”‚
   β”‚ QQuery 4 β”‚  2293.93 ms β”‚  2285.89 ms β”‚ no change β”‚
   β”‚ QQuery 5 β”‚ 28351.12 ms β”‚ 28179.93 ms β”‚ no change β”‚
   β”‚ QQuery 6 β”‚  3856.37 ms β”‚  3907.41 ms β”‚ no change β”‚
   β”‚ QQuery 7 β”‚  3715.67 ms β”‚  3538.47 ms β”‚ no change β”‚
   β””β”€β”€β”΄β”€β”΄β”€β”΄β”€β”€β”€β”˜
   ┏━━━┳┓
   ┃ Benchmark Summary ┃┃
   ┑━━━╇┩
   β”‚ Total Time (HEAD) β”‚ 44779.32ms β”‚
   β”‚ Total Time (perfect_hj)   β”‚ 44237.56ms β”‚
   β”‚ Average Time (HEAD)   β”‚  5597.41ms β”‚
   β”‚ Average Time (perfect_hj) β”‚  5529.70ms β”‚
   β”‚ Queries Fasterβ”‚  1 β”‚
   β”‚ Queries Slowerβ”‚  0 β”‚
   β”‚ Queries with No Changeβ”‚  7 β”‚
   β”‚ Queries with Failure  β”‚  0 β”‚
   β””β”€β”€β”€β”΄β”˜
   
   Benchmark clickbench_partitioned.json
   
   ┏━━┳━┳━┳━━━┓
   ┃ Query┃HEAD ┃  perfect_hj ┃Change ┃
   ┑━━╇━╇━╇━━━┩
   β”‚ QQuery 0 β”‚ 1.43 ms β”‚ 1.42 ms β”‚ no change β”‚
   β”‚ QQuery 1 β”‚49.63 ms β”‚47.90 ms β”‚ no change β”‚
   β”‚ QQuery 2 β”‚   137.51 ms β”‚   133.46 ms β”‚ no change β”‚
   β”‚ QQuery 3 β”‚   159.31 ms β”‚   150.20 ms β”‚ +1.06x faster β”‚
   β”‚ QQuery 4 β”‚  1114.41 ms β”‚  1126.40 ms β”‚ no change β”‚
   β”‚ QQuery 5 β”‚  1357.08 ms β”‚  1386.74 ms β”‚ no change β”‚
   β”‚ QQuery 6 β”‚ 1.43 ms β”‚ 1.45 ms β”‚ no change β”‚
   β”‚ QQuery 7 β”‚53.61 ms β”‚54.27 ms β”‚ no change β”‚
   β”‚ QQuery 8 β”‚  1474.17 ms β”‚  1468.98 ms β”‚ no change β”‚
   β”‚ QQuery 9 β”‚  1818.60 ms β”‚  1789.35 ms β”‚ no change β”‚
   β”‚ QQuery 10β”‚   357.03 ms β”‚   353.62 ms β”‚ no change β”‚
   β”‚ QQuery 11β”‚   400.27 ms β”‚   394.83 ms β”‚ no change β”‚
   β”‚ QQuery 12β”‚  1270.13 ms β”‚  1273.58 ms β”‚ no change β”‚
   β”‚ QQuery 13β”‚  1969.54 ms β”‚  1953.45 ms β”‚ no change β”‚
   β”‚ QQuery 14β”‚  1259.80 ms β”‚  1245.24 ms β”‚ no change β”‚
   β”‚ QQuery 15β”‚  1272.31 ms β”‚  1221.68 ms β”‚ no change β”‚
   β”‚ QQuery 16β”‚  2573.32 ms β”‚  2591.69 ms β”‚ no change β”‚
   β”‚ QQuery 17β”‚  2535.10 ms β”‚  2579.22 ms β”‚ no change β”‚
   β”‚ QQuery 18β”‚  5445.15 ms β”‚  5005.38 ms β”‚ +1.09x faster β”‚
   β”‚ QQuery 19β”‚   123.16 ms β”‚   121.03 ms β”‚ no change β”‚
   β”‚ QQuery 20β”‚  1933.61 ms β”‚  1888.59 ms β”‚ no change β”‚
   β”‚ QQuery 21β”‚  2115.66 ms β”‚  2211.40 ms β”‚ no change β”‚
   β”‚ QQuery 22β”‚  3720.38 ms β”‚  3756.81 ms β”‚ no change β”‚
   β”‚ QQuery 23β”‚ 22330.25 ms β”‚ 12098.87 ms β”‚ +1.85x faster β”‚
   β”‚ QQuery 24β”‚   213.70 ms β”‚   209.57 ms β”‚ no change β”‚
   β”‚ QQuery 25β”‚   456.72 ms β”‚   463.82 ms β”‚ no change β”‚
   β”‚ QQuery 26β”‚   217.99 ms β”‚   206.74 ms β”‚ +1.05x faster β”‚
   β”‚ QQuery 27β”‚  2849.58 ms β”‚  2672.49 ms β”‚ +1.07x faster β”‚
   β”‚ QQuery 28β”‚ 22643.07 ms β”‚ 23254.82 ms β”‚ no change β”‚
   β”‚ QQuery 29β”‚  1003.18 ms β”‚   965.96 ms β”‚ no change β”‚
   β”‚ QQuery 30β”‚  1337.18 ms β”‚  1317.95 ms β”‚ no change β”‚
   β”‚ QQuery 31β”‚  1362.68 ms β”‚  1337.85 ms β”‚ no change β”‚
   β”‚ QQuery 32β”‚  5349.86 ms β”‚  4728.39 ms β”‚ +1.13x faster β”‚
   β”‚ QQuery 33β”‚  5896.39 ms β”‚  5499.63 ms β”‚ +1.07x faster β”‚
   β”‚ QQuery 34β”‚  5921.03 ms β”‚  5651.70 ms β”‚ no change β”‚
   β”‚ QQuery 35β”‚  1954.84 ms β”‚  1860.00 ms β”‚ no change β”‚
   β”‚ QQuery 36β”‚64.52 ms β”‚65.88 ms β”‚ no change β”‚
   β”‚ QQuery 37β”‚44.40 ms β”‚43.29 ms β”‚ no change β”‚
   β”‚ QQuery 38β”‚65.27 ms β”‚65.04 ms β”‚ no change β”‚
   β”‚ QQuery 39β”‚   101.86 ms β”‚99.54 ms β”‚ no change β”‚
   β”‚ QQuery 40β”‚24.38 ms β”‚24.85 ms β”‚ no change β”‚
   β”‚ QQuery 41β”‚22.19 ms β”‚21.72 ms β”‚ no change β”‚
   β”‚ QQuery 42β”‚19.30 ms β”‚19.20 ms β”‚ no change β”‚
   β””β”€β”€β”΄β”€β”΄β”€β”΄β”€β”€β”€β”˜
   ┏━━━┳━┓
   ┃ Benchmark Summary ┃ ┃
   ┑━━━╇━┩
   β”‚ Total Time (HEAD) β”‚ 103021.02ms β”‚
   β”‚ Total Time (perfect_hj)   β”‚  91364.01ms β”‚
   β”‚ Average Time (HEAD)   β”‚   2395.84ms β”‚
   β”‚ 

Re: [PR] perfect hash join [datafusion]

2026-01-09 Thread via GitHub


alamb-ghbot commented on PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#issuecomment-3728022461

   πŸ€– `./gh_compare_branch.sh` 
[gh_compare_branch.sh](https://github.com/alamb/datafusion-benchmarking/blob/main/scripts/gh_compare_branch.sh)
 Running
   Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 
2025 x86_64 x86_64 x86_64 GNU/Linux
   Comparing perfect_hj (c772070b6690093fc094644b618e52bd414c70a9) to 
1037f0aa205e0d04ade7aa9973b4b4c485b696ab 
[diff](https://github.com/apache/datafusion/compare/1037f0aa205e0d04ade7aa9973b4b4c485b696ab..c772070b6690093fc094644b618e52bd414c70a9)
 using:  tpch_mem clickbench_partitioned clickbench_extended
   Results will be posted here when complete
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perfect hash join [datafusion]

2026-01-09 Thread via GitHub


Dandandan commented on PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#issuecomment-3727828861

   run benchmark tpch


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perfect hash join [datafusion]

2026-01-09 Thread via GitHub


Dandandan commented on PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#issuecomment-3727828403

   run benchmarks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perfect hash join [datafusion]

2026-01-08 Thread via GitHub


UBarney commented on code in PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#discussion_r2672460238


##
datafusion/common/src/config.rs:
##
@@ -468,6 +468,25 @@ config_namespace! {
 /// metadata memory consumption
 pub batch_size: usize, default = 8192
 
+/// A perfect hash join will be considered if the number of rows on 
the build
+/// side is below this threshold. This provides a fast path for joins 
with
+/// very small build sides, bypassing the density check.
+/// 
+/// TODO: Currently only supports cases where left_side.num_rows() < 
u32::MAX.
+/// Support for left_side.num_rows() >= u32::MAX will be added in the 
future.
+pub perfect_hash_join_small_build_threshold: usize, default = 1024
+
+/// The minimum required density of join keys on the build side to 
consider a
+/// perfect hash join. Density is calculated as:
+/// `(number of rows) / (max_key - min_key + 1)`.
+/// A perfect hash join may be used if the actual key density exceeds 
this
+/// value. For example, a value of 0.99 means the keys must fill at 
least
+/// 99% of their value range.
+/// 
+/// TODO: Currently only supports cases where left_side.num_rows() < 
u32::MAX.
+/// Support for left_side.num_rows() >= u32::MAX will be added in the 
future.
+pub perfect_hash_join_min_key_density: f64, default = 0.99

Review Comment:
   Agreed. Setting it to ~15% works wellβ€”it not only improves performance but 
also reduces the memory footprint (space usage). I've updated the threshold.
   
   ```
   Memory Comparison Matrix (num_rows = 100)
   | Density | Dup Rate | ArrayMap (MB) | JoinHashMap (MB) | Ratio (AM/JHM) |
   |-|--|---|--||
   ...
   | 15% |   0% | 25.43 |37.81 |   0.67x |
   | 15% |  25% | 29.25 |37.81 |   0.77x |
   | 15% |  50% | 29.25 |37.81 |   0.77x |
   | 15% |  75% | 29.25 |37.81 |   0.77x |
   
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perfect hash join [datafusion]

2026-01-08 Thread via GitHub


UBarney commented on code in PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#discussion_r2672453954


##
datafusion/physical-plan/src/joins/hash_join/exec.rs:
##
@@ -168,6 +249,34 @@ impl JoinLeftData {
 /// ` != `) are known as "filter expressions" and are evaluated
 /// after the equijoin predicates.
 ///
+/// # ArrayMap Optimization
+///
+/// For joins with a single integer-based join key, `HashJoinExec` may use an 
[`ArrayMap`]
+/// (also known as a "perfect hash join") instead of a general-purpose hash 
map.
+/// This optimization is used when:
+/// 1. There is exactly one join key.
+/// 2. The join key can be any integer type convertible to u64 (excluding i128 
and u128).

Review Comment:
I've applied the suggested change



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perfect hash join [datafusion]

2026-01-08 Thread via GitHub


UBarney commented on code in PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#discussion_r2672451547


##
datafusion/physical-plan/src/joins/hash_join/exec.rs:
##
@@ -1601,10 +1778,37 @@ mod tests {
 
 #[template]
 #[rstest]
-fn batch_sizes(#[values(8192, 10, 5, 2, 1)] batch_size: usize) {}
+fn hash_join_exec_configs(
+#[values(8192, 10, 5, 2, 1)] batch_size: usize,
+#[values(true, false)] use_perfect_hash_join_as_possible: bool,
+) {
+}
 
-fn prepare_task_ctx(batch_size: usize) -> Arc {
-let session_config = 
SessionConfig::default().with_batch_size(batch_size);
+fn prepare_task_ctx(
+batch_size: usize,
+use_perfect_hash_join_as_possible: bool,
+) -> Arc {
+let mut session_config = 
SessionConfig::default().with_batch_size(batch_size);
+
+if use_perfect_hash_join_as_possible {
+session_config
+.options_mut()
+.execution
+.perfect_hash_join_small_build_threshold = 819200;
+session_config
+.options_mut()
+.execution
+.perfect_hash_join_min_key_density = 0.0;
+} else {
+session_config
+.options_mut()
+.execution
+.perfect_hash_join_small_build_threshold = 0;
+session_config
+.options_mut()
+.execution
+.perfect_hash_join_min_key_density = 1.0 / 0.0;

Review Comment:
   Accepted. Using f64::INFINITY is indeed much clearer.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perfect hash join [datafusion]

2026-01-08 Thread via GitHub


UBarney commented on code in PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#discussion_r2672449674


##
datafusion/physical-plan/src/joins/array_map.rs:
##
@@ -0,0 +1,540 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow_schema::DataType;
+use num_traits::AsPrimitive;
+use std::mem::size_of;
+
+use crate::joins::MapOffset;
+use crate::joins::chain::traverse_chain;
+use arrow::array::{Array, ArrayRef, AsArray, BooleanArray};
+use arrow::buffer::BooleanBuffer;
+use arrow::datatypes::ArrowNumericType;
+use datafusion_common::{Result, ScalarValue, internal_err};
+
+/// A macro to downcast only supported integer types (up to 64-bit) and invoke 
a generic function.
+///
+/// Usage: `downcast_supported_integer!(data_type => (Method, arg1, arg2, 
...))`
+///
+/// The `Method` must be an associated method of [`ArrayMap`] that is generic 
over
+/// `` and allow `T::Native: AsPrimitive`.
+macro_rules! downcast_supported_integer {
+($DATA_TYPE:expr => ($METHOD:ident $(, $ARGS:expr)*)) => {
+match $DATA_TYPE {
+arrow::datatypes::DataType::Int8 => 
ArrayMap::$METHOD::($($ARGS),*),
+arrow::datatypes::DataType::Int16 => 
ArrayMap::$METHOD::($($ARGS),*),
+arrow::datatypes::DataType::Int32 => 
ArrayMap::$METHOD::($($ARGS),*),
+arrow::datatypes::DataType::Int64 => 
ArrayMap::$METHOD::($($ARGS),*),
+arrow::datatypes::DataType::UInt8 => 
ArrayMap::$METHOD::($($ARGS),*),
+arrow::datatypes::DataType::UInt16 => 
ArrayMap::$METHOD::($($ARGS),*),
+arrow::datatypes::DataType::UInt32 => 
ArrayMap::$METHOD::($($ARGS),*),
+arrow::datatypes::DataType::UInt64 => 
ArrayMap::$METHOD::($($ARGS),*),
+_ => {
+return internal_err!(
+"Unsupported type for ArrayMap: {:?}",
+$DATA_TYPE
+);
+}
+}
+};
+}
+
+/// A dense map for single-column integer join keys within a limited range.
+///
+/// Maps join keys to build-side indices using direct array indexing:
+/// `data[val - min_val_in_build_side] -> val_idx_in_build_side + 1`.
+///
+/// NULL values are ignored on both the build side and the probe side.
+///
+/// # Handling Negative Numbers with `wrapping_sub`
+///
+/// This implementation supports signed integer ranges (e.g., `[-5, 5]`) 
efficiently by
+/// treating them as `u64` (Two's Complement) and relying on the bitwise 
properties of
+/// wrapping arithmetic (`wrapping_sub`).
+///
+/// In Two's Complement representation, `a_signed - b_signed` produces the 
same bit pattern
+/// as `a_unsigned.wrapping_sub(b_unsigned)` (modulo 2^N). This allows us to 
perform
+/// range calculations and zero-based index mapping uniformly for both signed 
and unsigned
+/// types without branching.
+///
+/// ## Examples
+///
+/// Consider an `Int64` range `[-5, 5]`.
+/// * `min_val (-5)` casts to `u64`: `...1011` (`u64::MAX - 4`)
+/// * `max_val (5)` casts to `u64`: `...0101` (`5`)
+///
+/// **1. Range Calculation**
+///
+/// ```text
+/// In modular arithmetic, this is equivalent to:
+///   (5 - (2^64 - 5)) mod 2^64
+/// = (5 - 2^64 + 5) mod 2^64
+/// = (10 - 2^64) mod 2^64
+/// = 10
+///
+/// ```
+/// The resulting `range` (10) correctly represents the size of the interval 
`[-5, 5]`.
+///
+/// **2. Index Lookup (in `get_matched_indices`)**
+///
+/// For a probe value of `0` (which is stored as `0u64`):
+/// ```text
+/// In modular arithmetic, this is equivalent to:
+///   (0 - (2^64 - 5)) mod 2^64
+/// = (-2^64 + 5) mod 2^64
+/// = 5
+/// ```
+/// This correctly maps `-5` to index `0`, `0` to index `5`, etc.
+#[derive(Debug)]
+pub struct ArrayMap {
+// data[probSideVal-offset] -> valIdxInBuildSide + 1; 0 for absent
+data: Vec,
+// min val in buildSide
+offset: u64,
+// next[buildSideIdx] -> next matching valIdxInBuildSide + 1; 0 for end of 
chain.
+// If next is empty, it means there are no duplicate keys (no conflicts).
+// It uses the same chain-based conflict resolution as [`JoinHashMapType`].
+next: Vec,
+num_of_distinct_key: usize,
+}
+
+impl ArrayMap {
+pub fn is_supported_type(data_type: &Data

Re: [PR] perfect hash join [datafusion]

2026-01-08 Thread via GitHub


UBarney commented on code in PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#discussion_r2672448148


##
datafusion/physical-plan/src/joins/array_map.rs:
##
@@ -0,0 +1,540 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow_schema::DataType;
+use num_traits::AsPrimitive;
+use std::mem::size_of;
+
+use crate::joins::MapOffset;
+use crate::joins::chain::traverse_chain;
+use arrow::array::{Array, ArrayRef, AsArray, BooleanArray};
+use arrow::buffer::BooleanBuffer;
+use arrow::datatypes::ArrowNumericType;
+use datafusion_common::{Result, ScalarValue, internal_err};
+
+/// A macro to downcast only supported integer types (up to 64-bit) and invoke 
a generic function.
+///
+/// Usage: `downcast_supported_integer!(data_type => (Method, arg1, arg2, 
...))`
+///
+/// The `Method` must be an associated method of [`ArrayMap`] that is generic 
over
+/// `` and allow `T::Native: AsPrimitive`.
+macro_rules! downcast_supported_integer {
+($DATA_TYPE:expr => ($METHOD:ident $(, $ARGS:expr)*)) => {
+match $DATA_TYPE {
+arrow::datatypes::DataType::Int8 => 
ArrayMap::$METHOD::($($ARGS),*),
+arrow::datatypes::DataType::Int16 => 
ArrayMap::$METHOD::($($ARGS),*),
+arrow::datatypes::DataType::Int32 => 
ArrayMap::$METHOD::($($ARGS),*),
+arrow::datatypes::DataType::Int64 => 
ArrayMap::$METHOD::($($ARGS),*),
+arrow::datatypes::DataType::UInt8 => 
ArrayMap::$METHOD::($($ARGS),*),
+arrow::datatypes::DataType::UInt16 => 
ArrayMap::$METHOD::($($ARGS),*),
+arrow::datatypes::DataType::UInt32 => 
ArrayMap::$METHOD::($($ARGS),*),
+arrow::datatypes::DataType::UInt64 => 
ArrayMap::$METHOD::($($ARGS),*),
+_ => {
+return internal_err!(
+"Unsupported type for ArrayMap: {:?}",
+$DATA_TYPE
+);
+}
+}
+};
+}
+
+/// A dense map for single-column integer join keys within a limited range.
+///
+/// Maps join keys to build-side indices using direct array indexing:
+/// `data[val - min_val_in_build_side] -> val_idx_in_build_side + 1`.
+///
+/// NULL values are ignored on both the build side and the probe side.
+///
+/// # Handling Negative Numbers with `wrapping_sub`
+///
+/// This implementation supports signed integer ranges (e.g., `[-5, 5]`) 
efficiently by
+/// treating them as `u64` (Two's Complement) and relying on the bitwise 
properties of
+/// wrapping arithmetic (`wrapping_sub`).
+///
+/// In Two's Complement representation, `a_signed - b_signed` produces the 
same bit pattern
+/// as `a_unsigned.wrapping_sub(b_unsigned)` (modulo 2^N). This allows us to 
perform
+/// range calculations and zero-based index mapping uniformly for both signed 
and unsigned
+/// types without branching.
+///
+/// ## Examples
+///
+/// Consider an `Int64` range `[-5, 5]`.
+/// * `min_val (-5)` casts to `u64`: `...1011` (`u64::MAX - 4`)
+/// * `max_val (5)` casts to `u64`: `...0101` (`5`)
+///
+/// **1. Range Calculation**
+///
+/// ```text
+/// In modular arithmetic, this is equivalent to:
+///   (5 - (2^64 - 5)) mod 2^64
+/// = (5 - 2^64 + 5) mod 2^64
+/// = (10 - 2^64) mod 2^64
+/// = 10
+///
+/// ```
+/// The resulting `range` (10) correctly represents the size of the interval 
`[-5, 5]`.
+///
+/// **2. Index Lookup (in `get_matched_indices`)**
+///
+/// For a probe value of `0` (which is stored as `0u64`):
+/// ```text
+/// In modular arithmetic, this is equivalent to:
+///   (0 - (2^64 - 5)) mod 2^64
+/// = (-2^64 + 5) mod 2^64
+/// = 5
+/// ```
+/// This correctly maps `-5` to index `0`, `0` to index `5`, etc.
+#[derive(Debug)]
+pub struct ArrayMap {
+// data[probSideVal-offset] -> valIdxInBuildSide + 1; 0 for absent
+data: Vec,
+// min val in buildSide
+offset: u64,
+// next[buildSideIdx] -> next matching valIdxInBuildSide + 1; 0 for end of 
chain.
+// If next is empty, it means there are no duplicate keys (no conflicts).
+// It uses the same chain-based conflict resolution as [`JoinHashMapType`].
+next: Vec,
+num_of_distinct_key: usize,
+}
+
+impl ArrayMap {
+pub fn is_supported_type(data_type: &Data

Re: [PR] perfect hash join [datafusion]

2026-01-07 Thread via GitHub


Dandandan commented on code in PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#discussion_r2668825491


##
datafusion/common/src/config.rs:
##
@@ -468,6 +468,25 @@ config_namespace! {
 /// metadata memory consumption
 pub batch_size: usize, default = 8192
 
+/// A perfect hash join will be considered if the number of rows on 
the build
+/// side is below this threshold. This provides a fast path for joins 
with
+/// very small build sides, bypassing the density check.
+/// 
+/// TODO: Currently only supports cases where left_side.num_rows() < 
u32::MAX.
+/// Support for left_side.num_rows() >= u32::MAX will be added in the 
future.
+pub perfect_hash_join_small_build_threshold: usize, default = 1024
+
+/// The minimum required density of join keys on the build side to 
consider a
+/// perfect hash join. Density is calculated as:
+/// `(number of rows) / (max_key - min_key + 1)`.
+/// A perfect hash join may be used if the actual key density exceeds 
this
+/// value. For example, a value of 0.99 means the keys must fill at 
least
+/// 99% of their value range.
+/// 
+/// TODO: Currently only supports cases where left_side.num_rows() < 
u32::MAX.
+/// Support for left_side.num_rows() >= u32::MAX will be added in the 
future.
+pub perfect_hash_join_min_key_density: f64, default = 0.99

Review Comment:
   What about ~15%?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perfect hash join [datafusion]

2026-01-07 Thread via GitHub


Dandandan commented on code in PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#discussion_r2668825491


##
datafusion/common/src/config.rs:
##
@@ -468,6 +468,25 @@ config_namespace! {
 /// metadata memory consumption
 pub batch_size: usize, default = 8192
 
+/// A perfect hash join will be considered if the number of rows on 
the build
+/// side is below this threshold. This provides a fast path for joins 
with
+/// very small build sides, bypassing the density check.
+/// 
+/// TODO: Currently only supports cases where left_side.num_rows() < 
u32::MAX.
+/// Support for left_side.num_rows() >= u32::MAX will be added in the 
future.
+pub perfect_hash_join_small_build_threshold: usize, default = 1024
+
+/// The minimum required density of join keys on the build side to 
consider a
+/// perfect hash join. Density is calculated as:
+/// `(number of rows) / (max_key - min_key + 1)`.
+/// A perfect hash join may be used if the actual key density exceeds 
this
+/// value. For example, a value of 0.99 means the keys must fill at 
least
+/// 99% of their value range.
+/// 
+/// TODO: Currently only supports cases where left_side.num_rows() < 
u32::MAX.
+/// Support for left_side.num_rows() >= u32::MAX will be added in the 
future.
+pub perfect_hash_join_min_key_density: f64, default = 0.99

Review Comment:
   What about ~0.15?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perfect hash join [datafusion]

2026-01-07 Thread via GitHub


Copilot commented on code in PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#discussion_r2668300012


##
datafusion/physical-plan/src/joins/hash_join/exec.rs:
##
@@ -168,6 +249,34 @@ impl JoinLeftData {
 /// ` != `) are known as "filter expressions" and are evaluated
 /// after the equijoin predicates.
 ///
+/// # ArrayMap Optimization
+///
+/// For joins with a single integer-based join key, `HashJoinExec` may use an 
[`ArrayMap`]
+/// (also known as a "perfect hash join") instead of a general-purpose hash 
map.
+/// This optimization is used when:
+/// 1. There is exactly one join key.
+/// 2. The join key can be any integer type convertible to u64 (excluding i128 
and u128).

Review Comment:
   The documentation refers to "excluding i128 and u128" types, but these types 
are not actually checked or handled anywhere in the code. The 
`is_supported_type` function in `array_map.rs` only supports up to 64-bit 
integers (Int8-Int64, UInt8-UInt64). Consider removing the mention of i128/u128 
from the documentation or clarifying that they're not supported due to design 
limitations, not just current implementation.
   ```suggestion
   /// 2. The join key is an integer type up to 64 bits wide that can be 
losslessly converted
   ///to `u64` (128-bit integer types such as `i128` and `u128` are not 
supported).
   ```



##
datafusion/physical-plan/src/joins/array_map.rs:
##
@@ -0,0 +1,540 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow_schema::DataType;
+use num_traits::AsPrimitive;
+use std::mem::size_of;
+
+use crate::joins::MapOffset;
+use crate::joins::chain::traverse_chain;
+use arrow::array::{Array, ArrayRef, AsArray, BooleanArray};
+use arrow::buffer::BooleanBuffer;
+use arrow::datatypes::ArrowNumericType;
+use datafusion_common::{Result, ScalarValue, internal_err};
+
+/// A macro to downcast only supported integer types (up to 64-bit) and invoke 
a generic function.
+///
+/// Usage: `downcast_supported_integer!(data_type => (Method, arg1, arg2, 
...))`
+///
+/// The `Method` must be an associated method of [`ArrayMap`] that is generic 
over
+/// `` and allow `T::Native: AsPrimitive`.
+macro_rules! downcast_supported_integer {
+($DATA_TYPE:expr => ($METHOD:ident $(, $ARGS:expr)*)) => {
+match $DATA_TYPE {
+arrow::datatypes::DataType::Int8 => 
ArrayMap::$METHOD::($($ARGS),*),
+arrow::datatypes::DataType::Int16 => 
ArrayMap::$METHOD::($($ARGS),*),
+arrow::datatypes::DataType::Int32 => 
ArrayMap::$METHOD::($($ARGS),*),
+arrow::datatypes::DataType::Int64 => 
ArrayMap::$METHOD::($($ARGS),*),
+arrow::datatypes::DataType::UInt8 => 
ArrayMap::$METHOD::($($ARGS),*),
+arrow::datatypes::DataType::UInt16 => 
ArrayMap::$METHOD::($($ARGS),*),
+arrow::datatypes::DataType::UInt32 => 
ArrayMap::$METHOD::($($ARGS),*),
+arrow::datatypes::DataType::UInt64 => 
ArrayMap::$METHOD::($($ARGS),*),
+_ => {
+return internal_err!(
+"Unsupported type for ArrayMap: {:?}",
+$DATA_TYPE
+);
+}
+}
+};
+}
+
+/// A dense map for single-column integer join keys within a limited range.
+///
+/// Maps join keys to build-side indices using direct array indexing:
+/// `data[val - min_val_in_build_side] -> val_idx_in_build_side + 1`.
+///
+/// NULL values are ignored on both the build side and the probe side.
+///
+/// # Handling Negative Numbers with `wrapping_sub`
+///
+/// This implementation supports signed integer ranges (e.g., `[-5, 5]`) 
efficiently by
+/// treating them as `u64` (Two's Complement) and relying on the bitwise 
properties of
+/// wrapping arithmetic (`wrapping_sub`).
+///
+/// In Two's Complement representation, `a_signed - b_signed` produces the 
same bit pattern
+/// as `a_unsigned.wrapping_sub(b_unsigned)` (modulo 2^N). This allows us to 
perform
+/// range calculations and zero-based index mapping uniformly for both signed 
and unsigned
+/// types without branching.
+///
+/// ## Examples
+///
+/// Consider an `Int64` range `[-5, 5]`.
+/// * `min_val (-5)` casts to `u64`: `...1011` (`

Re: [PR] perfect hash join [datafusion]

2026-01-07 Thread via GitHub


alamb-ghbot commented on PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#issuecomment-3718680693

   πŸ€– Hi @Dandandan, thanks for the request 
(https://github.com/apache/datafusion/pull/19411#issuecomment-3718680622).
   
   
[`scrape_comments.py`](https://github.com/alamb/datafusion-benchmarking/blob/main/scripts/scrape_comments.py)
 only supports whitelisted benchmarks.
   - Standard: clickbench_1, clickbench_extended, clickbench_partitioned, 
clickbench_pushdown, external_aggr, tpcds, tpch, tpch10, tpch_mem, tpch_mem10
   - Criterion: aggregate_query_sql, aggregate_vectorized, case_when, 
character_length, in_list, range_and_generate_series, sort, sql_planner, 
strpos, substr_index, with_hashes
   
   Please choose one or more of these with `run benchmark ` or `run 
benchmark  ...`
   Unsupported benchmarks: hj.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perfect hash join [datafusion]

2026-01-07 Thread via GitHub


Dandandan commented on PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#issuecomment-3718680622

   run benchmark hj


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perfect hash join [datafusion]

2026-01-07 Thread via GitHub


alamb-ghbot commented on PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#issuecomment-3718354201

   πŸ€– `./gh_compare_branch.sh` 
[gh_compare_branch.sh](https://github.com/alamb/datafusion-benchmarking/blob/main/scripts/gh_compare_branch.sh)
 Running
   Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 
2025 x86_64 x86_64 x86_64 GNU/Linux
   Comparing perfect_hj (e9b35e5ac9e764eeab29aa6ea237e400647466d7) to 
1037f0aa205e0d04ade7aa9973b4b4c485b696ab 
[diff](https://github.com/apache/datafusion/compare/1037f0aa205e0d04ade7aa9973b4b4c485b696ab..e9b35e5ac9e764eeab29aa6ea237e400647466d7)
 using:  tpch
   Results will be posted here when complete
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perfect hash join [datafusion]

2026-01-07 Thread via GitHub


Dandandan commented on PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#issuecomment-3718372339

   ```
   ┏━━┳━━━┳┳━━━┓
   ┃ Query┃  HEAD ┃ perfect_hj ┃Change ┃
   ┑━━╇━━━╇╇━━━┩
   β”‚ QQuery 1 β”‚ 120.09 ms β”‚  112.47 ms β”‚ +1.07x faster β”‚
   β”‚ QQuery 2 β”‚  29.95 ms β”‚   25.31 ms β”‚ +1.18x faster β”‚
   β”‚ QQuery 3 β”‚  36.86 ms β”‚   36.60 ms β”‚ no change β”‚
   β”‚ QQuery 4 β”‚  28.68 ms β”‚   29.61 ms β”‚ no change β”‚
   β”‚ QQuery 5 β”‚  85.84 ms β”‚   86.63 ms β”‚ no change β”‚
   β”‚ QQuery 6 β”‚  19.61 ms β”‚   19.47 ms β”‚ no change β”‚
   β”‚ QQuery 7 β”‚ 243.13 ms β”‚  145.00 ms β”‚ +1.68x faster β”‚
   β”‚ QQuery 8 β”‚  37.03 ms β”‚   31.49 ms β”‚ +1.18x faster β”‚
   β”‚ QQuery 9 β”‚ 117.04 ms β”‚   95.96 ms β”‚ +1.22x faster β”‚
   β”‚ QQuery 10β”‚  75.18 ms β”‚   63.06 ms β”‚ +1.19x faster β”‚
   β”‚ QQuery 11β”‚  19.93 ms β”‚   12.30 ms β”‚ +1.62x faster β”‚
   β”‚ QQuery 12β”‚  49.38 ms β”‚   49.85 ms β”‚ no change β”‚
   β”‚ QQuery 13β”‚  46.20 ms β”‚   45.37 ms β”‚ no change β”‚
   β”‚ QQuery 14β”‚  13.04 ms β”‚   13.40 ms β”‚ no change β”‚
   β”‚ QQuery 15β”‚  23.47 ms β”‚   23.69 ms β”‚ no change β”‚
   β”‚ QQuery 16β”‚  24.37 ms β”‚   23.86 ms β”‚ no change β”‚
   β”‚ QQuery 17β”‚ 148.67 ms β”‚  147.84 ms β”‚ no change β”‚
   β”‚ QQuery 18β”‚ 284.66 ms β”‚  272.35 ms β”‚ no change β”‚
   β”‚ QQuery 19β”‚  35.58 ms β”‚   37.97 ms β”‚  1.07x slower β”‚
   β”‚ QQuery 20β”‚  48.85 ms β”‚   51.08 ms β”‚ no change β”‚
   β”‚ QQuery 21β”‚ 295.14 ms β”‚  184.30 ms β”‚ +1.60x faster β”‚
   β”‚ QQuery 22β”‚  17.43 ms β”‚   17.36 ms β”‚ no change β”‚
   β””β”€β”€β”΄β”€β”€β”€β”΄β”΄β”€β”€β”€β”˜
   ┏━━━┳━━━┓
   ┃ Benchmark Summary ┃   ┃
   ┑━━━╇━━━┩
   β”‚ Total Time (HEAD) β”‚ 1800.12ms β”‚
   β”‚ Total Time (perfect_hj)   β”‚ 1524.97ms β”‚
   β”‚ Average Time (HEAD)   β”‚   81.82ms β”‚
   β”‚ Average Time (perfect_hj) β”‚   69.32ms β”‚
   β”‚ Queries Fasterβ”‚ 8 β”‚
   β”‚ Queries Slowerβ”‚ 1 β”‚
   β”‚ Queries with No Changeβ”‚13 β”‚
   β”‚ Queries with Failure  β”‚ 0 β”‚
   β””β”€β”€β”€β”΄β”€β”€β”€β”˜
   
   ```
   
   Very nice!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perfect hash join [datafusion]

2026-01-07 Thread via GitHub


alamb-ghbot commented on PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#issuecomment-3718356562

   πŸ€– `./gh_compare_branch.sh` 
[gh_compare_branch.sh](https://github.com/alamb/datafusion-benchmarking/blob/main/scripts/gh_compare_branch.sh)
 Running
   Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 
2025 x86_64 x86_64 x86_64 GNU/Linux
   Comparing perfect_hj (e9b35e5ac9e764eeab29aa6ea237e400647466d7) to 
1037f0aa205e0d04ade7aa9973b4b4c485b696ab 
[diff](https://github.com/apache/datafusion/compare/1037f0aa205e0d04ade7aa9973b4b4c485b696ab..e9b35e5ac9e764eeab29aa6ea237e400647466d7)
 using:  tpcds
   Results will be posted here when complete
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perfect hash join [datafusion]

2026-01-07 Thread via GitHub


alamb-ghbot commented on PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#issuecomment-3718356620

   Benchmark script failed with exit code 1.
   
   Last 10 lines of output:
   Click to expand
   
   ```
   BRANCH_NAME: HEAD
   DATA_DIR: /home/alamb/arrow-datafusion/benchmarks/data
   RESULTS_DIR: /home/alamb/arrow-datafusion/benchmarks/results/HEAD
   CARGO_COMMAND: cargo run --release
   PREFER_HASH_JOIN: true
   ***
   
   Please prepare TPC-DS data first by following instructions:
 ./bench.sh data tpcds
   ```
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perfect hash join [datafusion]

2026-01-07 Thread via GitHub


alamb-ghbot commented on PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#issuecomment-3718356441

   πŸ€–: Benchmark completed
   
   Details
   
   
   
   ```
   Comparing HEAD and perfect_hj
   
   Benchmark tpch_sf1.json
   
   ┏━━┳━━━┳┳━━━┓
   ┃ Query┃  HEAD ┃ perfect_hj ┃Change ┃
   ┑━━╇━━━╇╇━━━┩
   β”‚ QQuery 1 β”‚ 191.42 ms β”‚  194.17 ms β”‚ no change β”‚
   β”‚ QQuery 2 β”‚  92.90 ms β”‚   79.24 ms β”‚ +1.17x faster β”‚
   β”‚ QQuery 3 β”‚ 122.93 ms β”‚  118.45 ms β”‚ no change β”‚
   β”‚ QQuery 4 β”‚  76.58 ms β”‚   76.39 ms β”‚ no change β”‚
   β”‚ QQuery 5 β”‚ 169.79 ms β”‚  167.60 ms β”‚ no change β”‚
   β”‚ QQuery 6 β”‚  65.74 ms β”‚   62.99 ms β”‚ no change β”‚
   β”‚ QQuery 7 β”‚ 213.34 ms β”‚  204.62 ms β”‚ no change β”‚
   β”‚ QQuery 8 β”‚ 159.39 ms β”‚  158.60 ms β”‚ no change β”‚
   β”‚ QQuery 9 β”‚ 224.42 ms β”‚  220.08 ms β”‚ no change β”‚
   β”‚ QQuery 10β”‚ 184.10 ms β”‚  187.50 ms β”‚ no change β”‚
   β”‚ QQuery 11β”‚  73.24 ms β”‚   57.11 ms β”‚ +1.28x faster β”‚
   β”‚ QQuery 12β”‚ 116.93 ms β”‚  116.71 ms β”‚ no change β”‚
   β”‚ QQuery 13β”‚ 218.62 ms β”‚  214.30 ms β”‚ no change β”‚
   β”‚ QQuery 14β”‚  94.08 ms β”‚   89.32 ms β”‚ +1.05x faster β”‚
   β”‚ QQuery 15β”‚ 115.42 ms β”‚  117.40 ms β”‚ no change β”‚
   β”‚ QQuery 16β”‚  55.32 ms β”‚   56.42 ms β”‚ no change β”‚
   β”‚ QQuery 17β”‚ 269.67 ms β”‚  269.94 ms β”‚ no change β”‚
   β”‚ QQuery 18β”‚ 318.44 ms β”‚  316.34 ms β”‚ no change β”‚
   β”‚ QQuery 19β”‚ 131.59 ms β”‚  133.16 ms β”‚ no change β”‚
   β”‚ QQuery 20β”‚ 121.70 ms β”‚  124.62 ms β”‚ no change β”‚
   β”‚ QQuery 21β”‚ 259.46 ms β”‚  252.28 ms β”‚ no change β”‚
   β”‚ QQuery 22β”‚  41.13 ms β”‚   36.72 ms β”‚ +1.12x faster β”‚
   β””β”€β”€β”΄β”€β”€β”€β”΄β”΄β”€β”€β”€β”˜
   ┏━━━┳━━━┓
   ┃ Benchmark Summary ┃   ┃
   ┑━━━╇━━━┩
   β”‚ Total Time (HEAD) β”‚ 3316.23ms β”‚
   β”‚ Total Time (perfect_hj)   β”‚ 3253.95ms β”‚
   β”‚ Average Time (HEAD)   β”‚  150.74ms β”‚
   β”‚ Average Time (perfect_hj) β”‚  147.91ms β”‚
   β”‚ Queries Fasterβ”‚ 4 β”‚
   β”‚ Queries Slowerβ”‚ 0 β”‚
   β”‚ Queries with No Changeβ”‚18 β”‚
   β”‚ Queries with Failure  β”‚ 0 β”‚
   β””β”€β”€β”€β”΄β”€β”€β”€β”˜
   ```
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perfect hash join [datafusion]

2026-01-07 Thread via GitHub


alamb-ghbot commented on PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#issuecomment-3718353932

   πŸ€–: Benchmark completed
   
   Details
   
   
   
   ```
   Comparing HEAD and perfect_hj
   
   Benchmark clickbench_extended.json
   
   ┏━━┳━┳━┳━━━┓
   ┃ Query┃HEAD ┃  perfect_hj ┃Change ┃
   ┑━━╇━╇━╇━━━┩
   β”‚ QQuery 0 β”‚  2361.73 ms β”‚  2300.54 ms β”‚ no change β”‚
   β”‚ QQuery 1 β”‚   934.19 ms β”‚   912.49 ms β”‚ no change β”‚
   β”‚ QQuery 2 β”‚  1885.10 ms β”‚  1860.90 ms β”‚ no change β”‚
   β”‚ QQuery 3 β”‚  1202.09 ms β”‚  1180.71 ms β”‚ no change β”‚
   β”‚ QQuery 4 β”‚  2306.91 ms β”‚  2298.40 ms β”‚ no change β”‚
   β”‚ QQuery 5 β”‚ 28041.32 ms β”‚ 28094.82 ms β”‚ no change β”‚
   β”‚ QQuery 6 β”‚  3849.98 ms β”‚  3872.67 ms β”‚ no change β”‚
   β”‚ QQuery 7 β”‚  3794.60 ms β”‚  3815.13 ms β”‚ no change β”‚
   β””β”€β”€β”΄β”€β”΄β”€β”΄β”€β”€β”€β”˜
   ┏━━━┳┓
   ┃ Benchmark Summary ┃┃
   ┑━━━╇┩
   β”‚ Total Time (HEAD) β”‚ 44375.93ms β”‚
   β”‚ Total Time (perfect_hj)   β”‚ 44335.67ms β”‚
   β”‚ Average Time (HEAD)   β”‚  5546.99ms β”‚
   β”‚ Average Time (perfect_hj) β”‚  5541.96ms β”‚
   β”‚ Queries Fasterβ”‚  0 β”‚
   β”‚ Queries Slowerβ”‚  0 β”‚
   β”‚ Queries with No Changeβ”‚  8 β”‚
   β”‚ Queries with Failure  β”‚  0 β”‚
   β””β”€β”€β”€β”΄β”˜
   
   Benchmark clickbench_partitioned.json
   
   ┏━━┳━┳━┳━━━┓
   ┃ Query┃HEAD ┃  perfect_hj ┃Change ┃
   ┑━━╇━╇━╇━━━┩
   β”‚ QQuery 0 β”‚ 1.44 ms β”‚ 1.46 ms β”‚ no change β”‚
   β”‚ QQuery 1 β”‚48.94 ms β”‚49.13 ms β”‚ no change β”‚
   β”‚ QQuery 2 β”‚   131.83 ms β”‚   132.88 ms β”‚ no change β”‚
   β”‚ QQuery 3 β”‚   154.94 ms β”‚   151.89 ms β”‚ no change β”‚
   β”‚ QQuery 4 β”‚  1129.27 ms β”‚  1094.93 ms β”‚ no change β”‚
   β”‚ QQuery 5 β”‚  1364.06 ms β”‚  1368.14 ms β”‚ no change β”‚
   β”‚ QQuery 6 β”‚ 1.41 ms β”‚ 1.42 ms β”‚ no change β”‚
   β”‚ QQuery 7 β”‚52.53 ms β”‚53.49 ms β”‚ no change β”‚
   β”‚ QQuery 8 β”‚  1469.70 ms β”‚  1415.32 ms β”‚ no change β”‚
   β”‚ QQuery 9 β”‚  1804.95 ms β”‚  1800.38 ms β”‚ no change β”‚
   β”‚ QQuery 10β”‚   344.22 ms β”‚   345.91 ms β”‚ no change β”‚
   β”‚ QQuery 11β”‚   387.44 ms β”‚   396.02 ms β”‚ no change β”‚
   β”‚ QQuery 12β”‚  1237.81 ms β”‚  1286.77 ms β”‚ no change β”‚
   β”‚ QQuery 13β”‚  1912.18 ms β”‚  1972.25 ms β”‚ no change β”‚
   β”‚ QQuery 14β”‚  1233.69 ms β”‚  1247.50 ms β”‚ no change β”‚
   β”‚ QQuery 15β”‚  1254.11 ms β”‚  1243.60 ms β”‚ no change β”‚
   β”‚ QQuery 16β”‚  2567.98 ms β”‚  2570.61 ms β”‚ no change β”‚
   β”‚ QQuery 17β”‚  2550.81 ms β”‚  2580.33 ms β”‚ no change β”‚
   β”‚ QQuery 18β”‚  5612.71 ms β”‚  4913.15 ms β”‚ +1.14x faster β”‚
   β”‚ QQuery 19β”‚   120.74 ms β”‚   119.55 ms β”‚ no change β”‚
   β”‚ QQuery 20β”‚  1913.07 ms β”‚  1904.79 ms β”‚ no change β”‚
   β”‚ QQuery 21β”‚  2146.63 ms β”‚  2192.36 ms β”‚ no change β”‚
   β”‚ QQuery 22β”‚  3703.55 ms β”‚  3729.43 ms β”‚ no change β”‚
   β”‚ QQuery 23β”‚ 22049.38 ms β”‚ 12242.00 ms β”‚ +1.80x faster β”‚
   β”‚ QQuery 24β”‚   207.87 ms β”‚   203.23 ms β”‚ no change β”‚
   β”‚ QQuery 25β”‚   456.99 ms β”‚   469.41 ms β”‚ no change β”‚
   β”‚ QQuery 26β”‚   214.05 ms β”‚   204.73 ms β”‚ no change β”‚
   β”‚ QQuery 27β”‚  2767.10 ms β”‚  2680.87 ms β”‚ no change β”‚
   β”‚ QQuery 28β”‚ 22195.62 ms β”‚ 23141.30 ms β”‚ no change β”‚
   β”‚ QQuery 29β”‚   991.32 ms β”‚   939.56 ms β”‚ +1.06x faster β”‚
   β”‚ QQuery 30β”‚  1349.48 ms β”‚  1319.89 ms β”‚ no change β”‚
   β”‚ QQuery 31β”‚  1368.06 ms β”‚  1339.89 ms β”‚ no change β”‚
   β”‚ QQuery 32β”‚  5091.91 ms β”‚  5138.39 ms β”‚ no change β”‚
   β”‚ QQuery 33β”‚  5512.57 ms β”‚  5484.79 ms β”‚ no change β”‚
   β”‚ QQuery 34β”‚  5638.93 ms β”‚  6020.19 ms β”‚  1.07x slower β”‚
   β”‚ QQuery 35β”‚  1992.03 ms β”‚  1922.04 ms β”‚ no change β”‚
   β”‚ QQuery 36β”‚63.99 ms β”‚64.41 ms β”‚ no change β”‚
   β”‚ QQuery 37β”‚43.36 ms β”‚45.51 ms β”‚ no change β”‚
   β”‚ QQuery 38β”‚66.10 ms β”‚64.66 ms β”‚ no change β”‚
   β”‚ QQuery 39β”‚   100.63 ms β”‚   101.04 ms β”‚ no change β”‚
   β”‚ QQuery 40β”‚28.11 ms β”‚26.02 ms β”‚ +1.08x faster β”‚
   β”‚ QQuery 41β”‚21.33 ms β”‚22.12 ms β”‚ no change β”‚
   β”‚ QQuery 42β”‚18.14 ms β”‚19.02 ms β”‚ no change β”‚
   β””β”€β”€β”΄β”€β”΄β”€β”΄β”€β”€β”€β”˜
   ┏━━━┳━┓
   ┃ Benchmark Summary ┃ ┃
   ┑━━━╇━┩
   β”‚ Total Time (HEAD) β”‚ 101320.99ms β”‚
   β”‚ Total Time (perfect_hj)   β”‚  92020.38ms β”‚
   β”‚ Average Time (HEAD)   β”‚   2356.30ms β”‚
   β”‚ Average Time (perfect_hj) β”‚   2140.01ms β”‚
   β”‚ Q

Re: [PR] perfect hash join [datafusion]

2026-01-07 Thread via GitHub


Dandandan commented on PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#issuecomment-3718215175

   run benchmark tpcds


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perfect hash join [datafusion]

2026-01-07 Thread via GitHub


Dandandan commented on PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#issuecomment-3718214577

   run benchmark tpch


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perfect hash join [datafusion]

2026-01-07 Thread via GitHub


alamb-ghbot commented on PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#issuecomment-3718214414

   πŸ€– `./gh_compare_branch.sh` 
[gh_compare_branch.sh](https://github.com/alamb/datafusion-benchmarking/blob/main/scripts/gh_compare_branch.sh)
 Running
   Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 
2025 x86_64 x86_64 x86_64 GNU/Linux
   Comparing perfect_hj (e9b35e5ac9e764eeab29aa6ea237e400647466d7) to 
1037f0aa205e0d04ade7aa9973b4b4c485b696ab 
[diff](https://github.com/apache/datafusion/compare/1037f0aa205e0d04ade7aa9973b4b4c485b696ab..e9b35e5ac9e764eeab29aa6ea237e400647466d7)
 using:  tpch_mem clickbench_partitioned clickbench_extended
   Results will be posted here when complete
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perfect hash join [datafusion]

2026-01-07 Thread via GitHub


Dandandan commented on PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#issuecomment-3718214064

   run benchmarks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perfect hash join [datafusion]

2026-01-07 Thread via GitHub


UBarney commented on PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#issuecomment-3718078855

   > > > Can you solve the conflicts?
   > > 
   > > 
   > > I've resolved the conflicts. Once #19602 is merged, I will update 
`ArrayMap::mark_existing_probes` to follow logic similar to 
`JoinHashMapType::contain_hashes`.
   > 
   > Done
   
   I have added contain_keys to ArrayMap and updated the logic accordingly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perfect hash join [datafusion]

2026-01-06 Thread via GitHub


Dandandan commented on PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#issuecomment-3714076071

   > > Can you solve the conflicts?
   > 
   > I've resolved the conflicts. Once #19602 is merged, I will update 
`ArrayMap::mark_existing_probes` to follow logic similar to 
`JoinHashMapType::contain_hashes`.
   
   DonE


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perfect hash join [datafusion]

2026-01-06 Thread via GitHub


UBarney commented on PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#issuecomment-3713987665

   > Can you solve the conflicts?
   
   
   
   I've resolved the conflicts. Once 
https://github.com/apache/datafusion/pull/19602 is merged, I will update 
`ArrayMap::mark_existing_probes` to follow logic similar to 
`JoinHashMapType::contain_hashes`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perfect hash join [datafusion]

2026-01-05 Thread via GitHub


Dandandan commented on code in PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#discussion_r2662706360


##
datafusion/common/src/config.rs:
##
@@ -468,6 +468,25 @@ config_namespace! {
 /// metadata memory consumption
 pub batch_size: usize, default = 8192
 
+/// A perfect hash join will be considered if the number of rows on 
the build
+/// side is below this threshold. This provides a fast path for joins 
with
+/// very small build sides, bypassing the density check.
+/// 
+/// TODO: Currently only supports cases where left_side.num_rows() < 
u32::MAX.
+/// Support for left_side.num_rows() >= u32::MAX will be added in the 
future.
+pub perfect_hash_join_small_build_threshold: usize, default = 1024
+
+/// The minimum required density of join keys on the build side to 
consider a
+/// perfect hash join. Density is calculated as:
+/// `(number of rows) / (max_key - min_key + 1)`.
+/// A perfect hash join may be used if the actual key density exceeds 
this
+/// value. For example, a value of 0.99 means the keys must fill at 
least
+/// 99% of their value range.
+/// 
+/// TODO: Currently only supports cases where left_side.num_rows() < 
u32::MAX.
+/// Support for left_side.num_rows() >= u32::MAX will be added in the 
future.
+pub perfect_hash_join_min_key_density: f64, default = 0.99

Review Comment:
   Cool!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perfect hash join [datafusion]

2026-01-05 Thread via GitHub


Dandandan commented on PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#issuecomment-3711965227

   Can you solve the conflicts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perfect hash join [datafusion]

2025-12-30 Thread via GitHub


UBarney commented on code in PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#discussion_r2654932168


##
datafusion/common/src/config.rs:
##
@@ -468,6 +468,25 @@ config_namespace! {
 /// metadata memory consumption
 pub batch_size: usize, default = 8192
 
+/// A perfect hash join will be considered if the number of rows on 
the build
+/// side is below this threshold. This provides a fast path for joins 
with
+/// very small build sides, bypassing the density check.
+/// 
+/// TODO: Currently only supports cases where left_side.num_rows() < 
u32::MAX.
+/// Support for left_side.num_rows() >= u32::MAX will be added in the 
future.
+pub perfect_hash_join_small_build_threshold: usize, default = 1024
+
+/// The minimum required density of join keys on the build side to 
consider a
+/// perfect hash join. Density is calculated as:
+/// `(number of rows) / (max_key - min_key + 1)`.
+/// A perfect hash join may be used if the actual key density exceeds 
this
+/// value. For example, a value of 0.99 means the keys must fill at 
least
+/// 99% of their value range.
+/// 
+/// TODO: Currently only supports cases where left_side.num_rows() < 
u32::MAX.
+/// Support for left_side.num_rows() >= u32::MAX will be added in the 
future.
+pub perfect_hash_join_min_key_density: f64, default = 0.99

Review Comment:
   I think we can set it to 20%.
   
   - In terms of memory usage ([run with this 
code](https://github.com/UBarney/datafusion/commit/d901b70347e24a271464aefb64fe08f64d35a652#diff-1e8db68656c3040876f6740d3bc42251f6d1f9de1f5e3c14d257e72b48a0f68dR10):
 ), `ArrayMap` consumes less memory than `JoinHashMap` at a 20% density, even 
with duplicate keys.
   - Based on the hj.rs benchmark(results in the PR description), `ArrayMap` is 
also faster than `JoinHashMap` at 20% density, regardless of whether there are 
duplicate keys.
   
   ```
   Memory Comparison Matrix (num_rows = 100)
   | Density | Dup Rate | ArrayMap (MB) | JoinHashMap (MB) | Ratio (AM/JHM) |
   |-|--|---|--||
   |100% |   0% |  3.81 |37.81 |   0.10x |
   |100% |  25% |  7.63 |37.81 |   0.20x |
   |100% |  50% |  7.63 |37.81 |   0.20x |
   |100% |  75% |  7.63 |37.81 |   0.20x |
   | 75% |   0% |  5.09 |37.81 |   0.13x |
   | 75% |  25% |  8.90 |37.81 |   0.24x |
   | 75% |  50% |  8.90 |37.81 |   0.24x |
   | 75% |  75% |  8.90 |37.81 |   0.24x |
   | 50% |   0% |  7.63 |37.81 |   0.20x |
   | 50% |  25% | 11.44 |37.81 |   0.30x |
   | 50% |  50% | 11.44 |37.81 |   0.30x |
   | 50% |  75% | 11.44 |37.81 |   0.30x |
   | 20% |   0% | 19.07 |37.81 |   0.50x |
   | 20% |  25% | 22.89 |37.81 |   0.61x |
   | 20% |  50% | 22.89 |37.81 |   0.61x |
   | 20% |  75% | 22.89 |37.81 |   0.61x |
   | 10% |   0% | 38.15 |37.81 |   1.01x |
   | 10% |  25% | 41.96 |37.81 |   1.11x |
   | 10% |  50% | 41.96 |37.81 |   1.11x |
   | 10% |  75% | 41.96 |37.81 |   1.11x |
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perfect hash join [datafusion]

2025-12-30 Thread via GitHub


UBarney commented on code in PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#discussion_r2654932168


##
datafusion/common/src/config.rs:
##
@@ -468,6 +468,25 @@ config_namespace! {
 /// metadata memory consumption
 pub batch_size: usize, default = 8192
 
+/// A perfect hash join will be considered if the number of rows on 
the build
+/// side is below this threshold. This provides a fast path for joins 
with
+/// very small build sides, bypassing the density check.
+/// 
+/// TODO: Currently only supports cases where left_side.num_rows() < 
u32::MAX.
+/// Support for left_side.num_rows() >= u32::MAX will be added in the 
future.
+pub perfect_hash_join_small_build_threshold: usize, default = 1024
+
+/// The minimum required density of join keys on the build side to 
consider a
+/// perfect hash join. Density is calculated as:
+/// `(number of rows) / (max_key - min_key + 1)`.
+/// A perfect hash join may be used if the actual key density exceeds 
this
+/// value. For example, a value of 0.99 means the keys must fill at 
least
+/// 99% of their value range.
+/// 
+/// TODO: Currently only supports cases where left_side.num_rows() < 
u32::MAX.
+/// Support for left_side.num_rows() >= u32::MAX will be added in the 
future.
+pub perfect_hash_join_min_key_density: f64, default = 0.99

Review Comment:
   I think we can set it to 20%.
   
   - In terms of memory usage ([run with this 
code](https://github.com/UBarney/datafusion/commit/209c5002d3273e4ad762077d017c35049c9e4d6f#diff-1e8db68656c3040876f6740d3bc42251f6d1f9de1f5e3c14d257e72b48a0f68dR10):
 ), `ArrayMap` consumes less memory than `JoinHashMap` at a 20% density, even 
with duplicate keys.
   - Based on the hj.rs benchmark(results in the PR description), `ArrayMap` is 
also faster than `JoinHashMap` at 20% density, regardless of whether there are 
duplicate keys.
   
   ```
   Memory Comparison Matrix (num_rows = 100)
   | Density | Dup Rate | Distinct |  Range   | ArrayMap (MB) | JoinHashMap 
(MB) | Ratio (AM/JHM) |
   
|-|--|--|--|---|--||
   |100% |   0% |  100 |  100 |  3.81 |
37.81 |   0.10x |
   |100% |  25% |   75 |  100 |  7.63 |
37.81 |   0.20x |
   |100% |  50% |   50 |  100 |  7.63 |
37.81 |   0.20x |
   |100% |  75% |   25 |  100 |  7.63 |
37.81 |   0.20x |
   | 75% |   0% |  100 |  133 |  5.09 |
37.81 |   0.13x |
   | 75% |  25% |   75 |  133 |  8.90 |
37.81 |   0.24x |
   | 75% |  50% |   50 |  133 |  8.90 |
37.81 |   0.24x |
   | 75% |  75% |   25 |  133 |  8.90 |
37.81 |   0.24x |
   | 50% |   0% |  100 |  200 |  7.63 |
37.81 |   0.20x |
   | 50% |  25% |   75 |  200 | 11.44 |
37.81 |   0.30x |
   | 50% |  50% |   50 |  200 | 11.44 |
37.81 |   0.30x |
   | 50% |  75% |   25 |  200 | 11.44 |
37.81 |   0.30x |
   | 20% |   0% |  100 |  500 | 19.07 |
37.81 |   0.50x |
   | 20% |  25% |   75 |  500 | 22.89 |
37.81 |   0.61x |
   | 20% |  50% |   50 |  500 | 22.89 |
37.81 |   0.61x |
   | 20% |  75% |   25 |  500 | 22.89 |
37.81 |   0.61x |
   | 10% |   0% |  100 | 1000 | 38.15 |
37.81 |   1.01x |
   | 10% |  25% |   75 | 1000 | 41.96 |
37.81 |   1.11x |
   | 10% |  50% |   50 | 1000 | 41.96 |
37.81 |   1.11x |
   | 10% |  75% |   25 | 1000 | 41.96 |
37.81 |   1.11x |
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perfect hash join [datafusion]

2025-12-28 Thread via GitHub


UBarney commented on code in PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#discussion_r2649548041


##
datafusion/physical-plan/src/joins/array_map.rs:
##
@@ -0,0 +1,565 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow::buffer::MutableBuffer;
+use num_traits::AsPrimitive;
+use std::fmt;
+use std::sync::Arc;
+
+use crate::joins::chain::traverse_chain;
+use crate::joins::join_hash_map::JoinHashMapOffset;
+use crate::joins::utils::JoinHashMapType;
+use arrow::array::{Array, ArrayRef, AsArray};
+use arrow::datatypes::DataType;
+use arrow::datatypes::{
+Int8Type, Int16Type, Int32Type, Int64Type, UInt8Type, UInt16Type, 
UInt32Type,
+UInt64Type,
+};
+use datafusion_common::{Result, internal_err};
+
+/// A "perfect" hash map for single-column integer join keys, represented as a 
dense array.
+///
+/// This structure is highly optimized for joins where the keys are integers 
within a limited
+/// range. Instead of calculating hashes, it uses the integer key itself as an 
index into a
+/// `Vec`, achieving O(1) lookup performance.
+///
+/// # NULL Handling
+///
+/// This optimization can be used for joins with 
`NullEquality::NullEqualsNothing` even if the
+/// join keys contain `NULL`s. This is because:
+///
+/// 1. `try_new` (build side): Ignores rows with `NULL` keys when creating the 
map. This is
+///correct as `NULL` keys would not match anything anyway.
+/// 2. `get_matched_indices_with_limit_offset` (probe side): Skips any `NULL` 
keys encountered
+///in the probe side input.
+///
+/// This structure **cannot** be used for joins with 
`NullEquality::NullEqualsNull` if the
+/// build side contains `NULL`s, as it does not have a mechanism to store and 
match `NULL` values.
+#[derive(Debug)]
+pub struct ArrayMap {
+// data[probSideVal-offset] -> valIdxInBuildSide + 1; 0 for absent
+data: Vec,
+offset: u64, // min val in buildSide
+next: Option>,
+}
+
+impl ArrayMap {
+/// Creates a new [`ArrayKV`] from the given array of join keys.
+///
+/// Note: This function processes only the non-null values in the input 
`array`,
+/// effectively ignoring any rows where the key is `NULL`.
+///
+/// TODO: Support `NullEquality::NullEqualsNull` by storing null indices 
in a
+/// separate `Vec` to allow for `NULL=NULL` matching in the future.
+pub(crate) fn try_new(
+array: &ArrayRef,
+offset_val: u64,
+range: usize,
+) -> Result {
+// Initialize with 0 (sentinel for not found)
+let mut data: Vec = vec![0; range];
+let mut next: Option> = None;

Review Comment:
   Done, it's cleaner this way



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perfect hash join [datafusion]

2025-12-21 Thread via GitHub


UBarney commented on code in PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#discussion_r2637851683


##
datafusion/common/src/config.rs:
##
@@ -468,6 +468,25 @@ config_namespace! {
 /// metadata memory consumption
 pub batch_size: usize, default = 8192
 
+/// A perfect hash join will be considered if the number of rows on 
the build
+/// side is below this threshold. This provides a fast path for joins 
with
+/// very small build sides, bypassing the density check.
+/// 
+/// TODO: Currently only supports cases where left_side.num_rows() < 
u32::MAX.
+/// Support for left_side.num_rows() >= u32::MAX will be added in the 
future.
+pub perfect_hash_join_small_build_threshold: usize, default = 1024
+
+/// The minimum required density of join keys on the build side to 
consider a
+/// perfect hash join. Density is calculated as:
+/// `(number of rows) / (max_key - min_key + 1)`.
+/// A perfect hash join may be used if the actual key density exceeds 
this
+/// value. For example, a value of 0.99 means the keys must fill at 
least
+/// 99% of their value range.
+/// 
+/// TODO: Currently only supports cases where left_side.num_rows() < 
u32::MAX.
+/// Support for left_side.num_rows() >= u32::MAX will be added in the 
future.
+pub perfect_hash_join_min_key_density: f64, default = 0.99

Review Comment:
   I ran a quick test on performance (time taken) at different densities, and 
even at a density of 0.1, ArrayMap was still faster than HashMap πŸ€”. (I haven't 
measured memory usage yet, though.)
   
   ```
   
┏━━┳━━━┳━┳━━━┓
   ┃ Query┃ density=6 ┃ density=0.1 ┃   
 Change ┃
   
┑━━╇━━━╇━╇━━━┩
   β”‚ QQuery 1_density=1_prob_hit=1_25*1.5Mβ”‚   5.83 ms β”‚ 4.96 ms β”‚ 
+1.18x faster β”‚
   β”‚ QQuery 2_density=0.026_prob_hit=1_25*1.5Mβ”‚   6.56 ms β”‚ 4.58 ms β”‚ 
+1.43x faster β”‚
   β”‚ QQuery 3_density=1_prob_hit=1_100K*60M   β”‚ 134.90 ms β”‚94.36 ms β”‚ 
+1.43x faster β”‚
   β”‚ QQuery 4_density=1_prob_hit=0.1_100K*60M β”‚ 167.25 ms β”‚   106.93 ms β”‚ 
+1.56x faster β”‚
   β”‚ QQuery 5_density=0.75_prob_hit=1_100K*60Mβ”‚ 141.52 ms β”‚93.04 ms β”‚ 
+1.52x faster β”‚
   β”‚ QQuery 6_density=0.75_prob_hit=0.1_100K*60M  β”‚ 248.02 ms β”‚   193.13 ms β”‚ 
+1.28x faster β”‚
   β”‚ QQuery 7_density=0.5_prob_hit=1_100K*60M β”‚ 132.51 ms β”‚82.83 ms β”‚ 
+1.60x faster β”‚
   β”‚ QQuery 8_density=0.5_prob_hit=0.1_100K*60M   β”‚ 226.32 ms β”‚   171.06 ms β”‚ 
+1.32x faster β”‚
   β”‚ QQuery 9_density=0.2_prob_hit=1_100K*60M β”‚ 130.31 ms β”‚96.29 ms β”‚ 
+1.35x faster β”‚
   β”‚ QQuery 10_density=0.2_prob_hit=0.1_100K*60M  β”‚ 231.85 ms β”‚   184.09 ms β”‚ 
+1.26x faster β”‚
   β”‚ QQuery 11_density=0.1_prob_hit=1_100K*60Mβ”‚ 129.65 ms β”‚   105.23 ms β”‚ 
+1.23x faster β”‚
   β”‚ QQuery 12_density=0.1_prob_hit=0.1_100K*60M  β”‚ 235.93 ms β”‚   191.80 ms β”‚ 
+1.23x faster β”‚
   β”‚ QQuery 13_density=0.01_prob_hit=1_100K*60M   β”‚ 128.74 ms β”‚   129.35 ms β”‚   
  no change β”‚
   β”‚ QQuery 14_density=0.01_prob_hit=0.1_100K*60M β”‚ 230.82 ms β”‚   233.62 ms β”‚   
  no change β”‚
   
β””β”€β”€β”΄β”€β”€β”€β”΄β”€β”΄β”€β”€β”€β”˜
   ┏┳━━━┓
   ┃ Benchmark Summary  ┃   ┃
   ┑╇━━━┩
   β”‚ Total Time (density=6) β”‚ 2150.21ms β”‚
   β”‚ Total Time (density=0.1)   β”‚ 1691.27ms β”‚
   β”‚ Average Time (density=6)   β”‚  153.59ms β”‚
   β”‚ Average Time (density=0.1) β”‚  120.80ms β”‚
   β”‚ Queries Faster β”‚12 β”‚
   β”‚ Queries Slower β”‚ 0 β”‚
   β”‚ Queries with No Change β”‚ 2 β”‚
   β”‚ Queries with Failure   β”‚ 0 β”‚
   β””β”΄β”€β”€β”€β”˜
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perfect hash join [datafusion]

2025-12-21 Thread via GitHub


UBarney commented on code in PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#discussion_r2637856134


##
benchmarks/compare.py:
##
@@ -154,17 +154,17 @@ def compare(
 baseline = BenchmarkRun.load_from_file(baseline_path)
 comparison = BenchmarkRun.load_from_file(comparison_path)
 
-console = Console()
+console = Console(width=200)

Review Comment:
   I've increased the console width to 200. I added more information like 
'density' to the queryName, which made it longer and caused it to be cut off in 
the output before



##
benchmarks/compare.py:
##
@@ -154,17 +154,17 @@ def compare(
 baseline = BenchmarkRun.load_from_file(baseline_path)
 comparison = BenchmarkRun.load_from_file(comparison_path)
 
-console = Console()
+console = Console(width=200)
 
 # use basename as the column names
-baseline_header = baseline_path.parent.stem
-comparison_header = comparison_path.parent.stem
+baseline_header = baseline_path.parent.name
+comparison_header = comparison_path.parent.name

Review Comment:
   Before, a path like `.../density=0.1/...` was incorrectly shortened to 
`density=0`. Now, by using `.parent.name`, we correctly get the full directory 
name, `density=0.1`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perfect hash join [datafusion]

2025-12-20 Thread via GitHub


Dandandan commented on code in PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#discussion_r2636932448


##
datafusion/common/src/config.rs:
##
@@ -468,6 +468,25 @@ config_namespace! {
 /// metadata memory consumption
 pub batch_size: usize, default = 8192
 
+/// A perfect hash join will be considered if the number of rows on 
the build
+/// side is below this threshold. This provides a fast path for joins 
with
+/// very small build sides, bypassing the density check.
+/// 
+/// TODO: Currently only supports cases where left_side.num_rows() < 
u32::MAX.
+/// Support for left_side.num_rows() >= u32::MAX will be added in the 
future.
+pub perfect_hash_join_small_build_threshold: usize, default = 1024
+
+/// The minimum required density of join keys on the build side to 
consider a
+/// perfect hash join. Density is calculated as:
+/// `(number of rows) / (max_key - min_key + 1)`.
+/// A perfect hash join may be used if the actual key density exceeds 
this
+/// value. For example, a value of 0.99 means the keys must fill at 
least
+/// 99% of their value range.
+/// 
+/// TODO: Currently only supports cases where left_side.num_rows() < 
u32::MAX.
+/// Support for left_side.num_rows() >= u32::MAX will be added in the 
future.
+pub perfect_hash_join_min_key_density: f64, default = 0.99

Review Comment:
   Sounds good!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perfect hash join [datafusion]

2025-12-19 Thread via GitHub


UBarney commented on code in PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#discussion_r2636743834


##
datafusion/common/src/config.rs:
##
@@ -468,6 +468,25 @@ config_namespace! {
 /// metadata memory consumption
 pub batch_size: usize, default = 8192
 
+/// A perfect hash join will be considered if the number of rows on 
the build
+/// side is below this threshold. This provides a fast path for joins 
with
+/// very small build sides, bypassing the density check.
+/// 
+/// TODO: Currently only supports cases where left_side.num_rows() < 
u32::MAX.
+/// Support for left_side.num_rows() >= u32::MAX will be added in the 
future.
+pub perfect_hash_join_small_build_threshold: usize, default = 1024
+
+/// The minimum required density of join keys on the build side to 
consider a
+/// perfect hash join. Density is calculated as:
+/// `(number of rows) / (max_key - min_key + 1)`.
+/// A perfect hash join may be used if the actual key density exceeds 
this
+/// value. For example, a value of 0.99 means the keys must fill at 
least
+/// 99% of their value range.
+/// 
+/// TODO: Currently only supports cases where left_side.num_rows() < 
u32::MAX.
+/// Support for left_side.num_rows() >= u32::MAX will be added in the 
future.
+pub perfect_hash_join_min_key_density: f64, default = 0.99

Review Comment:
   That's a great point. 
   I'll add some benchmarks to compare the performance at different densities, 
including 75%, to find the optimal value for our use case. I'll update this 
based on the results.
   Thanks for the suggestion!
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perfect hash join [datafusion]

2025-12-19 Thread via GitHub


alamb-ghbot commented on PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#issuecomment-3676314727

   πŸ€–: Benchmark completed
   
   Details
   
   
   
   ```
   Comparing HEAD and perfect_hj
   
   Benchmark tpcds_sf1.json
   
   ┏━━┳━┳━┳━━━┓
   ┃ Query┃HEAD ┃  perfect_hj ┃Change ┃
   ┑━━╇━╇━╇━━━┩
   β”‚ QQuery 1 β”‚64.32 ms β”‚64.03 ms β”‚ no change β”‚
   β”‚ QQuery 2 β”‚   202.76 ms β”‚   209.93 ms β”‚ no change β”‚
   β”‚ QQuery 3 β”‚   160.17 ms β”‚   160.30 ms β”‚ no change β”‚
   β”‚ QQuery 4 β”‚  2076.33 ms β”‚  2016.14 ms β”‚ no change β”‚
   β”‚ QQuery 5 β”‚   271.72 ms β”‚   267.86 ms β”‚ no change β”‚
   β”‚ QQuery 6 β”‚  1539.80 ms β”‚  1528.76 ms β”‚ no change β”‚
   β”‚ QQuery 7 β”‚   501.34 ms β”‚   500.09 ms β”‚ no change β”‚
   β”‚ QQuery 8 β”‚   172.26 ms β”‚   172.88 ms β”‚ no change β”‚
   β”‚ QQuery 9 β”‚   283.58 ms β”‚   274.82 ms β”‚ no change β”‚
   β”‚ QQuery 10β”‚   169.65 ms β”‚   172.34 ms β”‚ no change β”‚
   β”‚ QQuery 11β”‚  1414.43 ms β”‚  1436.15 ms β”‚ no change β”‚
   β”‚ QQuery 12β”‚74.91 ms β”‚73.81 ms β”‚ no change β”‚
   β”‚ QQuery 13β”‚   554.37 ms β”‚   557.49 ms β”‚ no change β”‚
   β”‚ QQuery 14β”‚  2016.91 ms β”‚  1803.46 ms β”‚ +1.12x faster β”‚
   β”‚ QQuery 15β”‚28.50 ms β”‚27.30 ms β”‚ no change β”‚
   β”‚ QQuery 16β”‚56.60 ms β”‚57.60 ms β”‚ no change β”‚
   β”‚ QQuery 17β”‚   354.64 ms β”‚   357.88 ms β”‚ no change β”‚
   β”‚ QQuery 18β”‚   191.69 ms β”‚   190.50 ms β”‚ no change β”‚
   β”‚ QQuery 19β”‚   228.28 ms β”‚   229.67 ms β”‚ no change β”‚
   β”‚ QQuery 20β”‚23.36 ms β”‚23.50 ms β”‚ no change β”‚
   β”‚ QQuery 21β”‚33.71 ms β”‚32.15 ms β”‚ no change β”‚
   β”‚ QQuery 22β”‚   985.29 ms β”‚   906.05 ms β”‚ +1.09x faster β”‚
   β”‚ QQuery 23β”‚  1845.20 ms β”‚  1713.30 ms β”‚ +1.08x faster β”‚
   β”‚ QQuery 24β”‚   642.89 ms β”‚   636.64 ms β”‚ no change β”‚
   β”‚ QQuery 25β”‚   523.12 ms β”‚   516.07 ms β”‚ no change β”‚
   β”‚ QQuery 26β”‚   123.94 ms β”‚   126.15 ms β”‚ no change β”‚
   β”‚ QQuery 27β”‚   504.93 ms β”‚   500.48 ms β”‚ no change β”‚
   β”‚ QQuery 28β”‚   303.81 ms β”‚   302.04 ms β”‚ no change β”‚
   β”‚ QQuery 29β”‚   444.98 ms β”‚   447.70 ms β”‚ no change β”‚
   β”‚ QQuery 30β”‚65.01 ms β”‚63.94 ms β”‚ no change β”‚
   β”‚ QQuery 31β”‚   306.15 ms β”‚   306.66 ms β”‚ no change β”‚
   β”‚ QQuery 32β”‚77.86 ms β”‚78.00 ms β”‚ no change β”‚
   β”‚ QQuery 33β”‚   193.25 ms β”‚   194.57 ms β”‚ no change β”‚
   β”‚ QQuery 34β”‚   159.04 ms β”‚   162.10 ms β”‚ no change β”‚
   β”‚ QQuery 35β”‚   169.34 ms β”‚   171.45 ms β”‚ no change β”‚
   β”‚ QQuery 36β”‚   296.09 ms β”‚   286.26 ms β”‚ no change β”‚
   β”‚ QQuery 37β”‚   257.06 ms β”‚   254.49 ms β”‚ no change β”‚
   β”‚ QQuery 38β”‚   155.26 ms β”‚   151.49 ms β”‚ no change β”‚
   β”‚ QQuery 39β”‚   216.46 ms β”‚   194.61 ms β”‚ +1.11x faster β”‚
   β”‚ QQuery 40β”‚   197.40 ms β”‚   172.47 ms β”‚ +1.14x faster β”‚
   β”‚ QQuery 41β”‚16.30 ms β”‚16.66 ms β”‚ no change β”‚
   β”‚ QQuery 42β”‚   142.70 ms β”‚   145.52 ms β”‚ no change β”‚
   β”‚ QQuery 43β”‚   126.26 ms β”‚   129.13 ms β”‚ no change β”‚
   β”‚ QQuery 44β”‚15.68 ms β”‚14.96 ms β”‚ no change β”‚
   β”‚ QQuery 45β”‚83.14 ms β”‚80.86 ms β”‚ no change β”‚
   β”‚ QQuery 46β”‚   329.53 ms β”‚   329.04 ms β”‚ no change β”‚
   β”‚ QQuery 47β”‚  1279.95 ms β”‚  1186.17 ms β”‚ +1.08x faster β”‚
   β”‚ QQuery 48β”‚   416.35 ms β”‚   420.97 ms β”‚ no change β”‚
   β”‚ QQuery 49β”‚   355.14 ms β”‚   354.37 ms β”‚ no change β”‚
   β”‚ QQuery 50β”‚   338.48 ms β”‚   343.17 ms β”‚ no change β”‚
   β”‚ QQuery 51β”‚   298.83 ms β”‚   294.59 ms β”‚ no change β”‚
   β”‚ QQuery 52β”‚   147.14 ms β”‚   143.27 ms β”‚ no change β”‚
   β”‚ QQuery 53β”‚   149.53 ms β”‚   146.40 ms β”‚ no change β”‚
   β”‚ QQuery 54β”‚   207.43 ms β”‚   216.19 ms β”‚ no change β”‚
   β”‚ QQuery 55β”‚   144.21 ms β”‚   144.60 ms β”‚ no change β”‚
   β”‚ QQuery 56β”‚   193.16 ms β”‚   190.45 ms β”‚ no change β”‚
   β”‚ QQuery 57β”‚   319.98 ms β”‚   302.28 ms β”‚ +1.06x faster β”‚
   β”‚ QQuery 58β”‚   515.69 ms β”‚   482.75 ms β”‚ +1.07x faster β”‚
   β”‚ QQuery 59β”‚   285.93 ms β”‚   288.57 ms β”‚ no change β”‚
   β”‚ QQuery 60β”‚   199.98 ms β”‚   199.54 ms β”‚ no change β”‚
   β”‚ QQuery 61β”‚   240.08 ms β”‚   235.47 ms β”‚ no change β”‚
   β”‚ QQuery 62β”‚  1378.44 ms β”‚  1398.12 ms β”‚ no change β”‚
   β”‚ QQuery 63β”‚   149.43 ms β”‚   148.88 ms β”‚ no change β”‚
   β”‚ QQuery 64β”‚  1142.94 ms β”‚  1161.26 ms β”‚ no change β”‚
   β”‚ QQuery 65β”‚   368.38 ms β”‚   363.41 ms β”‚ no change β”‚
   β”‚ QQuery 66β”‚   389.40 ms β”‚   393.14 ms β”‚ no change β”‚
   β”‚ QQuery 67β”‚   646.76 ms β”‚   627.61 ms β”‚ no change β”‚
   β”‚ QQuery 68β”‚   383.87 ms β”‚   384.78 ms β”‚ no change β”‚
   β”‚ QQuery 69β”‚   167.41 ms β”‚   170.00 ms β”‚ no change β”‚
   β”‚ QQuery 70β”‚   526.93 ms β”‚   528.35 ms β”‚ no change β”‚
   β”‚ QQuery 71β”‚   183.79 ms β”‚   184.70

Re: [PR] perfect hash join [datafusion]

2025-12-19 Thread via GitHub


alamb-ghbot commented on PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#issuecomment-3676292704

   πŸ€– `./gh_compare_branch.sh` 
[gh_compare_branch.sh](https://github.com/alamb/datafusion-benchmarking/blob/main/scripts/gh_compare_branch.sh)
 Running
   Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 
2025 x86_64 x86_64 x86_64 GNU/Linux
   Comparing perfect_hj (a442460deb5764caf51a454fcc0959292e0cf530) to 
8550010bd55ccae217b08240664c44fa8bda8261 
[diff](https://github.com/apache/datafusion/compare/8550010bd55ccae217b08240664c44fa8bda8261..a442460deb5764caf51a454fcc0959292e0cf530)
 using:  tpcds
   Results will be posted here when complete
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perfect hash join [datafusion]

2025-12-19 Thread via GitHub


alamb-ghbot commented on PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#issuecomment-3676292478

   πŸ€–: Benchmark completed
   
   Details
   
   
   
   ```
   Comparing HEAD and perfect_hj
   
   Benchmark clickbench_extended.json
   
   ┏━━┳━┳━┳━━━┓
   ┃ Query┃HEAD ┃  perfect_hj ┃Change ┃
   ┑━━╇━╇━╇━━━┩
   β”‚ QQuery 0 β”‚  2728.43 ms β”‚  2746.77 ms β”‚ no change β”‚
   β”‚ QQuery 1 β”‚  1259.76 ms β”‚  1286.32 ms β”‚ no change β”‚
   β”‚ QQuery 2 β”‚  2442.97 ms β”‚  2404.14 ms β”‚ no change β”‚
   β”‚ QQuery 3 β”‚  1146.01 ms β”‚  1175.40 ms β”‚ no change β”‚
   β”‚ QQuery 4 β”‚  2324.58 ms β”‚  2297.95 ms β”‚ no change β”‚
   β”‚ QQuery 5 β”‚ 28847.71 ms β”‚ 28651.86 ms β”‚ no change β”‚
   β”‚ QQuery 6 β”‚  3925.44 ms β”‚  4071.92 ms β”‚ no change β”‚
   β”‚ QQuery 7 β”‚  3480.92 ms β”‚  3513.25 ms β”‚ no change β”‚
   β””β”€β”€β”΄β”€β”΄β”€β”΄β”€β”€β”€β”˜
   ┏━━━┳┓
   ┃ Benchmark Summary ┃┃
   ┑━━━╇┩
   β”‚ Total Time (HEAD) β”‚ 46155.84ms β”‚
   β”‚ Total Time (perfect_hj)   β”‚ 46147.61ms β”‚
   β”‚ Average Time (HEAD)   β”‚  5769.48ms β”‚
   β”‚ Average Time (perfect_hj) β”‚  5768.45ms β”‚
   β”‚ Queries Fasterβ”‚  0 β”‚
   β”‚ Queries Slowerβ”‚  0 β”‚
   β”‚ Queries with No Changeβ”‚  8 β”‚
   β”‚ Queries with Failure  β”‚  0 β”‚
   β””β”€β”€β”€β”΄β”˜
   
   Benchmark clickbench_partitioned.json
   
   ┏━━┳━┳━┳━━━┓
   ┃ Query┃HEAD ┃  perfect_hj ┃Change ┃
   ┑━━╇━╇━╇━━━┩
   β”‚ QQuery 0 β”‚ 2.55 ms β”‚ 2.28 ms β”‚ +1.12x faster β”‚
   β”‚ QQuery 1 β”‚49.94 ms β”‚49.72 ms β”‚ no change β”‚
   β”‚ QQuery 2 β”‚   136.75 ms β”‚   135.77 ms β”‚ no change β”‚
   β”‚ QQuery 3 β”‚   157.55 ms β”‚   156.38 ms β”‚ no change β”‚
   β”‚ QQuery 4 β”‚  1127.59 ms β”‚  1099.95 ms β”‚ no change β”‚
   β”‚ QQuery 5 β”‚  1513.28 ms β”‚  1540.48 ms β”‚ no change β”‚
   β”‚ QQuery 6 β”‚ 2.09 ms β”‚ 2.18 ms β”‚ no change β”‚
   β”‚ QQuery 7 β”‚54.92 ms β”‚54.54 ms β”‚ no change β”‚
   β”‚ QQuery 8 β”‚  1435.36 ms β”‚  1429.02 ms β”‚ no change β”‚
   β”‚ QQuery 9 β”‚  1826.50 ms β”‚  1818.96 ms β”‚ no change β”‚
   β”‚ QQuery 10β”‚   362.05 ms β”‚   359.18 ms β”‚ no change β”‚
   β”‚ QQuery 11β”‚   413.64 ms β”‚   415.45 ms β”‚ no change β”‚
   β”‚ QQuery 12β”‚  1346.02 ms β”‚  1355.55 ms β”‚ no change β”‚
   β”‚ QQuery 13β”‚  2003.64 ms β”‚  2030.93 ms β”‚ no change β”‚
   β”‚ QQuery 14β”‚  1258.62 ms β”‚  1264.24 ms β”‚ no change β”‚
   β”‚ QQuery 15β”‚  1236.44 ms β”‚  1220.63 ms β”‚ no change β”‚
   β”‚ QQuery 16β”‚  2660.06 ms β”‚  2679.40 ms β”‚ no change β”‚
   β”‚ QQuery 17β”‚  2646.48 ms β”‚  2669.57 ms β”‚ no change β”‚
   β”‚ QQuery 18β”‚  5032.89 ms β”‚  5243.53 ms β”‚ no change β”‚
   β”‚ QQuery 19β”‚   119.08 ms β”‚   120.08 ms β”‚ no change β”‚
   β”‚ QQuery 20β”‚  1909.83 ms β”‚  1955.97 ms β”‚ no change β”‚
   β”‚ QQuery 21β”‚  2219.56 ms β”‚  2228.45 ms β”‚ no change β”‚
   β”‚ QQuery 22β”‚  3754.10 ms β”‚  3742.17 ms β”‚ no change β”‚
   β”‚ QQuery 23β”‚ 12284.83 ms β”‚ 12304.70 ms β”‚ no change β”‚
   β”‚ QQuery 24β”‚   200.89 ms β”‚   216.80 ms β”‚  1.08x slower β”‚
   β”‚ QQuery 25β”‚   459.79 ms β”‚   477.52 ms β”‚ no change β”‚
   β”‚ QQuery 26β”‚   217.46 ms β”‚   226.68 ms β”‚ no change β”‚
   β”‚ QQuery 27β”‚  2733.98 ms β”‚  2714.81 ms β”‚ no change β”‚
   β”‚ QQuery 28β”‚ 24706.02 ms β”‚ 23446.26 ms β”‚ +1.05x faster β”‚
   β”‚ QQuery 29β”‚   983.86 ms β”‚   951.63 ms β”‚ no change β”‚
   β”‚ QQuery 30β”‚  1325.66 ms β”‚  1334.96 ms β”‚ no change β”‚
   β”‚ QQuery 31β”‚  1361.59 ms β”‚  1319.42 ms β”‚ no change β”‚
   β”‚ QQuery 32β”‚  4654.91 ms β”‚  4894.57 ms β”‚  1.05x slower β”‚
   β”‚ QQuery 33β”‚  5895.01 ms β”‚  5580.74 ms β”‚ +1.06x faster β”‚
   β”‚ QQuery 34β”‚  5921.45 ms β”‚  5968.72 ms β”‚ no change β”‚
   β”‚ QQuery 35β”‚  1944.76 ms β”‚  1894.39 ms β”‚ no change β”‚
   β”‚ QQuery 36β”‚66.49 ms β”‚65.88 ms β”‚ no change β”‚
   β”‚ QQuery 37β”‚45.71 ms β”‚47.05 ms β”‚ no change β”‚
   β”‚ QQuery 38β”‚66.55 ms β”‚67.85 ms β”‚ no change β”‚
   β”‚ QQuery 39β”‚   105.16 ms β”‚   104.80 ms β”‚ no change β”‚
   β”‚ QQuery 40β”‚28.03 ms β”‚27.95 ms β”‚ no change β”‚
   β”‚ QQuery 41β”‚23.88 ms β”‚24.08 ms β”‚ no change β”‚
   β”‚ QQuery 42β”‚19.99 ms β”‚20.73 ms β”‚ no change β”‚
   β””β”€β”€β”΄β”€β”΄β”€β”΄β”€β”€β”€β”˜
   ┏━━━┳┓
   ┃ Benchmark Summary ┃┃
   ┑━━━╇┩
   β”‚ Total Time (HEAD) β”‚ 94314.96ms β”‚
   β”‚ Total Time (perfect_hj)   β”‚ 93263.97ms β”‚
   β”‚ Average Time (HEAD)   β”‚  2193.37ms β”‚
   β”‚ Average Time (perfect_hj) β”‚  2168.93ms β”‚
   β”‚ Queries 

Re: [PR] perfect hash join [datafusion]

2025-12-19 Thread via GitHub


Dandandan commented on code in PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#discussion_r2636002211


##
datafusion/common/src/config.rs:
##
@@ -468,6 +468,25 @@ config_namespace! {
 /// metadata memory consumption
 pub batch_size: usize, default = 8192
 
+/// A perfect hash join will be considered if the number of rows on 
the build
+/// side is below this threshold. This provides a fast path for joins 
with
+/// very small build sides, bypassing the density check.
+/// 
+/// TODO: Currently only supports cases where left_side.num_rows() < 
u32::MAX.
+/// Support for left_side.num_rows() >= u32::MAX will be added in the 
future.
+pub perfect_hash_join_small_build_threshold: usize, default = 1024
+
+/// The minimum required density of join keys on the build side to 
consider a
+/// perfect hash join. Density is calculated as:
+/// `(number of rows) / (max_key - min_key + 1)`.
+/// A perfect hash join may be used if the actual key density exceeds 
this
+/// value. For example, a value of 0.99 means the keys must fill at 
least
+/// 99% of their value range.
+/// 
+/// TODO: Currently only supports cases where left_side.num_rows() < 
u32::MAX.
+/// Support for left_side.num_rows() >= u32::MAX will be added in the 
future.
+pub perfect_hash_join_min_key_density: f64, default = 0.99

Review Comment:
   This seems very high? For a hashmap I believe it's ~75% default (plus it has 
some more overhead per key), so I think a 75% probably could still be better 
overall?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perfect hash join [datafusion]

2025-12-19 Thread via GitHub


Dandandan commented on PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#issuecomment-3676159917

   run benchmark tpcds


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perfect hash join [datafusion]

2025-12-19 Thread via GitHub


alamb-ghbot commented on PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#issuecomment-3676159863

   πŸ€– `./gh_compare_branch.sh` 
[gh_compare_branch.sh](https://github.com/alamb/datafusion-benchmarking/blob/main/scripts/gh_compare_branch.sh)
 Running
   Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 
2025 x86_64 x86_64 x86_64 GNU/Linux
   Comparing perfect_hj (a442460deb5764caf51a454fcc0959292e0cf530) to 
8550010bd55ccae217b08240664c44fa8bda8261 
[diff](https://github.com/apache/datafusion/compare/8550010bd55ccae217b08240664c44fa8bda8261..a442460deb5764caf51a454fcc0959292e0cf530)
 using:  tpch_mem clickbench_partitioned clickbench_extended
   Results will be posted here when complete
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perfect hash join [datafusion]

2025-12-19 Thread via GitHub


Dandandan commented on PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#issuecomment-3676159501

   run benchmarks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] perfect hash join [datafusion]

2025-12-19 Thread via GitHub


Dandandan commented on code in PR #19411:
URL: https://github.com/apache/datafusion/pull/19411#discussion_r2635957210


##
datafusion/physical-plan/src/joins/array_map.rs:
##
@@ -0,0 +1,565 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use arrow::buffer::MutableBuffer;
+use num_traits::AsPrimitive;
+use std::fmt;
+use std::sync::Arc;
+
+use crate::joins::chain::traverse_chain;
+use crate::joins::join_hash_map::JoinHashMapOffset;
+use crate::joins::utils::JoinHashMapType;
+use arrow::array::{Array, ArrayRef, AsArray};
+use arrow::datatypes::DataType;
+use arrow::datatypes::{
+Int8Type, Int16Type, Int32Type, Int64Type, UInt8Type, UInt16Type, 
UInt32Type,
+UInt64Type,
+};
+use datafusion_common::{Result, internal_err};
+
+/// A "perfect" hash map for single-column integer join keys, represented as a 
dense array.
+///
+/// This structure is highly optimized for joins where the keys are integers 
within a limited
+/// range. Instead of calculating hashes, it uses the integer key itself as an 
index into a
+/// `Vec`, achieving O(1) lookup performance.
+///
+/// # NULL Handling
+///
+/// This optimization can be used for joins with 
`NullEquality::NullEqualsNothing` even if the
+/// join keys contain `NULL`s. This is because:
+///
+/// 1. `try_new` (build side): Ignores rows with `NULL` keys when creating the 
map. This is
+///correct as `NULL` keys would not match anything anyway.
+/// 2. `get_matched_indices_with_limit_offset` (probe side): Skips any `NULL` 
keys encountered
+///in the probe side input.
+///
+/// This structure **cannot** be used for joins with 
`NullEquality::NullEqualsNull` if the
+/// build side contains `NULL`s, as it does not have a mechanism to store and 
match `NULL` values.
+#[derive(Debug)]
+pub struct ArrayMap {
+// data[probSideVal-offset] -> valIdxInBuildSide + 1; 0 for absent
+data: Vec,
+offset: u64, // min val in buildSide
+next: Option>,
+}
+
+impl ArrayMap {
+/// Creates a new [`ArrayKV`] from the given array of join keys.
+///
+/// Note: This function processes only the non-null values in the input 
`array`,
+/// effectively ignoring any rows where the key is `NULL`.
+///
+/// TODO: Support `NullEquality::NullEqualsNull` by storing null indices 
in a
+/// separate `Vec` to allow for `NULL=NULL` matching in the future.
+pub(crate) fn try_new(
+array: &ArrayRef,
+offset_val: u64,
+range: usize,
+) -> Result {
+// Initialize with 0 (sentinel for not found)
+let mut data: Vec = vec![0; range];
+let mut next: Option> = None;

Review Comment:
   ```suggestion
   let mut next: Vec = vec![];
   ```
   
   I think this should work as well



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]