Re: [PR] chore: Clarify rehash setting in hash utils [datafusion]

2026-02-24 Thread via GitHub


kumarUjjawal commented on PR #20154:
URL: https://github.com/apache/datafusion/pull/20154#issuecomment-3957451327

   > Also, for the inconsistencies, can you describe what is "fixed" by 
resolving them. Is there a potential wrong results bug? Something else? If it 
is wrong results, I wonder why we haven't seen it before
   
   As per my understanding The old nested-type hashers always called 
combine_hashes(*hash, value), they always behaved as rehash = true, even when 
called as the first column (where hashes_buffer was pre-zeroed to 0).
   
   So they computed combine_hashes(0, inner_hash) instead of inner_hash 
directly. Results were internally consistent within a version. Joins and 
partitioning worked correctly.
   
   What this PR fixes is purely semantic inconsistency and clarity (per the 
original issue). Nested types always behaved as rehash = true regardless of 
context, while primitive and dictionary types correctly skipped combine_hashes 
on the first column.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] chore: Clarify rehash setting in hash utils [datafusion]

2026-02-24 Thread via GitHub


alamb commented on PR #20154:
URL: https://github.com/apache/datafusion/pull/20154#issuecomment-3954657715

   I find this PR hard to review as it has both documentation and code changes
   
   Perhaps we can break the clarification parts into one PR and then make a 
second PR to address any potential inconsistencies
   
   Also, for the inconsistencies, can you describe what is "fixed" by resolving 
them. Is there a potential wrong results bug? Something else? If it is wrong 
results, I wonder why we haven't seen it before


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] chore: Clarify rehash setting in hash utils [datafusion]

2026-02-11 Thread via GitHub


alamb-ghbot commented on PR #20154:
URL: https://github.com/apache/datafusion/pull/20154#issuecomment-3889090859

   🤖: Benchmark completed
   
   Details
   
   
   
   ```
   groupchore_clarify_rehash
   main
   -
   
   dictionary_utf8_int32: multiple, no nulls1.00 75.4±0.77µs? 
?/sec1.01 76.5±0.37µs? ?/sec
   dictionary_utf8_int32: multiple, nulls   1.00111.0±0.21µs? 
?/sec1.02113.0±0.36µs? ?/sec
   dictionary_utf8_int32: single, no nulls  1.00 21.3±0.22µs? 
?/sec1.26 26.9±0.13µs? ?/sec
   dictionary_utf8_int32: single, nulls 1.00 36.2±0.13µs? 
?/sec1.06 38.3±0.16µs? ?/sec
   int64: multiple, no nulls1.00 38.7±0.08µs? 
?/sec1.00 38.7±0.10µs? ?/sec
   int64: multiple, nulls   1.23 73.5±0.14µs? 
?/sec1.00 59.8±0.41µs? ?/sec
   int64: single, no nulls  1.00 11.3±0.02µs? 
?/sec1.00 11.3±0.03µs? ?/sec
   int64: single, nulls 1.08 23.8±0.03µs? 
?/sec1.00 22.1±0.38µs? ?/sec
   large_utf8: multiple, no nulls   1.00228.7±1.33µs? 
?/sec1.00228.7±1.41µs? ?/sec
   large_utf8: multiple, nulls  1.05277.8±0.99µs? 
?/sec1.00263.9±0.93µs? ?/sec
   large_utf8: single, no nulls 1.01 72.2±0.31µs? 
?/sec1.00 71.6±0.49µs? ?/sec
   large_utf8: single, nulls1.01 83.0±0.59µs? 
?/sec1.00 82.4±0.38µs? ?/sec
   run_array_int32: multiple, no nulls  1.00 11.8±0.04µs? 
?/sec1.00 11.8±0.14µs? ?/sec
   run_array_int32: multiple, nulls 1.03 13.4±0.09µs? 
?/sec1.00 13.1±0.08µs? ?/sec
   run_array_int32: single, no nulls1.00  3.9±0.02µs? 
?/sec1.02  4.0±0.14µs? ?/sec
   run_array_int32: single, nulls   1.00  4.4±0.04µs? 
?/sec1.00  4.4±0.04µs? ?/sec
   struct_array: multiple, no nulls 1.09417.8±1.11µs? 
?/sec1.00382.8±4.91µs? ?/sec
   struct_array: multiple, nulls1.09452.8±2.98µs? 
?/sec1.00416.3±4.33µs? ?/sec
   struct_array: single, no nulls   1.06135.3±0.37µs? 
?/sec1.00128.0±1.24µs? ?/sec
   struct_array: single, nulls  1.05147.6±0.53µs? 
?/sec1.00140.1±3.03µs? ?/sec
   utf8: multiple, no nulls 1.06243.2±1.34µs? 
?/sec1.00228.9±0.74µs? ?/sec
   utf8: multiple, nulls1.08282.8±1.43µs? 
?/sec1.00262.2±0.90µs? ?/sec
   utf8: single, no nulls   1.00 72.2±0.30µs? 
?/sec1.00 72.0±0.40µs? ?/sec
   utf8: single, nulls  1.00 82.2±0.53µs? 
?/sec1.00 82.6±0.57µs? ?/sec
   utf8_view (small): multiple, no nulls1.00 47.6±0.09µs? 
?/sec1.00 47.5±0.09µs? ?/sec
   utf8_view (small): multiple, nulls   1.00 78.4±0.16µs? 
?/sec1.00 78.2±0.10µs? ?/sec
   utf8_view (small): single, no nulls  1.00 13.9±0.03µs? 
?/sec1.00 13.9±0.05µs? ?/sec
   utf8_view (small): single, nulls 1.00 23.7±0.15µs? 
?/sec1.00 23.7±0.05µs? ?/sec
   utf8_view: multiple, no nulls1.01231.6±0.49µs? 
?/sec1.00229.9±1.48µs? ?/sec
   utf8_view: multiple, nulls   1.00239.6±0.77µs? 
?/sec1.00239.7±0.76µs? ?/sec
   utf8_view: single, no nulls  1.00 74.3±0.51µs? 
?/sec1.01 75.1±0.47µs? ?/sec
   utf8_view: single, nulls 1.00 83.1±0.47µs? 
?/sec1.00 83.0±2.37µs? ?/sec
   ```
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] chore: Clarify rehash setting in hash utils [datafusion]

2026-02-11 Thread via GitHub


alamb-ghbot commented on PR #20154:
URL: https://github.com/apache/datafusion/pull/20154#issuecomment-3889048823

   🤖 `./gh_compare_branch_bench.sh` 
[compare_branch_bench.sh](https://github.com/alamb/datafusion-benchmarking/blob/main/scripts/compare_branch_bench.sh)
 Running
   Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 
2025 x86_64 x86_64 x86_64 GNU/Linux
   Comparing chore/clarify_rehash (0fc807a4204482e5376bb5ed408f945fc1262bb4) to 
aa9520e6e260410e10a01eff5c202dfe018772ac 
[diff](https://github.com/apache/datafusion/compare/aa9520e6e260410e10a01eff5c202dfe018772ac..0fc807a4204482e5376bb5ed408f945fc1262bb4)
   BENCH_NAME=with_hashes
   BENCH_COMMAND=cargo bench --features=parquet --bench with_hashes
   BENCH_FILTER=
   BENCH_BRANCH_NAME=chore_clarify_rehash
   Results will be posted here when complete
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] chore: Clarify rehash setting in hash utils [datafusion]

2026-02-11 Thread via GitHub


Jefffrey commented on PR #20154:
URL: https://github.com/apache/datafusion/pull/20154#issuecomment-3889048670

   run benchmark with_hashes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] chore: Clarify rehash setting in hash utils [datafusion]

2026-02-11 Thread via GitHub


alamb-ghbot commented on PR #20154:
URL: https://github.com/apache/datafusion/pull/20154#issuecomment-3889012702

   🤖: Benchmark completed
   
   Details
   
   
   
   ```
   groupchore_clarify_rehash
   main
   -
   
   dictionary_utf8_int32: multiple, no nulls1.00 75.3±0.22µs? 
?/sec1.02 76.6±0.21µs? ?/sec
   dictionary_utf8_int32: multiple, nulls   1.00110.9±0.97µs? 
?/sec1.02113.1±0.92µs? ?/sec
   dictionary_utf8_int32: single, no nulls  1.00 21.3±0.07µs? 
?/sec1.27 27.0±0.24µs? ?/sec
   dictionary_utf8_int32: single, nulls 1.00 36.2±0.11µs? 
?/sec1.06 38.3±0.24µs? ?/sec
   int64: multiple, no nulls1.00 38.7±0.08µs? 
?/sec1.00 38.8±0.21µs? ?/sec
   int64: multiple, nulls   1.23 73.5±0.11µs? 
?/sec1.00 59.8±0.25µs? ?/sec
   int64: single, no nulls  1.00 11.3±0.02µs? 
?/sec1.00 11.3±0.04µs? ?/sec
   int64: single, nulls 1.08 23.8±0.05µs? 
?/sec1.00 22.1±0.11µs? ?/sec
   large_utf8: multiple, no nulls   1.00228.1±0.55µs? 
?/sec1.01229.6±3.18µs? ?/sec
   large_utf8: multiple, nulls  1.05277.4±0.59µs? 
?/sec1.00265.4±7.30µs? ?/sec
   large_utf8: single, no nulls 1.00 72.0±0.22µs? 
?/sec1.00 71.9±0.83µs? ?/sec
   large_utf8: single, nulls1.01 83.4±0.58µs? 
?/sec1.00 82.6±0.93µs? ?/sec
   run_array_int32: multiple, no nulls  1.00 11.8±0.05µs? 
?/sec1.00 11.9±0.07µs? ?/sec
   run_array_int32: multiple, nulls 1.03 13.5±0.06µs? 
?/sec1.00 13.1±0.06µs? ?/sec
   run_array_int32: single, no nulls1.00  3.9±0.02µs? 
?/sec1.02  3.9±0.02µs? ?/sec
   run_array_int32: single, nulls   1.02  4.5±0.03µs? 
?/sec1.00  4.4±0.03µs? ?/sec
   struct_array: multiple, no nulls 1.09419.7±4.15µs? 
?/sec1.00386.3±0.88µs? ?/sec
   struct_array: multiple, nulls1.08451.5±0.71µs? 
?/sec1.00419.5±0.86µs? ?/sec
   struct_array: single, no nulls   1.05135.9±1.78µs? 
?/sec1.00129.3±0.32µs? ?/sec
   struct_array: single, nulls  1.05147.0±0.87µs? 
?/sec1.00140.6±0.37µs? ?/sec
   utf8: multiple, no nulls 1.06243.8±1.67µs? 
?/sec1.00229.3±1.05µs? ?/sec
   utf8: multiple, nulls1.08282.9±1.61µs? 
?/sec1.00261.8±1.34µs? ?/sec
   utf8: single, no nulls   1.00 72.2±0.29µs? 
?/sec1.00 72.2±0.54µs? ?/sec
   utf8: single, nulls  1.00 82.2±0.23µs? 
?/sec1.00 82.4±0.49µs? ?/sec
   utf8_view (small): multiple, no nulls1.00 47.5±0.05µs? 
?/sec1.00 47.6±0.30µs? ?/sec
   utf8_view (small): multiple, nulls   1.00 78.3±0.50µs? 
?/sec1.00 78.4±1.08µs? ?/sec
   utf8_view (small): single, no nulls  1.00 13.9±0.03µs? 
?/sec1.00 13.9±0.17µs? ?/sec
   utf8_view (small): single, nulls 1.00 23.6±0.03µs? 
?/sec1.00 23.7±0.33µs? ?/sec
   utf8_view: multiple, no nulls1.01231.5±1.28µs? 
?/sec1.00229.9±1.87µs? ?/sec
   utf8_view: multiple, nulls   1.00238.5±0.72µs? 
?/sec1.01239.7±2.19µs? ?/sec
   utf8_view: single, no nulls  1.00 74.1±0.47µs? 
?/sec1.02 75.4±2.55µs? ?/sec
   utf8_view: single, nulls 1.00 82.4±0.81µs? 
?/sec1.01 82.9±0.75µs? ?/sec
   ```
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] chore: Clarify rehash setting in hash utils [datafusion]

2026-02-11 Thread via GitHub


alamb-ghbot commented on PR #20154:
URL: https://github.com/apache/datafusion/pull/20154#issuecomment-3888971533

   🤖 `./gh_compare_branch_bench.sh` 
[compare_branch_bench.sh](https://github.com/alamb/datafusion-benchmarking/blob/main/scripts/compare_branch_bench.sh)
 Running
   Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 
2025 x86_64 x86_64 x86_64 GNU/Linux
   Comparing chore/clarify_rehash (0fc807a4204482e5376bb5ed408f945fc1262bb4) to 
aa9520e6e260410e10a01eff5c202dfe018772ac 
[diff](https://github.com/apache/datafusion/compare/aa9520e6e260410e10a01eff5c202dfe018772ac..0fc807a4204482e5376bb5ed408f945fc1262bb4)
   BENCH_NAME=with_hashes
   BENCH_COMMAND=cargo bench --features=parquet --bench with_hashes
   BENCH_FILTER=
   BENCH_BRANCH_NAME=chore_clarify_rehash
   Results will be posted here when complete
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] chore: Clarify rehash setting in hash utils [datafusion]

2026-02-11 Thread via GitHub


Jefffrey commented on PR #20154:
URL: https://github.com/apache/datafusion/pull/20154#issuecomment-3888971375

   run benchmark with_hashes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] chore: Clarify rehash setting in hash utils [datafusion]

2026-02-11 Thread via GitHub


alamb-ghbot commented on PR #20154:
URL: https://github.com/apache/datafusion/pull/20154#issuecomment-3888740090

   🤖: Benchmark completed
   
   Details
   
   
   
   ```
   groupchore_clarify_rehash
   main
   -
   
   dictionary_utf8_int32: multiple, no nulls1.00 76.7±1.03µs? 
?/sec1.15 88.0±1.21µs? ?/sec
   dictionary_utf8_int32: multiple, nulls   1.02113.1±0.87µs? 
?/sec1.00111.1±1.44µs? ?/sec
   dictionary_utf8_int32: single, no nulls  1.12 27.1±0.13µs? 
?/sec1.00 24.2±0.32µs? ?/sec
   dictionary_utf8_int32: single, nulls 1.06 38.3±0.35µs? 
?/sec1.00 36.1±0.40µs? ?/sec
   int64: multiple, no nulls1.00 38.7±0.06µs? 
?/sec1.00 38.7±0.17µs? ?/sec
   int64: multiple, nulls   1.00 54.5±0.58µs? 
?/sec1.00 54.5±0.25µs? ?/sec
   int64: single, no nulls  1.00 11.5±0.04µs? 
?/sec1.01 11.6±0.42µs? ?/sec
   int64: single, nulls 1.00 16.5±0.18µs? 
?/sec1.02 16.7±0.48µs? ?/sec
   large_utf8: multiple, no nulls   1.00228.4±0.54µs? 
?/sec1.00228.0±1.50µs? ?/sec
   large_utf8: multiple, nulls  1.04275.6±0.64µs? 
?/sec1.00264.0±3.07µs? ?/sec
   large_utf8: single, no nulls 1.00 72.0±0.38µs? 
?/sec1.00 72.0±1.25µs? ?/sec
   large_utf8: single, nulls1.00 81.4±0.26µs? 
?/sec1.03 83.6±0.30µs? ?/sec
   utf8: multiple, no nulls 1.00230.7±1.69µs? 
?/sec1.06243.4±2.93µs? ?/sec
   utf8: multiple, nulls1.00263.1±4.67µs? 
?/sec1.07281.3±3.29µs? ?/sec
   utf8: single, no nulls   1.00 72.4±0.69µs? 
?/sec1.00 72.2±1.01µs? ?/sec
   utf8: single, nulls  1.00 81.2±0.30µs? 
?/sec1.01 82.1±0.51µs? ?/sec
   utf8_view (small): multiple, no nulls1.00 47.6±0.19µs? 
?/sec1.01 47.9±1.65µs? ?/sec
   utf8_view (small): multiple, nulls   1.03 66.2±0.23µs? 
?/sec1.00 64.4±1.78µs? ?/sec
   utf8_view (small): single, no nulls  1.00 13.9±0.10µs? 
?/sec1.00 13.9±0.04µs? ?/sec
   utf8_view (small): single, nulls 1.10 23.7±1.20µs? 
?/sec1.00 21.6±0.04µs? ?/sec
   utf8_view: multiple, no nulls1.01245.2±1.56µs? 
?/sec1.00243.4±0.33µs? ?/sec
   utf8_view: multiple, nulls   1.00251.1±0.51µs? 
?/sec1.02257.2±0.66µs? ?/sec
   utf8_view: single, no nulls  1.00 80.4±0.32µs? 
?/sec1.00 80.6±0.23µs? ?/sec
   utf8_view: single, nulls 1.00 82.6±0.29µs? 
?/sec1.03 85.4±0.26µs? ?/sec
   ```
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] chore: Clarify rehash setting in hash utils [datafusion]

2026-02-11 Thread via GitHub


alamb-ghbot commented on PR #20154:
URL: https://github.com/apache/datafusion/pull/20154#issuecomment-3888714932

   🤖 `./gh_compare_branch_bench.sh` 
[compare_branch_bench.sh](https://github.com/alamb/datafusion-benchmarking/blob/main/scripts/compare_branch_bench.sh)
 Running
   Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 
2025 x86_64 x86_64 x86_64 GNU/Linux
   Comparing chore/clarify_rehash (98a1fd9a1c773b2bc963d302c69bb9392259d0f5) to 
a02e6836e7d449008eef592b177ea282595e027a 
[diff](https://github.com/apache/datafusion/compare/a02e6836e7d449008eef592b177ea282595e027a..98a1fd9a1c773b2bc963d302c69bb9392259d0f5)
   BENCH_NAME=with_hashes
   BENCH_COMMAND=cargo bench --features=parquet --bench with_hashes
   BENCH_FILTER=
   BENCH_BRANCH_NAME=chore_clarify_rehash
   Results will be posted here when complete
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] chore: Clarify rehash setting in hash utils [datafusion]

2026-02-11 Thread via GitHub


Jefffrey commented on PR #20154:
URL: https://github.com/apache/datafusion/pull/20154#issuecomment-3888714768

   run benchmark with_hashes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] chore: Clarify rehash setting in hash utils [datafusion]

2026-02-09 Thread via GitHub


kumarUjjawal commented on PR #20154:
URL: https://github.com/apache/datafusion/pull/20154#issuecomment-3870745673

   > Might need some benchmarks to see what effect this has
   
   Do you want the benchmark for just functions-aggregate or end to end sql 
benchmar, also should i commit the benchmark code?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] chore: Clarify rehash setting in hash utils [datafusion]

2026-02-08 Thread via GitHub


Jefffrey commented on code in PR #20154:
URL: https://github.com/apache/datafusion/pull/20154#discussion_r2780978390


##
datafusion/common/src/hash_utils.rs:
##
@@ -484,22 +505,49 @@ fn hash_map_array(
 let mut values_hashes = vec![0u64; array.entries().len()];
 create_hashes(array.entries().columns(), random_state, &mut 
values_hashes)?;
 
-// Combine the hashes for entries on each row with each other and previous 
hash for that row
-if let Some(nulls) = nulls {
+// Combine the hashes for entries on each row with each other.
+// When `rehash=true`, combine the per-row map hash into the existing 
accumulator.

Review Comment:
   Refer to my previous comment about not duplicating this explanation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] chore: Clarify rehash setting in hash utils [datafusion]

2026-02-05 Thread via GitHub


Jefffrey commented on code in PR #20154:
URL: https://github.com/apache/datafusion/pull/20154#discussion_r2771981300


##
datafusion/common/src/hash_utils.rs:
##
@@ -448,7 +462,13 @@ fn hash_struct_array(
 array: &StructArray,
 random_state: &RandomState,
 hashes_buffer: &mut [u64],
+rehash: bool,
 ) -> Result<()> {
+// Hashing for nested types follows the same initialize-vs-combine 
convention as

Review Comment:
   We don't need to repeat this explanation in each function; the top level 
explanation is sufficient



##
datafusion/common/src/hash_utils.rs:
##
@@ -462,9 +482,23 @@ fn hash_struct_array(
 let mut values_hashes = vec![0u64; row_len];
 create_hashes(array.columns(), random_state, &mut values_hashes)?;
 
-for i in valid_row_indices {
-let hash = &mut hashes_buffer[i];
-*hash = combine_hashes(*hash, values_hashes[i]);
+if rehash {
+for i in valid_row_indices {
+hashes_buffer[i] = combine_hashes(values_hashes[i], 
hashes_buffer[i]);
+}
+} else {
+for i in valid_row_indices {
+hashes_buffer[i] = values_hashes[i];
+}
+}
+
+// Hash null struct values consistently with other array types

Review Comment:
   I thought other array types ignore nulls when hashing? Except for 
`DataType::Null`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] chore: Clarify rehash setting in hash utils [datafusion]

2026-02-05 Thread via GitHub


Jefffrey commented on code in PR #20154:
URL: https://github.com/apache/datafusion/pull/20154#discussion_r2768884529


##
datafusion/common/src/hash_utils.rs:
##
@@ -400,6 +417,8 @@ fn update_hash_for_dict_key(
 dict_hashes: &[u64],
 dict_values: &dyn Array,
 idx: usize,
+// `multi_col` is the historical name for what is now referred to as 
`rehash` elsewhere

Review Comment:
   I don't think we should take the approach of having a `apply_row_hash` 
function as it currently is, per my reasoning above



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] chore: Clarify rehash setting in hash utils [datafusion]

2026-02-05 Thread via GitHub


kumarUjjawal commented on code in PR #20154:
URL: https://github.com/apache/datafusion/pull/20154#discussion_r2768824594


##
datafusion/common/src/hash_utils.rs:
##
@@ -400,6 +417,8 @@ fn update_hash_for_dict_key(
 dict_hashes: &[u64],
 dict_values: &dyn Array,
 idx: usize,
+// `multi_col` is the historical name for what is now referred to as 
`rehash` elsewhere

Review Comment:
   Oh, Thanks for the heads up. Does the changes look good?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] chore: Clarify rehash setting in hash utils [datafusion]

2026-02-05 Thread via GitHub


Jefffrey commented on code in PR #20154:
URL: https://github.com/apache/datafusion/pull/20154#discussion_r2768721673


##
datafusion/common/src/hash_utils.rs:
##
@@ -400,6 +417,8 @@ fn update_hash_for_dict_key(
 dict_hashes: &[u64],
 dict_values: &dyn Array,
 idx: usize,
+// `multi_col` is the historical name for what is now referred to as 
`rehash` elsewhere

Review Comment:
   I'd be careful of this approach, as we'd want to pull as many checks outside 
the hotloop as we can; checking `rehash` each time we compute a hash is 
inefficient compared to checking once before the loop, as many other functions 
here do



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] chore: Clarify rehash setting in hash utils [datafusion]

2026-02-04 Thread via GitHub


kumarUjjawal commented on code in PR #20154:
URL: https://github.com/apache/datafusion/pull/20154#discussion_r2767273412


##
datafusion/common/src/hash_utils.rs:
##
@@ -400,6 +417,8 @@ fn update_hash_for_dict_key(
 dict_hashes: &[u64],
 dict_values: &dyn Array,
 idx: usize,
+// `multi_col` is the historical name for what is now referred to as 
`rehash` elsewhere

Review Comment:
   Made some changes, also did a minor refactor, there is another opportunity 
for the refactor to centralize the common “compute per-row nested hash → apply 
rehash (init vs combine) → hash null rows” logic in a shared helper, so each 
nested hasher mostly just computes a row_hash and delegates the buffer 
update/null handling. What do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] chore: Clarify rehash setting in hash utils [datafusion]

2026-02-04 Thread via GitHub


Jefffrey commented on code in PR #20154:
URL: https://github.com/apache/datafusion/pull/20154#discussion_r2766542459


##
datafusion/common/src/hash_utils.rs:
##
@@ -449,6 +472,14 @@ fn hash_struct_array(
 random_state: &RandomState,
 hashes_buffer: &mut [u64],
 ) -> Result<()> {
+// This nested-type hasher currently always combines its computed 
struct-row hash
+// into `hashes_buffer` (equivalent to `rehash=true`). This preserves 
existing
+// behavior for single-column hashing of nested types.
+//
+// If we were to add a `rehash` flag here and make `rehash=false` 
overwrite the
+// buffer, it would change the numeric hash values produced for standalone
+// Struct columns.

Review Comment:
   I think we should look into fixing this instead of leaving the reasoning as 
"keep existing behaviour", especially when we don't know the root cause of why 
this existing behaviour is the way it is



##
datafusion/common/src/hash_utils.rs:
##
@@ -400,6 +417,8 @@ fn update_hash_for_dict_key(
 dict_hashes: &[u64],
 dict_values: &dyn Array,
 idx: usize,
+// `multi_col` is the historical name for what is now referred to as 
`rehash` elsewhere

Review Comment:
   It's better to fix the naming than add a comment trying to explain the 
discrepency



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]