Re: [PR] chore: Clarify rehash setting in hash utils [datafusion]
kumarUjjawal commented on PR #20154: URL: https://github.com/apache/datafusion/pull/20154#issuecomment-3957451327 > Also, for the inconsistencies, can you describe what is "fixed" by resolving them. Is there a potential wrong results bug? Something else? If it is wrong results, I wonder why we haven't seen it before As per my understanding The old nested-type hashers always called combine_hashes(*hash, value), they always behaved as rehash = true, even when called as the first column (where hashes_buffer was pre-zeroed to 0). So they computed combine_hashes(0, inner_hash) instead of inner_hash directly. Results were internally consistent within a version. Joins and partitioning worked correctly. What this PR fixes is purely semantic inconsistency and clarity (per the original issue). Nested types always behaved as rehash = true regardless of context, while primitive and dictionary types correctly skipped combine_hashes on the first column. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] chore: Clarify rehash setting in hash utils [datafusion]
alamb commented on PR #20154: URL: https://github.com/apache/datafusion/pull/20154#issuecomment-3954657715 I find this PR hard to review as it has both documentation and code changes Perhaps we can break the clarification parts into one PR and then make a second PR to address any potential inconsistencies Also, for the inconsistencies, can you describe what is "fixed" by resolving them. Is there a potential wrong results bug? Something else? If it is wrong results, I wonder why we haven't seen it before -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] chore: Clarify rehash setting in hash utils [datafusion]
alamb-ghbot commented on PR #20154: URL: https://github.com/apache/datafusion/pull/20154#issuecomment-3889090859 🤖: Benchmark completed Details ``` groupchore_clarify_rehash main - dictionary_utf8_int32: multiple, no nulls1.00 75.4±0.77µs? ?/sec1.01 76.5±0.37µs? ?/sec dictionary_utf8_int32: multiple, nulls 1.00111.0±0.21µs? ?/sec1.02113.0±0.36µs? ?/sec dictionary_utf8_int32: single, no nulls 1.00 21.3±0.22µs? ?/sec1.26 26.9±0.13µs? ?/sec dictionary_utf8_int32: single, nulls 1.00 36.2±0.13µs? ?/sec1.06 38.3±0.16µs? ?/sec int64: multiple, no nulls1.00 38.7±0.08µs? ?/sec1.00 38.7±0.10µs? ?/sec int64: multiple, nulls 1.23 73.5±0.14µs? ?/sec1.00 59.8±0.41µs? ?/sec int64: single, no nulls 1.00 11.3±0.02µs? ?/sec1.00 11.3±0.03µs? ?/sec int64: single, nulls 1.08 23.8±0.03µs? ?/sec1.00 22.1±0.38µs? ?/sec large_utf8: multiple, no nulls 1.00228.7±1.33µs? ?/sec1.00228.7±1.41µs? ?/sec large_utf8: multiple, nulls 1.05277.8±0.99µs? ?/sec1.00263.9±0.93µs? ?/sec large_utf8: single, no nulls 1.01 72.2±0.31µs? ?/sec1.00 71.6±0.49µs? ?/sec large_utf8: single, nulls1.01 83.0±0.59µs? ?/sec1.00 82.4±0.38µs? ?/sec run_array_int32: multiple, no nulls 1.00 11.8±0.04µs? ?/sec1.00 11.8±0.14µs? ?/sec run_array_int32: multiple, nulls 1.03 13.4±0.09µs? ?/sec1.00 13.1±0.08µs? ?/sec run_array_int32: single, no nulls1.00 3.9±0.02µs? ?/sec1.02 4.0±0.14µs? ?/sec run_array_int32: single, nulls 1.00 4.4±0.04µs? ?/sec1.00 4.4±0.04µs? ?/sec struct_array: multiple, no nulls 1.09417.8±1.11µs? ?/sec1.00382.8±4.91µs? ?/sec struct_array: multiple, nulls1.09452.8±2.98µs? ?/sec1.00416.3±4.33µs? ?/sec struct_array: single, no nulls 1.06135.3±0.37µs? ?/sec1.00128.0±1.24µs? ?/sec struct_array: single, nulls 1.05147.6±0.53µs? ?/sec1.00140.1±3.03µs? ?/sec utf8: multiple, no nulls 1.06243.2±1.34µs? ?/sec1.00228.9±0.74µs? ?/sec utf8: multiple, nulls1.08282.8±1.43µs? ?/sec1.00262.2±0.90µs? ?/sec utf8: single, no nulls 1.00 72.2±0.30µs? ?/sec1.00 72.0±0.40µs? ?/sec utf8: single, nulls 1.00 82.2±0.53µs? ?/sec1.00 82.6±0.57µs? ?/sec utf8_view (small): multiple, no nulls1.00 47.6±0.09µs? ?/sec1.00 47.5±0.09µs? ?/sec utf8_view (small): multiple, nulls 1.00 78.4±0.16µs? ?/sec1.00 78.2±0.10µs? ?/sec utf8_view (small): single, no nulls 1.00 13.9±0.03µs? ?/sec1.00 13.9±0.05µs? ?/sec utf8_view (small): single, nulls 1.00 23.7±0.15µs? ?/sec1.00 23.7±0.05µs? ?/sec utf8_view: multiple, no nulls1.01231.6±0.49µs? ?/sec1.00229.9±1.48µs? ?/sec utf8_view: multiple, nulls 1.00239.6±0.77µs? ?/sec1.00239.7±0.76µs? ?/sec utf8_view: single, no nulls 1.00 74.3±0.51µs? ?/sec1.01 75.1±0.47µs? ?/sec utf8_view: single, nulls 1.00 83.1±0.47µs? ?/sec1.00 83.0±2.37µs? ?/sec ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] chore: Clarify rehash setting in hash utils [datafusion]
alamb-ghbot commented on PR #20154: URL: https://github.com/apache/datafusion/pull/20154#issuecomment-3889048823 🤖 `./gh_compare_branch_bench.sh` [compare_branch_bench.sh](https://github.com/alamb/datafusion-benchmarking/blob/main/scripts/compare_branch_bench.sh) Running Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux Comparing chore/clarify_rehash (0fc807a4204482e5376bb5ed408f945fc1262bb4) to aa9520e6e260410e10a01eff5c202dfe018772ac [diff](https://github.com/apache/datafusion/compare/aa9520e6e260410e10a01eff5c202dfe018772ac..0fc807a4204482e5376bb5ed408f945fc1262bb4) BENCH_NAME=with_hashes BENCH_COMMAND=cargo bench --features=parquet --bench with_hashes BENCH_FILTER= BENCH_BRANCH_NAME=chore_clarify_rehash Results will be posted here when complete -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] chore: Clarify rehash setting in hash utils [datafusion]
Jefffrey commented on PR #20154: URL: https://github.com/apache/datafusion/pull/20154#issuecomment-3889048670 run benchmark with_hashes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] chore: Clarify rehash setting in hash utils [datafusion]
alamb-ghbot commented on PR #20154: URL: https://github.com/apache/datafusion/pull/20154#issuecomment-3889012702 🤖: Benchmark completed Details ``` groupchore_clarify_rehash main - dictionary_utf8_int32: multiple, no nulls1.00 75.3±0.22µs? ?/sec1.02 76.6±0.21µs? ?/sec dictionary_utf8_int32: multiple, nulls 1.00110.9±0.97µs? ?/sec1.02113.1±0.92µs? ?/sec dictionary_utf8_int32: single, no nulls 1.00 21.3±0.07µs? ?/sec1.27 27.0±0.24µs? ?/sec dictionary_utf8_int32: single, nulls 1.00 36.2±0.11µs? ?/sec1.06 38.3±0.24µs? ?/sec int64: multiple, no nulls1.00 38.7±0.08µs? ?/sec1.00 38.8±0.21µs? ?/sec int64: multiple, nulls 1.23 73.5±0.11µs? ?/sec1.00 59.8±0.25µs? ?/sec int64: single, no nulls 1.00 11.3±0.02µs? ?/sec1.00 11.3±0.04µs? ?/sec int64: single, nulls 1.08 23.8±0.05µs? ?/sec1.00 22.1±0.11µs? ?/sec large_utf8: multiple, no nulls 1.00228.1±0.55µs? ?/sec1.01229.6±3.18µs? ?/sec large_utf8: multiple, nulls 1.05277.4±0.59µs? ?/sec1.00265.4±7.30µs? ?/sec large_utf8: single, no nulls 1.00 72.0±0.22µs? ?/sec1.00 71.9±0.83µs? ?/sec large_utf8: single, nulls1.01 83.4±0.58µs? ?/sec1.00 82.6±0.93µs? ?/sec run_array_int32: multiple, no nulls 1.00 11.8±0.05µs? ?/sec1.00 11.9±0.07µs? ?/sec run_array_int32: multiple, nulls 1.03 13.5±0.06µs? ?/sec1.00 13.1±0.06µs? ?/sec run_array_int32: single, no nulls1.00 3.9±0.02µs? ?/sec1.02 3.9±0.02µs? ?/sec run_array_int32: single, nulls 1.02 4.5±0.03µs? ?/sec1.00 4.4±0.03µs? ?/sec struct_array: multiple, no nulls 1.09419.7±4.15µs? ?/sec1.00386.3±0.88µs? ?/sec struct_array: multiple, nulls1.08451.5±0.71µs? ?/sec1.00419.5±0.86µs? ?/sec struct_array: single, no nulls 1.05135.9±1.78µs? ?/sec1.00129.3±0.32µs? ?/sec struct_array: single, nulls 1.05147.0±0.87µs? ?/sec1.00140.6±0.37µs? ?/sec utf8: multiple, no nulls 1.06243.8±1.67µs? ?/sec1.00229.3±1.05µs? ?/sec utf8: multiple, nulls1.08282.9±1.61µs? ?/sec1.00261.8±1.34µs? ?/sec utf8: single, no nulls 1.00 72.2±0.29µs? ?/sec1.00 72.2±0.54µs? ?/sec utf8: single, nulls 1.00 82.2±0.23µs? ?/sec1.00 82.4±0.49µs? ?/sec utf8_view (small): multiple, no nulls1.00 47.5±0.05µs? ?/sec1.00 47.6±0.30µs? ?/sec utf8_view (small): multiple, nulls 1.00 78.3±0.50µs? ?/sec1.00 78.4±1.08µs? ?/sec utf8_view (small): single, no nulls 1.00 13.9±0.03µs? ?/sec1.00 13.9±0.17µs? ?/sec utf8_view (small): single, nulls 1.00 23.6±0.03µs? ?/sec1.00 23.7±0.33µs? ?/sec utf8_view: multiple, no nulls1.01231.5±1.28µs? ?/sec1.00229.9±1.87µs? ?/sec utf8_view: multiple, nulls 1.00238.5±0.72µs? ?/sec1.01239.7±2.19µs? ?/sec utf8_view: single, no nulls 1.00 74.1±0.47µs? ?/sec1.02 75.4±2.55µs? ?/sec utf8_view: single, nulls 1.00 82.4±0.81µs? ?/sec1.01 82.9±0.75µs? ?/sec ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] chore: Clarify rehash setting in hash utils [datafusion]
alamb-ghbot commented on PR #20154: URL: https://github.com/apache/datafusion/pull/20154#issuecomment-3888971533 🤖 `./gh_compare_branch_bench.sh` [compare_branch_bench.sh](https://github.com/alamb/datafusion-benchmarking/blob/main/scripts/compare_branch_bench.sh) Running Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux Comparing chore/clarify_rehash (0fc807a4204482e5376bb5ed408f945fc1262bb4) to aa9520e6e260410e10a01eff5c202dfe018772ac [diff](https://github.com/apache/datafusion/compare/aa9520e6e260410e10a01eff5c202dfe018772ac..0fc807a4204482e5376bb5ed408f945fc1262bb4) BENCH_NAME=with_hashes BENCH_COMMAND=cargo bench --features=parquet --bench with_hashes BENCH_FILTER= BENCH_BRANCH_NAME=chore_clarify_rehash Results will be posted here when complete -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] chore: Clarify rehash setting in hash utils [datafusion]
Jefffrey commented on PR #20154: URL: https://github.com/apache/datafusion/pull/20154#issuecomment-3888971375 run benchmark with_hashes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] chore: Clarify rehash setting in hash utils [datafusion]
alamb-ghbot commented on PR #20154: URL: https://github.com/apache/datafusion/pull/20154#issuecomment-3888740090 🤖: Benchmark completed Details ``` groupchore_clarify_rehash main - dictionary_utf8_int32: multiple, no nulls1.00 76.7±1.03µs? ?/sec1.15 88.0±1.21µs? ?/sec dictionary_utf8_int32: multiple, nulls 1.02113.1±0.87µs? ?/sec1.00111.1±1.44µs? ?/sec dictionary_utf8_int32: single, no nulls 1.12 27.1±0.13µs? ?/sec1.00 24.2±0.32µs? ?/sec dictionary_utf8_int32: single, nulls 1.06 38.3±0.35µs? ?/sec1.00 36.1±0.40µs? ?/sec int64: multiple, no nulls1.00 38.7±0.06µs? ?/sec1.00 38.7±0.17µs? ?/sec int64: multiple, nulls 1.00 54.5±0.58µs? ?/sec1.00 54.5±0.25µs? ?/sec int64: single, no nulls 1.00 11.5±0.04µs? ?/sec1.01 11.6±0.42µs? ?/sec int64: single, nulls 1.00 16.5±0.18µs? ?/sec1.02 16.7±0.48µs? ?/sec large_utf8: multiple, no nulls 1.00228.4±0.54µs? ?/sec1.00228.0±1.50µs? ?/sec large_utf8: multiple, nulls 1.04275.6±0.64µs? ?/sec1.00264.0±3.07µs? ?/sec large_utf8: single, no nulls 1.00 72.0±0.38µs? ?/sec1.00 72.0±1.25µs? ?/sec large_utf8: single, nulls1.00 81.4±0.26µs? ?/sec1.03 83.6±0.30µs? ?/sec utf8: multiple, no nulls 1.00230.7±1.69µs? ?/sec1.06243.4±2.93µs? ?/sec utf8: multiple, nulls1.00263.1±4.67µs? ?/sec1.07281.3±3.29µs? ?/sec utf8: single, no nulls 1.00 72.4±0.69µs? ?/sec1.00 72.2±1.01µs? ?/sec utf8: single, nulls 1.00 81.2±0.30µs? ?/sec1.01 82.1±0.51µs? ?/sec utf8_view (small): multiple, no nulls1.00 47.6±0.19µs? ?/sec1.01 47.9±1.65µs? ?/sec utf8_view (small): multiple, nulls 1.03 66.2±0.23µs? ?/sec1.00 64.4±1.78µs? ?/sec utf8_view (small): single, no nulls 1.00 13.9±0.10µs? ?/sec1.00 13.9±0.04µs? ?/sec utf8_view (small): single, nulls 1.10 23.7±1.20µs? ?/sec1.00 21.6±0.04µs? ?/sec utf8_view: multiple, no nulls1.01245.2±1.56µs? ?/sec1.00243.4±0.33µs? ?/sec utf8_view: multiple, nulls 1.00251.1±0.51µs? ?/sec1.02257.2±0.66µs? ?/sec utf8_view: single, no nulls 1.00 80.4±0.32µs? ?/sec1.00 80.6±0.23µs? ?/sec utf8_view: single, nulls 1.00 82.6±0.29µs? ?/sec1.03 85.4±0.26µs? ?/sec ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] chore: Clarify rehash setting in hash utils [datafusion]
alamb-ghbot commented on PR #20154: URL: https://github.com/apache/datafusion/pull/20154#issuecomment-3888714932 🤖 `./gh_compare_branch_bench.sh` [compare_branch_bench.sh](https://github.com/alamb/datafusion-benchmarking/blob/main/scripts/compare_branch_bench.sh) Running Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux Comparing chore/clarify_rehash (98a1fd9a1c773b2bc963d302c69bb9392259d0f5) to a02e6836e7d449008eef592b177ea282595e027a [diff](https://github.com/apache/datafusion/compare/a02e6836e7d449008eef592b177ea282595e027a..98a1fd9a1c773b2bc963d302c69bb9392259d0f5) BENCH_NAME=with_hashes BENCH_COMMAND=cargo bench --features=parquet --bench with_hashes BENCH_FILTER= BENCH_BRANCH_NAME=chore_clarify_rehash Results will be posted here when complete -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] chore: Clarify rehash setting in hash utils [datafusion]
Jefffrey commented on PR #20154: URL: https://github.com/apache/datafusion/pull/20154#issuecomment-3888714768 run benchmark with_hashes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] chore: Clarify rehash setting in hash utils [datafusion]
kumarUjjawal commented on PR #20154: URL: https://github.com/apache/datafusion/pull/20154#issuecomment-3870745673 > Might need some benchmarks to see what effect this has Do you want the benchmark for just functions-aggregate or end to end sql benchmar, also should i commit the benchmark code? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] chore: Clarify rehash setting in hash utils [datafusion]
Jefffrey commented on code in PR #20154:
URL: https://github.com/apache/datafusion/pull/20154#discussion_r2780978390
##
datafusion/common/src/hash_utils.rs:
##
@@ -484,22 +505,49 @@ fn hash_map_array(
let mut values_hashes = vec![0u64; array.entries().len()];
create_hashes(array.entries().columns(), random_state, &mut
values_hashes)?;
-// Combine the hashes for entries on each row with each other and previous
hash for that row
-if let Some(nulls) = nulls {
+// Combine the hashes for entries on each row with each other.
+// When `rehash=true`, combine the per-row map hash into the existing
accumulator.
Review Comment:
Refer to my previous comment about not duplicating this explanation
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] chore: Clarify rehash setting in hash utils [datafusion]
Jefffrey commented on code in PR #20154:
URL: https://github.com/apache/datafusion/pull/20154#discussion_r2771981300
##
datafusion/common/src/hash_utils.rs:
##
@@ -448,7 +462,13 @@ fn hash_struct_array(
array: &StructArray,
random_state: &RandomState,
hashes_buffer: &mut [u64],
+rehash: bool,
) -> Result<()> {
+// Hashing for nested types follows the same initialize-vs-combine
convention as
Review Comment:
We don't need to repeat this explanation in each function; the top level
explanation is sufficient
##
datafusion/common/src/hash_utils.rs:
##
@@ -462,9 +482,23 @@ fn hash_struct_array(
let mut values_hashes = vec![0u64; row_len];
create_hashes(array.columns(), random_state, &mut values_hashes)?;
-for i in valid_row_indices {
-let hash = &mut hashes_buffer[i];
-*hash = combine_hashes(*hash, values_hashes[i]);
+if rehash {
+for i in valid_row_indices {
+hashes_buffer[i] = combine_hashes(values_hashes[i],
hashes_buffer[i]);
+}
+} else {
+for i in valid_row_indices {
+hashes_buffer[i] = values_hashes[i];
+}
+}
+
+// Hash null struct values consistently with other array types
Review Comment:
I thought other array types ignore nulls when hashing? Except for
`DataType::Null`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] chore: Clarify rehash setting in hash utils [datafusion]
Jefffrey commented on code in PR #20154: URL: https://github.com/apache/datafusion/pull/20154#discussion_r2768884529 ## datafusion/common/src/hash_utils.rs: ## @@ -400,6 +417,8 @@ fn update_hash_for_dict_key( dict_hashes: &[u64], dict_values: &dyn Array, idx: usize, +// `multi_col` is the historical name for what is now referred to as `rehash` elsewhere Review Comment: I don't think we should take the approach of having a `apply_row_hash` function as it currently is, per my reasoning above -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] chore: Clarify rehash setting in hash utils [datafusion]
kumarUjjawal commented on code in PR #20154: URL: https://github.com/apache/datafusion/pull/20154#discussion_r2768824594 ## datafusion/common/src/hash_utils.rs: ## @@ -400,6 +417,8 @@ fn update_hash_for_dict_key( dict_hashes: &[u64], dict_values: &dyn Array, idx: usize, +// `multi_col` is the historical name for what is now referred to as `rehash` elsewhere Review Comment: Oh, Thanks for the heads up. Does the changes look good? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] chore: Clarify rehash setting in hash utils [datafusion]
Jefffrey commented on code in PR #20154: URL: https://github.com/apache/datafusion/pull/20154#discussion_r2768721673 ## datafusion/common/src/hash_utils.rs: ## @@ -400,6 +417,8 @@ fn update_hash_for_dict_key( dict_hashes: &[u64], dict_values: &dyn Array, idx: usize, +// `multi_col` is the historical name for what is now referred to as `rehash` elsewhere Review Comment: I'd be careful of this approach, as we'd want to pull as many checks outside the hotloop as we can; checking `rehash` each time we compute a hash is inefficient compared to checking once before the loop, as many other functions here do -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] chore: Clarify rehash setting in hash utils [datafusion]
kumarUjjawal commented on code in PR #20154: URL: https://github.com/apache/datafusion/pull/20154#discussion_r2767273412 ## datafusion/common/src/hash_utils.rs: ## @@ -400,6 +417,8 @@ fn update_hash_for_dict_key( dict_hashes: &[u64], dict_values: &dyn Array, idx: usize, +// `multi_col` is the historical name for what is now referred to as `rehash` elsewhere Review Comment: Made some changes, also did a minor refactor, there is another opportunity for the refactor to centralize the common “compute per-row nested hash → apply rehash (init vs combine) → hash null rows” logic in a shared helper, so each nested hasher mostly just computes a row_hash and delegates the buffer update/null handling. What do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] chore: Clarify rehash setting in hash utils [datafusion]
Jefffrey commented on code in PR #20154:
URL: https://github.com/apache/datafusion/pull/20154#discussion_r2766542459
##
datafusion/common/src/hash_utils.rs:
##
@@ -449,6 +472,14 @@ fn hash_struct_array(
random_state: &RandomState,
hashes_buffer: &mut [u64],
) -> Result<()> {
+// This nested-type hasher currently always combines its computed
struct-row hash
+// into `hashes_buffer` (equivalent to `rehash=true`). This preserves
existing
+// behavior for single-column hashing of nested types.
+//
+// If we were to add a `rehash` flag here and make `rehash=false`
overwrite the
+// buffer, it would change the numeric hash values produced for standalone
+// Struct columns.
Review Comment:
I think we should look into fixing this instead of leaving the reasoning as
"keep existing behaviour", especially when we don't know the root cause of why
this existing behaviour is the way it is
##
datafusion/common/src/hash_utils.rs:
##
@@ -400,6 +417,8 @@ fn update_hash_for_dict_key(
dict_hashes: &[u64],
dict_values: &dyn Array,
idx: usize,
+// `multi_col` is the historical name for what is now referred to as
`rehash` elsewhere
Review Comment:
It's better to fix the naming than add a comment trying to explain the
discrepency
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
