Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-08-13 Thread via GitHub
UBarney commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2275332927 ## datafusion/physical-plan/src/joins/utils.rs: ## @@ -843,24 +844,56 @@ pub(crate) fn apply_join_filter_to_indices( probe_indices: UInt32Array, filter:

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-08-13 Thread via GitHub
UBarney commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2275332927 ## datafusion/physical-plan/src/joins/utils.rs: ## @@ -843,24 +844,56 @@ pub(crate) fn apply_join_filter_to_indices( probe_indices: UInt32Array, filter:

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-16 Thread via GitHub
UBarney commented on PR #16443: URL: https://github.com/apache/datafusion/pull/16443#issuecomment-3078462264 > > select t1.value from range(100) t1 join range(819200) t2 on t1.value + t2.value < t1.value * t2.value; > > I'm happy to include this benchmark in the bench suite this week,

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-15 Thread via GitHub
2010YOUY01 commented on PR #16443: URL: https://github.com/apache/datafusion/pull/16443#issuecomment-3076513755 > select t1.value from range(100) t1 join range(819200) t2 on t1.value + t2.value < t1.value * t2.value; I'm happy to include this benchmark in the bench suite this week, un

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-14 Thread via GitHub
UBarney commented on PR #16443: URL: https://github.com/apache/datafusion/pull/16443#issuecomment-3071663279 > * Refactor so we limit building the entire cartesian product of both batches (this is already covered in the issue and I believe @UBarney is willing to work on this) Yes. I'

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-14 Thread via GitHub
alamb commented on PR #16443: URL: https://github.com/apache/datafusion/pull/16443#issuecomment-3070765444 Awesome -- thanks @jonathanc-n and @UBarney -- I am very happy to see this moving along -- This is an automated message from the Apache Git Service. To respond to the message, please

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-14 Thread via GitHub
alamb merged PR #16443: URL: https://github.com/apache/datafusion/pull/16443 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-14 Thread via GitHub
jonathanc-n commented on PR #16443: URL: https://github.com/apache/datafusion/pull/16443#issuecomment-3070761175 @alamb Yes I believe all comments have been addressed. I think we have two notable follow ups: - Refactor so we limit building the entire cartesian product of both batches (t

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-14 Thread via GitHub
alamb commented on PR #16443: URL: https://github.com/apache/datafusion/pull/16443#issuecomment-3070585433 Is this one ready to merge? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-13 Thread via GitHub
2010YOUY01 commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2203255450 ## datafusion/physical-plan/src/joins/utils.rs: ## @@ -843,24 +844,56 @@ pub(crate) fn apply_join_filter_to_indices( probe_indices: UInt32Array, filte

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-13 Thread via GitHub
Dandandan commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2203255187 ## datafusion/physical-plan/src/joins/utils.rs: ## @@ -843,24 +844,56 @@ pub(crate) fn apply_join_filter_to_indices( probe_indices: UInt32Array, filter

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-13 Thread via GitHub
UBarney commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2203230458 ## datafusion/physical-plan/src/joins/utils.rs: ## @@ -843,24 +844,56 @@ pub(crate) fn apply_join_filter_to_indices( probe_indices: UInt32Array, filter:

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-12 Thread via GitHub
UBarney commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2202485385 ## datafusion/physical-plan/src/joins/nested_loop_join.rs: ## @@ -828,15 +845,125 @@ impl NestedLoopJoinStream { let poll = handle_state!(self

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-12 Thread via GitHub
UBarney commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2202485179 ## datafusion/physical-plan/src/joins/nested_loop_join.rs: ## @@ -705,8 +696,29 @@ impl NestedLoopJoinStreamState { } } +/// Tracks incremental output of j

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-12 Thread via GitHub
UBarney commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2202340283 ## datafusion/physical-plan/src/joins/nested_loop_join.rs: ## @@ -828,13 +833,127 @@ impl NestedLoopJoinStream { handle_state!(self.process_pr

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-11 Thread via GitHub
UBarney commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2202340283 ## datafusion/physical-plan/src/joins/nested_loop_join.rs: ## @@ -828,13 +833,127 @@ impl NestedLoopJoinStream { handle_state!(self.process_pr

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-11 Thread via GitHub
UBarney commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2202340283 ## datafusion/physical-plan/src/joins/nested_loop_join.rs: ## @@ -828,13 +833,127 @@ impl NestedLoopJoinStream { handle_state!(self.process_pr

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-10 Thread via GitHub
2010YOUY01 commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2199484385 ## datafusion/physical-plan/src/joins/utils.rs: ## @@ -843,24 +844,56 @@ pub(crate) fn apply_join_filter_to_indices( probe_indices: UInt32Array, filte

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-10 Thread via GitHub
2010YOUY01 commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2199423074 ## datafusion/physical-plan/src/joins/nested_loop_join.rs: ## @@ -828,13 +833,127 @@ impl NestedLoopJoinStream { handle_state!(self.process

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-10 Thread via GitHub
jonathanc-n commented on PR #16443: URL: https://github.com/apache/datafusion/pull/16443#issuecomment-3058063741 @2010YOUY01 Special types need to only return the matching rows, so only one side needs to return rows while the other side can return a null array and not be projected in the fi

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-10 Thread via GitHub
2010YOUY01 commented on PR #16443: URL: https://github.com/apache/datafusion/pull/16443#issuecomment-3056490809 > I have addressed all of your comments. @2010YOUY01 please take another look > > > I recommend to doc more high-level ideas to key functions, to make this module easier to

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-09 Thread via GitHub
UBarney commented on PR #16443: URL: https://github.com/apache/datafusion/pull/16443#issuecomment-3055598311 I have addressed all of your comments. @2010YOUY01 please take another look > I recommend to doc more high-level ideas to key functions, to make this module easier to

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-09 Thread via GitHub
2010YOUY01 commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2194486005 ## datafusion/physical-plan/src/joins/nested_loop_join.rs: ## @@ -689,6 +674,8 @@ enum NestedLoopJoinStreamState { ProcessProbeBatch(RecordBatch), ///

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-09 Thread via GitHub
UBarney commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2194337073 ## datafusion/physical-plan/src/joins/utils.rs: ## @@ -843,24 +844,56 @@ pub(crate) fn apply_join_filter_to_indices( probe_indices: UInt32Array, filter:

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-08 Thread via GitHub
alamb commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2192449966 ## datafusion/physical-plan/src/joins/utils.rs: ## @@ -843,24 +844,56 @@ pub(crate) fn apply_join_filter_to_indices( probe_indices: UInt32Array, filter: &J

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-08 Thread via GitHub
alamb commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2192449082 ## datafusion/physical-plan/src/joins/utils.rs: ## @@ -843,24 +844,56 @@ pub(crate) fn apply_join_filter_to_indices( probe_indices: UInt32Array, filter: &J

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-08 Thread via GitHub
Dandandan commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2191954803 ## datafusion/physical-plan/src/joins/utils.rs: ## @@ -843,24 +844,56 @@ pub(crate) fn apply_join_filter_to_indices( probe_indices: UInt32Array, filter

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-08 Thread via GitHub
Dandandan commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2191939442 ## datafusion/physical-plan/src/joins/utils.rs: ## @@ -843,24 +844,56 @@ pub(crate) fn apply_join_filter_to_indices( probe_indices: UInt32Array, filter

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-08 Thread via GitHub
Dandandan commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2191937168 ## datafusion/physical-plan/src/joins/utils.rs: ## @@ -843,24 +844,56 @@ pub(crate) fn apply_join_filter_to_indices( probe_indices: UInt32Array, filter

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-08 Thread via GitHub
Dandandan commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2191935939 ## datafusion/physical-plan/src/joins/utils.rs: ## @@ -843,24 +844,56 @@ pub(crate) fn apply_join_filter_to_indices( probe_indices: UInt32Array, filter

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-08 Thread via GitHub
2010YOUY01 commented on PR #16443: URL: https://github.com/apache/datafusion/pull/16443#issuecomment-3047980169 > @korowa @2010YOUY01 Are you able to take a quick look? Thanks! Thank you so much for this optimization. It's on my list, but due to the complexity of the join operator, I

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-07 Thread via GitHub
jonathanc-n commented on PR #16443: URL: https://github.com/apache/datafusion/pull/16443#issuecomment-3045016104 @korowa @2010YOUY01 Are you able to take a quick look? Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub an

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-05 Thread via GitHub
jonathanc-n commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2181119355 ## datafusion/physical-plan/src/joins/nested_loop_join.rs: ## @@ -828,13 +833,127 @@ impl NestedLoopJoinStream { handle_state!(self.proces

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-05 Thread via GitHub
UBarney commented on PR #16443: URL: https://github.com/apache/datafusion/pull/16443#issuecomment-3038952572 > Thanks @UBarney, just some comments Thanks @jonathanc-n for reviewing. I have addressed all of your comments. -- This is an automated message from the Apache Git Service. T

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-03 Thread via GitHub
UBarney commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2182935284 ## datafusion/physical-plan/src/joins/utils.rs: ## @@ -843,24 +844,56 @@ pub(crate) fn apply_join_filter_to_indices( probe_indices: UInt32Array, filter:

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-03 Thread via GitHub
jonathanc-n commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2182837993 ## datafusion/physical-plan/src/joins/utils.rs: ## @@ -843,24 +844,56 @@ pub(crate) fn apply_join_filter_to_indices( probe_indices: UInt32Array, filt

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-03 Thread via GitHub
UBarney commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2182032475 ## datafusion/physical-plan/src/joins/utils.rs: ## @@ -843,24 +844,56 @@ pub(crate) fn apply_join_filter_to_indices( probe_indices: UInt32Array, filter:

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-02 Thread via GitHub
UBarney commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2181428292 ## datafusion/physical-plan/src/joins/nested_loop_join.rs: ## @@ -883,44 +1000,63 @@ impl NestedLoopJoinStream { let visited_left_side = left_data.bitmap(

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-02 Thread via GitHub
jonathanc-n commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2181125973 ## datafusion/physical-plan/src/joins/nested_loop_join.rs: ## @@ -828,13 +828,127 @@ impl NestedLoopJoinStream { handle_state!(self.proces

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-02 Thread via GitHub
jonathanc-n commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2181102938 ## datafusion/physical-plan/src/joins/nested_loop_join.rs: ## @@ -828,13 +833,127 @@ impl NestedLoopJoinStream { handle_state!(self.proces

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-01 Thread via GitHub
UBarney commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2177799623 ## datafusion/physical-plan/src/joins/nested_loop_join.rs: ## @@ -883,44 +1002,66 @@ impl NestedLoopJoinStream { let visited_left_side = left_data.bitmap(

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-01 Thread via GitHub
UBarney commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2177790602 ## datafusion/physical-plan/src/joins/nested_loop_join.rs: ## @@ -1215,16 +1324,25 @@ pub(crate) mod tests { batches.extend( more_bat

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-01 Thread via GitHub
UBarney commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r216496 ## datafusion/physical-plan/src/joins/nested_loop_join.rs: ## @@ -828,13 +833,127 @@ impl NestedLoopJoinStream { handle_state!(self.process_pr

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-01 Thread via GitHub
UBarney commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r210988 ## datafusion/physical-plan/src/joins/nested_loop_join.rs: ## @@ -729,10 +716,26 @@ struct NestedLoopJoinStream { right_side_ordered: bool, /// Current s

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-06-29 Thread via GitHub
2010YOUY01 commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2173715452 ## datafusion/physical-plan/src/joins/nested_loop_join.rs: ## @@ -883,44 +1002,66 @@ impl NestedLoopJoinStream { let visited_left_side = left_data.bitm

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-06-29 Thread via GitHub
Copilot commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2173686344 ## datafusion/physical-plan/src/joins/nested_loop_join.rs: ## @@ -729,10 +716,26 @@ struct NestedLoopJoinStream { right_side_ordered: bool, /// Current s

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-06-29 Thread via GitHub
UBarney commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2173642138 ## datafusion/physical-plan/src/joins/nested_loop_join.rs: ## @@ -883,44 +1002,66 @@ impl NestedLoopJoinStream { let visited_left_side = left_data.bitmap(

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-06-29 Thread via GitHub
UBarney commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2173635381 ## datafusion/physical-plan/src/joins/nested_loop_join.rs: ## @@ -828,13 +833,127 @@ impl NestedLoopJoinStream { handle_state!(self.process_pr

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-06-29 Thread via GitHub
UBarney commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2173634499 ## datafusion/physical-plan/src/joins/utils.rs: ## @@ -843,24 +844,56 @@ pub(crate) fn apply_join_filter_to_indices( probe_indices: UInt32Array, filter:

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-06-28 Thread via GitHub
korowa commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2173258366 ## datafusion/physical-plan/src/joins/nested_loop_join.rs: ## @@ -729,10 +716,26 @@ struct NestedLoopJoinStream { right_side_ordered: bool, /// Current st

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-06-22 Thread via GitHub
jonathanc-n commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2160347506 ## datafusion/physical-plan/src/joins/nested_loop_join.rs: ## @@ -828,13 +833,127 @@ impl NestedLoopJoinStream { handle_state!(self.proces

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-06-22 Thread via GitHub
UBarney commented on PR #16443: URL: https://github.com/apache/datafusion/pull/16443#issuecomment-2994063902 > * `apply_join_filter_to_indices` Showed a reduction in execution time (sample count reduced from 528million to 241million). The benchmark results indicate that restricting th

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-06-21 Thread via GitHub
UBarney commented on PR #16443: URL: https://github.com/apache/datafusion/pull/16443#issuecomment-2993919183 > When you are running the benchmarks do they stay consistent? Yes. bechmarks result almost consistent. I ran the benchmarks a few minutes ago on commit. It's wort

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-06-21 Thread via GitHub
jonathanc-n commented on PR #16443: URL: https://github.com/apache/datafusion/pull/16443#issuecomment-2993895538 When you are running the benchmarks do they stay consistent? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-06-21 Thread via GitHub
UBarney commented on PR #16443: URL: https://github.com/apache/datafusion/pull/16443#issuecomment-2993893069 > I'll find out why there is a performance improvement From the flame graph (when executing the SQL `select t1.value from range(8192) t1 join range(8192) t2 on t1.value + t2.va

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-06-19 Thread via GitHub
UBarney commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2156308803 ## datafusion/physical-plan/src/joins/nested_loop_join.rs: ## @@ -510,8 +511,6 @@ impl ExecutionPlan for NestedLoopJoinExec { })?; let batch_si

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-06-19 Thread via GitHub
jonathanc-n commented on PR #16443: URL: https://github.com/apache/datafusion/pull/16443#issuecomment-2988168626 Those benchmark helper functions are really cool, I'll see if I can take a look today. -- This is an automated message from the Apache Git Service. To respond to the message, p

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-06-19 Thread via GitHub
UBarney commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2156308803 ## datafusion/physical-plan/src/joins/nested_loop_join.rs: ## @@ -510,8 +511,6 @@ impl ExecutionPlan for NestedLoopJoinExec { })?; let batch_si

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-06-19 Thread via GitHub
UBarney commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2156308803 ## datafusion/physical-plan/src/joins/nested_loop_join.rs: ## @@ -510,8 +511,6 @@ impl ExecutionPlan for NestedLoopJoinExec { })?; let batch_si

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-06-18 Thread via GitHub
UBarney commented on PR #16443: URL: https://github.com/apache/datafusion/pull/16443#issuecomment-2986400295 # benchmark I use this [script](https://gist.github.com/UBarney/9dcbf304e65f061d3352b34abd0f0e05#file-sql_bench-py) to do benchmark | ID | SQL | join_base Time(s) | join_li

[PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-06-18 Thread via GitHub
UBarney opened a new pull request, #16443: URL: https://github.com/apache/datafusion/pull/16443 ## Which issue does this PR close? part of #16364 ## Rationale for this change see issue ## What changes are included in this PR? 1. Limit intermediate_batch Siz