Re: [PR] Added documentation for SortMergeJoin [datafusion]

2024-11-25 Thread via GitHub


alamb merged PR #13469:
URL: https://github.com/apache/datafusion/pull/13469


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Added documentation for SortMergeJoin [datafusion]

2024-11-25 Thread via GitHub


alamb commented on PR #13469:
URL: https://github.com/apache/datafusion/pull/13469#issuecomment-2499081979

   I changed the PR description to say "part of" rather than closing 
https://github.com/apache/datafusion/issues/10357
   
   Thanks @athultr1997 @comphead  and @viirya 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Added documentation for SortMergeJoin [datafusion]

2024-11-21 Thread via GitHub


comphead commented on code in PR #13469:
URL: https://github.com/apache/datafusion/pull/13469#discussion_r1853015607


##
datafusion/physical-plan/src/joins/sort_merge_join.rs:
##
@@ -68,8 +68,46 @@ use crate::{
 RecordBatchStream, SendableRecordBatchStream, Statistics,
 };
 
-/// join execution plan executes partitions in parallel and combines them into 
a set of
-/// partitions.
+/// Join execution plan that executes equi-join predicates on multiple 
partitions using Sort-Merge
+/// join algorithm and applies an optional filter post join. Can be used to 
join arbitrarily large
+/// inputs where one or both of the inputs don't fit in the available memory.
+///  
+/// # Join Expressions
+///
+/// Equi-join predicate (e.g. ` = `) expressions are represented 
by [`Self::on`].
+///
+/// Non-equality predicates, which can not be pushed down to join inputs (e.g.
+/// ` != `) are known as "filter expressions" and are evaluated
+/// after the equijoin predicates. They are represented by [`Self::filter`]. 
These are optional
+/// expressions.
+///
+/// # Sorting
+///
+/// Assumes that both the left and right input to the join are pre-sorted. It 
is not the
+/// responisibility of this execution plan to sort the inputs.
+///
+/// # "Streamed" vs "Buffered"
+///
+/// Buffered input is buffered for all record batches having the same value of 
join key.
+/// If the memory limit increases beyond the specified value and spilling is 
enabled,
+/// buffered batches could be spilled to disk. If spilling is disabled, the 
execution

Review Comment:
   Maybe we can point to `DiskManager` documentation here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Added documentation for SortMergeJoin [datafusion]

2024-11-18 Thread via GitHub


viirya commented on code in PR #13469:
URL: https://github.com/apache/datafusion/pull/13469#discussion_r1847130362


##
datafusion/physical-plan/src/joins/sort_merge_join.rs:
##
@@ -68,8 +68,46 @@ use crate::{
 RecordBatchStream, SendableRecordBatchStream, Statistics,
 };
 
-/// join execution plan executes partitions in parallel and combines them into 
a set of
-/// partitions.
+/// Join execution plan that executes equi-join predicates on multiple 
partitions using Sort-Merge
+/// join algorithm and applies an optional filter post join. Can be used to 
join arbitrarily large
+/// inputs where one or both of the inputs don't fit in the available memory.

Review Comment:
   > where one or both of the inputs don't fit in the available memory.
   
   Hmm, is this true?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Added documentation for SortMergeJoin [datafusion]

2024-11-18 Thread via GitHub


athultr1997 commented on code in PR #13469:
URL: https://github.com/apache/datafusion/pull/13469#discussion_r1847632368


##
datafusion/physical-plan/src/joins/sort_merge_join.rs:
##
@@ -68,8 +68,43 @@ use crate::{
 RecordBatchStream, SendableRecordBatchStream, Statistics,
 };
 
-/// join execution plan executes partitions in parallel and combines them into 
a set of
-/// partitions.
+/// Join execution plan that executes equi-join predicates on multiple 
partitions using Sort-Merge
+/// join algorithm and applies an optional filter post join. Can be used to 
join arbitrarily large
+/// inputs where one or both of the inputs don't fit in the available memory.
+///  
+/// # Join Expressions
+///
+/// Equi-join predicate (e.g. ` = `) expressions are represented 
by [`Self::on`].
+///
+/// Non-equality predicates, which can not be pushed down to join inputs (e.g.
+/// ` != `) are known as "filter expressions" and are evaluated
+/// after the equijoin predicates. They are represented by [`Self::filter`]. 
These are optional
+/// expressions.
+///
+/// # Sorting
+///
+/// Assumes that both the left and right input to the join are pre-sorted. It 
is not the
+/// responisibility of this execution plan to sort the inputs.
+///
+/// # "Streamed" vs "Buffered"
+///
+/// Buffered input is buffered for all record batches having the same value of 
join key.
+/// If the memory limit increases beyond the specified value and spilling is 
enabled,
+/// buffered batches could be spilled to disk. If spilling is disabled, the 
execution
+/// will fail under the same conditions. Multiple record batches of buffered 
could be
+/// present in memory/disk during the exectution.
+///
+/// Only one record batch of streamed input will be present in the memory at 
all times. There is no
+/// spilling support for streamed input. The comparisons are performed from 
values of join keys in
+/// streamed input with the values of join keys in buffered input. One row in 
streamed record
+/// batch could be matched with multiple rows in buffered input batches.
+///
+/// Depending on the type of join left or right input may be selected as 
streamed or buffered
+/// respectively. For example, in a left-outer join, the left execution plan 
will be selected as
+/// streamed input.
+///
+/// Reference for the algorithm:
+/// 

Review Comment:
   Changed `https://www.youtube.com/watch?v=jiWCPJtDE2c` to 
``



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Added documentation for SortMergeJoin [datafusion]

2024-11-18 Thread via GitHub


athultr1997 commented on code in PR #13469:
URL: https://github.com/apache/datafusion/pull/13469#discussion_r1847572086


##
datafusion/physical-plan/src/joins/sort_merge_join.rs:
##
@@ -68,8 +68,46 @@ use crate::{
 RecordBatchStream, SendableRecordBatchStream, Statistics,
 };
 
-/// join execution plan executes partitions in parallel and combines them into 
a set of
-/// partitions.
+/// Join execution plan that executes equi-join predicates on multiple 
partitions using Sort-Merge
+/// join algorithm and applies an optional filter post join. Can be used to 
join arbitrarily large
+/// inputs where one or both of the inputs don't fit in the available memory.
+///  
+/// # Join Expressions
+///
+/// Equi-join predicate (e.g. ` = `) expressions are represented 
by [`Self::on`].
+///
+/// Non-equality predicates, which can not be pushed down to join inputs (e.g.
+/// ` != `) are known as "filter expressions" and are evaluated
+/// after the equijoin predicates. They are represented by [`Self::filter`]. 
These are optional
+/// expressions.
+///
+/// # Sorting
+///
+/// Assumes that both the left and right input to the join are pre-sorted. It 
is not the
+/// responisibility of this execution plan to sort the inputs.
+///
+/// # "Streamed" vs "Buffered"
+///
+/// Buffered input is buffered for all record batches having the same value of 
join key.
+/// If the memory limit increases beyond the specified value and spilling is 
enabled,
+/// buffered batches could be spilled to disk. If spilling is disabled, the 
execution

Review Comment:
   One has to enable the disk manager when creating the `RuntimeEnv` during the 
creation of `TaskContext`.
   ```
   let runtime = RuntimeEnvBuilder::new()
   .with_memory_limit(100, 1.0)
   .with_disk_manager(DiskManagerConfig::NewOs)
   .build_arc()?;
   ```
   I am not sure if we should mention it here, since this is all the way over 
in `RuntimeEnv` and there are multiple strategies to enable the disk manager.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Added documentation for SortMergeJoin [datafusion]

2024-11-18 Thread via GitHub


athultr1997 commented on code in PR #13469:
URL: https://github.com/apache/datafusion/pull/13469#discussion_r1847567666


##
datafusion/physical-plan/src/joins/sort_merge_join.rs:
##
@@ -68,8 +68,46 @@ use crate::{
 RecordBatchStream, SendableRecordBatchStream, Statistics,
 };
 
-/// join execution plan executes partitions in parallel and combines them into 
a set of
-/// partitions.
+/// Join execution plan that executes equi-join predicates on multiple 
partitions using Sort-Merge
+/// join algorithm and applies an optional filter post join. Can be used to 
join arbitrarily large
+/// inputs where one or both of the inputs don't fit in the available memory.

Review Comment:
   streamed - Always taken one batch at a time.
   buffered - Has spilling support.
   Hence, the inputs don't have to fit in memory.
   
   Also, I think this was the vision behind SMJ: 
https://github.com/apache/datafusion/issues/1599#issue-1105949955.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Added documentation for SortMergeJoin [datafusion]

2024-11-18 Thread via GitHub


viirya commented on code in PR #13469:
URL: https://github.com/apache/datafusion/pull/13469#discussion_r1847131367


##
datafusion/physical-plan/src/joins/sort_merge_join.rs:
##
@@ -68,8 +68,46 @@ use crate::{
 RecordBatchStream, SendableRecordBatchStream, Statistics,
 };
 
-/// join execution plan executes partitions in parallel and combines them into 
a set of
-/// partitions.
+/// Join execution plan that executes equi-join predicates on multiple 
partitions using Sort-Merge
+/// join algorithm and applies an optional filter post join. Can be used to 
join arbitrarily large
+/// inputs where one or both of the inputs don't fit in the available memory.
+///  
+/// # Join Expressions
+///
+/// Equi-join predicate (e.g. ` = `) expressions are represented 
by [`Self::on`].
+///
+/// Non-equality predicates, which can not be pushed down to join inputs (e.g.
+/// ` != `) are known as "filter expressions" and are evaluated
+/// after the equijoin predicates. They are represented by [`Self::filter`]. 
These are optional
+/// expressions.
+///
+/// # Sorting
+///
+/// Assumes that both the left and right input to the join are pre-sorted. It 
is not the
+/// responisibility of this execution plan to sort the inputs.
+///
+/// # "Streamed" vs "Buffered"
+///
+/// Buffered input is buffered for all record batches having the same value of 
join key.
+/// If the memory limit increases beyond the specified value and spilling is 
enabled,
+/// buffered batches could be spilled to disk. If spilling is disabled, the 
execution

Review Comment:
   Is there a config for spilling? Shall we mention it here too?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Added documentation for SortMergeJoin [datafusion]

2024-11-18 Thread via GitHub


comphead commented on code in PR #13469:
URL: https://github.com/apache/datafusion/pull/13469#discussion_r1847124820


##
datafusion/physical-plan/src/joins/sort_merge_join.rs:
##
@@ -68,8 +68,46 @@ use crate::{
 RecordBatchStream, SendableRecordBatchStream, Statistics,
 };
 
-/// join execution plan executes partitions in parallel and combines them into 
a set of
-/// partitions.
+/// Join execution plan that executes equi-join predicates on multiple 
partitions using Sort-Merge
+/// join algorithm and applies an optional filter post join. Can be used to 
join arbitrarily large
+/// inputs where one or both of the inputs don't fit in the available memory.
+///  
+/// # Join Expressions
+///
+/// Equi-join predicate (e.g. ` = `) expressions are represented 
by [`Self::on`].
+///
+/// Non-equality predicates, which can not be pushed down to join inputs (e.g.
+/// ` != `) are known as "filter expressions" and are evaluated
+/// after the equijoin predicates. They are represented by [`Self::filter`]. 
These are optional
+/// expressions.
+///
+/// # Sorting
+///
+/// Assumes that both the left and right input to the join are pre-sorted. It 
is not the
+/// responisibility of this execution plan to sort the inputs.
+///
+/// # "Streamed" vs "Buffered"
+///
+/// Buffered input is buffered for all record batches having the same value of 
join key.
+/// If the memory limit increases beyond the specified value and spilling is 
enabled,
+/// buffered batches could be spilled to disk. If spilling is disabled, the 
execution
+/// will fail under the same conditions. Multiple record batches of buffered 
could be
+/// present in memory/disk during the exectution.
+///
+/// Only one record batch of streamed input will be present in the memory at 
all times. There is no

Review Comment:
   it depends on the batch size and there can `batch_size` of streamed rows 
currently lives in memory.
   For the buffered the buffered batch lives until BufferedState switches and 
due to presorted inputs  the algorithm understands the buffered batch is not 
needed anymore and it gets released



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Added documentation for SortMergeJoin [datafusion]

2024-11-18 Thread via GitHub


comphead commented on code in PR #13469:
URL: https://github.com/apache/datafusion/pull/13469#discussion_r1847120705


##
datafusion/physical-plan/src/joins/sort_merge_join.rs:
##
@@ -68,8 +68,43 @@ use crate::{
 RecordBatchStream, SendableRecordBatchStream, Statistics,
 };
 
-/// join execution plan executes partitions in parallel and combines them into 
a set of
-/// partitions.
+/// Join execution plan that executes equi-join predicates on multiple 
partitions using Sort-Merge
+/// join algorithm and applies an optional filter post join. Can be used to 
join arbitrarily large
+/// inputs where one or both of the inputs don't fit in the available memory.
+///  
+/// # Join Expressions
+///
+/// Equi-join predicate (e.g. ` = `) expressions are represented 
by [`Self::on`].
+///
+/// Non-equality predicates, which can not be pushed down to join inputs (e.g.
+/// ` != `) are known as "filter expressions" and are evaluated
+/// after the equijoin predicates. They are represented by [`Self::filter`]. 
These are optional
+/// expressions.
+///
+/// # Sorting
+///
+/// Assumes that both the left and right input to the join are pre-sorted. It 
is not the
+/// responisibility of this execution plan to sort the inputs.
+///
+/// # "Streamed" vs "Buffered"
+///
+/// Buffered input is buffered for all record batches having the same value of 
join key.
+/// If the memory limit increases beyond the specified value and spilling is 
enabled,
+/// buffered batches could be spilled to disk. If spilling is disabled, the 
execution
+/// will fail under the same conditions. Multiple record batches of buffered 
could be
+/// present in memory/disk during the exectution.
+///
+/// Only one record batch of streamed input will be present in the memory at 
all times. There is no
+/// spilling support for streamed input. The comparisons are performed from 
values of join keys in
+/// streamed input with the values of join keys in buffered input. One row in 
streamed record
+/// batch could be matched with multiple rows in buffered input batches.
+///
+/// Depending on the type of join left or right input may be selected as 
streamed or buffered
+/// respectively. For example, in a left-outer join, the left execution plan 
will be selected as
+/// streamed input.
+///
+/// Reference for the algorithm:
+/// 

Review Comment:
   ```suggestion
   /// 
   ///
   /// Helpful short video demonstration
   /// https://www.youtube.com/watch?v=jiWCPJtDE2c
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org



Re: [PR] Added documentation for SortMergeJoin [datafusion]

2024-11-18 Thread via GitHub


comphead commented on PR #13469:
URL: https://github.com/apache/datafusion/pull/13469#issuecomment-2483613007

   I would be adding this short video to make the dev/user familiar with SMJ 
concepts https://www.youtube.com/watch?v=jiWCPJtDE2c


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org