Re: [I] Excessive Arc-clone in HashJoinStream with StringView on build-side [datafusion]

2025-06-25 Thread via GitHub


alamb commented on issue #16206:
URL: https://github.com/apache/datafusion/issues/16206#issuecomment-3006040173

   > Extend the take (or interleave) kernels to eagerly allocate new data 
buffers for string views (i.e. add a compute::TakeOption - unfortunately also a 
breaking change).
   
   BTW this eagerly allocating buffers / copying string views when needed is 
what the (in progress) `coalesce` kernel does:
   - https://github.com/apache/arrow-rs/blob/main/arrow-select/src/coalesce.rs
   
   Here is the logic that does it for StringView 
   - 
https://github.com/apache/arrow-rs/blob/main/arrow-select/src/coalesce/byte_view.rs
   
   I hope to migrate DataFusion to use the BatchCoalescer soon. 
   
   There is a `BatchCoalescer::push_batch_with_filter`  (that does the 
equivalent of `filter` + `concat`) that I am actively working to improve
   
   I imagine implementing something like 
`BatchCoalescer::push_batch_with_indices` that does the equivalent of `take` + 
`concat`
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Excessive Arc-clone in HashJoinStream with StringView on build-side [datafusion]

2025-06-23 Thread via GitHub


ctsk commented on issue #16206:
URL: https://github.com/apache/datafusion/issues/16206#issuecomment-2994380522

   I think the issue on the take+concat pattern is only tangentially related to 
this issue. Ultimately, you would need a version of the take operation, that 
does not simply clone the vector of data buffers, but rather eagerly allocates 
a new data buffer to hold the string view data. This modification could be 
applied to the current compute::take kernel too.
   
   The interleave kernel already does something like this: It only takes 
references to the data buffers that are actually referenced in the new array - 
that is why it would help somewhat alleviate this issue. If the build side is 
*really large*, this could still mean allocating a vector of 1000+ buffer 
references. The subsequent gc (in CoalesceBatchesExec) drops the vector and 
allocate a new data buffer.
   
   Instead, I think there are 2 direct solutions for this:
   1. Extend the take (or interleave) kernels to eagerly allocate new data 
buffers for string views  (i.e. add a `compute::TakeOption` - unfortunately 
also a breaking change).
   2. Switch from `Vec` to `Arc<[Buffer]>` -  also a breaking change in 
the arrow crate. This would not require writing new kernel-style code - the 
take/interleave operations would remain unchanged.
   
   My PR on executing a GC on the hash join build-side is unfortunately not a 
good solution: This GC operation is expensive, and whether it pays off depends 
on the number of probe side batches - which is not known in advance. The 
benchmark shows that it significantly slows down smaller joins. It would also 
not prevent the CoalesceBatchesExec after the HJ from issuing another GC.
   
   For now, the best way to fix this issue is to work on getting 
https://github.com/apache/arrow-rs/pull/6427 merged. As far as I can tell, it 
is already fully implemented and got stuck due to lack of time of the author to 
add the `ViewBuffers` abstraction.
   
   Lastly, I believe that using interleave, or a good solution to the 
take+concat pattern would be very helpful in making the HJ more efficient in 
general. Most importantly, it could allow avoiding the concat on the build-side 
batches, and remove the CoalesceBatchesExec after the join.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Excessive Arc-clone in HashJoinStream with StringView on build-side [datafusion]

2025-06-20 Thread via GitHub


alamb commented on issue #16206:
URL: https://github.com/apache/datafusion/issues/16206#issuecomment-2992632434

   > Thanks for sharing the insight [@ctsk](https://github.com/ctsk)
   > 
   > I think there is some overlap here to the work 
[@alamb](https://github.com/alamb) to improve (and maybe remove over time) 
`CoalesceBatchesExec` and gc of views.
   
   The relevant arrow ticket is
   - https://github.com/apache/arrow-rs/issues/6692
   
   @Dandandan  also made a PR that improves some view code here (but I don't 
think it affects take_view) here: 
   - https://github.com/apache/arrow-rs/pull/7695


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Excessive Arc-clone in HashJoinStream with StringView on build-side [datafusion]

2025-06-19 Thread via GitHub


Dandandan commented on issue #16206:
URL: https://github.com/apache/datafusion/issues/16206#issuecomment-2988836122

   Thanks for sharing the insight @ctsk 
   
   I think there is some overlap here to the work @alamb to improve (and maybe 
remove over time) `CoalesceBatchesExec` and gc of views.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Excessive Arc-clone in HashJoinStream with StringView on build-side [datafusion]

2025-06-19 Thread via GitHub


ctsk commented on issue #16206:
URL: https://github.com/apache/datafusion/issues/16206#issuecomment-2988799373

   I've re-encountered this issue and it (obviously, duh) gets amplified with 
larger scale facctors (>100). For example, the top 3 items by cycles when 
running tpch query 18 @ sf 300 are:
   
   ```
 18.07%  tokio-runtime-w  tpch  
[.] arrow_select::take::take_byte_view  

  ◆
 14.52%  tokio-runtime-w  tpch  
[.] ::poll_next
▒
 12.96%  tokio-runtime-w  tpch  
[.] 
core::ptr::drop_in_place>

   ``` 
   
   That is close to 50% of cycles spent due to this issue. Because the batches 
that get produced by the hash join have a lot of data buffers, the 
CoalesceBatchesExec gets more expensive too: Inside the operator, we iterate 
over the data buffers to determine whether to perform a garbage collection.
   
   I believe a reasonable fix for this issue is  to perform a garbage 
collection on the build side after the concatenation. The build side would then 
only hold a single data buffer per string-view / byte-view column.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Excessive Arc-clone in HashJoinStream with StringView on build-side [datafusion]

2025-05-31 Thread via GitHub


Dandandan commented on issue #16206:
URL: https://github.com/apache/datafusion/issues/16206#issuecomment-2924659989

   So if we're to change `RecordBatch` to `Vec` in the hashjoin, I 
think this is what's roughly needed:
   
   * Change `batch` to `Vec` in `JoinLeftData`
   * Initialize `JoinLeftData` with a `Vec<(usize, usize>)` which has the 
batch_id and row_id for each index
   * Pass a `Vec<(usize, usize>)` to methods based on the indices from the 
build side
   * Change code to use `interleave`
   * Adapt other places where needed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Excessive Arc-clone in HashJoinStream with StringView on build-side [datafusion]

2025-05-30 Thread via GitHub


jonathanc-n commented on issue #16206:
URL: https://github.com/apache/datafusion/issues/16206#issuecomment-2923567825

   I'll try to give that a try, thanks @Dandandan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Excessive Arc-clone in HashJoinStream with StringView on build-side [datafusion]

2025-05-30 Thread via GitHub


Dandandan commented on issue #16206:
URL: https://github.com/apache/datafusion/issues/16206#issuecomment-2923538632

   This might help here as well I think?
   
   https://github.com/apache/arrow-rs/pull/6427/files


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Excessive Arc-clone in HashJoinStream with StringView on build-side [datafusion]

2025-05-30 Thread via GitHub


Dandandan commented on issue #16206:
URL: https://github.com/apache/datafusion/issues/16206#issuecomment-2923526973

   Yeah - one of the options to optimize join is by avoiding the `take` + 
`concat` (which happens when generating fewer rows per input batch in join, so 
I think that could use a similar strategy).
   
   Ideally, we also would avoid `concat_batches` in the start (and rather 
maintain a`Vec`).
   This was changed in the early days (by me) as it used to be slow.
   
   But now that the hashmap is basically only comparing hash values during 
probe, I think we can try and avoid the `concat_batches` (and store a u32 / u32 
of batch_id + index instead of a u64 index).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Excessive Arc-clone in HashJoinStream with StringView on build-side [datafusion]

2025-05-30 Thread via GitHub


alamb commented on issue #16206:
URL: https://github.com/apache/datafusion/issues/16206#issuecomment-2922383864

   I have noticed something similar -- that the pattern of "take / concat" 
results in potentially non trival overhead.
   
   I have been recently working on some related code (the same basic pattern 
happens for `filter`) Related:
   - https://github.com/apache/arrow-rs/issues/6692
   - https://github.com/apache/arrow-rs/pull/7513


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [I] Excessive Arc-clone in HashJoinStream with StringView on build-side [datafusion]

2025-05-28 Thread via GitHub


jonathanc-n commented on issue #16206:
URL: https://github.com/apache/datafusion/issues/16206#issuecomment-2917999199

   take


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]