Re: [I] Excessive Arc-clone in HashJoinStream with StringView on build-side [datafusion]
alamb commented on issue #16206: URL: https://github.com/apache/datafusion/issues/16206#issuecomment-3006040173 > Extend the take (or interleave) kernels to eagerly allocate new data buffers for string views (i.e. add a compute::TakeOption - unfortunately also a breaking change). BTW this eagerly allocating buffers / copying string views when needed is what the (in progress) `coalesce` kernel does: - https://github.com/apache/arrow-rs/blob/main/arrow-select/src/coalesce.rs Here is the logic that does it for StringView - https://github.com/apache/arrow-rs/blob/main/arrow-select/src/coalesce/byte_view.rs I hope to migrate DataFusion to use the BatchCoalescer soon. There is a `BatchCoalescer::push_batch_with_filter` (that does the equivalent of `filter` + `concat`) that I am actively working to improve I imagine implementing something like `BatchCoalescer::push_batch_with_indices` that does the equivalent of `take` + `concat` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Excessive Arc-clone in HashJoinStream with StringView on build-side [datafusion]
ctsk commented on issue #16206: URL: https://github.com/apache/datafusion/issues/16206#issuecomment-2994380522 I think the issue on the take+concat pattern is only tangentially related to this issue. Ultimately, you would need a version of the take operation, that does not simply clone the vector of data buffers, but rather eagerly allocates a new data buffer to hold the string view data. This modification could be applied to the current compute::take kernel too. The interleave kernel already does something like this: It only takes references to the data buffers that are actually referenced in the new array - that is why it would help somewhat alleviate this issue. If the build side is *really large*, this could still mean allocating a vector of 1000+ buffer references. The subsequent gc (in CoalesceBatchesExec) drops the vector and allocate a new data buffer. Instead, I think there are 2 direct solutions for this: 1. Extend the take (or interleave) kernels to eagerly allocate new data buffers for string views (i.e. add a `compute::TakeOption` - unfortunately also a breaking change). 2. Switch from `Vec` to `Arc<[Buffer]>` - also a breaking change in the arrow crate. This would not require writing new kernel-style code - the take/interleave operations would remain unchanged. My PR on executing a GC on the hash join build-side is unfortunately not a good solution: This GC operation is expensive, and whether it pays off depends on the number of probe side batches - which is not known in advance. The benchmark shows that it significantly slows down smaller joins. It would also not prevent the CoalesceBatchesExec after the HJ from issuing another GC. For now, the best way to fix this issue is to work on getting https://github.com/apache/arrow-rs/pull/6427 merged. As far as I can tell, it is already fully implemented and got stuck due to lack of time of the author to add the `ViewBuffers` abstraction. Lastly, I believe that using interleave, or a good solution to the take+concat pattern would be very helpful in making the HJ more efficient in general. Most importantly, it could allow avoiding the concat on the build-side batches, and remove the CoalesceBatchesExec after the join. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Excessive Arc-clone in HashJoinStream with StringView on build-side [datafusion]
alamb commented on issue #16206: URL: https://github.com/apache/datafusion/issues/16206#issuecomment-2992632434 > Thanks for sharing the insight [@ctsk](https://github.com/ctsk) > > I think there is some overlap here to the work [@alamb](https://github.com/alamb) to improve (and maybe remove over time) `CoalesceBatchesExec` and gc of views. The relevant arrow ticket is - https://github.com/apache/arrow-rs/issues/6692 @Dandandan also made a PR that improves some view code here (but I don't think it affects take_view) here: - https://github.com/apache/arrow-rs/pull/7695 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Excessive Arc-clone in HashJoinStream with StringView on build-side [datafusion]
Dandandan commented on issue #16206: URL: https://github.com/apache/datafusion/issues/16206#issuecomment-2988836122 Thanks for sharing the insight @ctsk I think there is some overlap here to the work @alamb to improve (and maybe remove over time) `CoalesceBatchesExec` and gc of views. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Excessive Arc-clone in HashJoinStream with StringView on build-side [datafusion]
ctsk commented on issue #16206: URL: https://github.com/apache/datafusion/issues/16206#issuecomment-2988799373 I've re-encountered this issue and it (obviously, duh) gets amplified with larger scale facctors (>100). For example, the top 3 items by cycles when running tpch query 18 @ sf 300 are: ``` 18.07% tokio-runtime-w tpch [.] arrow_select::take::take_byte_view ◆ 14.52% tokio-runtime-w tpch [.] ::poll_next ▒ 12.96% tokio-runtime-w tpch [.] core::ptr::drop_in_place> ``` That is close to 50% of cycles spent due to this issue. Because the batches that get produced by the hash join have a lot of data buffers, the CoalesceBatchesExec gets more expensive too: Inside the operator, we iterate over the data buffers to determine whether to perform a garbage collection. I believe a reasonable fix for this issue is to perform a garbage collection on the build side after the concatenation. The build side would then only hold a single data buffer per string-view / byte-view column. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Excessive Arc-clone in HashJoinStream with StringView on build-side [datafusion]
Dandandan commented on issue #16206: URL: https://github.com/apache/datafusion/issues/16206#issuecomment-2924659989 So if we're to change `RecordBatch` to `Vec` in the hashjoin, I think this is what's roughly needed: * Change `batch` to `Vec` in `JoinLeftData` * Initialize `JoinLeftData` with a `Vec<(usize, usize>)` which has the batch_id and row_id for each index * Pass a `Vec<(usize, usize>)` to methods based on the indices from the build side * Change code to use `interleave` * Adapt other places where needed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Excessive Arc-clone in HashJoinStream with StringView on build-side [datafusion]
jonathanc-n commented on issue #16206: URL: https://github.com/apache/datafusion/issues/16206#issuecomment-2923567825 I'll try to give that a try, thanks @Dandandan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Excessive Arc-clone in HashJoinStream with StringView on build-side [datafusion]
Dandandan commented on issue #16206: URL: https://github.com/apache/datafusion/issues/16206#issuecomment-2923538632 This might help here as well I think? https://github.com/apache/arrow-rs/pull/6427/files -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Excessive Arc-clone in HashJoinStream with StringView on build-side [datafusion]
Dandandan commented on issue #16206: URL: https://github.com/apache/datafusion/issues/16206#issuecomment-2923526973 Yeah - one of the options to optimize join is by avoiding the `take` + `concat` (which happens when generating fewer rows per input batch in join, so I think that could use a similar strategy). Ideally, we also would avoid `concat_batches` in the start (and rather maintain a`Vec`). This was changed in the early days (by me) as it used to be slow. But now that the hashmap is basically only comparing hash values during probe, I think we can try and avoid the `concat_batches` (and store a u32 / u32 of batch_id + index instead of a u64 index). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Excessive Arc-clone in HashJoinStream with StringView on build-side [datafusion]
alamb commented on issue #16206: URL: https://github.com/apache/datafusion/issues/16206#issuecomment-2922383864 I have noticed something similar -- that the pattern of "take / concat" results in potentially non trival overhead. I have been recently working on some related code (the same basic pattern happens for `filter`) Related: - https://github.com/apache/arrow-rs/issues/6692 - https://github.com/apache/arrow-rs/pull/7513 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [I] Excessive Arc-clone in HashJoinStream with StringView on build-side [datafusion]
jonathanc-n commented on issue #16206: URL: https://github.com/apache/datafusion/issues/16206#issuecomment-2917999199 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
