EnricoMi commented on PR #37360:
URL: https://github.com/apache/spark/pull/37360#issuecomment-1477427083
@xinrong-meng what do you think about this?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go
EnricoMi commented on PR #37360:
URL: https://github.com/apache/spark/pull/37360#issuecomment-1274838048
@zhengruifeng how do you feel about this potential performance improvement?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to
EnricoMi commented on PR #37360:
URL: https://github.com/apache/spark/pull/37360#issuecomment-1272901268
@HyukjinKwon Two options here:
- provide an alternative for `applyInPandas` that takes the same user
function signature in batch mode
- Python
EnricoMi commented on PR #37360:
URL: https://github.com/apache/spark/pull/37360#issuecomment-1228293766
Here is a benchmark (core seconds for 10m rows) on the batched
`applyInPandasBatched` with batch sizes `65536`, `1024`, `16`:
| group size | no batch | 65535 | 1024 | 16 | |
EnricoMi commented on PR #37360:
URL: https://github.com/apache/spark/pull/37360#issuecomment-1202140475
> Hm, the general idea might be fine but I think the implementation is the
problem. For example, the current design is that the user defined `function`
always takes one group for `pdf`.