[GitHub] [spark] EnricoMi commented on pull request #37360: [SPARK-39931][PYTHON][WIP] Improve applyInPandas performance for very small groups

2023-03-21 Thread via GitHub
EnricoMi commented on PR #37360: URL: https://github.com/apache/spark/pull/37360#issuecomment-1477427083 @xinrong-meng what do you think about this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

[GitHub] [spark] EnricoMi commented on pull request #37360: [SPARK-39931][PYTHON][WIP] Improve applyInPandas performance for very small groups

2022-10-11 Thread GitBox
EnricoMi commented on PR #37360: URL: https://github.com/apache/spark/pull/37360#issuecomment-1274838048 @zhengruifeng how do you feel about this potential performance improvement? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[GitHub] [spark] EnricoMi commented on pull request #37360: [SPARK-39931][PYTHON][WIP] Improve applyInPandas performance for very small groups

2022-10-10 Thread GitBox
EnricoMi commented on PR #37360: URL: https://github.com/apache/spark/pull/37360#issuecomment-1272901268 @HyukjinKwon Two options here: - provide an alternative for `applyInPandas` that takes the same user function signature in batch mode - Python

[GitHub] [spark] EnricoMi commented on pull request #37360: [SPARK-39931][PYTHON][WIP] Improve applyInPandas performance for very small groups

2022-08-26 Thread GitBox
EnricoMi commented on PR #37360: URL: https://github.com/apache/spark/pull/37360#issuecomment-1228293766 Here is a benchmark (core seconds for 10m rows) on the batched `applyInPandasBatched` with batch sizes `65536`, `1024`, `16`: | group size | no batch | 65535 | 1024 | 16 | |

[GitHub] [spark] EnricoMi commented on pull request #37360: [SPARK-39931][PYTHON][WIP] Improve applyInPandas performance for very small groups

2022-08-02 Thread GitBox
EnricoMi commented on PR #37360: URL: https://github.com/apache/spark/pull/37360#issuecomment-1202140475 > Hm, the general idea might be fine but I think the implementation is the problem. For example, the current design is that the user defined `function` always takes one group for `pdf`.