[GitHub] [spark] zhengruifeng commented on pull request #40263: [SPARK-42659][ML] Reimplement `FPGrowthModel.transform` with dataframe operations

2023-03-21 Thread via GitHub
zhengruifeng commented on PR #40263: URL: https://github.com/apache/spark/pull/40263#issuecomment-1478764552 @srowen sounds reasonable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [spark] zhengruifeng commented on pull request #40263: [SPARK-42659][ML] Reimplement `FPGrowthModel.transform` with dataframe operations

2023-03-20 Thread via GitHub
zhengruifeng commented on PR #40263: URL: https://github.com/apache/spark/pull/40263#issuecomment-1477193966 TL;DR I want to apply scalar subquery to optimize `FPGrowthModel.transform`, there are two options: 1, create temp views and use `spark.sql`, see

[GitHub] [spark] zhengruifeng commented on pull request #40263: [SPARK-42659][ML] Reimplement `FPGrowthModel.transform` with dataframe operations

2023-03-16 Thread via GitHub
zhengruifeng commented on PR #40263: URL: https://github.com/apache/spark/pull/40263#issuecomment-1473064231 @srowen if the latest performance test seems fine, then I'd ask the SQL guys whether we can have a subquery method in DataFrame APIs. -- This is an automated message from the

[GitHub] [spark] zhengruifeng commented on pull request #40263: [SPARK-42659][ML] Reimplement `FPGrowthModel.transform` with dataframe operations

2023-03-14 Thread via GitHub
zhengruifeng commented on PR #40263: URL: https://github.com/apache/spark/pull/40263#issuecomment-1467821846 yes, the `BroadcastNestedLoopJoin` is slower. Then I have a try with subquery, and it's faster in both execution and analysis, but I have to create temp view and write the sql

[GitHub] [spark] zhengruifeng commented on pull request #40263: [SPARK-42659][ML] Reimplement `FPGrowthModel.transform` with dataframe operations

2023-03-13 Thread via GitHub
zhengruifeng commented on PR #40263: URL: https://github.com/apache/spark/pull/40263#issuecomment-1465662299 I did a quick test with dataset `T10I4D100K` in http://fimi.uantwerpen.be/data/ fit: ``` scala> val df =