subject:"\[GitHub\] \[spark\] srowen commented on pull request #40263\: \[SPARK\-42659\]\[ML\] Reimplement `FPGrowthModel.transform` with dataframe operations"

[GitHub] [spark] srowen commented on pull request #40263: [SPARK-42659][ML] Reimplement `FPGrowthModel.transform` with dataframe operations

2023-03-20 Thread via GitHub

srowen commented on PR #40263: URL: https://github.com/apache/spark/pull/40263#issuecomment-1477246959 I don't know enough to say whether it's worth a new method. Can we start with the change that needs no new API, is it a big enough win? -- This is an automated message from the Apache Gi

[GitHub] [spark] srowen commented on pull request #40263: [SPARK-42659][ML] Reimplement `FPGrowthModel.transform` with dataframe operations

2023-03-16 Thread via GitHub

srowen commented on PR #40263: URL: https://github.com/apache/spark/pull/40263#issuecomment-1473071461 If it's faster and gives the right answers, sure -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to g

[GitHub] [spark] srowen commented on pull request #40263: [SPARK-42659][ML] Reimplement `FPGrowthModel.transform` with dataframe operations

2023-03-13 Thread via GitHub

srowen commented on PR #40263: URL: https://github.com/apache/spark/pull/40263#issuecomment-1466280043 So this seems slower on a medium-sized data set. I don't know if delaying the collect() matters much; the overall execution time matters. I'm worried that this gets much slower on 1M or 10