[ https://issues.apache.org/jira/browse/SPARK-29832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16975799#comment-16975799 ]
Sean R. Owen commented on SPARK-29832: -------------------------------------- [~spark_cachecheck] some of these may be valid, but a lot of them don't appear to be. Let's tackle a few before opening the 30 JIRAs you did -- that's very noisy. run() is going to do use (a transform of) this input many times in a loop. I don't think this analysis is accurate. You could say that it's more optimal to persist stuff closer to where the loop is; sometimes it's better, sometimes, not, depends on how big and expensive the result is. > Unnecessary persist on instances in ml.regression.IsotonicRegression.fit > ------------------------------------------------------------------------ > > Key: SPARK-29832 > URL: https://issues.apache.org/jira/browse/SPARK-29832 > Project: Spark > Issue Type: Improvement > Components: ML > Affects Versions: 3.0.0 > Reporter: Dong Wang > Priority: Major > > Persist on instances in ml.regression.IsotonicRegression.fit() is > unnecessary, because it is only used once in run(instances). > {code:scala} > override def fit(dataset: Dataset[_]): IsotonicRegressionModel = > instrumented { instr => > transformSchema(dataset.schema, logging = true) > // Extract columns from data. If dataset is persisted, do not persist > oldDataset. > val instances = extractWeightedLabeledPoints(dataset) > val handlePersistence = dataset.storageLevel == StorageLevel.NONE > // Unnecessary persist > if (handlePersistence) instances.persist(StorageLevel.MEMORY_AND_DISK) > instr.logPipelineStage(this) > instr.logDataset(dataset) > instr.logParams(this, labelCol, featuresCol, weightCol, predictionCol, > featureIndex, isotonic) > instr.logNumFeatures(1) > val isotonicRegression = new > MLlibIsotonicRegression().setIsotonic($(isotonic)) > val oldModel = isotonicRegression.run(instances) // Only use once here > if (handlePersistence) instances.unpersist() > {code} > This issue is reported by our tool CacheCheck, which is used to dynamically > detecting persist()/unpersist() api misuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org