Dong Wang created SPARK-29856:
---------------------------------

             Summary: Conditional unnecessary persist on RDDs in ML algorithms
                 Key: SPARK-29856
                 URL: https://issues.apache.org/jira/browse/SPARK-29856
             Project: Spark
          Issue Type: Improvement
          Components: ML, MLlib
    Affects Versions: 3.0.0
            Reporter: Dong Wang


When I run example.ml.GradientBoostedTreeRegressorExample, I find that RDD 
_{color:#DE350B}baggedInput{color}_ in _ml.tree.impl.RandomForest.run()_ is 
persisted, but it only used once. So this persist operation is unnecessary.

{code:scala}
    val baggedInput = BaggedPoint
      .convertToBaggedRDD(treeInput, strategy.subsamplingRate, numTrees, 
withReplacement,
        (tp: TreePoint) => tp.weight, seed = seed)
      .persist(StorageLevel.MEMORY_AND_DISK)
      ...
   while (nodeStack.nonEmpty) {
      ...
      timer.start("findBestSplits")
      RandomForest.findBestSplits(baggedInput, metadata, topNodesForGroup, 
nodesForGroup,
        treeToNodeToIndexInfo, splits, nodeStack, timer, nodeIdCache)
      timer.stop("findBestSplits")
    }
    baggedInput.unpersist()
{code}

However, the action on {color:#DE350B}_baggedInput_{color} is in a while loop. 
In GradientBoostedTreeRegressorExample, this loop only executes once, so only 
one action uses {color:#DE350B}_baggedInput_{color}.
In most of ML applications, the loop will executes for many times, which means 
{color:#DE350B}_baggedInput_{color} will be used in many actions. So the 
persist is necessary now.
That's the point why the persist operation is "conditional" unnecessary.

Same situations exist in many other ML algorithms, e.g., RDD 
{color:#DE350B}_instances_{color} in ml.clustering.KMeans.fit(), RDD 
{color:#DE350B}_indices_{color} in mllib.clustering.BisectingKMeans.run().

This issue is reported by our tool CacheCheck, which is used to dynamically 
detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to