[ 
https://issues.apache.org/jira/browse/SPARK-29827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971336#comment-16971336
 ] 

shahid commented on SPARK-29827:
--------------------------------

I would like to analyze this issue

> Wrong persist strategy in mllib.clustering.BisectingKMeans.run
> --------------------------------------------------------------
>
>                 Key: SPARK-29827
>                 URL: https://issues.apache.org/jira/browse/SPARK-29827
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 2.4.3
>            Reporter: Dong Wang
>            Priority: Major
>
> There are three persist misuses in mllib.clustering.BisectingKMeans.run.
>  * First, the rdd {color:#de350b}_input_{color} should be persisted, because 
> it was not only used by the action _first(),_ but also used by other __ 
> actions in the following code.
>  * Second, the rdd {color:#de350b}_assignments_{color} should be persisted. 
> It was used in the fuction _summarize()_ more than once, which containts an 
> action on _assignments_.
>  * Third, once the rdd _{color:#de350b}assignments{color}_ is persisted_,_ 
> persisting the rdd {color:#de350b}_norms_{color} would be unnecessary. 
> Because {color:#de350b}_norms_ {color} is an intermediate rdd. Since its 
> child rdd {color:#de350b}_assignments_{color} is persisted, it is unnecessary 
> to persist {color:#de350b}_norms_{color} anymore.
> {code:scala}
>   private[spark] def run(
>       input: RDD[Vector],
>       instr: Option[Instrumentation]): BisectingKMeansModel = {
>     if (input.getStorageLevel == StorageLevel.NONE) {
>       logWarning(s"The input RDD ${input.id} is not directly cached, which 
> may hurt performance if"
>         + " its parent RDDs are also not cached.")
>     }
>     // Needs to persist input
>     val d = input.map(_.size).first() 
>     logInfo(s"Feature dimension: $d.")
>     val dMeasure: DistanceMeasure = 
> DistanceMeasure.decodeFromString(this.distanceMeasure)
>     // Compute and cache vector norms for fast distance computation.
>     val norms = input.map(v => Vectors.norm(v, 
> 2.0)).persist(StorageLevel.MEMORY_AND_DISK)  // Unnecessary persist
>     val vectors = input.zip(norms).map { case (x, norm) => new 
> VectorWithNorm(x, norm) }
>     var assignments = vectors.map(v => (ROOT_INDEX, v))  // Needs to persist
>     var activeClusters = summarize(d, assignments, dMeasure)
> {code}
> This issue is reported by our tool CacheCheck, which is used to dynamically 
> detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to