[ https://issues.apache.org/jira/browse/SPARK-29827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971336#comment-16971336 ]
shahid commented on SPARK-29827: -------------------------------- I would like to analyze this issue > Wrong persist strategy in mllib.clustering.BisectingKMeans.run > -------------------------------------------------------------- > > Key: SPARK-29827 > URL: https://issues.apache.org/jira/browse/SPARK-29827 > Project: Spark > Issue Type: Bug > Components: MLlib > Affects Versions: 2.4.3 > Reporter: Dong Wang > Priority: Major > > There are three persist misuses in mllib.clustering.BisectingKMeans.run. > * First, the rdd {color:#de350b}_input_{color} should be persisted, because > it was not only used by the action _first(),_ but also used by other __ > actions in the following code. > * Second, the rdd {color:#de350b}_assignments_{color} should be persisted. > It was used in the fuction _summarize()_ more than once, which containts an > action on _assignments_. > * Third, once the rdd _{color:#de350b}assignments{color}_ is persisted_,_ > persisting the rdd {color:#de350b}_norms_{color} would be unnecessary. > Because {color:#de350b}_norms_ {color} is an intermediate rdd. Since its > child rdd {color:#de350b}_assignments_{color} is persisted, it is unnecessary > to persist {color:#de350b}_norms_{color} anymore. > {code:scala} > private[spark] def run( > input: RDD[Vector], > instr: Option[Instrumentation]): BisectingKMeansModel = { > if (input.getStorageLevel == StorageLevel.NONE) { > logWarning(s"The input RDD ${input.id} is not directly cached, which > may hurt performance if" > + " its parent RDDs are also not cached.") > } > // Needs to persist input > val d = input.map(_.size).first() > logInfo(s"Feature dimension: $d.") > val dMeasure: DistanceMeasure = > DistanceMeasure.decodeFromString(this.distanceMeasure) > // Compute and cache vector norms for fast distance computation. > val norms = input.map(v => Vectors.norm(v, > 2.0)).persist(StorageLevel.MEMORY_AND_DISK) // Unnecessary persist > val vectors = input.zip(norms).map { case (x, norm) => new > VectorWithNorm(x, norm) } > var assignments = vectors.map(v => (ROOT_INDEX, v)) // Needs to persist > var activeClusters = summarize(d, assignments, dMeasure) > {code} > This issue is reported by our tool CacheCheck, which is used to dynamically > detecting persist()/unpersist() api misuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org