Dong Wang created SPARK-29813: --------------------------------- Summary: Missing persist in mllib.PrefixSpan.findFrequentItems() Key: SPARK-29813 URL: https://issues.apache.org/jira/browse/SPARK-29813 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 2.4.3 Reporter: Dong Wang
There are three actions in this piece of code: reduceByKey, sortBy, and collect. But data is not persisted, which will cause recomputation. {code:scala} private[fpm] def findFrequentItems[Item: ClassTag]( data: RDD[Array[Array[Item]]], minCount: Long): Array[Item] = { data.flatMap { itemsets => val uniqItems = mutable.Set.empty[Item] itemsets.foreach(set => uniqItems ++= set) uniqItems.toIterator.map((_, 1L)) }.reduceByKey(_ + _).filter { case (_, count) => count >= minCount }.sortBy(-_._2).map(_._1).collect() } {code} This issue is reported by our tool CacheCheck, which is used to dynamically detecting persist()/unpersist() api misuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org