Nick Pentreath created SPARK-18608: -------------------------------------- Summary: Spark ML algorithms that check RDD cache level for internal caching double-cache data Key: SPARK-18608 URL: https://issues.apache.org/jira/browse/SPARK-18608 Project: Spark Issue Type: Bug Components: ML Reporter: Nick Pentreath
Some algorithms in Spark ML (e.g. {{LogisticRegression}}, {{LinearRegression}}, and I believe now {{KMeans}}) handle persistence internally. They check whether the input dataset is cached, and if not they cache it for performance. However, the check is done using {{dataset.rdd.getStorageLevel == NONE}}. This will actually always be true, since even if the dataset itself is cached, the RDD returned by {{dataset.rdd}} will not be cached. Hence if the input dataset is cached, the data will end up being cached twice, which is wasteful. To see this: {code} scala> import org.apache.spark.storage.StorageLevel import org.apache.spark.storage.StorageLevel scala> val df = spark.range(10).toDF("num") df: org.apache.spark.sql.DataFrame = [num: bigint] scala> df.storageLevel == StorageLevel.NONE res0: Boolean = true scala> df.persist res1: df.type = [num: bigint] scala> df.storageLevel == StorageLevel.MEMORY_AND_DISK res2: Boolean = true scala> df.rdd.getStorageLevel == StorageLevel.MEMORY_AND_DISK res3: Boolean = false scala> df.rdd.getStorageLevel == StorageLevel.NONE res4: Boolean = true {code} Before SPARK-16063, there was no way to check the storage level of the input {{DataSet}}, but now we can, so the checks should be migrated to use {{dataset.storageLevel}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org