[GitHub] spark pull request #23016: [SPARK-26006][mllib] unpersist 'dataInternalRepr'...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/23016 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23016: [SPARK-26006][mllib] unpersist 'dataInternalRepr'...
Github user shahidki31 commented on a diff in the pull request: https://github.com/apache/spark/pull/23016#discussion_r234396276 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala --- @@ -174,6 +174,10 @@ class PrefixSpan private ( val freqSequences = results.map { case (seq: Array[Int], count: Long) => new FreqSequence(toPublicRepr(seq), count) } +// Cache the final RDD to the same storage level as input +freqSequences.persist(data.getStorageLevel) --- End diff -- @srowen Yes. That is the correct approach. I updated the code. Thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23016: [SPARK-26006][mllib] unpersist 'dataInternalRepr'...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23016#discussion_r234395721 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala --- @@ -174,6 +174,10 @@ class PrefixSpan private ( val freqSequences = results.map { case (seq: Array[Int], count: Long) => new FreqSequence(toPublicRepr(seq), count) } +// Cache the final RDD to the same storage level as input +freqSequences.persist(data.getStorageLevel) --- End diff -- The problem here is that it won't get persisted until something materializes it, and at that point its dependent RDD dataInternalRepr is already unpersisted. I'd say that _if_ the input's storage level isn't NONE, then persist freqSequences at the same level and .count() it to materialize it. Then unpersist dataInternalRepr in all events. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23016: [SPARK-26006][mllib] unpersist 'dataInternalRepr'...
GitHub user shahidki31 opened a pull request: https://github.com/apache/spark/pull/23016 [SPARK-26006][mllib] unpersist 'dataInternalRepr' in the PrefixSpan ## What changes were proposed in this pull request? Mllib's Prefixspan - run method - cached RDD stays in cache. After run is comlpeted , rdd remain in cache. We need to unpersist the cached RDD after run method. ## How was this patch tested? Existing tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/shahidki31/spark SPARK-26006 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23016.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23016 commit 2e3b0891b54dd12441d8c55230837bc182e11608 Author: Shahid Date: 2018-11-12T14:08:33Z unpersist 'dataInternalRepr' after run method --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org