[GitHub] spark pull request #23016: [SPARK-26006][mllib] unpersist 'dataInternalRepr'...

2018-11-17 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/23016


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23016: [SPARK-26006][mllib] unpersist 'dataInternalRepr'...

2018-11-16 Thread shahidki31
Github user shahidki31 commented on a diff in the pull request:

https://github.com/apache/spark/pull/23016#discussion_r234396276
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala 
---
@@ -174,6 +174,10 @@ class PrefixSpan private (
 val freqSequences = results.map { case (seq: Array[Int], count: Long) 
=>
   new FreqSequence(toPublicRepr(seq), count)
 }
+// Cache the final RDD to the same storage level as input
+freqSequences.persist(data.getStorageLevel)
--- End diff --

@srowen  Yes. That is the correct approach. I updated the code. Thanks


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23016: [SPARK-26006][mllib] unpersist 'dataInternalRepr'...

2018-11-16 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/23016#discussion_r234395721
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala 
---
@@ -174,6 +174,10 @@ class PrefixSpan private (
 val freqSequences = results.map { case (seq: Array[Int], count: Long) 
=>
   new FreqSequence(toPublicRepr(seq), count)
 }
+// Cache the final RDD to the same storage level as input
+freqSequences.persist(data.getStorageLevel)
--- End diff --

The problem here is that it won't get persisted until something 
materializes it, and at that point its dependent RDD dataInternalRepr is 
already unpersisted.

I'd say that _if_ the input's storage level isn't NONE, then persist 
freqSequences at the same level and .count() it to materialize it. Then 
unpersist dataInternalRepr in all events.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #23016: [SPARK-26006][mllib] unpersist 'dataInternalRepr'...

2018-11-12 Thread shahidki31
GitHub user shahidki31 opened a pull request:

https://github.com/apache/spark/pull/23016

[SPARK-26006][mllib] unpersist 'dataInternalRepr' in the PrefixSpan

## What changes were proposed in this pull request?
Mllib's Prefixspan - run method - cached RDD stays in cache. After run is 
comlpeted , rdd remain in cache.
We need to unpersist the cached RDD after run method.


## How was this patch tested?
Existing tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/shahidki31/spark SPARK-26006

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/23016.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #23016


commit 2e3b0891b54dd12441d8c55230837bc182e11608
Author: Shahid 
Date:   2018-11-12T14:08:33Z

unpersist 'dataInternalRepr' after run method




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org