[jira] [Commented] (SPARK-3625) In some cases, the RDD.checkpoint does not work
[ https://issues.apache.org/jira/browse/SPARK-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14395599#comment-14395599 ] Sean Owen commented on SPARK-3625: -- Why would this change be necessary in order to use checkpointing? see the discussion above. In some cases, the RDD.checkpoint does not work --- Key: SPARK-3625 URL: https://issues.apache.org/jira/browse/SPARK-3625 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.2, 1.1.0 Reporter: Guoqiang Li Assignee: Guoqiang Li The reproduce code: {code} sc.setCheckpointDir(checkpointDir) val c = sc.parallelize((1 to 1000)).map(_ + 1) c.count val dep = c.dependencies.head.rdd c.checkpoint() c.count assert(dep != c.dependencies.head.rdd) {code} This limit is too strict , This makes it difficult to implement SPARK-3623 . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3625) In some cases, the RDD.checkpoint does not work
[ https://issues.apache.org/jira/browse/SPARK-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14395613#comment-14395613 ] Guoqiang Li commented on SPARK-3625: Sometimes, when calling the RDD.checkpoint , we cannot determine it before any job has been executed on this RDD. Just like [PeriodicGraphCheckpointer|https://github.com/apache/spark/blob/branch-1.3/mllib/src/main/scala/org/apache/spark/mllib/impl/PeriodicGraphCheckpointer.scala] In some cases, the RDD.checkpoint does not work --- Key: SPARK-3625 URL: https://issues.apache.org/jira/browse/SPARK-3625 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.2, 1.1.0 Reporter: Guoqiang Li Assignee: Guoqiang Li The reproduce code: {code} sc.setCheckpointDir(checkpointDir) val c = sc.parallelize((1 to 1000)).map(_ + 1) c.count val dep = c.dependencies.head.rdd c.checkpoint() c.count assert(dep != c.dependencies.head.rdd) {code} This limit is too strict , This makes it difficult to implement SPARK-3623 . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3625) In some cases, the RDD.checkpoint does not work
[ https://issues.apache.org/jira/browse/SPARK-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14395492#comment-14395492 ] Guoqiang Li commented on SPARK-3625: When we run the machine learning and graph algorithms, this feature is very necessary, I think we should merge this PR 2480 to master . In some cases, the RDD.checkpoint does not work --- Key: SPARK-3625 URL: https://issues.apache.org/jira/browse/SPARK-3625 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.2, 1.1.0 Reporter: Guoqiang Li Assignee: Guoqiang Li The reproduce code: {code} sc.setCheckpointDir(checkpointDir) val c = sc.parallelize((1 to 1000)).map(_ + 1) c.count val dep = c.dependencies.head.rdd c.checkpoint() c.count assert(dep != c.dependencies.head.rdd) {code} This limit is too strict , This makes it difficult to implement SPARK-3623 . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3625) In some cases, the RDD.checkpoint does not work
[ https://issues.apache.org/jira/browse/SPARK-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14156663#comment-14156663 ] Apache Spark commented on SPARK-3625: - User 'witgo' has created a pull request for this issue: https://github.com/apache/spark/pull/2631 In some cases, the RDD.checkpoint does not work --- Key: SPARK-3625 URL: https://issues.apache.org/jira/browse/SPARK-3625 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.2, 1.1.0 Reporter: Guoqiang Li Assignee: Guoqiang Li The reproduce code: {code} sc.setCheckpointDir(checkpointDir) val c = sc.parallelize((1 to 1000)).map(_ + 1) c.count val dep = c.dependencies.head.rdd c.checkpoint() c.count assert(dep != c.dependencies.head.rdd) {code} This limit is too strict , This makes it difficult to implement SPARK-3623 . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3625) In some cases, the RDD.checkpoint does not work
[ https://issues.apache.org/jira/browse/SPARK-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143153#comment-14143153 ] Sean Owen commented on SPARK-3625: -- This prints 1000 both times for me, which is correct. When you say doesn't work, could you please elaborate? different count? exception? what is your environment? In some cases, the RDD.checkpoint does not work --- Key: SPARK-3625 URL: https://issues.apache.org/jira/browse/SPARK-3625 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0 Reporter: Guoqiang Li Assignee: Guoqiang Li Priority: Blocker The reproduce code: {code} sc.setCheckpointDir(checkpointDir) val c = sc.parallelize((1 to 1000)) c.count c.checkpoint() c.count {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3625) In some cases, the RDD.checkpoint does not work
[ https://issues.apache.org/jira/browse/SPARK-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143342#comment-14143342 ] Sean Owen commented on SPARK-3625: -- It still prints 1000 both times, which is correct. Your assertion is about something different. The assertion fails, but, the behavior you are asserting is not what the javadoc suggests: {quote} Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint directory set with SparkContext.setCheckpointDir() and all references to its parent RDDs will be removed. This function must be called before any job has been executed on this RDD. It is strongly recommended that this RDD is persisted in memory, otherwise saving it on a file will require recomputation. {quote} This example calls count() before checkpoint(). If you don't, I think you get the expected behavior, since the dependency becomes a CheckpointRDD. This looks like not a bug. In some cases, the RDD.checkpoint does not work --- Key: SPARK-3625 URL: https://issues.apache.org/jira/browse/SPARK-3625 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0 Reporter: Guoqiang Li Assignee: Guoqiang Li Priority: Blocker The reproduce code: {code} sc.setCheckpointDir(checkpointDir) val c = sc.parallelize((1 to 1000)).map(_ + 1) c.count val dep = c.dependencies.head.rdd c.checkpoint() c.count assert(dep != c.dependencies.head.rdd) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3625) In some cases, the RDD.checkpoint does not work
[ https://issues.apache.org/jira/browse/SPARK-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14143410#comment-14143410 ] Guoqiang Li commented on SPARK-3625: Ok, it has been modified to improvement This limit is too strict , SPARK-3623 relies on here. In some cases, the RDD.checkpoint does not work --- Key: SPARK-3625 URL: https://issues.apache.org/jira/browse/SPARK-3625 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.2, 1.1.0 Reporter: Guoqiang Li Assignee: Guoqiang Li The reproduce code: {code} sc.setCheckpointDir(checkpointDir) val c = sc.parallelize((1 to 1000)).map(_ + 1) c.count val dep = c.dependencies.head.rdd c.checkpoint() c.count assert(dep != c.dependencies.head.rdd) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3625) In some cases, the RDD.checkpoint does not work
[ https://issues.apache.org/jira/browse/SPARK-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14142471#comment-14142471 ] Apache Spark commented on SPARK-3625: - User 'witgo' has created a pull request for this issue: https://github.com/apache/spark/pull/2480 In some cases, the RDD.checkpoint does not work --- Key: SPARK-3625 URL: https://issues.apache.org/jira/browse/SPARK-3625 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0 Reporter: Guoqiang Li Assignee: Guoqiang Li Priority: Blocker The reproduce code: {code} sc.setCheckpointDir(checkpointDir) val c = sc.parallelize((1 to 1000)) c.count c.checkpoint() c.count {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org