[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user dding3 commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r113021061 --- Diff: docs/graphx-programming-guide.md --- @@ -709,7 +709,8 @@ messages remaining. The following is the type signature of the [Pregel operator][GraphOps.pregel] as well as a *sketch* of its implementation (note: to avoid stackOverflowError due to long lineage chains, pregel support periodcally -checkpoint graph and messages by setting "spark.graphx.pregel.checkpointInterval"): +checkpoint graph and messages by setting "spark.graphx.pregel.checkpointInterval" to a positive number, +say 10. And set checkpoint directory as well using SparkContext.setCheckpointDir(directory: String)): --- End diff -- I reference the value in LDA and other ml algorithms in spark, by default their checkpointInterval is set to 10. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...
Github user dding3 commented on the issue: https://github.com/apache/spark/pull/15125 OK, agreed. If user didn't set checkpointer directory while we turn on checkpoint in pregel by default, there may be exception. I will change spark.graphx.pregel.checkpointInterval to -1 as default value. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user dding3 commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r103516676 --- Diff: docs/graphx-programming-guide.md --- @@ -708,7 +708,9 @@ messages remaining. > messaging function. These constraints allow additional optimization within GraphX. The following is the type signature of the [Pregel operator][GraphOps.pregel] as well as a *sketch* -of its implementation (note calls to graph.cache have been removed): +of its implementation (note: to avoid stackOverflowError due to long lineage chains, graph and --- End diff -- OK. Have removed references to checkpoint in the sketch and documented the config property in the spark configuration document in the latest update. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user dding3 commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r102294384 --- Diff: docs/graphx-programming-guide.md --- @@ -708,7 +708,9 @@ messages remaining. > messaging function. These constraints allow additional optimization within GraphX. The following is the type signature of the [Pregel operator][GraphOps.pregel] as well as a *sketch* -of its implementation (note calls to graph.cache have been removed): +of its implementation (note: to avoid stackOverflowError due to long lineage chains, graph and --- End diff -- About document this config property in the Spark configuration document. Is it OK if I add a new Graphx section to include the config, or just add the config under existing section, say "execution behavor"? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user dding3 commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r102067500 --- Diff: docs/graphx-programming-guide.md --- @@ -708,7 +708,9 @@ messages remaining. > messaging function. These constraints allow additional optimization within GraphX. The following is the type signature of the [Pregel operator][GraphOps.pregel] as well as a *sketch* -of its implementation (note calls to graph.cache have been removed): +of its implementation (note: to avoid stackOverflowError due to long lineage chains, graph and --- End diff -- OK. I am thinking in the original graphx-programming-guide, only cache is called. After we add do checkpointing and give notes we do that in the sketch, do we need to show the information in the sketch. If there is an agreement on remove all references to checkpoint, I will revert the changes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user dding3 commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r102065803 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala --- @@ -154,7 +169,9 @@ object Pregel extends Logging { // count the iteration i += 1 } -messages.unpersist(blocking = false) +messageCheckpointer.unpersistDataSet() --- End diff -- I think the thing is we use messageCheckpointer.update to do the cache, to make a pair, we can use it to unpersist data. Please correct me if I understand wrong. I think it's fine to add this new method as there is already a public method to cache data in PersistQueue, we should provide a public method to clean the queue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user dding3 commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r101952678 --- Diff: docs/graphx-programming-guide.md --- @@ -720,25 +722,53 @@ class GraphOps[VD, ED] { sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)], mergeMsg: (A, A) => A) : Graph[VD, ED] = { -// Receive the initial message at each vertex -var g = mapVertices( (vid, vdata) => vprog(vid, vdata, initialMsg) ).cache() +val checkpointInterval = graph.vertices.sparkContext.getConf --- End diff -- OK. I will change back then. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user dding3 commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r101938997 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala --- @@ -122,27 +125,39 @@ object Pregel extends Logging { require(maxIterations > 0, s"Maximum number of iterations must be greater than 0," + s" but got ${maxIterations}") -var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, initialMsg)).cache() +val checkpointInterval = graph.vertices.sparkContext.getConf + .getInt("spark.graphx.pregel.checkpointInterval", 10) --- End diff -- I think so. Currently I add the document in graphx-programming-guide.md. But I am not sure if it's the right place, please let me know if there is a better place to add the document. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user dding3 commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r101838127 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala --- @@ -155,6 +169,8 @@ object Pregel extends Logging { i += 1 } messages.unpersist(blocking = false) --- End diff -- Looks like unpersist is a protected method and we cannot access it from Pregel. Add a new public method to unpersist dataset to work around this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user dding3 commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r101828986 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/GraphOps.scala --- @@ -362,12 +362,14 @@ class GraphOps[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED]) extends Seriali def pregel[A: ClassTag]( initialMsg: A, maxIterations: Int = Int.MaxValue, + checkpointInterval: Int = 25, --- End diff -- About the default value, I think we should set it as a positive value to turn on checkpoint operation by default to avoid stackoverflow exception. To align with other implementations in spark, I would like to set 10 as default value. Please let me know if there is any thoughts about this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user dding3 commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r101827588 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/GraphOps.scala --- @@ -362,12 +362,14 @@ class GraphOps[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED]) extends Seriali def pregel[A: ClassTag]( initialMsg: A, maxIterations: Int = Int.MaxValue, + checkpointInterval: Int = 25, --- End diff -- Agree with @mallman , we don't need change the api interface if we use a config value to controll checkpoint interval. I will udpate the PR soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user dding3 commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r101823330 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala --- @@ -155,6 +169,8 @@ object Pregel extends Logging { i += 1 } messages.unpersist(blocking = false) +graphCheckpointer.deleteAllCheckpoints() +messageCheckpointer.deleteAllCheckpoints() --- End diff -- I think when there is an exception during training, if we keep the checkpoints, there is a chance for user to recover from it. I checked in RandomForest/GBT in spark, looks like they only delete the checkpoints when the training successful finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user dding3 commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r101822138 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/GraphOps.scala --- @@ -362,12 +362,14 @@ class GraphOps[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED]) extends Seriali def pregel[A: ClassTag]( initialMsg: A, maxIterations: Int = Int.MaxValue, + checkpointInterval: Int = 25, --- End diff -- 25 is the value in my test. I checked this value in LDA/ALS, looks like the default value is 10, change it to 10 ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...
Github user dding3 commented on the issue: https://github.com/apache/spark/pull/15125 Both are OK for me. Please let me know if any update needed from me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...
Github user dding3 commented on the issue: https://github.com/apache/spark/pull/15125 Thank you guys for reviewing the code. I have updated it based on the comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user dding3 commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r100680651 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala --- @@ -123,16 +127,25 @@ object Pregel extends Logging { s" but got ${maxIterations}") var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, initialMsg)).cache() +val graphCheckpointer = new PeriodicGraphCheckpointer[VD, ED]( + checkpointInterval, graph.vertices.sparkContext) +graphCheckpointer.update(g) + // compute the messages -var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg) +var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg).cache() +val messageCheckpointer = new PeriodicRDDCheckpointer[(VertexId, A)]( + checkpointInterval, graph.vertices.sparkContext) +messageCheckpointer.update(messages.asInstanceOf[RDD[(VertexId, A)]]) var activeMessages = messages.count() + // Loop var prevG: Graph[VD, ED] = null var i = 0 while (activeMessages > 0 && i < maxIterations) { // Receive the messages and update the vertices. prevG = g g = g.joinVertices(messages)(vprog).cache() --- End diff -- OK --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user dding3 commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r100680648 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala --- @@ -123,16 +127,25 @@ object Pregel extends Logging { s" but got ${maxIterations}") var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, initialMsg)).cache() +val graphCheckpointer = new PeriodicGraphCheckpointer[VD, ED]( + checkpointInterval, graph.vertices.sparkContext) +graphCheckpointer.update(g) + // compute the messages -var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg) +var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg).cache() +val messageCheckpointer = new PeriodicRDDCheckpointer[(VertexId, A)]( + checkpointInterval, graph.vertices.sparkContext) +messageCheckpointer.update(messages.asInstanceOf[RDD[(VertexId, A)]]) --- End diff -- I think we need cache graph/messages here so they don't need to be computed again in the loop. I agree with you and I will keep the checkpointer update calls and remove all .cache calls. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...
Github user dding3 commented on the issue: https://github.com/apache/spark/pull/15125 Sure. I will work on the rebase and update the PR soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
GitHub user dding3 opened a pull request: https://github.com/apache/spark/pull/15125 [SPARK-5484][GraphX] Periodically do checkpoint in Pregel ## What changes were proposed in this pull request? Pregel-based iterative algorithms with more than ~50 iterations begin to slow down and eventually fail with a StackOverflowError due to Spark's lack of support for long lineage chains. This PR causes Pregel to checkpoint the graph periodically if the checkpoint directory is set. This PR moves PeriodicGraphCheckpointer.scala from mllib to graphx, moves PeriodicRDDCheckpointer.scala, PeriodicCheckpointer.scala from mllib to core ## How was this patch tested? unit tests, manual tests (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) You can merge this pull request into a Git repository by running: $ git pull https://github.com/dding3/spark cp2_pregel Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15125.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15125 commit 3268ca2144519a1d84484c116e16e6064553a6c1 Author: ding <ding@localhost.localdomain> Date: 2016-09-16T17:51:58Z test commit ffeadbe96b7fdb491962e065f97ece09fd1a1282 Author: ding <ding@localhost.localdomain> Date: 2016-09-16T18:13:22Z period do checkpoint in pregel commit 720a741db037a94ab86750fb3c9d5a54732da5e1 Author: ding <ding@localhost.localdomain> Date: 2016-09-16T18:37:50Z remove unused code --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15124: [SPARK-17559][MLLIB]persist edges if their storag...
GitHub user dding3 opened a pull request: https://github.com/apache/spark/pull/15124 [SPARK-17559][MLLIB]persist edges if their storage level is non in PeriodicGraphCheckpointer ## What changes were proposed in this pull request? When use PeriodicGraphCheckpointer to persist graph, sometimes the edges isn't persisted. As currently only when vertices's storage level is none, graph is persisted. However there is a chance vertices's storage level is not none while edges's is none. Eg. graph created by a outerJoinVertices operation, vertices is automatically cached while edges is not. In this way, edges will not be persisted if we use PeriodicGraphCheckpointer do persist. We need separately check edges's storage level and persisted it if it's none. ## How was this patch tested? manual tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/dding3/spark spark-persisitEdge Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15124.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15124 commit bf94b4dbcc4e8e0602715dce92f5053608674b43 Author: ding <ding@localhost.localdomain> Date: 2016-09-16T17:04:27Z persist edges if their storage level is non in PeriodicGraphCheckpointer --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15116: [SPARK-17559][MLLIB]persist edges if their storage level...
Github user dding3 commented on the issue: https://github.com/apache/spark/pull/15116 Close the pr since it mix with another unrelated commit --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15116: [SPARK-17559][MLLIB]persist edges if their storag...
Github user dding3 closed the pull request at: https://github.com/apache/spark/pull/15116 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15116: [SPARK-17559][MLLIB]persist edges if their storag...
GitHub user dding3 opened a pull request: https://github.com/apache/spark/pull/15116 [SPARK-17559][MLLIB]persist edges if their storage level is none in PeriodicGraphCheckpointer ## What changes were proposed in this pull request? When use PeriodicGraphCheckpointer to persist graph, sometimes the edges isn't persisted. As currently only when vertices's storage level is none, graph is persisted. However there is a chance vertices's storage level is not none while edges's is none. Eg. graph created by a outerJoinVertices operation, vertices is automatically cached while edges is not. In this way, edges will not be persisted if we use PeriodicGraphCheckpointer do persist. We need separately check edges's storage level and persisted it if it's none. ## How was this patch tested? manual tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/dding3/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15116.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15116 commit ad29af46b34d2d156078aba48b8e0427136fc6dd Author: ding <ding@localhost.localdomain> Date: 2016-09-15T21:39:10Z persist edges if their storage level is none --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15562][ML] Delete temp directory after ...
Github user dding3 commented on the pull request: https://github.com/apache/spark/pull/13328#issuecomment-222070319 Jenkins retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15562][ML] Delete temp directory after ...
Github user dding3 commented on the pull request: https://github.com/apache/spark/pull/13328#issuecomment-222039995 Thanks for pointing out it. I think you are right, delete the checkpoint directory after it's created is on purpose. I will add "delete checkpoint directory" back. Besides, the temp directory will be deleted after the test finished as it has been registered to delete if VM shuts down when it's created. I have verified that, the temp checkpoint directory is not deleted after test finished in the previous code, while it's deleted in the new code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15562][ML] Delete temp directory after ...
GitHub user dding3 opened a pull request: https://github.com/apache/spark/pull/13328 [SPARK-15562][ML] Delete temp directory after program exit in DataFrameExample ## What changes were proposed in this pull request? Temp directory used to save records is not deleted after program exit in DataFrameExample. Although it called deleteOnExit, it doesn't work as the directory is not empty. Similar things happend in ContextCleanerSuite. Update the code to make sure temp directory is deleted after program exit. ## How was this patch tested? unit tests and local build. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dding3/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13328.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13328 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15172][ML] Explicitly tell user initial...
Github user dding3 commented on the pull request: https://github.com/apache/spark/pull/12948#issuecomment-217716898 Jenkins retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-15172][ML] Improve LogisticRegression w...
GitHub user dding3 opened a pull request: https://github.com/apache/spark/pull/12948 [SPARK-15172][ML] Improve LogisticRegression warning message ## What changes were proposed in this pull request? Explicitly tell user initial coefficients is ignored if its size doesn't match expected size in LogisticRegression ## How was this patch tested? local build You can merge this pull request into a Git repository by running: $ git pull https://github.com/dding3/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12948.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12948 commit 54cfaed5d307513cfd94a5807cf26a16695313ef Author: dding3 <dingd...@dingding-ubuntu.sh.intel.com> Date: 2016-05-06T06:15:10Z Improve LogisticRegression warning message --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14969][MLLib] Remove duplicate implemen...
Github user dding3 commented on the pull request: https://github.com/apache/spark/pull/12747#issuecomment-215609146 @srowen Thanks for your review. I have removed it in ANNGradient. Besides, I checked all subclass of Gradient, looks like there is no duplicate implementation now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-14969][MLLib] Remove duplicate implemen...
Github user dding3 commented on the pull request: https://github.com/apache/spark/pull/12747#issuecomment-215280895 Thanks for your comments. Removed the PR description. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Remove duplicate implementation of compute in ...
GitHub user dding3 opened a pull request: https://github.com/apache/spark/pull/12747 Remove duplicate implementation of compute in LogisticGradient ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) This PR removes duplicate implementation of compute in LogisticGradient class ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) unit tests (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) You can merge this pull request into a Git repository by running: $ git pull https://github.com/dding3/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/12747.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12747 commit a88b7f40eb6e363de8e530a2e377d6027cb91b16 Author: dding3 <dingd...@dingding-ubuntu.sh.intel.com> Date: 2016-04-26T08:31:28Z remove unnecessary compute method in LogisticGRadient commit 3695a52b7e3d7e95cbc0ec30ea8a76da53d59a70 Author: dding3 <dingd...@dingding-ubuntu.sh.intel.com> Date: 2016-04-26T08:33:08Z Merge branch 'upstream_master' --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4094][CORE] checkpoint should still be ...
Github user dding3 commented on the pull request: https://github.com/apache/spark/pull/2956#issuecomment-67931182 We have tested the patch in below senarios and find it works : 1. Apply checkpoint. RDD has been flush to disk as expected 2. Doesn't apply checkpoint. There is no performance degradation, our app(pagerank) only spend 1 more second compared to spark without patch --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org