[GitHub] spark issue #16415: [SPARK-19007]Speedup and optimize the GradientBoostedTre...
Github user zdh2292390 commented on the issue: https://github.com/apache/spark/pull/16415 @jkbradley @srowen Can anyone check my latest commit please? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16415: [SPARK-19007]Speedup and optimize the GradientBoostedTre...
Github user zdh2292390 commented on the issue: https://github.com/apache/spark/pull/16415 @jkbradley Accually, I see there are quite a few points to improve MLlib GBDT. First, we should stop using the RandomForest API and make a private API for GBDT because there are a lot of redundancy or reduplicative operations in RandomForest API such as buildMetadata ãfindSplitsBins and convertToTreeRDD. These operations accually only need once for GBDT and not necessary to do for every tree. I'd like to change that but I need some advices to avoid break the existing architectureã --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16415: [SPARK-19007]Speedup and optimize the GradientBoostedTre...
Github user zdh2292390 commented on the issue: https://github.com/apache/spark/pull/16415 @jkbradley I've changed code to specify the storageLevel via a Param. Please check. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16415: [SPARK-19007]Speedup and optimize the GradientBoostedTre...
Github user zdh2292390 commented on the issue: https://github.com/apache/spark/pull/16415 @jkbradley OK ï¼I will close the issue. Do I need to close the PR ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16415: [SPARK-19007]Speedup and optimize the GradientBoostedTre...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/16415 @zdh2292390 Thanks for the update. Given that this will change behavior for existing workloads, I'll ask that we specify it via a Param. Also, I'm going to create a new JIRA for this since it has changed significantly since the original JIRA reporting the failure. Could you please close this issue and instead work from this new JIRA? https://issues.apache.org/jira/browse/SPARK-19063 Thank you! I'll add the other TODOs to your original JIRA. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16415: [SPARK-19007]Speedup and optimize the GradientBoostedTre...
Github user zdh2292390 commented on the issue: https://github.com/apache/spark/pull/16415 @jkbradley Yes,changing the storageLevel in predErrorCheckpointer can fix the problem. Can you please merge my latest commit which just changed the storageLevel in predErrorCheckpointer? When you begin to fix the rest problem can i join in? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16415: [SPARK-19007]Speedup and optimize the GradientBoostedTre...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/16415 Thanks for checking! Does changing the storageLevel in predErrorCheckpointer fix the problem? "other use cases": Well, I remember thinking about this a lot when adding the periodic checkpointer, and it had to do with the fact that RDDs may be materialized later than checkpointer.update() gets called. Now that I look again, it's possible that we could maintain 2 instead of 3 cached RDDs in the checkpointer's persistedQueue, but I'd want to check this more carefully. Problem with 2 RDDs left cached after the loop: This could be fixed by adding a finalize() method to trait LDAOptimizer which can clean up the extra cached RDD. Unfortunately, now that the trait is public, we cannot change it. This fix will need to wait until we move the implementation to the spark.ml package, at which time we can fix the API. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16415: [SPARK-19007]Speedup and optimize the GradientBoostedTre...
Github user zdh2292390 commented on the issue: https://github.com/apache/spark/pull/16415 @jkbradley There is another problem : predErrorCheckpointer did not unpersist the RDD in the queue after the loop. There will be 2 RDD left cached after the loop . --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16415: [SPARK-19007]Speedup and optimize the GradientBoostedTre...
Github user zdh2292390 commented on the issue: https://github.com/apache/spark/pull/16415 @jkbradley I test with MEMORY_ONLY and got the same problem: `(ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 10.2 GB of 10 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.)` and it cost more than 1 hour. So my code should just change the storageLevel in predErrorCheckpointer ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16415: [SPARK-19007]Speedup and optimize the GradientBoostedTre...
Github user zdh2292390 commented on the issue: https://github.com/apache/spark/pull/16415 @jkbradley Can I ask what other uses are for "The periodic checkpointer caches more because of some other use cases" ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16415: [SPARK-19007]Speedup and optimize the GradientBoostedTre...
Github user zdh2292390 commented on the issue: https://github.com/apache/spark/pull/16415 Thx, you are so clearly. I found i did something stupid , my test code accualy doesn't include predErrorCheckpointerï¼I shut it down), so my commit code is not right which should delete the persist in predErrorCheckpointer. I am trying my test with MEMORY_ONLY instead of MEMORY_AND_DISK now and we will know what really caused the improvement in 1 hour. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16415: [SPARK-19007]Speedup and optimize the GradientBoostedTre...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/16415 predErrorCheckpointer should already be persisting and unpersisting predError. This PR's changes will mean: * persist will use MEMORY_AND_DISK instead of MEMORY_ONLY * 1 (instead of 3) predError RDDs will be persisted at any one time. (The periodic checkpointer caches more because of some other use cases.) I recommend we figure out which of the 2 changes above are causing the improvement for your use case. Could you please try your test with MEMORY_ONLY instead of MEMORY_AND_DISK? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16415: [SPARK-19007]Speedup and optimize the GradientBoostedTre...
Github user zdh2292390 commented on the issue: https://github.com/apache/spark/pull/16415 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16415: [SPARK-19007]Speedup and optimize the GradientBoostedTre...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16415 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org