[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16775 I think we can solve this issue by tackling the codes in SQL. So close it for now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16775 @viirya I believe this PR meshes with the refactoring and application to pregel GraphX algorithms in #15125. Basically, it moves the periodic checkpointing code from mllib into core and uses it in GraphX to checkpoint long lineages. This is essential to scale GraphX to huge graphs, as described in my comment in the PR, and solves a very real problem for us. Can you take a look at that PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16775 ping @mengxr @jkbradley @liancheng @MLnick May you take a look at this? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16775 For the issue reported on mailing list, I found the root cause makes significant difference between 1.6 and current branch. The fix is at #16785. However, I think this patch is still useful. So I keep it open for a while for reviewers. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16775 `StringIndexer` and `OneHotEncoder` are just used as example here. The concept is to have a pipeline with enough long stages. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...
Github user DavidArenburg commented on the issue: https://github.com/apache/spark/pull/16775 Wouldn't it better to Vectorize `StringIndexer` and `OneHotEncoder`? Like for instance `.na.fill` or `.na.replace` operate over the whole data set at once instead of running it in a loop? I feel like even with this patch this isn't scalable on lets say 1MM covariates (unless I'm missing something) as this process isn't executed at all nodes/cores at the same time. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16775 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16775 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72277/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16775 **[Test build #72277 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72277/testReport)** for PR 16775 at commit [`32c90dd`](https://github.com/apache/spark/commit/32c90dd0817778d3a1a0d1a955463d656dd92d60). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16775 **[Test build #72277 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72277/testReport)** for PR 16775 at commit [`32c90dd`](https://github.com/apache/spark/commit/32c90dd0817778d3a1a0d1a955463d656dd92d60). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/16775 retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16775 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16775 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72275/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16775 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72276/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16775 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16775 **[Test build #72276 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72276/testReport)** for PR 16775 at commit [`32c90dd`](https://github.com/apache/spark/commit/32c90dd0817778d3a1a0d1a955463d656dd92d60). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16775 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16775 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72274/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16775 **[Test build #72275 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72275/testReport)** for PR 16775 at commit [`7a1b300`](https://github.com/apache/spark/commit/7a1b3008a5873600016ebe0649285a724c6f4d7c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16775 **[Test build #72274 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72274/testReport)** for PR 16775 at commit [`5ed5c2a`](https://github.com/apache/spark/commit/5ed5c2a65c31c78b7845bbb8a3ef859590453ba9). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org