[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...

2017-02-22 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16775
  
I think we can solve this issue by tackling the codes in SQL. So close it 
for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...

2017-02-09 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16775
  
@viirya I believe this PR meshes with the refactoring and application to 
pregel GraphX algorithms in #15125. Basically, it moves the periodic 
checkpointing code from mllib into core and uses it in GraphX to checkpoint 
long lineages. This is essential to scale GraphX to huge graphs, as described 
in my comment in the PR, and solves a very real problem for us. Can you take a 
look at that PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...

2017-02-08 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16775
  
ping @mengxr @jkbradley @liancheng @MLnick May you take a look at this? 
Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...

2017-02-02 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16775
  
For the issue reported on mailing list, I found the root cause makes 
significant difference between 1.6 and current branch. The fix is at #16785.

However, I think this patch is still useful. So I keep it open for a while 
for reviewers.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...

2017-02-02 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16775
  
`StringIndexer` and `OneHotEncoder` are just used as example here. The 
concept is to have a pipeline with enough long stages.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...

2017-02-02 Thread DavidArenburg
Github user DavidArenburg commented on the issue:

https://github.com/apache/spark/pull/16775
  
Wouldn't it better to Vectorize `StringIndexer` and `OneHotEncoder`? Like 
for instance `.na.fill` or `.na.replace` operate over the whole data set at 
once instead of running it in a loop? I feel like even with this patch this 
isn't scalable on lets say 1MM covariates (unless I'm missing something) as 
this process isn't executed at all nodes/cores at the same time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...

2017-02-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16775
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...

2017-02-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16775
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72277/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...

2017-02-02 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16775
  
**[Test build #72277 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72277/testReport)**
 for PR 16775 at commit 
[`32c90dd`](https://github.com/apache/spark/commit/32c90dd0817778d3a1a0d1a955463d656dd92d60).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...

2017-02-02 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16775
  
**[Test build #72277 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72277/testReport)**
 for PR 16775 at commit 
[`32c90dd`](https://github.com/apache/spark/commit/32c90dd0817778d3a1a0d1a955463d656dd92d60).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...

2017-02-02 Thread viirya
Github user viirya commented on the issue:

https://github.com/apache/spark/pull/16775
  
retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...

2017-02-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16775
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...

2017-02-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16775
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72275/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...

2017-02-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16775
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72276/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...

2017-02-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16775
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...

2017-02-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16775
  
**[Test build #72276 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72276/testReport)**
 for PR 16775 at commit 
[`32c90dd`](https://github.com/apache/spark/commit/32c90dd0817778d3a1a0d1a955463d656dd92d60).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...

2017-02-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16775
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...

2017-02-01 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/16775
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72274/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...

2017-02-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16775
  
**[Test build #72275 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72275/testReport)**
 for PR 16775 at commit 
[`7a1b300`](https://github.com/apache/spark/commit/7a1b3008a5873600016ebe0649285a724c6f4d7c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16775: [SPARK-19433][ML] Periodic checkout datasets for long ml...

2017-02-01 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/16775
  
**[Test build #72274 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72274/testReport)**
 for PR 16775 at commit 
[`5ed5c2a`](https://github.com/apache/spark/commit/5ed5c2a65c31c78b7845bbb8a3ef859590453ba9).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org