[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-27 Thread baluchicken
Github user baluchicken commented on the issue:

https://github.com/apache/spark/pull/21067
  
Thanks for the responses, I learned a lot from this:) I am going to close 
this PR for now, and maybe collaborate on the Kubernetes ticket raised by this 
PR. Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-19 Thread liyinan926
Github user liyinan926 commented on the issue:

https://github.com/apache/spark/pull/21067
  
+1 on what @foxish said. If using a Job is the right way to go ultimately, 
it's good to open discussion with sig-apps on adding an option to the Job API & 
controller to use deterministic pod names as well as to offer the exactly-once 
semantic. Spark probably is not the only use case needing such a semantic 
guarantee.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-19 Thread foxish
Github user foxish commented on the issue:

https://github.com/apache/spark/pull/21067
  
> ReadWriteOnce storage can only be attached to one node.

This is well known. Using the RWO volume for fencing here would work - but 
this is not representative of all users. This breaks down if you include 
checkpointing to object storage (s3) or HDFS or into ReadWriteMany volumes like 
NFS. In all of those cases, there will be a problem with correctness. 

For folks that need it right away, the same restarts feature can be 
realized using an approach like the 
[spark-operator](https://github.com/GoogleCloudPlatform/spark-on-k8s-operator) 
without any of this hassle in a safe way, so, why are we trying to fit this 
into Spark with caveats around how volumes should be used to ensure fencing? 
This seems more error prone and harder to explain and I can't see the gain from 
it. One way forward is proposing to the k8s community to have a new option jobs 
that allow us to get fencing from the k8s apiserver through deterministic 
names. I think that would be a good way forward. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-19 Thread skonto
Github user skonto commented on the issue:

https://github.com/apache/spark/pull/21067
  
@baluchicken yeah I thought of that but I was hoping for more automation. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-19 Thread baluchicken
Github user baluchicken commented on the issue:

https://github.com/apache/spark/pull/21067
  
@skonto if the node never become available again the new driver will stay 
in Pending state until like @foxish said "the user explicitly force-kills the 
old driver".


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-19 Thread skonto
Github user skonto commented on the issue:

https://github.com/apache/spark/pull/21067
  
> Once the partitioned node become available again the unknown old driver 
pod got terminated, the volume got unattached and get reattached to the new 
driver pod which state now changed from pending to running.

What if the node never becomes available again?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-17 Thread baluchicken
Github user baluchicken commented on the issue:

https://github.com/apache/spark/pull/21067
  
I ran some more tests about this. I think we can say that this change can 
add resiliency to spark batch jobs where just like in case of YARN Spark will 
retry the job from the beginning if an error happened. 

Also it can add resiliency to the Spark Streaming apps. I fully understand 
your concerns but if someone is going to submit a  resilient Spark Streaming 
app it will use Spark feature Checkpointing. For checkpointing he/she should 
use some kind of PersistentVolume otherwise all info saved to this dir will be 
lost in case of a node failure. For PVC the accessMode should be a 
ReadWriteOnce solution because for this amount of data it is way faster than 
the ReadWriteMany ones. 

My new tests used the same approach described above with one modifications 
I enabled the checkpointing dir backed with ReadWriteOnce PVC. ReadWriteOnce 
storage can only be attached to one node. I thought Kubernetes will detach this 
volume once the Node become "NotReady", but other thing happened. Kubernetes 
does not detached the volume from the unknown node so despite of the Job 
Controller created a new driver pod to replace the Unknown one it remained in 
Pending state because of required PVC still attached to a different node. Once 
the partitioned node become available again the unknown old driver pod got 
terminated, the volume got unattached and get reattached to the new driver pod 
which state now changed from pending to running.

So I think there is no problem with the correctness here. We can maybe add 
a warning to the documentation that if someone wants to use a ReadWriteMany 
backed checkpoint dir correctness issue may arise, but otherwise maybe I am 
still missing something but I think don't.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21067
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/93201/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-17 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21067
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-17 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21067
  
**[Test build #93201 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93201/testReport)**
 for PR 21067 at commit 
[`c04179b`](https://github.com/apache/spark/commit/c04179b48056b14912a95016d5040777bdb1007c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-17 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21067
  
**[Test build #93201 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/93201/testReport)**
 for PR 21067 at commit 
[`c04179b`](https://github.com/apache/spark/commit/c04179b48056b14912a95016d5040777bdb1007c).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-12 Thread foxish
Github user foxish commented on the issue:

https://github.com/apache/spark/pull/21067
  
> After a short/configurable delay the driver pod state changed to Unknown
and the Job controller initiated a new spark driver.

This is dangerous behavior. The old spark driver can still be perfectly
functional and running within the cluster even though it's state is marked
Unknown. It could also still be making progress with it's own executors.
Network connection with the K8s master is not a prerequisite for pods to
continue running.

On Thu, Jul 12, 2018, 7:57 AM Lucas Kacher  wrote:

> @baluchicken , did that test involve
> using checkpointing in a shared location?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> , or 
mute
> the thread
> 

> .
>



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-12 Thread promiseofcake
Github user promiseofcake commented on the issue:

https://github.com/apache/spark/pull/21067
  
@baluchicken, did that test involve using checkpointing in a shared 
location?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-12 Thread baluchicken
Github user baluchicken commented on the issue:

https://github.com/apache/spark/pull/21067
  
@foxish I just checked on a Google Kubernetes Cluster with Kubernetes 
version 1.10.4-gke.2. I created a two node cluster and I emulated "network 
partition" with iptables rules (node running the spark driver become NotReady). 
After a short/configurable delay the driver pod state changed to Unknown and 
the Job controller initiated a new spark driver. After that I removed the 
iptables rules denying the kubelet to speak with the master (The node with 
status NotReady become Ready again). The node become ready and the driver pod 
with the unknown state got terminated, with all of it's executors. In this case 
there are no parallel running spark drivers so I think we are not sacrificing 
correctness. Am I missing something?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-06 Thread liyinan926
Github user liyinan926 commented on the issue:

https://github.com/apache/spark/pull/21067
  
+1 on what @foxish said. I would also like to see a detailed discussion on 
the semantic differences this brings onto the table first before committing to 
this approach.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-06 Thread foxish
Github user foxish commented on the issue:

https://github.com/apache/spark/pull/21067
  
I don't think this current approach will suffice. Correctness is important 
here, especially for folks using spark streaming. I understand that we're 
proposing the use of backoff limits but there is **no guarantee** that a job 
controller **won't** spin up 2 driver pods when we ask for 1. That by 
definition is how the job controller works, by being greedy and working towards 
desired completions. For example, in the case of a network partition, the job 
controller logic in the Kubernetes master will not differentiate between:

1. Losing contact with the driver pod temporarily
2. Finding no driver pod and starting a new one

This has been the reason why in the past I've proposed using a StatefulSet. 
However, getting termination semantics with a StatefulSet will be more work. I 
don't think we should sacrifice correctness in this layer as it would surprise 
the application author who now has to reason about whether the operation they 
are performing is idempotent.

Can we have a proposal and understand all the subtleties before trying to 
change this behavior. For example, if we end up with more than one driver for a 
single job, I'd like to ensure that only one of them is making progress (for 
ex. by using a lease in ZK). 




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21067
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21067
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92685/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21067
  
**[Test build #92685 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92685/testReport)**
 for PR 21067 at commit 
[`0f280f4`](https://github.com/apache/spark/commit/0f280f4bbd3943bb1dd02040a76adf846ed2a8e9).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-06 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21067
  
**[Test build #92685 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92685/testReport)**
 for PR 21067 at commit 
[`0f280f4`](https://github.com/apache/spark/commit/0f280f4bbd3943bb1dd02040a76adf846ed2a8e9).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-06 Thread baluchicken
Github user baluchicken commented on the issue:

https://github.com/apache/spark/pull/21067
  
@skonto thanks, I am going to check it.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-06 Thread skonto
Github user skonto commented on the issue:

https://github.com/apache/spark/pull/21067
  
@baluchicken probably this is covered here: 
https://github.com/apache/spark/pull/21260. I kind of missed that, as I thought 
it was only for hostpaths but it also covers PVs.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21067
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92650/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21067
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21067
  
**[Test build #92650 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92650/testReport)**
 for PR 21067 at commit 
[`4e0b3b0`](https://github.com/apache/spark/commit/4e0b3b0f6ef33918549c17403ab19bb6a92b6643).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-05 Thread baluchicken
Github user baluchicken commented on the issue:

https://github.com/apache/spark/pull/21067
  
@mccheah rebased to master and updated the PR, now the 
KubernetesDriverBuilder will create the driver job instead of the configuration 
steps.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21067
  
**[Test build #92650 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92650/testReport)**
 for PR 21067 at commit 
[`4e0b3b0`](https://github.com/apache/spark/commit/4e0b3b0f6ef33918549c17403ab19bb6a92b6643).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-04 Thread baluchicken
Github user baluchicken commented on the issue:

https://github.com/apache/spark/pull/21067
  
@skonto sorry I have couple of other things to do but I am trying to update 
this as my time allows it.
Yes we are planning to create a PR about the PVs related stuff as soon as 
this one went in.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-07-03 Thread skonto
Github user skonto commented on the issue:

https://github.com/apache/spark/pull/21067
  
@baluchicken @foxish any update on this? HA story is pretty critical for 
production in many cases.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-06-11 Thread baluchicken
Github user baluchicken commented on the issue:

https://github.com/apache/spark/pull/21067
  
@felixcheung rebased to master and fixed failing unit tests


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-06-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21067
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-06-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21067
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91660/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-06-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21067
  
**[Test build #91660 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91660/testReport)**
 for PR 21067 at commit 
[`00a149a`](https://github.com/apache/spark/commit/00a149a046aa87e0ba9a621df7068d134a999a9f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-06-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21067
  
**[Test build #91660 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91660/testReport)**
 for PR 21067 at commit 
[`00a149a`](https://github.com/apache/spark/commit/00a149a046aa87e0ba9a621df7068d134a999a9f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-06-10 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/21067
  
any update?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-06-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21067
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-05-21 Thread liyinan926
Github user liyinan926 commented on the issue:

https://github.com/apache/spark/pull/21067
  
@foxish on concerns of the lack of exactly-one semantics.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21067
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90898/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21067
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-05-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21067
  
**[Test build #90898 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90898/testReport)**
 for PR 21067 at commit 
[`2b1de38`](https://github.com/apache/spark/commit/2b1de389e195026ebc94ed939f299c92661f384a).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-05-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21067
  
**[Test build #90898 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90898/testReport)**
 for PR 21067 at commit 
[`2b1de38`](https://github.com/apache/spark/commit/2b1de389e195026ebc94ed939f299c92661f384a).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-05-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21067
  
**[Test build #90895 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90895/testReport)**
 for PR 21067 at commit 
[`95f6886`](https://github.com/apache/spark/commit/95f6886a29ed09eeeb0254c5289fb832328f1581).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21067
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90895/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-05-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21067
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-05-21 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21067
  
**[Test build #90895 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90895/testReport)**
 for PR 21067 at commit 
[`95f6886`](https://github.com/apache/spark/commit/95f6886a29ed09eeeb0254c5289fb832328f1581).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-05-21 Thread baluchicken
Github user baluchicken commented on the issue:

https://github.com/apache/spark/pull/21067
  
@felixcheung fixed the Scala style validations, sorry.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-05-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21067
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/90868/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-05-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21067
  
**[Test build #90868 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90868/testReport)**
 for PR 21067 at commit 
[`f19bf1a`](https://github.com/apache/spark/commit/f19bf1af3830482069c168856656fc6e6928fe3f).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-05-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21067
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-05-20 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21067
  
**[Test build #90868 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90868/testReport)**
 for PR 21067 at commit 
[`f19bf1a`](https://github.com/apache/spark/commit/f19bf1af3830482069c168856656fc6e6928fe3f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-05-20 Thread felixcheung
Github user felixcheung commented on the issue:

https://github.com/apache/spark/pull/21067
  
Jenkins, ok to test


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-05-13 Thread baluchicken
Github user baluchicken commented on the issue:

https://github.com/apache/spark/pull/21067
  
Rebased again to master.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-05-07 Thread baluchicken
Github user baluchicken commented on the issue:

https://github.com/apache/spark/pull/21067
  
@mccheah Rebased to master, and added support for configurable backofflimit.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-04-14 Thread stoader
Github user stoader commented on the issue:

https://github.com/apache/spark/pull/21067
  
@mccheah 

> But whether or not the driver should be relaunchable should be determined 
by the application submitter, and not necessarily done all the time. Can we 
make this behavior configurable?

This should be easy by configuring [Pod Backoff failure 
policy](https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#pod-backoff-failure-policy)
 of the job such that it executes the pod only once.

> We don't have a solid story for checkpointing streaming computation right 
now

We've done work for this to store checkpointing on persistence volume but 
thought that should be a separate PR as it's not strictly linked to this change.

> you'll certainly lose all progress from batch jobs

Agree that the batch job would be rerun from scratch. Still I think there 
is value for one being able to run the batch job unattended and not intervene 
in case of machine failure as the batch job will be rescheduled to another node.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-04-13 Thread mccheah
Github user mccheah commented on the issue:

https://github.com/apache/spark/pull/21067
  
> We don't have a solid story for checkpointing streaming computation right 
now, and even if we did, you'll certainly lose all progress from batch jobs.

Should probably clarify re: streaming - we don't do any Kubernetes-specific 
actions (e.g. Persistent Volumes) to do Streaming checkpointing. But anything 
built-in to Spark should work, such as DFS checkpointing - barring anything 
that requires using the pod's local disk.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-04-13 Thread mccheah
Github user mccheah commented on the issue:

https://github.com/apache/spark/pull/21067
  
Looks like there's a lot of conflicts from the refactor that was just 
merged.

In general though I don't think this buys us too much. The problem is that 
when the driver fails, you'll lose any and all state of progress done so far. 
We don't have a solid story for checkpointing streaming computation right now, 
and even if we did, you'll certainly lose all progress from batch jobs.

Also, restarting the driver might not be the right thing to do in all 
cases. This assumes that it's always ok to have the driver re-launch itself 
automatically. But whether or not the driver should be relaunchable should be 
determined by the application submitter, and not necessarily done all the time. 
Can we make this behavior configurable?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-04-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21067
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21067: [SPARK-23980][K8S] Resilient Spark driver on Kubernetes

2018-04-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21067
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org