[GitHub] spark pull request: [SPARK-4991][CORE] Worker should reconnect to ...

2015-05-18 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/3825#issuecomment-103157190
  
I think this has timed out. Would you close this PR for now?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4991][CORE] Worker should reconnect to ...

2015-05-18 Thread liyezhang556520
Github user liyezhang556520 closed the pull request at:

https://github.com/apache/spark/pull/3825


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4991][CORE] Worker should reconnect to ...

2015-01-05 Thread liyezhang556520
Github user liyezhang556520 commented on the pull request:

https://github.com/apache/spark/pull/3825#issuecomment-68684212
  
@JoshRosen , If we want to use the supervision mechanism. We need to add 
another actor level as parent of the current Master actor. I don't know if that 
is suitable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4991][CORE] Worker should reconnect to ...

2015-01-05 Thread liyezhang556520
Github user liyezhang556520 commented on the pull request:

https://github.com/apache/spark/pull/3825#issuecomment-68828355
  
@markhamstra , thanks for reminder, I'll update this PR by making a try to 
introduce the supervision.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4991][CORE] Worker should reconnect to ...

2015-01-05 Thread markhamstra
Github user markhamstra commented on the pull request:

https://github.com/apache/spark/pull/3825#issuecomment-68721665
  
@liyezhang556520 That's been done already in the DAGScheduler.  If we need 
another level of supervision for Master or other actors, we should consider 
whether these actors need separate Supervisors or whether they can be combined 
in the supervision hierarchy.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4991][CORE] Worker should reconnect to ...

2015-01-04 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/3825#issuecomment-68648457
  
@markhamstra From that page:

 Depending on the nature of the work to be supervised and the nature of 
the failure, the supervisor has a choice of the following four options:
 
 1. Resume the subordinate, keeping its accumulated internal state
 2. Restart the subordinate, clearing out its accumulated internal state
 3. Stop the subordinate permanently
 4. Escalate the failure, thereby failing itself

It sounds like we want approach 1, resuming the subordinate without losing 
its state, so I'd be in favor of reworking this PR to use that type of 
supervision strategy.  I don't think that extending our actors to support 
restart necessarily makes sense, since it adds a lot of complexity (for 
instance, I don't think that this PR handles loss of the `appInfo` or 
`driverInfo` data structures).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4991][CORE] Worker should reconnect to ...

2015-01-04 Thread markhamstra
Github user markhamstra commented on the pull request:

https://github.com/apache/spark/pull/3825#issuecomment-68650323
  
@JoshRosen your thinking is that Master will be in good shape even though 
an exception has been thrown?  If you can guarantee that, then resuming the 
actor while keeping the accumulated state should do the job.  Otherwise, things 
get more complicated.  Within the lengthy process of handling exceptions thrown 
within the DAGScheduler (https://github.com/apache/spark/pull/186), we ended up 
taking the conservative approach of restarting the whole system instead of 
trying to restart the DAGScheduler actor with fixed or reconstructed state.  I 
haven't dug into the details of this PR yet, so I can't say for certain, but 
there are probably lessons to be learned from that DAGScheduler epic PR.

Something else that we'll need to consider at some point if other actors 
start requiring supervision strategies other than the default is what the 
overall structure of the supervision hierarchy should be.  Right now, only the 
DAGScheduler has another level of supervision, but perhaps Spark actors from 
outside the DAGScheduler should also be handled under one or more levels of 
common supervision.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4991][CORE] Worker should reconnect to ...

2014-12-31 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/3825#issuecomment-68429462
  
I'd rather use fewer Akka features than more, since this will make it 
easier to replace Akka with our own RPC layer in the future.  Therefore, I'd 
much prefer to not allow exceptions to trigger actor restarts / state clearing. 
 I think that adding an experimental Akka feature like persistence would be a 
huge risk for little obvious gain.

I'm not sure if the heartbeat from unknown worker can ever occur if we 
don't clear the master's state because I think that workers only begin sending 
heartbeats once a master has ack'd their registration in which case the master 
would know that it was a previously-registered worker and instruct it to 
reconnect.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4991][CORE] Worker should reconnect to ...

2014-12-31 Thread markhamstra
Github user markhamstra commented on the pull request:

https://github.com/apache/spark/pull/3825#issuecomment-68445280
  
It doesn't seem to me that usage of the newer Akka persistence API is 
called for, but it does seem that wrapping the `receive` in a try-catch is 
trying to do the job for which Akka's `SupervisorStrategy` is intended.  I 
can't recommend the hand-rolled try-catch approach.

http://doc.akka.io/docs/akka/2.3.4/general/supervision.html#supervision


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4991][CORE] Worker should reconnect to ...

2014-12-30 Thread liyezhang556520
Github user liyezhang556520 commented on the pull request:

https://github.com/apache/spark/pull/3825#issuecomment-68424069
  
Hi @JoshRosen , it'll be a little weird to wrap the `receive` or 
`receiveWithLogging` method with try-catch blocks. And also this is conflict 
with the fault tolerance design of Akka, which is supposed to handle exception 
within supervisor. If you think it's fine to add a try-catch block to `receive` 
method. Then I'll make an additional commit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4991][CORE] Worker should reconnect to ...

2014-12-30 Thread liyezhang556520
Github user liyezhang556520 commented on the pull request:

https://github.com/apache/spark/pull/3825#issuecomment-68428197
  
I think a better way is to use the Akka's 
[persistence](http://doc.akka.io/docs/akka/snapshot/scala/persistence.html) 
feature, recover the actor's state when actor restart. 

Anyway, still this PR has it's value that when an unregistered worker has 
heartbeat to this Master, this master should not just ignore it, at least 
should tell the worker it has disconnected. What do you think @JoshRosen ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4991][CORE] Worker should reconnect to ...

2014-12-29 Thread liyezhang556520
GitHub user liyezhang556520 opened a pull request:

https://github.com/apache/spark/pull/3825

[SPARK-4991][CORE] Worker should reconnect to Master when Master actor 
restart

This is a following JIRA of 
[SPARK-4989](https://issues.apache.org/jira/browse/SPARK-4991). when Master 
akka actor encounter an exception, the Master will restart (akka actor restart 
not JVM restart). And all old information are cleared on Master (including 
workers, applications, etc). However, the workers are not aware of this at all. 
The state of the cluster is that: the master is on, and all workers are also 
on, but master is not aware of the exists of workers, and will ignore all 
worker's heartbeat because all workers are not registered. So that the whole 
cluster is not available.

In this PR, master will tell worker the connection is disconnected, so that 
worker will register to master again. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liyezhang556520/spark workerReconn

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3825.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3825


commit 107e5c58fdbe143fe6eabcfdb5d91d7b1184bb35
Author: Zhang, Liye liye.zh...@intel.com
Date:   2014-12-29T07:35:45Z

worker reconnect to master when master restart for exception

commit e9c99e3969f6e058e46d65575d796d1289351318
Author: Zhang, Liye liye.zh...@intel.com
Date:   2014-12-29T08:51:50Z

add log info




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4991][CORE] Worker should reconnect to ...

2014-12-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3825#issuecomment-68241590
  
  [Test build #24859 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24859/consoleFull)
 for   PR 3825 at commit 
[`e9c99e3`](https://github.com/apache/spark/commit/e9c99e3969f6e058e46d65575d796d1289351318).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4991][CORE] Worker should reconnect to ...

2014-12-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3825#issuecomment-68246470
  
  [Test build #24859 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24859/consoleFull)
 for   PR 3825 at commit 
[`e9c99e3`](https://github.com/apache/spark/commit/e9c99e3969f6e058e46d65575d796d1289351318).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  case class MasterDisconnected(masterUrl: String) extends 
DeployMessage`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4991][CORE] Worker should reconnect to ...

2014-12-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3825#issuecomment-68246474
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24859/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4991][CORE] Worker should reconnect to ...

2014-12-29 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/3825#issuecomment-68284859
  
Wouldn't it be better to ensure that actors like Master and DAGScheduler 
never die due to uncaught exceptions?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4991][CORE] Worker should reconnect to ...

2014-12-29 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/3825#issuecomment-68286943
  
More specifically, I guess I'm suggesting that we modify wrap the `receive` 
and `receiveWithLogging` methods of our actors with try-catch blocks to log any 
exceptions that bubble up to the top of the actors.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org