[jira] [Resolved] (SPARK-19900) [Standalone] Master registers application again when driver relaunched
[ https://issues.apache.org/jira/browse/SPARK-19900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-19900. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18084 [https://github.com/apache/spark/pull/18084] > [Standalone] Master registers application again when driver relaunched > -- > > Key: SPARK-19900 > URL: https://issues.apache.org/jira/browse/SPARK-19900 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 1.6.2 > Environment: Centos 6.5, spark standalone >Reporter: Sergey >Priority: Critical > Labels: Spark, network, standalone, supervise > Fix For: 2.3.0 > > > I've found some problems when node, where driver is running, has unstable > network. A situation is possible when two identical applications are running > on a cluster. > *Steps to Reproduce:* > # prepare 3 node. One for the spark master and two for the spark workers. > # submit an application with parameter spark.driver.supervise = true > # go to the node where driver is running (for example spark-worker-1) and > close 7077 port > {code} > # iptables -A OUTPUT -p tcp --dport 7077 -j DROP > {code} > # wait more 60 seconds > # look at the spark master UI > There are two spark applications and one driver. The new application has > WAITING state and the second application has RUNNING state. Driver has > RUNNING or RELAUNCHING state (It depends on the resources available, as I > understand it) and it launched on other node (for example spark-worker-2) > # open the port > {code} > # iptables -D OUTPUT -p tcp --dport 7077 -j DROP > {code} > # look an the spark UI again > There are no changes > In addition, if you look at the processes on the node spark-worker-1 > {code} > # ps ax | grep spark > {code} > you will see that the old driver is still working! > *Spark master logs:* > {code} > 17/03/10 05:26:27 WARN Master: Removing > worker-20170310052240-spark-worker-1-35039 because we got no heartbeat in 60 > seconds > 17/03/10 05:26:27 INFO Master: Removing worker > worker-20170310052240-spark-worker-1-35039 on spark-worker-1:35039 > 17/03/10 05:26:27 INFO Master: Telling app of lost executor: 1 > 17/03/10 05:26:27 INFO Master: Telling app of lost executor: 0 > 17/03/10 05:26:27 INFO Master: Re-launching driver-20170310052347- > 17/03/10 05:26:27 INFO Master: Launching driver driver-20170310052347- on > worker worker-20170310052411-spark-worker-2-40473 > 17/03/10 05:26:35 INFO Master: Registering app TestApplication > 17/03/10 05:26:35 INFO Master: Registered app TestApplication with ID > app-20170310052635-0001 > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got status update for unknown executor > app-20170310052354-/1 > 17/03/10 05:31:07 WARN Master: Got status update for unknown executor > app-20170310052354-/0 > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN
[jira] [Resolved] (SPARK-19900) [Standalone] Master registers application again when driver relaunched
[ https://issues.apache.org/jira/browse/SPARK-19900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19900. --- Resolution: Cannot Reproduce > [Standalone] Master registers application again when driver relaunched > -- > > Key: SPARK-19900 > URL: https://issues.apache.org/jira/browse/SPARK-19900 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 1.6.2 > Environment: Centos 6.5, spark standalone >Reporter: Sergey >Priority: Critical > Labels: Spark, network, standalone, supervise > > I've found some problems when node, where driver is running, has unstable > network. A situation is possible when two identical applications are running > on a cluster. > *Steps to Reproduce:* > # prepare 3 node. One for the spark master and two for the spark workers. > # submit an application with parameter spark.driver.supervise = true > # go to the node where driver is running (for example spark-worker-1) and > close 7077 port > {code} > # iptables -A OUTPUT -p tcp --dport 7077 -j DROP > {code} > # wait more 60 seconds > # look at the spark master UI > There are two spark applications and one driver. The new application has > WAITING state and the second application has RUNNING state. Driver has > RUNNING or RELAUNCHING state (It depends on the resources available, as I > understand it) and it launched on other node (for example spark-worker-2) > # open the port > {code} > # iptables -D OUTPUT -p tcp --dport 7077 -j DROP > {code} > # look an the spark UI again > There are no changes > In addition, if you look at the processes on the node spark-worker-1 > {code} > # ps ax | grep spark > {code} > you will see that the old driver is still working! > *Spark master logs:* > {code} > 17/03/10 05:26:27 WARN Master: Removing > worker-20170310052240-spark-worker-1-35039 because we got no heartbeat in 60 > seconds > 17/03/10 05:26:27 INFO Master: Removing worker > worker-20170310052240-spark-worker-1-35039 on spark-worker-1:35039 > 17/03/10 05:26:27 INFO Master: Telling app of lost executor: 1 > 17/03/10 05:26:27 INFO Master: Telling app of lost executor: 0 > 17/03/10 05:26:27 INFO Master: Re-launching driver-20170310052347- > 17/03/10 05:26:27 INFO Master: Launching driver driver-20170310052347- on > worker worker-20170310052411-spark-worker-2-40473 > 17/03/10 05:26:35 INFO Master: Registering app TestApplication > 17/03/10 05:26:35 INFO Master: Registered app TestApplication with ID > app-20170310052635-0001 > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got status update for unknown executor > app-20170310052354-/1 > 17/03/10 05:31:07 WARN Master: Got status update for unknown executor > app-20170310052354-/0 > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07 WARN Master: Got heartbeat from unregistered worker > worker-20170310052240-spark-worker-1-35039. Asking it to re-register. > 17/03/10 05:31:07