Hi there, i check the source code and found that in org.apache.spark.deploy.client.AppClient, there is a parameter tells(line 52): val REGISTRATION_TIMEOUT = 20.seconds val REGISTRATION_RETRIES = 3As I know If I wanna increase the retry times, must I modify this value,rebuild the entire Spark project and then redeply spark cluster with my modified version?Or is there a better way to solve this issue?Thanks.
-------------------------------- Thanks&Best regards! San.Luo ----- 原始邮件 ----- 发件人:<luohui20...@sina.com> 收件人:"user" <user@spark.apache.org> 主题:All master are unreponsive issue 日期:2015年07月02日 17点31分 Hi there: I got an problem that "Application has been killed.Reason:All masters are unresponsive!Giving up." I check the network I/O and found sometimes it is really high when running my app. Pls refer to the attached pic for more info.I also checked http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/connectivity_issues.html, and set SPARK_LOCAL_IP in every node's spark-env.sh of my spark cluster. However it does not benifit in solving this problem.I am not sure if this parameter is correctly set,my setting is like this:On node1:export SPARK_LOCAL_IP={node1's IP}On node2:export SPARK_LOCAL_IP={node2's IP}...... BTW,I guess that the akka will retry 3 times when communicate between master and slave, it is possible to increase the akka retries? And except expand the network bandwidth, is there another way to solve this problem? thanks for any coming ideas. -------------------------------- Thanks&Best regards! San.Luo