[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2018-10-08 Thread Jorge Machado (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16641468#comment-16641468
 ] 

Jorge Machado commented on MESOS-2186:
--

Hi Guys, 

I think this needs to be re-opened. I have this situation on a Mesos 1.3.2 
cluster. 

 

Running on machine: mesosAgentNode
Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
F1008 07:34:48.771600 12897 zookeeper.cpp:132] Failed to create ZooKeeper, 
zookeeper_init: No such file or directory [2]

we have 5 zookeepers configured and the last of them was removed from our 
network. The cluster is totally broken now. 

This should not happen.

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.26.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:318] 

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-22 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14969565#comment-14969565
 ] 

Neil Conway commented on MESOS-2186:


FWIW, this sounds like pretty weird DNS behavior: a host being down shouldn't 
result in getaddrinfo() returning EAI_NONAME. You could possibly work around 
this by doing your own hostname resolution and passing IPs into Mesos, but I 
think the root problem is that DNS in this environment behaves weirdly.

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.26.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
> (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 
> 

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-22 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968622#comment-14968622
 ] 

Steven Schlansker commented on MESOS-2186:
--

That's a bummer.  Thank you everyone for looking and your time.

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.26.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
> (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 
> 28640 master.cpp:366] Master allowing unauthenticated frameworks to register
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123327 
> 28640 master.cpp:371] Master allowing unauthenticated slaves to register
> 

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Raul Gutierrez Segales (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968076#comment-14968076
 ] 

Raul Gutierrez Segales commented on MESOS-2186:
---

I would think so... 

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.26.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
> (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 
> 28640 master.cpp:366] Master allowing unauthenticated frameworks to register
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123327 
> 28640 master.cpp:371] Master allowing unauthenticated slaves to register
> Dec  9 22:54:54 mesosmaster-2 

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Raul Gutierrez Segales (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968075#comment-14968075
 ] 

Raul Gutierrez Segales commented on MESOS-2186:
---

Yeah, at least for the 3.4 branch we'll probably not have the constructor 
(zookeeper_init) retry the failed getaddrinfo() calls, so it's up to the caller.

(ignore the part about the locks not properly initialized mentioned in the 
description of ZOOKEEPER-1029, that has nothing to do with this bug).

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.26.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
> (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968131#comment-14968131
 ] 

Neil Conway commented on MESOS-2186:


If the DNS resolution failure lasts for a long time, zookeeper_init() will 
continue to return NULL and hence Mesos will still be unable to make progress.

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.26.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
> (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 
> 28640 master.cpp:366] Master allowing unauthenticated frameworks to register
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123327 

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968027#comment-14968027
 ] 

Steven Schlansker commented on MESOS-2186:
--

I reopened the ticket since it is still a crasher in master.  I hope that is 
appropriate, I apologize in advance if not.  Not trying to be a stick in the 
mud but this compromises the "high availability" of Mesos which is a critical 
piece of infrastructure.

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.26.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
> (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 
> 28640 master.cpp:366] Master 

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968034#comment-14968034
 ] 

Neil Conway commented on MESOS-2186:


Hi Steven,

The current theory is that this is a Zookeeper; from a quick look at the Zk bug 
([ZOOKEEPER-1029]), that seems likely correct to me. When there is a Zookeeper 
patch for the problem, we can discuss whether to backport it to Mesos in the 
time before a new Zk stable release is made. Other than that, I'm not sure what 
else we can do.

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.26.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
> (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 mesosmaster-2 

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968056#comment-14968056
 ] 

Steven Schlansker commented on MESOS-2186:
--

Well, rgs above called into question whether that is truly the case.  
Additionally at least as of now the "check failure stack trace" is entirely in 
C++ code, seemingly not in the Zookeeper library (pure C).

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.26.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
> (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 
> 28640 master.cpp:366] Master allowing unauthenticated frameworks to register
> Dec  

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968073#comment-14968073
 ] 

Steven Schlansker commented on MESOS-2186:
--

If zookeeper_init() returns NULL, that in fact means that ZOOKEEPER-1029 is 
unrelated, yeah?

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.26.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
> (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 
> 28640 master.cpp:366] Master allowing unauthenticated frameworks to register
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123327 
> 28640 master.cpp:371] Master allowing 

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968143#comment-14968143
 ] 

Steven Schlansker commented on MESOS-2186:
--

Maybe this will end up being too hard to fix, since it seems to be a limitation 
of the ZK C API.  It's just surprising from an end user perspective that a 
single name failing to resolve (even when two are still happy) causes such a 
disruptive failure.

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.26.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
> (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 
> 28640 master.cpp:366] Master allowing 

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968063#comment-14968063
 ] 

Neil Conway commented on MESOS-2186:


The check failure trace happens because the call to zookeeper_init() returns 
NULL; Mesos checks for this and aborts with an error and a stack trace.

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.26.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
> (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 
> 28640 master.cpp:366] Master allowing unauthenticated frameworks to register
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123327 
> 

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968072#comment-14968072
 ] 

Steven Schlansker commented on MESOS-2186:
--

If zookeeper_init() returns NULL, that in fact means that ZOOKEEPER-1029 is 
unrelated, yeah?

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.26.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
> (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 
> 28640 master.cpp:366] Master allowing unauthenticated frameworks to register
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123327 
> 28640 master.cpp:371] Master allowing 

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-21 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968099#comment-14968099
 ] 

Neil Conway commented on MESOS-2186:


Ah, okay. So the situation seems to be:

(1) zookeeper_init() returns NULL when getaddrinfo() fails, as intended.
(2) Mesos is _designed_ to loop and retry zookeeper_init(), but it doesn't do 
this: we use a gross hack to determine whether the zookeeper_init() failure was 
due to a hostname resolution failure, and apparently it doesn't account for 
this case (we're expecting errno == EINVAL, apparently we see ENOENT instead).
(3) Hence, we abort the process.

We can revise the condition we're checking in #2 slightly, but that is only 
intended as a convenience anyway: as discussed above, you should be running 
Mesos under process supervision and restarting it when it fails. (The question 
is just whether we do the retry loop in Mesos itself or in the process 
supervisor.) If Mesos exiting unexpectedly "compromises the 'high availability' 
of Mesos", your Mesos installation is not configured correctly.

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0, 0.26.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> 

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-19 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963868#comment-14963868
 ] 

Vinod Kone commented on MESOS-2186:
---

Linking the ticket that added retry logic around zookeeper_init in Mesos.

[~rgs] The above ticket also includes production traces showing the issue. Let 
me know if that's helpful.

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
> (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 
> 28640 master.cpp:366] Master allowing unauthenticated frameworks to register
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-10-18 Thread Raul Gutierrez Segales (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962629#comment-14962629
 ] 

Raul Gutierrez Segales commented on MESOS-2186:
---

[~stevenschlansker]: I am not sure if what's described in ZOOKEEPER-1029 is 
what you are seeing here.. could you please provide a repro? 

For instance, this small test program:

https://gist.github.com/rgs1/998684a0e93c072cb65d

Does not crash when using bad hostnames. Thanks!

> Mesos crashes if any configured zookeeper does not resolve.
> ---
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.21.0
> Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>Reporter: Daniel Hall
>Priority: Critical
>  Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not 
> resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
> while we were rebuilding one of our zookeeper hosts in Google compute (which 
> bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been 
> sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
> 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
> 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
> 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
> such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
> 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
> file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
> create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
> 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
> zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
> trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
> google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
> google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
> 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
> (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 
> 28640 

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-08-11 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14682072#comment-14682072
 ] 

Steven Schlansker commented on MESOS-2186:
--

I strongly disagree with closing this bug, it is not fixed, and is a serious 
issue affecting multiple end users.  The ZOOKEEPER- bug tracks the actual fix, 
IMO this bug then should track integrating a fixed library into Mesos.

 Mesos crashes if any configured zookeeper does not resolve.
 ---

 Key: MESOS-2186
 URL: https://issues.apache.org/jira/browse/MESOS-2186
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.21.0
 Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
 Mesos: 0.21.0-1.0.centos65
 CentOS: CentOS release 6.6 (Final)
Reporter: Daniel Hall
Priority: Critical
  Labels: mesosphere

 When starting Mesos, if one of the configured zookeeper servers does not 
 resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
 while we were rebuilding one of our zookeeper hosts in Google compute (which 
 bases the DNS on the machines running).
 Here is a log from a failed startup (hostnames and ip addresses have been 
 sanitised).
 {noformat}
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
 28627 main.cpp:292] Starting Mesos master
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 google::LogMessage::Fail()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 google::LogMessage::Fail()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
 google::LogMessage::SendToLog()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
 create ZooKeeper, zookeeper_init: No such file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 google::LogMessage::Fail()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
 create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
 zookeeper_init: No such file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 google::LogMessage::Fail()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
 google::LogMessage::SendToLog()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
 google::LogMessage::SendToLog()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
 (mesosmaster-2.internal) started on 10.x.x.x:5050
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 
 28640 master.cpp:366] Master allowing unauthenticated frameworks to register
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-06-15 Thread Raul Gutierrez Segales (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586492#comment-14586492
 ] 

Raul Gutierrez Segales commented on MESOS-2186:
---

The ZooKeeper client should probably just keep retrying the lookups from the IO 
thread (i.e.: it could very well be transient). I don't think a failed DNS 
lookup should be coupled with failing the ZK handler (at all). 

Lets follow-up in ZOOKEEPER-1029. Thanks!

 Mesos crashes if any configured zookeeper does not resolve.
 ---

 Key: MESOS-2186
 URL: https://issues.apache.org/jira/browse/MESOS-2186
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.21.0
 Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
 Mesos: 0.21.0-1.0.centos65
 CentOS: CentOS release 6.6 (Final)
Reporter: Daniel Hall
Priority: Critical
  Labels: mesosphere

 When starting Mesos, if one of the configured zookeeper servers does not 
 resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
 while we were rebuilding one of our zookeeper hosts in Google compute (which 
 bases the DNS on the machines running).
 Here is a log from a failed startup (hostnames and ip addresses have been 
 sanitised).
 {noformat}
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
 28627 main.cpp:292] Starting Mesos master
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 google::LogMessage::Fail()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 google::LogMessage::Fail()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
 google::LogMessage::SendToLog()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
 create ZooKeeper, zookeeper_init: No such file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 google::LogMessage::Fail()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
 create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
 zookeeper_init: No such file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 google::LogMessage::Fail()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
 google::LogMessage::SendToLog()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
 google::LogMessage::SendToLog()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
 (mesosmaster-2.internal) started on 10.x.x.x:5050
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 
 28640 master.cpp:366] Master allowing unauthenticated frameworks to register
 Dec  9 22:54:54 

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-03-10 Thread Alexander Rojas (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14354783#comment-14354783
 ] 

Alexander Rojas commented on MESOS-2186:


I related issue from the zookeeper issues: ZOOKEEPER-1029

 Mesos crashes if any configured zookeeper does not resolve.
 ---

 Key: MESOS-2186
 URL: https://issues.apache.org/jira/browse/MESOS-2186
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.21.0
 Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
 Mesos: 0.21.0-1.0.centos65
 CentOS: CentOS release 6.6 (Final)
Reporter: Daniel Hall
Priority: Critical
  Labels: mesosphere

 When starting Mesos, if one of the configured zookeeper servers does not 
 resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
 while we were rebuilding one of our zookeeper hosts in Google compute (which 
 bases the DNS on the machines running).
 Here is a log from a failed startup (hostnames and ip addresses have been 
 sanitised).
 {noformat}
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
 28627 main.cpp:292] Starting Mesos master
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 google::LogMessage::Fail()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 google::LogMessage::Fail()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
 google::LogMessage::SendToLog()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
 create ZooKeeper, zookeeper_init: No such file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 google::LogMessage::Fail()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
 create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
 zookeeper_init: No such file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 google::LogMessage::Fail()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
 google::LogMessage::SendToLog()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
 google::LogMessage::SendToLog()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
 (mesosmaster-2.internal) started on 10.x.x.x:5050
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 
 28640 master.cpp:366] Master allowing unauthenticated frameworks to register
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123327 
 28640 master.cpp:371] Master allowing unauthenticated slaves to register
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
 

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-03-06 Thread Alexander Rojas (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14350433#comment-14350433
 ] 

Alexander Rojas commented on MESOS-2186:


So, I was looking into this, and It is definitely a zookeeper client bug. When 
we call {{zookeeper_init}}, the zookeeper client ends up calling a function 
called {{getaddrs}} which goes through the list of given hosts calling 
[{{getaddrinfo}}|http://linux.die.net/man/3/getaddrinfo]. If any of the calls 
fail, it marks the whole initialisation as failed.

Since there's no more information about failed initialisation given to the 
caller, apart from the return value being null and {{errno}} being set to 
{{EINVAL}}, mesos just aborts.

So, I guess the Idea would be to report a bug into the zookeeper project. The 
question then is, what is the desired behaviour?

 Mesos crashes if any configured zookeeper does not resolve.
 ---

 Key: MESOS-2186
 URL: https://issues.apache.org/jira/browse/MESOS-2186
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.21.0
 Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
 Mesos: 0.21.0-1.0.centos65
 CentOS: CentOS release 6.6 (Final)
Reporter: Daniel Hall
Priority: Critical
  Labels: mesosphere

 When starting Mesos, if one of the configured zookeeper servers does not 
 resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
 while we were rebuilding one of our zookeeper hosts in Google compute (which 
 bases the DNS on the machines running).
 Here is a log from a failed startup (hostnames and ip addresses have been 
 sanitised).
 {noformat}
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
 28627 main.cpp:292] Starting Mesos master
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 google::LogMessage::Fail()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 google::LogMessage::Fail()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
 google::LogMessage::SendToLog()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
 create ZooKeeper, zookeeper_init: No such file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 google::LogMessage::Fail()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
 create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
 zookeeper_init: No such file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 google::LogMessage::Fail()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
 google::LogMessage::SendToLog()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
 

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-02-18 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325597#comment-14325597
 ] 

Benjamin Mahler commented on MESOS-2186:


Aha, thanks for clarifying and apologies that you were bit by this! I believe 
they fixed this issue in the Java client (ZOOKEEPER-1576) but we'll need to 
investigate further for the C client.

{quote}
Absolutely upstart should give up if it has tried restarting the process many 
times in a short period
{quote}

Is upstart restarting the process immediately, or after a delay? I would 
caution against immediate restart for the reason you mentioned. You can alert 
while it remains slowly flapping, instead of allowing upstart to give up. I 
mention this because giving up transitions the system into a state that 
_requires_ human intervention to right the ship.

For the education of others that encounter this ticket, how did this kill the 
entire cluster? All of the tasks should remain running in such a situation, 
was that not the case? What happened on the slave-side?

 Mesos crashes if any configured zookeeper does not resolve.
 ---

 Key: MESOS-2186
 URL: https://issues.apache.org/jira/browse/MESOS-2186
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.21.0
 Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
 Mesos: 0.21.0-1.0.centos65
 CentOS: CentOS release 6.6 (Final)
Reporter: Daniel Hall
Priority: Critical
  Labels: mesosphere

 When starting Mesos, if one of the configured zookeeper servers does not 
 resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
 while we were rebuilding one of our zookeeper hosts in Google compute (which 
 bases the DNS on the machines running).
 Here is a log from a failed startup (hostnames and ip addresses have been 
 sanitised).
 {noformat}
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
 28627 main.cpp:292] Starting Mesos master
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 google::LogMessage::Fail()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 google::LogMessage::Fail()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
 google::LogMessage::SendToLog()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
 create ZooKeeper, zookeeper_init: No such file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 google::LogMessage::Fail()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
 create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
 zookeeper_init: No such file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-02-18 Thread Daniel Hall (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14326975#comment-14326975
 ] 

Daniel Hall commented on MESOS-2186:


Lets just call the upstart giving up setting a personal preference? In our 
environment we have plenty of spare capacity so having a single machine give up 
is okay, and if the whole cluster is having issues then you are going to need a 
human regardless. In any case its inconsequential to this bug.

In our environment the whole cluster, mesos and its frameworks marathon and 
chronos lose the ability to cast elections. This makes breaks the cluster for 
both the masters and the slaves. Indeed all of the tasks remain running. Once 
you either remove the zookeeper from all the client lists, or provision a new 
server (and hence DNS) with the same name this start connecting again. Indeed 
all the masters return and elect a leader again. However if you look at the 
task list in mesos lots of tasks are missing that are still running on the 
slaves. Marathon also sees all the old tasks running until the next 
reconciliation. The only way we have found to recover from this situation is to 
restart all the mesos-slave processes, which kills all the tasks one each 
slave. I imagine that this is a separate bug, but since I can reproduce it 
faithfully in our staging environment I'll be able to file a better bug report 
once I get some spare time.

We encountered this bug while re provisioning one of the zookeepers in our 
cluster. It seems you can work around the issue by adding an `/etc/hosts` entry 
on the cluster machines for the machine that is about to be removed. The ip 
address you give doesn't even need to be running zookeeper, it just has to be 
able to be resolved.

 Mesos crashes if any configured zookeeper does not resolve.
 ---

 Key: MESOS-2186
 URL: https://issues.apache.org/jira/browse/MESOS-2186
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.21.0
 Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
 Mesos: 0.21.0-1.0.centos65
 CentOS: CentOS release 6.6 (Final)
Reporter: Daniel Hall
Priority: Critical
  Labels: mesosphere

 When starting Mesos, if one of the configured zookeeper servers does not 
 resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
 while we were rebuilding one of our zookeeper hosts in Google compute (which 
 bases the DNS on the machines running).
 Here is a log from a failed startup (hostnames and ip addresses have been 
 sanitised).
 {noformat}
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
 28627 main.cpp:292] Starting Mesos master
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 google::LogMessage::Fail()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 google::LogMessage::Fail()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
 google::LogMessage::SendToLog()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
 create ZooKeeper, zookeeper_init: No such file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 google::LogMessage::Fail()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-02-17 Thread Daniel Hall (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325503#comment-14325503
 ] 

Daniel Hall commented on MESOS-2186:


We are running it under upstart which is restarting the daemon when it crashes. 
However if any of the configured zookeepers is not resolving it crashes again, 
so restarting the process has no effect.

Eventually the respawn limit get gets hit and upstart (understandably) gives up.

 Mesos crashes if any configured zookeeper does not resolve.
 ---

 Key: MESOS-2186
 URL: https://issues.apache.org/jira/browse/MESOS-2186
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.21.0
 Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
 Mesos: 0.21.0-1.0.centos65
 CentOS: CentOS release 6.6 (Final)
Reporter: Daniel Hall
Priority: Critical
  Labels: mesosphere

 When starting Mesos, if one of the configured zookeeper servers does not 
 resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
 while we were rebuilding one of our zookeeper hosts in Google compute (which 
 bases the DNS on the machines running).
 Here is a log from a failed startup (hostnames and ip addresses have been 
 sanitised).
 {noformat}
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
 28627 main.cpp:292] Starting Mesos master
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 google::LogMessage::Fail()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 google::LogMessage::Fail()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
 google::LogMessage::SendToLog()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
 create ZooKeeper, zookeeper_init: No such file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 google::LogMessage::Fail()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
 create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
 zookeeper_init: No such file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 google::LogMessage::Fail()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
 google::LogMessage::SendToLog()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
 google::LogMessage::SendToLog()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
 (mesosmaster-2.internal) started on 10.x.x.x:5050
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 
 28640 master.cpp:366] Master allowing unauthenticated frameworks to register
 Dec  9 22:54:54 

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-02-17 Thread Daniel Hall (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325540#comment-14325540
 ] 

Daniel Hall commented on MESOS-2186:


The zookeeper cluster is available still. We run three zookeeper servers. If a 
single one is down the cluster can still operate because the quorum only 
requires two machines to be up. However if a single server is not resolving in 
DNS then this bug is triggered and mesos is unable to connect to the cluster 
despite it having quorum.

Absolutely upstart should give up if it has tried restarting the process many 
times in a short period. If it didn't it could be responsible for a thundering 
herd issue. We would rather it stop trying and alert a human operator.

 Mesos crashes if any configured zookeeper does not resolve.
 ---

 Key: MESOS-2186
 URL: https://issues.apache.org/jira/browse/MESOS-2186
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.21.0
 Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
 Mesos: 0.21.0-1.0.centos65
 CentOS: CentOS release 6.6 (Final)
Reporter: Daniel Hall
Priority: Critical
  Labels: mesosphere

 When starting Mesos, if one of the configured zookeeper servers does not 
 resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
 while we were rebuilding one of our zookeeper hosts in Google compute (which 
 bases the DNS on the machines running).
 Here is a log from a failed startup (hostnames and ip addresses have been 
 sanitised).
 {noformat}
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
 28627 main.cpp:292] Starting Mesos master
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 google::LogMessage::Fail()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 google::LogMessage::Fail()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
 google::LogMessage::SendToLog()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
 create ZooKeeper, zookeeper_init: No such file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 google::LogMessage::Fail()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
 create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
 zookeeper_init: No such file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 google::LogMessage::Fail()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
 google::LogMessage::SendToLog()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
 google::LogMessage::SendToLog()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
 28640 

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-02-17 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325528#comment-14325528
 ] 

Benjamin Mahler commented on MESOS-2186:


Ok that's great, though you'll not want upstart to give up like that. Even if 
we fixed this, no master can be elected during the time of ZK unavailability.

 Mesos crashes if any configured zookeeper does not resolve.
 ---

 Key: MESOS-2186
 URL: https://issues.apache.org/jira/browse/MESOS-2186
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.21.0
 Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
 Mesos: 0.21.0-1.0.centos65
 CentOS: CentOS release 6.6 (Final)
Reporter: Daniel Hall
Priority: Critical
  Labels: mesosphere

 When starting Mesos, if one of the configured zookeeper servers does not 
 resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
 while we were rebuilding one of our zookeeper hosts in Google compute (which 
 bases the DNS on the machines running).
 Here is a log from a failed startup (hostnames and ip addresses have been 
 sanitised).
 {noformat}
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
 28627 main.cpp:292] Starting Mesos master
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 google::LogMessage::Fail()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 google::LogMessage::Fail()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
 google::LogMessage::SendToLog()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
 create ZooKeeper, zookeeper_init: No such file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 google::LogMessage::Fail()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
 create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
 zookeeper_init: No such file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 google::LogMessage::Fail()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
 google::LogMessage::SendToLog()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
 google::LogMessage::SendToLog()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
 (mesosmaster-2.internal) started on 10.x.x.x:5050
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 
 28640 master.cpp:366] Master allowing unauthenticated frameworks to register
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123327 
 28640 master.cpp:371] Master allowing unauthenticated slaves to 

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-02-17 Thread Daniel Hall (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325475#comment-14325475
 ] 

Daniel Hall commented on MESOS-2186:


This issue just killed our entire cluster. Is there anything I can do to get 
some priority on this?

 Mesos crashes if any configured zookeeper does not resolve.
 ---

 Key: MESOS-2186
 URL: https://issues.apache.org/jira/browse/MESOS-2186
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.21.0
 Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
 Mesos: 0.21.0-1.0.centos65
 CentOS: CentOS release 6.6 (Final)
Reporter: Daniel Hall
Priority: Critical
  Labels: mesosphere

 When starting Mesos, if one of the configured zookeeper servers does not 
 resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
 while we were rebuilding one of our zookeeper hosts in Google compute (which 
 bases the DNS on the machines running).
 Here is a log from a failed startup (hostnames and ip addresses have been 
 sanitised).
 {noformat}
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
 28627 main.cpp:292] Starting Mesos master
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 google::LogMessage::Fail()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 google::LogMessage::Fail()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
 google::LogMessage::SendToLog()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
 create ZooKeeper, zookeeper_init: No such file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 google::LogMessage::Fail()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
 create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
 zookeeper_init: No such file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 google::LogMessage::Fail()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
 google::LogMessage::SendToLog()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
 google::LogMessage::SendToLog()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
 (mesosmaster-2.internal) started on 10.x.x.x:5050
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 
 28640 master.cpp:366] Master allowing unauthenticated frameworks to register
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123327 
 28640 master.cpp:371] Master allowing unauthenticated slaves to register
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:

[jira] [Commented] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

2015-02-17 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325496#comment-14325496
 ] 

Benjamin Mahler commented on MESOS-2186:


Please protect yourself by ensuring that the masters are run under a process 
which will restart them when they terminate for _any_ reason, this is required 
to operate a cluster that is highly available. Is there somewhere that you 
think this could be documented to help others avoid getting bit?

 Mesos crashes if any configured zookeeper does not resolve.
 ---

 Key: MESOS-2186
 URL: https://issues.apache.org/jira/browse/MESOS-2186
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.21.0
 Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
 Mesos: 0.21.0-1.0.centos65
 CentOS: CentOS release 6.6 (Final)
Reporter: Daniel Hall
Priority: Critical
  Labels: mesosphere

 When starting Mesos, if one of the configured zookeeper servers does not 
 resolve in DNS Mesos will crash and refuse to start. We noticed this issue 
 while we were rebuilding one of our zookeeper hosts in Google compute (which 
 bases the DNS on the machines running).
 Here is a log from a failed startup (hostnames and ip addresses have been 
 sanitised).
 {noformat}
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 
 28627 main.cpp:292] Starting Mesos master
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 
 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 google::LogMessage::Fail()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 google::LogMessage::Fail()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
 google::LogMessage::SendToLog()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
 create ZooKeeper, zookeeper_init: No such file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 google::LogMessage::Fail()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 
 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No 
 such file or directory
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 
 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such 
 file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to 
 create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 
 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, 
 zookeeper_init: No such file or directory [2]
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack 
 trace: ***
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160  
 google::LogMessage::Fail()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
 google::LogMessage::SendToLog()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9  
 google::LogMessage::SendToLog()
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 
 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 
 (mesosmaster-2.internal) started on 10.x.x.x:5050
 Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 
 28640 master.cpp:366] Master allowing unauthenticated frameworks to