Hi Ben, Let me know if there is a new issue created for this, I would like to add myself to watch it. Thanks.
On Wed, Feb 10, 2016 at 9:54 AM, Sharma Podila <[email protected]> wrote: > Hi Ben, > > That is accurate, with one additional line: > > -Agent running fine with 0.24.1 > -Transient ZK issues, slave flapping with zookeeper_init failure > -ZK issue resolved > -Most agents stop flapping and function correctly > -Some agents continue flapping, but silent exit after printing the > detector.cpp:481 log line. > -The agents that continue to flap repaired with manual removal of contents > in mesos-slave's working dir > > > > On Wed, Feb 10, 2016 at 9:43 AM, Benjamin Mahler <[email protected]> > wrote: > >> Hey Sharma, >> >> I didn't quite follow the timeline of events here or how the agent logs >> you posted fit into the timeline of events. Here's how I interpreted: >> >> -Agent running fine with 0.24.1 >> -Transient ZK issues, slave flapping with zookeeper_init failure >> -ZK issue resolved >> -Most agents stop flapping and function correctly >> -Some agents continue flapping, but silent exit after printing the >> detector.cpp:481 log line. >> >> Is this accurate? What is the exit code from the silent exit? >> >> On Tue, Feb 9, 2016 at 9:09 PM, Sharma Podila <[email protected]> >> wrote: >> >>> Maybe related, but, maybe different since a new process seems to find >>> the master leader and still aborts, never recovering with restarts until >>> work dir data is removed. >>> It is happening in 0.24.1. >>> >>> >>> >>> >>> On Tue, Feb 9, 2016 at 11:53 AM, Vinod Kone <[email protected]> >>> wrote: >>> >>>> MESOS-1326 was fixed in 0.19.0 (set the fix version now). But I guess >>>> you are saying it is somehow related but not exactly the same issue? >>>> >>>> On Tue, Feb 9, 2016 at 11:46 AM, Raúl Gutiérrez Segalés < >>>> [email protected]> wrote: >>>> >>>>> On 9 February 2016 at 11:04, Sharma Podila <[email protected]> >>>>> wrote: >>>>> >>>>>> We had a few mesos agents stuck in an unrecoverable state after a >>>>>> transient ZK init error. Is this a known problem? I wasn't able to find >>>>>> an >>>>>> existing jira item for this. We are on 0.24.1 at this time. >>>>>> >>>>>> Most agents were fine, except a handful. These handful of agents had >>>>>> their mesos-slave process constantly restarting. The .INFO logfile had >>>>>> the >>>>>> following contents below, before the process exited, with no error >>>>>> messages. The restarts were happening constantly due to an existing >>>>>> service >>>>>> keep alive strategy. >>>>>> >>>>>> To fix it, we manually stopped the service, removed the data in the >>>>>> working dir, and then restarted it. The mesos-slave process was able to >>>>>> restart then. The manual intervention needed to resolve it is >>>>>> problematic. >>>>>> >>>>>> Here's the contents of the various log files on the agent: >>>>>> >>>>>> The .INFO logfile for one of the restarts before mesos-slave process >>>>>> exited with no other error messages: >>>>>> >>>>>> Log file created at: 2016/02/09 02:12:48 >>>>>> Running on machine: titusagent-main-i-7697a9c5 >>>>>> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg >>>>>> I0209 02:12:48.502403 97255 logging.cpp:172] INFO level logging >>>>>> started! >>>>>> I0209 02:12:48.502938 97255 main.cpp:185] Build: 2015-09-30 16:12:07 >>>>>> by builds >>>>>> I0209 02:12:48.502974 97255 main.cpp:187] Version: 0.24.1 >>>>>> I0209 02:12:48.503288 97255 containerizer.cpp:143] Using isolation: >>>>>> posix/cpu,posix/mem,filesystem/posix >>>>>> I0209 02:12:48.507961 97255 main.cpp:272] Starting Mesos slave >>>>>> I0209 02:12:48.509827 97296 slave.cpp:190] Slave started on 1)@ >>>>>> 10.138.146.230:7101 >>>>>> I0209 02:12:48.510074 97296 slave.cpp:191] Flags at startup: >>>>>> --appc_store_dir="/tmp/mesos/store/appc" >>>>>> --attributes="region:us-east-1;<snip>" --authenticatee="<snip>" >>>>>> --cgroups_cpu_enable_pids_and_tids_count="false" >>>>>> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" >>>>>> --cgroups_limit_swap="false" --cgroups_root="mesos" >>>>>> --container_disk_watch_interval="15secs" --containerizers="mesos" <snip>" >>>>>> I0209 02:12:48.511706 97296 slave.cpp:354] Slave resources: >>>>>> ports(*):[7150-7200]; mem(*):240135; cpus(*):32; disk(*):586104 >>>>>> I0209 02:12:48.512320 97296 slave.cpp:384] Slave hostname: <snip> >>>>>> I0209 02:12:48.512368 97296 slave.cpp:389] Slave checkpoint: true >>>>>> I0209 02:12:48.516139 97299 group.cpp:331] Group process (group(1)@ >>>>>> 10.138.146.230:7101) connected to ZooKeeper >>>>>> I0209 02:12:48.516216 97299 group.cpp:805] Syncing group operations: >>>>>> queue size (joins, cancels, datas) = (0, 0, 0) >>>>>> I0209 02:12:48.516253 97299 group.cpp:403] Trying to create path >>>>>> '/titus/main/mesos' in ZooKeeper >>>>>> I0209 02:12:48.520268 97275 detector.cpp:156] Detected a new leader: >>>>>> (id='209') >>>>>> I0209 02:12:48.520803 97284 group.cpp:674] Trying to get >>>>>> '/titus/main/mesos/json.info_0000000209' in ZooKeeper >>>>>> I0209 02:12:48.520874 97278 state.cpp:54] Recovering state from >>>>>> '/mnt/data/mesos/meta' >>>>>> I0209 02:12:48.520961 97278 state.cpp:690] Failed to find resources >>>>>> file '/mnt/data/mesos/meta/resources/resources.info' >>>>>> I0209 02:12:48.523680 97283 detector.cpp:481] A new leading master >>>>>> ([email protected]:7103) is detected >>>>>> >>>>>> >>>>>> The .FATAL log file when the original transient ZK error occurred: >>>>>> >>>>>> Log file created at: 2016/02/05 17:21:37 >>>>>> Running on machine: titusagent-main-i-7697a9c5 >>>>>> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg >>>>>> F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create >>>>>> ZooKeeper, zookeeper_init: No such file or directory [2] >>>>>> >>>>>> >>>>>> The .ERROR log file: >>>>>> >>>>>> Log file created at: 2016/02/05 17:21:37 >>>>>> Running on machine: titusagent-main-i-7697a9c5 >>>>>> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg >>>>>> F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create >>>>>> ZooKeeper, zookeeper_init: No such file or directory [2] >>>>>> >>>>>> The .WARNING file had the same content. >>>>>> >>>>> >>>>> Maybe related: https://issues.apache.org/jira/browse/MESOS-1326 >>>>> >>>>> >>>>> -rgs >>>>> >>>>> >>>> >>> >> >

