Sure, will do.
On Mon, Mar 7, 2016 at 5:54 PM, Benjamin Mahler <[email protected]> wrote: > Very surprising.. I don't have any ideas other than trying to replicate > the scenario in a test. > > Please do keep us posted if you encounter it again and gain more > information. > > On Fri, Feb 26, 2016 at 4:34 PM, Sharma Podila <[email protected]> > wrote: > >> MESOS-4795 created. >> >> I don't have the exit status. We haven't seen a repeat yet, will catch >> the exit status next time it happens. >> >> Yes, removing the metadata directory was the only way it was resolved. >> This happened on multiple hosts requiring the same resolution. >> >> >> On Thu, Feb 25, 2016 at 6:37 PM, Benjamin Mahler <[email protected]> >> wrote: >> >>> Feel free to create one. I don't have enough information to know what >>> the issue is without doing some further investigation, but if the situation >>> you described is accurate it seems like a there are two strange bugs: >>> >>> -the silent exit (do you not have the exit status?), and >>> -the flapping from ZK errors that needed the meta data directory to be >>> removed to resolve (are you convinced the removal of the meta directory is >>> what solved it?) >>> >>> It would be good to track these issues in case they crop up again. >>> >>> On Tue, Feb 23, 2016 at 2:51 PM, Sharma Podila <[email protected]> >>> wrote: >>> >>>> Hi Ben, >>>> >>>> Let me know if there is a new issue created for this, I would like to >>>> add myself to watch it. >>>> Thanks. >>>> >>>> >>>> >>>> On Wed, Feb 10, 2016 at 9:54 AM, Sharma Podila <[email protected]> >>>> wrote: >>>> >>>>> Hi Ben, >>>>> >>>>> That is accurate, with one additional line: >>>>> >>>>> -Agent running fine with 0.24.1 >>>>> -Transient ZK issues, slave flapping with zookeeper_init failure >>>>> -ZK issue resolved >>>>> -Most agents stop flapping and function correctly >>>>> -Some agents continue flapping, but silent exit after printing the >>>>> detector.cpp:481 log line. >>>>> -The agents that continue to flap repaired with manual removal of >>>>> contents in mesos-slave's working dir >>>>> >>>>> >>>>> >>>>> On Wed, Feb 10, 2016 at 9:43 AM, Benjamin Mahler <[email protected]> >>>>> wrote: >>>>> >>>>>> Hey Sharma, >>>>>> >>>>>> I didn't quite follow the timeline of events here or how the agent >>>>>> logs you posted fit into the timeline of events. Here's how I >>>>>> interpreted: >>>>>> >>>>>> -Agent running fine with 0.24.1 >>>>>> -Transient ZK issues, slave flapping with zookeeper_init failure >>>>>> -ZK issue resolved >>>>>> -Most agents stop flapping and function correctly >>>>>> -Some agents continue flapping, but silent exit after printing the >>>>>> detector.cpp:481 log line. >>>>>> >>>>>> Is this accurate? What is the exit code from the silent exit? >>>>>> >>>>>> On Tue, Feb 9, 2016 at 9:09 PM, Sharma Podila <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Maybe related, but, maybe different since a new process seems to >>>>>>> find the master leader and still aborts, never recovering with restarts >>>>>>> until work dir data is removed. >>>>>>> It is happening in 0.24.1. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Feb 9, 2016 at 11:53 AM, Vinod Kone <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> MESOS-1326 was fixed in 0.19.0 (set the fix version now). But I >>>>>>>> guess you are saying it is somehow related but not exactly the same >>>>>>>> issue? >>>>>>>> >>>>>>>> On Tue, Feb 9, 2016 at 11:46 AM, Raúl Gutiérrez Segalés < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> On 9 February 2016 at 11:04, Sharma Podila <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> We had a few mesos agents stuck in an unrecoverable state after a >>>>>>>>>> transient ZK init error. Is this a known problem? I wasn't able to >>>>>>>>>> find an >>>>>>>>>> existing jira item for this. We are on 0.24.1 at this time. >>>>>>>>>> >>>>>>>>>> Most agents were fine, except a handful. These handful of agents >>>>>>>>>> had their mesos-slave process constantly restarting. The .INFO >>>>>>>>>> logfile had >>>>>>>>>> the following contents below, before the process exited, with no >>>>>>>>>> error >>>>>>>>>> messages. The restarts were happening constantly due to an existing >>>>>>>>>> service >>>>>>>>>> keep alive strategy. >>>>>>>>>> >>>>>>>>>> To fix it, we manually stopped the service, removed the data in >>>>>>>>>> the working dir, and then restarted it. The mesos-slave process was >>>>>>>>>> able to >>>>>>>>>> restart then. The manual intervention needed to resolve it is >>>>>>>>>> problematic. >>>>>>>>>> >>>>>>>>>> Here's the contents of the various log files on the agent: >>>>>>>>>> >>>>>>>>>> The .INFO logfile for one of the restarts before mesos-slave >>>>>>>>>> process exited with no other error messages: >>>>>>>>>> >>>>>>>>>> Log file created at: 2016/02/09 02:12:48 >>>>>>>>>> Running on machine: titusagent-main-i-7697a9c5 >>>>>>>>>> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] >>>>>>>>>> msg >>>>>>>>>> I0209 02:12:48.502403 97255 logging.cpp:172] INFO level logging >>>>>>>>>> started! >>>>>>>>>> I0209 02:12:48.502938 97255 main.cpp:185] Build: 2015-09-30 >>>>>>>>>> 16:12:07 by builds >>>>>>>>>> I0209 02:12:48.502974 97255 main.cpp:187] Version: 0.24.1 >>>>>>>>>> I0209 02:12:48.503288 97255 containerizer.cpp:143] Using >>>>>>>>>> isolation: posix/cpu,posix/mem,filesystem/posix >>>>>>>>>> I0209 02:12:48.507961 97255 main.cpp:272] Starting Mesos slave >>>>>>>>>> I0209 02:12:48.509827 97296 slave.cpp:190] Slave started on 1)@ >>>>>>>>>> 10.138.146.230:7101 >>>>>>>>>> I0209 02:12:48.510074 97296 slave.cpp:191] Flags at startup: >>>>>>>>>> --appc_store_dir="/tmp/mesos/store/appc" >>>>>>>>>> --attributes="region:us-east-1;<snip>" --authenticatee="<snip>" >>>>>>>>>> --cgroups_cpu_enable_pids_and_tids_count="false" >>>>>>>>>> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" >>>>>>>>>> --cgroups_limit_swap="false" --cgroups_root="mesos" >>>>>>>>>> --container_disk_watch_interval="15secs" --containerizers="mesos" >>>>>>>>>> <snip>" >>>>>>>>>> I0209 02:12:48.511706 97296 slave.cpp:354] Slave resources: >>>>>>>>>> ports(*):[7150-7200]; mem(*):240135; cpus(*):32; disk(*):586104 >>>>>>>>>> I0209 02:12:48.512320 97296 slave.cpp:384] Slave hostname: <snip> >>>>>>>>>> I0209 02:12:48.512368 97296 slave.cpp:389] Slave checkpoint: true >>>>>>>>>> I0209 02:12:48.516139 97299 group.cpp:331] Group process >>>>>>>>>> (group(1)@10.138.146.230:7101) connected to ZooKeeper >>>>>>>>>> I0209 02:12:48.516216 97299 group.cpp:805] Syncing group >>>>>>>>>> operations: queue size (joins, cancels, datas) = (0, 0, 0) >>>>>>>>>> I0209 02:12:48.516253 97299 group.cpp:403] Trying to create path >>>>>>>>>> '/titus/main/mesos' in ZooKeeper >>>>>>>>>> I0209 02:12:48.520268 97275 detector.cpp:156] Detected a new >>>>>>>>>> leader: (id='209') >>>>>>>>>> I0209 02:12:48.520803 97284 group.cpp:674] Trying to get >>>>>>>>>> '/titus/main/mesos/json.info_0000000209' in ZooKeeper >>>>>>>>>> I0209 02:12:48.520874 97278 state.cpp:54] Recovering state from >>>>>>>>>> '/mnt/data/mesos/meta' >>>>>>>>>> I0209 02:12:48.520961 97278 state.cpp:690] Failed to find >>>>>>>>>> resources file '/mnt/data/mesos/meta/resources/resources.info' >>>>>>>>>> I0209 02:12:48.523680 97283 detector.cpp:481] A new leading >>>>>>>>>> master ([email protected]:7103) is detected >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> The .FATAL log file when the original transient ZK error occurred: >>>>>>>>>> >>>>>>>>>> Log file created at: 2016/02/05 17:21:37 >>>>>>>>>> Running on machine: titusagent-main-i-7697a9c5 >>>>>>>>>> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] >>>>>>>>>> msg >>>>>>>>>> F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create >>>>>>>>>> ZooKeeper, zookeeper_init: No such file or directory [2] >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> The .ERROR log file: >>>>>>>>>> >>>>>>>>>> Log file created at: 2016/02/05 17:21:37 >>>>>>>>>> Running on machine: titusagent-main-i-7697a9c5 >>>>>>>>>> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] >>>>>>>>>> msg >>>>>>>>>> F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create >>>>>>>>>> ZooKeeper, zookeeper_init: No such file or directory [2] >>>>>>>>>> >>>>>>>>>> The .WARNING file had the same content. >>>>>>>>>> >>>>>>>>> >>>>>>>>> Maybe related: https://issues.apache.org/jira/browse/MESOS-1326 >>>>>>>>> >>>>>>>>> >>>>>>>>> -rgs >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >

