Vinod, MESOS-5854 <https://issues.apache.org/jira/browse/MESOS-5854> created. Feel free to change the priority appropriately.
Yes, the workaround I mentioned for disk size is based on resource specification, so that works for now. On Fri, Jul 15, 2016 at 11:48 AM, Andrew Leung <[email protected]> wrote: > Hi Jie, > > Yes, that is how we are working around this issue. However, we wanted to > see if others were hitting this issue as well. If others had a similar > Mesos Slave on ZFS setup, it might be worth considering a disk space > calculation approach that works more reliably with ZFS or at least calling > out the need to specify the disk resource explicitly. > > Thanks for the help. > Andrew > > On Jul 15, 2016, at 11:41 AM, Jie Yu <[email protected]> wrote: > > Can you hard code your disk size using --resources flag? > > > On Fri, Jul 15, 2016 at 11:31 AM, Sharma Podila <[email protected]> > wrote: > >> We had this issue happen again and were able to debug further. The cause >> for agent not being able to restart is that one of the resources (disk) >> changed its total size since the last restart. However, this error does not >> show up in INFO/WARN/ERROR files. We saw it in stdout only when manually >> restarting the agent. It would be good to have all messages going to >> stdout/stderr show up in the logs. Is there a config setting for it that I >> missed? >> >> The disk size total is changing sometimes on our agents. It is off by a >> few bytes (seeing ~10 bytes difference out of, say, 600 GB). We use ZFS on >> our agents to manage the disk partition. From my colleague, Andrew (copied >> here): >> >> The current Mesos approach (i.e., `statvfs()` for total blocks and assume >>> that never changes) won’t work reliably on ZFS >>> >> >> Anyone else experience this? We can likely hack a workaround for this by >> reporting the "whole GBs" of the disk so we are insensitive to small >> changes in the total size. But, not sure if the changes can be larger due >> to Andrew's point above. >> >> >> On Mon, Mar 7, 2016 at 6:00 PM, Sharma Podila <[email protected]> >> wrote: >> >>> Sure, will do. >>> >>> >>> On Mon, Mar 7, 2016 at 5:54 PM, Benjamin Mahler <[email protected]> >>> wrote: >>> >>>> Very surprising.. I don't have any ideas other than trying to replicate >>>> the scenario in a test. >>>> >>>> Please do keep us posted if you encounter it again and gain more >>>> information. >>>> >>>> On Fri, Feb 26, 2016 at 4:34 PM, Sharma Podila <[email protected]> >>>> wrote: >>>> >>>>> MESOS-4795 created. >>>>> >>>>> I don't have the exit status. We haven't seen a repeat yet, will catch >>>>> the exit status next time it happens. >>>>> >>>>> Yes, removing the metadata directory was the only way it was resolved. >>>>> This happened on multiple hosts requiring the same resolution. >>>>> >>>>> >>>>> On Thu, Feb 25, 2016 at 6:37 PM, Benjamin Mahler <[email protected]> >>>>> wrote: >>>>> >>>>>> Feel free to create one. I don't have enough information to know what >>>>>> the issue is without doing some further investigation, but if the >>>>>> situation >>>>>> you described is accurate it seems like a there are two strange bugs: >>>>>> >>>>>> -the silent exit (do you not have the exit status?), and >>>>>> -the flapping from ZK errors that needed the meta data directory to >>>>>> be removed to resolve (are you convinced the removal of the meta >>>>>> directory >>>>>> is what solved it?) >>>>>> >>>>>> It would be good to track these issues in case they crop up again. >>>>>> >>>>>> On Tue, Feb 23, 2016 at 2:51 PM, Sharma Podila <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi Ben, >>>>>>> >>>>>>> Let me know if there is a new issue created for this, I would like >>>>>>> to add myself to watch it. >>>>>>> Thanks. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Feb 10, 2016 at 9:54 AM, Sharma Podila <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Ben, >>>>>>>> >>>>>>>> That is accurate, with one additional line: >>>>>>>> >>>>>>>> -Agent running fine with 0.24.1 >>>>>>>> -Transient ZK issues, slave flapping with zookeeper_init failure >>>>>>>> -ZK issue resolved >>>>>>>> -Most agents stop flapping and function correctly >>>>>>>> -Some agents continue flapping, but silent exit after printing the >>>>>>>> detector.cpp:481 log line. >>>>>>>> -The agents that continue to flap repaired with manual removal of >>>>>>>> contents in mesos-slave's working dir >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Feb 10, 2016 at 9:43 AM, Benjamin Mahler < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Hey Sharma, >>>>>>>>> >>>>>>>>> I didn't quite follow the timeline of events here or how the agent >>>>>>>>> logs you posted fit into the timeline of events. Here's how I >>>>>>>>> interpreted: >>>>>>>>> >>>>>>>>> -Agent running fine with 0.24.1 >>>>>>>>> -Transient ZK issues, slave flapping with zookeeper_init failure >>>>>>>>> -ZK issue resolved >>>>>>>>> -Most agents stop flapping and function correctly >>>>>>>>> -Some agents continue flapping, but silent exit after printing the >>>>>>>>> detector.cpp:481 log line. >>>>>>>>> >>>>>>>>> Is this accurate? What is the exit code from the silent exit? >>>>>>>>> >>>>>>>>> On Tue, Feb 9, 2016 at 9:09 PM, Sharma Podila <[email protected] >>>>>>>>> > wrote: >>>>>>>>> >>>>>>>>>> Maybe related, but, maybe different since a new process seems to >>>>>>>>>> find the master leader and still aborts, never recovering with >>>>>>>>>> restarts >>>>>>>>>> until work dir data is removed. >>>>>>>>>> It is happening in 0.24.1. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Feb 9, 2016 at 11:53 AM, Vinod Kone <[email protected] >>>>>>>>>> > wrote: >>>>>>>>>> >>>>>>>>>>> MESOS-1326 was fixed in 0.19.0 (set the fix version now). But I >>>>>>>>>>> guess you are saying it is somehow related but not exactly the same >>>>>>>>>>> issue? >>>>>>>>>>> >>>>>>>>>>> On Tue, Feb 9, 2016 at 11:46 AM, Raúl Gutiérrez Segalés < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> On 9 February 2016 at 11:04, Sharma Podila <[email protected] >>>>>>>>>>>> > wrote: >>>>>>>>>>>> >>>>>>>>>>>>> We had a few mesos agents stuck in an unrecoverable state >>>>>>>>>>>>> after a transient ZK init error. Is this a known problem? I >>>>>>>>>>>>> wasn't able to >>>>>>>>>>>>> find an existing jira item for this. We are on 0.24.1 at this >>>>>>>>>>>>> time. >>>>>>>>>>>>> >>>>>>>>>>>>> Most agents were fine, except a handful. These handful of >>>>>>>>>>>>> agents had their mesos-slave process constantly restarting. The >>>>>>>>>>>>> .INFO >>>>>>>>>>>>> logfile had the following contents below, before the process >>>>>>>>>>>>> exited, with >>>>>>>>>>>>> no error messages. The restarts were happening constantly due to >>>>>>>>>>>>> an >>>>>>>>>>>>> existing service keep alive strategy. >>>>>>>>>>>>> >>>>>>>>>>>>> To fix it, we manually stopped the service, removed the data >>>>>>>>>>>>> in the working dir, and then restarted it. The mesos-slave >>>>>>>>>>>>> process was able >>>>>>>>>>>>> to restart then. The manual intervention needed to resolve it is >>>>>>>>>>>>> problematic. >>>>>>>>>>>>> >>>>>>>>>>>>> Here's the contents of the various log files on the agent: >>>>>>>>>>>>> >>>>>>>>>>>>> The .INFO logfile for one of the restarts before mesos-slave >>>>>>>>>>>>> process exited with no other error messages: >>>>>>>>>>>>> >>>>>>>>>>>>> Log file created at: 2016/02/09 02:12:48 >>>>>>>>>>>>> Running on machine: titusagent-main-i-7697a9c5 >>>>>>>>>>>>> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid >>>>>>>>>>>>> file:line] msg >>>>>>>>>>>>> I0209 02:12:48.502403 97255 logging.cpp:172] INFO level >>>>>>>>>>>>> logging started! >>>>>>>>>>>>> I0209 02:12:48.502938 97255 main.cpp:185] Build: 2015-09-30 >>>>>>>>>>>>> 16:12:07 by builds >>>>>>>>>>>>> I0209 02:12:48.502974 97255 main.cpp:187] Version: 0.24.1 >>>>>>>>>>>>> I0209 02:12:48.503288 97255 containerizer.cpp:143] Using >>>>>>>>>>>>> isolation: posix/cpu,posix/mem,filesystem/posix >>>>>>>>>>>>> I0209 02:12:48.507961 97255 main.cpp:272] Starting Mesos slave >>>>>>>>>>>>> I0209 02:12:48.509827 97296 slave.cpp:190] Slave started on 1)@ >>>>>>>>>>>>> 10.138.146.230:7101 >>>>>>>>>>>>> I0209 02:12:48.510074 97296 slave.cpp:191] Flags at startup: >>>>>>>>>>>>> --appc_store_dir="/tmp/mesos/store/appc" >>>>>>>>>>>>> --attributes="region:us-east-1;<snip>" --authenticatee="<snip>" >>>>>>>>>>>>> --cgroups_cpu_enable_pids_and_tids_count="false" >>>>>>>>>>>>> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" >>>>>>>>>>>>> --cgroups_limit_swap="false" --cgroups_root="mesos" >>>>>>>>>>>>> --container_disk_watch_interval="15secs" --containerizers="mesos" >>>>>>>>>>>>> <snip>" >>>>>>>>>>>>> I0209 02:12:48.511706 97296 slave.cpp:354] Slave resources: >>>>>>>>>>>>> ports(*):[7150-7200]; mem(*):240135; cpus(*):32; disk(*):586104 >>>>>>>>>>>>> I0209 02:12:48.512320 97296 slave.cpp:384] Slave hostname: >>>>>>>>>>>>> <snip> >>>>>>>>>>>>> I0209 02:12:48.512368 97296 slave.cpp:389] Slave checkpoint: >>>>>>>>>>>>> true >>>>>>>>>>>>> I0209 02:12:48.516139 97299 group.cpp:331] Group process >>>>>>>>>>>>> (group(1)@10.138.146.230:7101) connected to ZooKeeper >>>>>>>>>>>>> I0209 02:12:48.516216 97299 group.cpp:805] Syncing group >>>>>>>>>>>>> operations: queue size (joins, cancels, datas) = (0, 0, 0) >>>>>>>>>>>>> I0209 02:12:48.516253 97299 group.cpp:403] Trying to create >>>>>>>>>>>>> path '/titus/main/mesos' in ZooKeeper >>>>>>>>>>>>> I0209 02:12:48.520268 97275 detector.cpp:156] Detected a new >>>>>>>>>>>>> leader: (id='209') >>>>>>>>>>>>> I0209 02:12:48.520803 97284 group.cpp:674] Trying to get >>>>>>>>>>>>> '/titus/main/mesos/json.info_0000000209' in ZooKeeper >>>>>>>>>>>>> I0209 02:12:48.520874 97278 state.cpp:54] Recovering state >>>>>>>>>>>>> from '/mnt/data/mesos/meta' >>>>>>>>>>>>> I0209 02:12:48.520961 97278 state.cpp:690] Failed to find >>>>>>>>>>>>> resources file '/mnt/data/mesos/meta/resources/resources.info' >>>>>>>>>>>>> I0209 02:12:48.523680 97283 detector.cpp:481] A new leading >>>>>>>>>>>>> master ([email protected]:7103) is detected >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> The .FATAL log file when the original transient ZK error >>>>>>>>>>>>> occurred: >>>>>>>>>>>>> >>>>>>>>>>>>> Log file created at: 2016/02/05 17:21:37 >>>>>>>>>>>>> Running on machine: titusagent-main-i-7697a9c5 >>>>>>>>>>>>> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid >>>>>>>>>>>>> file:line] msg >>>>>>>>>>>>> F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to >>>>>>>>>>>>> create ZooKeeper, zookeeper_init: No such file or directory [2] >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> The .ERROR log file: >>>>>>>>>>>>> >>>>>>>>>>>>> Log file created at: 2016/02/05 17:21:37 >>>>>>>>>>>>> Running on machine: titusagent-main-i-7697a9c5 >>>>>>>>>>>>> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid >>>>>>>>>>>>> file:line] msg >>>>>>>>>>>>> F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to >>>>>>>>>>>>> create ZooKeeper, zookeeper_init: No such file or directory [2] >>>>>>>>>>>>> >>>>>>>>>>>>> The .WARNING file had the same content. >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Maybe related: https://issues.apache.org/jira/browse/MESOS-1326 >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -rgs >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > >

