Re: Agent won't start
Hello Paul, Few things to note here: 1. Whenever, you change value of any *resource *or any *attribute* (Description: http://mesos.apache.org/documentation/latest/attributes-resources/), you need to cleanup the work_dir (rm -rf /tmp/mesos) and restart the slave. 2. You muse be already knowing that all mesos tasks/executors started by mesos-slave keeps running even in case mesos slave process dies. Once, you cleanup the work_dir, you will no longer be able to recover those executors/tasks and hence all mesos tasks/executors running on that mesos slave will get killed. So ideally you *shouldn't* do it routinely. But as in your case, if it doesn't matter, you can add this work_dir cleanup in maybe sysinit/systemd/upstart script. (I can't think of the reason why stopping all services on all mesos nodes is a routine tasks unless your slaves are very temporary in nature eg. AWS spot instances) 3. If your use case is that you want to change resources dynamically on each mesos slave, i would suggest you to check mesos dynamic reservation apis (http://mesos.apache.org/documentation/latest/reservation/) Hope this answer you questions. Let me know if i can help you more. On Wed, Mar 30, 2016 at 8:20 PM, Paul Bellwrote: > Greg, thanks again - I am planning on moving my work_dir. > > > > Pradeep, thanks again. In a slightly different scenario, namely, > > service mesos-slave stop > edit /etc/default/mesos-slave (add a port resource) > service mesos-slave start > > > I noticed that slave did not start and - again - the log shows the same > phenomena as in my original post. Per your suggestion, I did a > > rm -Rf /tmp/mesos > > and the slave service started correctly. > > Questions: > > >1. Did editing /etc/default/mesos-slave cause the failure of the >service to start? >2. given that starting/stopping the entire cluster (stopping all >services on all nodes) is a standard feature in our product, should I >routinely to the above "rm" command when the mesos services are stopped? > > > Thanks for your help. > > Cordially, > > Paul > > On Tue, Mar 29, 2016 at 6:16 PM, Greg Mann wrote: > >> Check out this link for info on /tmp cleanup in Ubuntu: >> http://askubuntu.com/questions/20783/how-is-the-tmp-directory-cleaned-up >> >> And check out this link for information on some of the work_dir's >> contents on a Mesos agent: >> http://mesos.apache.org/documentation/latest/sandbox/ >> >> The work_dir contains important application state for the Mesos agent, so >> it should not be placed in a location that will be automatically >> garbage-collected by the OS. The choice of /tmp/mesos as a default location >> is a bit unfortunate, and hopefully we can resolve that JIRA issue soon to >> change it. Ideally you should be able to leave the work_dir alone and let >> the Mesos agent manage it for you. >> >> In any case, I would recommend that you set the work_dir to something >> outside of /tmp; /var/lib/mesos is a commonly-used location. >> >> Cheers, >> Greg >> > > -- Regards, Pradeep Chhetri
Re: Agent won't start
Greg, thanks again - I am planning on moving my work_dir. Pradeep, thanks again. In a slightly different scenario, namely, service mesos-slave stop edit /etc/default/mesos-slave (add a port resource) service mesos-slave start I noticed that slave did not start and - again - the log shows the same phenomena as in my original post. Per your suggestion, I did a rm -Rf /tmp/mesos and the slave service started correctly. Questions: 1. Did editing /etc/default/mesos-slave cause the failure of the service to start? 2. given that starting/stopping the entire cluster (stopping all services on all nodes) is a standard feature in our product, should I routinely to the above "rm" command when the mesos services are stopped? Thanks for your help. Cordially, Paul On Tue, Mar 29, 2016 at 6:16 PM, Greg Mannwrote: > Check out this link for info on /tmp cleanup in Ubuntu: > http://askubuntu.com/questions/20783/how-is-the-tmp-directory-cleaned-up > > And check out this link for information on some of the work_dir's contents > on a Mesos agent: http://mesos.apache.org/documentation/latest/sandbox/ > > The work_dir contains important application state for the Mesos agent, so > it should not be placed in a location that will be automatically > garbage-collected by the OS. The choice of /tmp/mesos as a default location > is a bit unfortunate, and hopefully we can resolve that JIRA issue soon to > change it. Ideally you should be able to leave the work_dir alone and let > the Mesos agent manage it for you. > > In any case, I would recommend that you set the work_dir to something > outside of /tmp; /var/lib/mesos is a commonly-used location. > > Cheers, > Greg >
Re: Agent won't start
Check out this link for info on /tmp cleanup in Ubuntu: http://askubuntu.com/questions/20783/how-is-the-tmp-directory-cleaned-up And check out this link for information on some of the work_dir's contents on a Mesos agent: http://mesos.apache.org/documentation/latest/sandbox/ The work_dir contains important application state for the Mesos agent, so it should not be placed in a location that will be automatically garbage-collected by the OS. The choice of /tmp/mesos as a default location is a bit unfortunate, and hopefully we can resolve that JIRA issue soon to change it. Ideally you should be able to leave the work_dir alone and let the Mesos agent manage it for you. In any case, I would recommend that you set the work_dir to something outside of /tmp; /var/lib/mesos is a commonly-used location. Cheers, Greg
Re: Agent won't start
Hi Pradeep, And thank you for your reply! That, too, is very interesting. I think I need to synthesize what you and Greg are telling me and come up with a clean solution. Agent nodes can crash. Moreover, I can stop the mesos-slave service, and start it later with a reboot in between. So I am interested in fully understanding the causal chain here before I try to fix anything. -Paul On Tue, Mar 29, 2016 at 5:51 PM, Paul Bellwrote: > Whoa...interessant! > > The node *may* have been rebooted. Uptime says 2 days. I'll need to check > my notes. > > Can you point me to reference re Ubuntu behavior? > > Based on what you've told me so far, it sounds as if the sequence: > > stop service > reboot agent node > start service > > > could lead to trouble - or do I misunderstand? > > > Thank you again for your help. > > -Paul > > On Tue, Mar 29, 2016 at 5:36 PM, Greg Mann wrote: > >> Paul, >> This would be relevant for any system which is automatically deleting >> files in /tmp. It looks like in Ubuntu, the default behavior is for /tmp to >> be completely nuked at boot time. Was the agent node rebooted prior to this >> problem? >> >> On Tue, Mar 29, 2016 at 2:29 PM, Paul Bell wrote: >> >>> Hi Greg, >>> >>> Thanks very much for your quick reply. >>> >>> I simply forgot to mention platform. It's Ubuntu 14.04 LTS and it's not >>> systemd. I will look at the link you provide. >>> >>> Is there any chance that it might apply to non-systemd platforms? >>> >>> Cordially, >>> >>> Paul >>> >>> On Tue, Mar 29, 2016 at 5:18 PM, Greg Mann wrote: >>> Hi Paul, Noticing the logging output, "Failed to find resources file '/tmp/mesos/meta/resources/resources.info'", I wonder if your trouble may be related to the location of your agent's work_dir. See this ticket: https://issues.apache.org/jira/browse/MESOS-4541 Some users have reported issues resulting from the systemd-tmpfiles service garbage collecting files in /tmp, perhaps this is related? What platform is your agent running on? You could try specifying a different agent work directory outside of /tmp/ via the `--work_dir` command-line flag. Cheers, Greg On Tue, Mar 29, 2016 at 2:08 PM, Paul Bell wrote: > Hi, > > I am hoping someone can shed some light on this. > > An agent node failed to start, that is, when I did "service > mesos-slave start" the service came up briefly & then stopped. Before > stopping it produced the log shown below. The last thing it wrote is > "Trying to create path '/mesos' in Zookeeper". > > This mention of the mesos znode prompted me to go for a clean slate by > removing the mesos znode from Zookeeper. > > After doing this, the mesos-slave service started perfectly. > > What might be happening here, and also what's the right way to > trouble-shoot such a problem? Mesos is version 0.23.0. > > Thanks for your help. > > -Paul > > > Log file created at: 2016/03/29 14:19:39 > Running on machine: 71.100.202.193 > Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg > I0329 14:19:39.512249 5870 logging.cpp:172] INFO level logging > started! > I0329 14:19:39.512564 5870 main.cpp:162] Build: 2015-07-24 10:05:39 > by root > I0329 14:19:39.512588 5870 main.cpp:164] Version: 0.23.0 > I0329 14:19:39.512600 5870 main.cpp:167] Git tag: 0.23.0 > I0329 14:19:39.512612 5870 main.cpp:171] Git SHA: > 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66 > I0329 14:19:39.615172 5870 containerizer.cpp:111] Using isolation: > posix/cpu,posix/mem > I0329 14:19:39.615697 5870 main.cpp:249] Starting Mesos slave > I0329 14:19:39.616267 5870 slave.cpp:190] Slave started on 1)@ > 71.100.202.193:5051 > I0329 14:19:39.616286 5870 slave.cpp:191] Flags at startup: > --attributes="hostType:shard1" --authenticatee="crammd5" > --cgroups_cpu_enable_pids_and_tids_count="false" > --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" > --cgroups_limit_swap="false" --cgroups_root="mesos" > --container_disk_watch_interval="15secs" --containerizers="docker,mesos" > --default_role="*" --disk_watch_interval="1mins" > --docker="/usr/local/ecxmcc/weaveShim" --docker_kill_orphans="true" > --docker_remove_delay="6hrs" > --docker_sandbox_directory="/mnt/mesos/sandbox" > --docker_socket="/var/run/docker.sock" --docker_stop_timeout="15secs" > --enforce_container_disk_quota="false" > --executor_registration_timeout="5mins" > --executor_shutdown_grace_period="5secs" > --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" > --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" > --hadoop_home="" --help="false" --hostname="71.100.202.193" >
Re: Agent won't start
Whoa...interessant! The node *may* have been rebooted. Uptime says 2 days. I'll need to check my notes. Can you point me to reference re Ubuntu behavior? Based on what you've told me so far, it sounds as if the sequence: stop service reboot agent node start service could lead to trouble - or do I misunderstand? Thank you again for your help. -Paul On Tue, Mar 29, 2016 at 5:36 PM, Greg Mannwrote: > Paul, > This would be relevant for any system which is automatically deleting > files in /tmp. It looks like in Ubuntu, the default behavior is for /tmp to > be completely nuked at boot time. Was the agent node rebooted prior to this > problem? > > On Tue, Mar 29, 2016 at 2:29 PM, Paul Bell wrote: > >> Hi Greg, >> >> Thanks very much for your quick reply. >> >> I simply forgot to mention platform. It's Ubuntu 14.04 LTS and it's not >> systemd. I will look at the link you provide. >> >> Is there any chance that it might apply to non-systemd platforms? >> >> Cordially, >> >> Paul >> >> On Tue, Mar 29, 2016 at 5:18 PM, Greg Mann wrote: >> >>> Hi Paul, >>> Noticing the logging output, "Failed to find resources file >>> '/tmp/mesos/meta/resources/resources.info'", I wonder if your trouble >>> may be related to the location of your agent's work_dir. See this ticket: >>> https://issues.apache.org/jira/browse/MESOS-4541 >>> >>> Some users have reported issues resulting from the systemd-tmpfiles >>> service garbage collecting files in /tmp, perhaps this is related? What >>> platform is your agent running on? >>> >>> You could try specifying a different agent work directory outside of >>> /tmp/ via the `--work_dir` command-line flag. >>> >>> Cheers, >>> Greg >>> >>> >>> On Tue, Mar 29, 2016 at 2:08 PM, Paul Bell wrote: >>> Hi, I am hoping someone can shed some light on this. An agent node failed to start, that is, when I did "service mesos-slave start" the service came up briefly & then stopped. Before stopping it produced the log shown below. The last thing it wrote is "Trying to create path '/mesos' in Zookeeper". This mention of the mesos znode prompted me to go for a clean slate by removing the mesos znode from Zookeeper. After doing this, the mesos-slave service started perfectly. What might be happening here, and also what's the right way to trouble-shoot such a problem? Mesos is version 0.23.0. Thanks for your help. -Paul Log file created at: 2016/03/29 14:19:39 Running on machine: 71.100.202.193 Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg I0329 14:19:39.512249 5870 logging.cpp:172] INFO level logging started! I0329 14:19:39.512564 5870 main.cpp:162] Build: 2015-07-24 10:05:39 by root I0329 14:19:39.512588 5870 main.cpp:164] Version: 0.23.0 I0329 14:19:39.512600 5870 main.cpp:167] Git tag: 0.23.0 I0329 14:19:39.512612 5870 main.cpp:171] Git SHA: 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66 I0329 14:19:39.615172 5870 containerizer.cpp:111] Using isolation: posix/cpu,posix/mem I0329 14:19:39.615697 5870 main.cpp:249] Starting Mesos slave I0329 14:19:39.616267 5870 slave.cpp:190] Slave started on 1)@ 71.100.202.193:5051 I0329 14:19:39.616286 5870 slave.cpp:191] Flags at startup: --attributes="hostType:shard1" --authenticatee="crammd5" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="docker,mesos" --default_role="*" --disk_watch_interval="1mins" --docker="/usr/local/ecxmcc/weaveShim" --docker_kill_orphans="true" --docker_remove_delay="6hrs" --docker_sandbox_directory="/mnt/mesos/sandbox" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="15secs" --enforce_container_disk_quota="false" --executor_registration_timeout="5mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hostname="71.100.202.193" --initialize_driver_logging="true" --ip="71.100.202.193" --isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --master="zk://71.100.202.191:2181/mesos" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --resource_monitoring_interval="1secs" --revocable_cpu_low_priority="true"
Re: Agent won't start
Hello Paul, >From the logs, it looks like, on starting the mesos slave, it is trying to do slave recovery ( http://mesos.apache.org/documentation/latest/slave-recovery/) but since the resources.info is unavailable, it is unable to perform the recovery & hence end up killing itself. If you are fine with loosing any running existing mesos tasks/executors, then you can just cleanup the mesos default working directory where it keeps the checkpoint information($ rm -rf /tmp/mesos) and then try to restart the mesos slave. On Tue, Mar 29, 2016 at 10:08 PM, Paul Bellwrote: > Hi, > > I am hoping someone can shed some light on this. > > An agent node failed to start, that is, when I did "service mesos-slave > start" the service came up briefly & then stopped. Before stopping it > produced the log shown below. The last thing it wrote is "Trying to create > path '/mesos' in Zookeeper". > > This mention of the mesos znode prompted me to go for a clean slate by > removing the mesos znode from Zookeeper. > > After doing this, the mesos-slave service started perfectly. > > What might be happening here, and also what's the right way to > trouble-shoot such a problem? Mesos is version 0.23.0. > > Thanks for your help. > > -Paul > > > Log file created at: 2016/03/29 14:19:39 > Running on machine: 71.100.202.193 > Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg > I0329 14:19:39.512249 5870 logging.cpp:172] INFO level logging started! > I0329 14:19:39.512564 5870 main.cpp:162] Build: 2015-07-24 10:05:39 by > root > I0329 14:19:39.512588 5870 main.cpp:164] Version: 0.23.0 > I0329 14:19:39.512600 5870 main.cpp:167] Git tag: 0.23.0 > I0329 14:19:39.512612 5870 main.cpp:171] Git SHA: > 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66 > I0329 14:19:39.615172 5870 containerizer.cpp:111] Using isolation: > posix/cpu,posix/mem > I0329 14:19:39.615697 5870 main.cpp:249] Starting Mesos slave > I0329 14:19:39.616267 5870 slave.cpp:190] Slave started on 1)@ > 71.100.202.193:5051 > I0329 14:19:39.616286 5870 slave.cpp:191] Flags at startup: > --attributes="hostType:shard1" --authenticatee="crammd5" > --cgroups_cpu_enable_pids_and_tids_count="false" > --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" > --cgroups_limit_swap="false" --cgroups_root="mesos" > --container_disk_watch_interval="15secs" --containerizers="docker,mesos" > --default_role="*" --disk_watch_interval="1mins" > --docker="/usr/local/ecxmcc/weaveShim" --docker_kill_orphans="true" > --docker_remove_delay="6hrs" > --docker_sandbox_directory="/mnt/mesos/sandbox" > --docker_socket="/var/run/docker.sock" --docker_stop_timeout="15secs" > --enforce_container_disk_quota="false" > --executor_registration_timeout="5mins" > --executor_shutdown_grace_period="5secs" > --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" > --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" > --hadoop_home="" --help="false" --hostname="71.100.202.193" > --initialize_driver_logging="true" --ip="71.100.202.193" > --isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos" > --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" > --master="zk://71.100.202.191:2181/mesos" > --oversubscribed_resources_interval="15secs" --perf_duration="10secs" > --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" > --quiet="false" --recover="reconnect" --recovery_timeout="15mins" > --registration_backoff_factor="1secs" > --resource_monitoring_interval="1secs" --revocable_cpu_low_priority="true" > --strict="true" --switch_user="true" --version="false" > --work_dir="/tmp/mesos" > I0329 14:19:39.616835 5870 slave.cpp:354] Slave resources: cpus(*):4; > mem(*):23089; disk(*):122517; ports(*):[31000-32000] > I0329 14:19:39.617032 5870 slave.cpp:384] Slave hostname: 71.100.202.193 > I0329 14:19:39.617046 5870 slave.cpp:389] Slave checkpoint: true > I0329 14:19:39.618841 5894 state.cpp:36] Recovering state from > '/tmp/mesos/meta' > I0329 14:19:39.618872 5894 state.cpp:672] Failed to find resources file > '/tmp/mesos/meta/resources/resources.info' > I0329 14:19:39.619730 5898 group.cpp:313] Group process (group(1)@ > 71.100.202.193:5051) connected to ZooKeeper > I0329 14:19:39.619760 5898 group.cpp:787] Syncing group operations: queue > size (joins, cancels, datas) = (0, 0, 0) > I0329 14:19:39.619773 5898 group.cpp:385] Trying to create path '/mesos' > in ZooKeeper > > -- Regards, Pradeep Chhetri
Re: Agent won't start
Paul, This would be relevant for any system which is automatically deleting files in /tmp. It looks like in Ubuntu, the default behavior is for /tmp to be completely nuked at boot time. Was the agent node rebooted prior to this problem? On Tue, Mar 29, 2016 at 2:29 PM, Paul Bellwrote: > Hi Greg, > > Thanks very much for your quick reply. > > I simply forgot to mention platform. It's Ubuntu 14.04 LTS and it's not > systemd. I will look at the link you provide. > > Is there any chance that it might apply to non-systemd platforms? > > Cordially, > > Paul > > On Tue, Mar 29, 2016 at 5:18 PM, Greg Mann wrote: > >> Hi Paul, >> Noticing the logging output, "Failed to find resources file >> '/tmp/mesos/meta/resources/resources.info'", I wonder if your trouble >> may be related to the location of your agent's work_dir. See this ticket: >> https://issues.apache.org/jira/browse/MESOS-4541 >> >> Some users have reported issues resulting from the systemd-tmpfiles >> service garbage collecting files in /tmp, perhaps this is related? What >> platform is your agent running on? >> >> You could try specifying a different agent work directory outside of >> /tmp/ via the `--work_dir` command-line flag. >> >> Cheers, >> Greg >> >> >> On Tue, Mar 29, 2016 at 2:08 PM, Paul Bell wrote: >> >>> Hi, >>> >>> I am hoping someone can shed some light on this. >>> >>> An agent node failed to start, that is, when I did "service mesos-slave >>> start" the service came up briefly & then stopped. Before stopping it >>> produced the log shown below. The last thing it wrote is "Trying to create >>> path '/mesos' in Zookeeper". >>> >>> This mention of the mesos znode prompted me to go for a clean slate by >>> removing the mesos znode from Zookeeper. >>> >>> After doing this, the mesos-slave service started perfectly. >>> >>> What might be happening here, and also what's the right way to >>> trouble-shoot such a problem? Mesos is version 0.23.0. >>> >>> Thanks for your help. >>> >>> -Paul >>> >>> >>> Log file created at: 2016/03/29 14:19:39 >>> Running on machine: 71.100.202.193 >>> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg >>> I0329 14:19:39.512249 5870 logging.cpp:172] INFO level logging started! >>> I0329 14:19:39.512564 5870 main.cpp:162] Build: 2015-07-24 10:05:39 by >>> root >>> I0329 14:19:39.512588 5870 main.cpp:164] Version: 0.23.0 >>> I0329 14:19:39.512600 5870 main.cpp:167] Git tag: 0.23.0 >>> I0329 14:19:39.512612 5870 main.cpp:171] Git SHA: >>> 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66 >>> I0329 14:19:39.615172 5870 containerizer.cpp:111] Using isolation: >>> posix/cpu,posix/mem >>> I0329 14:19:39.615697 5870 main.cpp:249] Starting Mesos slave >>> I0329 14:19:39.616267 5870 slave.cpp:190] Slave started on 1)@ >>> 71.100.202.193:5051 >>> I0329 14:19:39.616286 5870 slave.cpp:191] Flags at startup: >>> --attributes="hostType:shard1" --authenticatee="crammd5" >>> --cgroups_cpu_enable_pids_and_tids_count="false" >>> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" >>> --cgroups_limit_swap="false" --cgroups_root="mesos" >>> --container_disk_watch_interval="15secs" --containerizers="docker,mesos" >>> --default_role="*" --disk_watch_interval="1mins" >>> --docker="/usr/local/ecxmcc/weaveShim" --docker_kill_orphans="true" >>> --docker_remove_delay="6hrs" >>> --docker_sandbox_directory="/mnt/mesos/sandbox" >>> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="15secs" >>> --enforce_container_disk_quota="false" >>> --executor_registration_timeout="5mins" >>> --executor_shutdown_grace_period="5secs" >>> --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" >>> --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" >>> --hadoop_home="" --help="false" --hostname="71.100.202.193" >>> --initialize_driver_logging="true" --ip="71.100.202.193" >>> --isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos" >>> --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" >>> --master="zk://71.100.202.191:2181/mesos" >>> --oversubscribed_resources_interval="15secs" --perf_duration="10secs" >>> --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" >>> --quiet="false" --recover="reconnect" --recovery_timeout="15mins" >>> --registration_backoff_factor="1secs" >>> --resource_monitoring_interval="1secs" --revocable_cpu_low_priority="true" >>> --strict="true" --switch_user="true" --version="false" >>> --work_dir="/tmp/mesos" >>> I0329 14:19:39.616835 5870 slave.cpp:354] Slave resources: cpus(*):4; >>> mem(*):23089; disk(*):122517; ports(*):[31000-32000] >>> I0329 14:19:39.617032 5870 slave.cpp:384] Slave hostname: 71.100.202.193 >>> I0329 14:19:39.617046 5870 slave.cpp:389] Slave checkpoint: true >>> I0329 14:19:39.618841 5894 state.cpp:36] Recovering state from >>> '/tmp/mesos/meta' >>> I0329 14:19:39.618872 5894 state.cpp:672] Failed to find resources file >>>
Re: Agent won't start
Hi Greg, Thanks very much for your quick reply. I simply forgot to mention platform. It's Ubuntu 14.04 LTS and it's not systemd. I will look at the link you provide. Is there any chance that it might apply to non-systemd platforms? Cordially, Paul On Tue, Mar 29, 2016 at 5:18 PM, Greg Mannwrote: > Hi Paul, > Noticing the logging output, "Failed to find resources file > '/tmp/mesos/meta/resources/resources.info'", I wonder if your trouble may > be related to the location of your agent's work_dir. See this ticket: > https://issues.apache.org/jira/browse/MESOS-4541 > > Some users have reported issues resulting from the systemd-tmpfiles > service garbage collecting files in /tmp, perhaps this is related? What > platform is your agent running on? > > You could try specifying a different agent work directory outside of /tmp/ > via the `--work_dir` command-line flag. > > Cheers, > Greg > > > On Tue, Mar 29, 2016 at 2:08 PM, Paul Bell wrote: > >> Hi, >> >> I am hoping someone can shed some light on this. >> >> An agent node failed to start, that is, when I did "service mesos-slave >> start" the service came up briefly & then stopped. Before stopping it >> produced the log shown below. The last thing it wrote is "Trying to create >> path '/mesos' in Zookeeper". >> >> This mention of the mesos znode prompted me to go for a clean slate by >> removing the mesos znode from Zookeeper. >> >> After doing this, the mesos-slave service started perfectly. >> >> What might be happening here, and also what's the right way to >> trouble-shoot such a problem? Mesos is version 0.23.0. >> >> Thanks for your help. >> >> -Paul >> >> >> Log file created at: 2016/03/29 14:19:39 >> Running on machine: 71.100.202.193 >> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg >> I0329 14:19:39.512249 5870 logging.cpp:172] INFO level logging started! >> I0329 14:19:39.512564 5870 main.cpp:162] Build: 2015-07-24 10:05:39 by >> root >> I0329 14:19:39.512588 5870 main.cpp:164] Version: 0.23.0 >> I0329 14:19:39.512600 5870 main.cpp:167] Git tag: 0.23.0 >> I0329 14:19:39.512612 5870 main.cpp:171] Git SHA: >> 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66 >> I0329 14:19:39.615172 5870 containerizer.cpp:111] Using isolation: >> posix/cpu,posix/mem >> I0329 14:19:39.615697 5870 main.cpp:249] Starting Mesos slave >> I0329 14:19:39.616267 5870 slave.cpp:190] Slave started on 1)@ >> 71.100.202.193:5051 >> I0329 14:19:39.616286 5870 slave.cpp:191] Flags at startup: >> --attributes="hostType:shard1" --authenticatee="crammd5" >> --cgroups_cpu_enable_pids_and_tids_count="false" >> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" >> --cgroups_limit_swap="false" --cgroups_root="mesos" >> --container_disk_watch_interval="15secs" --containerizers="docker,mesos" >> --default_role="*" --disk_watch_interval="1mins" >> --docker="/usr/local/ecxmcc/weaveShim" --docker_kill_orphans="true" >> --docker_remove_delay="6hrs" >> --docker_sandbox_directory="/mnt/mesos/sandbox" >> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="15secs" >> --enforce_container_disk_quota="false" >> --executor_registration_timeout="5mins" >> --executor_shutdown_grace_period="5secs" >> --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" >> --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" >> --hadoop_home="" --help="false" --hostname="71.100.202.193" >> --initialize_driver_logging="true" --ip="71.100.202.193" >> --isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos" >> --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" >> --master="zk://71.100.202.191:2181/mesos" >> --oversubscribed_resources_interval="15secs" --perf_duration="10secs" >> --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" >> --quiet="false" --recover="reconnect" --recovery_timeout="15mins" >> --registration_backoff_factor="1secs" >> --resource_monitoring_interval="1secs" --revocable_cpu_low_priority="true" >> --strict="true" --switch_user="true" --version="false" >> --work_dir="/tmp/mesos" >> I0329 14:19:39.616835 5870 slave.cpp:354] Slave resources: cpus(*):4; >> mem(*):23089; disk(*):122517; ports(*):[31000-32000] >> I0329 14:19:39.617032 5870 slave.cpp:384] Slave hostname: 71.100.202.193 >> I0329 14:19:39.617046 5870 slave.cpp:389] Slave checkpoint: true >> I0329 14:19:39.618841 5894 state.cpp:36] Recovering state from >> '/tmp/mesos/meta' >> I0329 14:19:39.618872 5894 state.cpp:672] Failed to find resources file >> '/tmp/mesos/meta/resources/resources.info' >> I0329 14:19:39.619730 5898 group.cpp:313] Group process (group(1)@ >> 71.100.202.193:5051) connected to ZooKeeper >> I0329 14:19:39.619760 5898 group.cpp:787] Syncing group operations: >> queue size (joins, cancels, datas) = (0, 0, 0) >> I0329 14:19:39.619773 5898 group.cpp:385] Trying to create path '/mesos' >> in ZooKeeper >> >> >
Re: Agent won't start
Hi Paul, Noticing the logging output, "Failed to find resources file '/tmp/mesos/meta/resources/resources.info'", I wonder if your trouble may be related to the location of your agent's work_dir. See this ticket: https://issues.apache.org/jira/browse/MESOS-4541 Some users have reported issues resulting from the systemd-tmpfiles service garbage collecting files in /tmp, perhaps this is related? What platform is your agent running on? You could try specifying a different agent work directory outside of /tmp/ via the `--work_dir` command-line flag. Cheers, Greg On Tue, Mar 29, 2016 at 2:08 PM, Paul Bellwrote: > Hi, > > I am hoping someone can shed some light on this. > > An agent node failed to start, that is, when I did "service mesos-slave > start" the service came up briefly & then stopped. Before stopping it > produced the log shown below. The last thing it wrote is "Trying to create > path '/mesos' in Zookeeper". > > This mention of the mesos znode prompted me to go for a clean slate by > removing the mesos znode from Zookeeper. > > After doing this, the mesos-slave service started perfectly. > > What might be happening here, and also what's the right way to > trouble-shoot such a problem? Mesos is version 0.23.0. > > Thanks for your help. > > -Paul > > > Log file created at: 2016/03/29 14:19:39 > Running on machine: 71.100.202.193 > Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg > I0329 14:19:39.512249 5870 logging.cpp:172] INFO level logging started! > I0329 14:19:39.512564 5870 main.cpp:162] Build: 2015-07-24 10:05:39 by > root > I0329 14:19:39.512588 5870 main.cpp:164] Version: 0.23.0 > I0329 14:19:39.512600 5870 main.cpp:167] Git tag: 0.23.0 > I0329 14:19:39.512612 5870 main.cpp:171] Git SHA: > 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66 > I0329 14:19:39.615172 5870 containerizer.cpp:111] Using isolation: > posix/cpu,posix/mem > I0329 14:19:39.615697 5870 main.cpp:249] Starting Mesos slave > I0329 14:19:39.616267 5870 slave.cpp:190] Slave started on 1)@ > 71.100.202.193:5051 > I0329 14:19:39.616286 5870 slave.cpp:191] Flags at startup: > --attributes="hostType:shard1" --authenticatee="crammd5" > --cgroups_cpu_enable_pids_and_tids_count="false" > --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" > --cgroups_limit_swap="false" --cgroups_root="mesos" > --container_disk_watch_interval="15secs" --containerizers="docker,mesos" > --default_role="*" --disk_watch_interval="1mins" > --docker="/usr/local/ecxmcc/weaveShim" --docker_kill_orphans="true" > --docker_remove_delay="6hrs" > --docker_sandbox_directory="/mnt/mesos/sandbox" > --docker_socket="/var/run/docker.sock" --docker_stop_timeout="15secs" > --enforce_container_disk_quota="false" > --executor_registration_timeout="5mins" > --executor_shutdown_grace_period="5secs" > --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" > --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" > --hadoop_home="" --help="false" --hostname="71.100.202.193" > --initialize_driver_logging="true" --ip="71.100.202.193" > --isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos" > --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" > --master="zk://71.100.202.191:2181/mesos" > --oversubscribed_resources_interval="15secs" --perf_duration="10secs" > --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" > --quiet="false" --recover="reconnect" --recovery_timeout="15mins" > --registration_backoff_factor="1secs" > --resource_monitoring_interval="1secs" --revocable_cpu_low_priority="true" > --strict="true" --switch_user="true" --version="false" > --work_dir="/tmp/mesos" > I0329 14:19:39.616835 5870 slave.cpp:354] Slave resources: cpus(*):4; > mem(*):23089; disk(*):122517; ports(*):[31000-32000] > I0329 14:19:39.617032 5870 slave.cpp:384] Slave hostname: 71.100.202.193 > I0329 14:19:39.617046 5870 slave.cpp:389] Slave checkpoint: true > I0329 14:19:39.618841 5894 state.cpp:36] Recovering state from > '/tmp/mesos/meta' > I0329 14:19:39.618872 5894 state.cpp:672] Failed to find resources file > '/tmp/mesos/meta/resources/resources.info' > I0329 14:19:39.619730 5898 group.cpp:313] Group process (group(1)@ > 71.100.202.193:5051) connected to ZooKeeper > I0329 14:19:39.619760 5898 group.cpp:787] Syncing group operations: queue > size (joins, cancels, datas) = (0, 0, 0) > I0329 14:19:39.619773 5898 group.cpp:385] Trying to create path '/mesos' > in ZooKeeper > >
Agent won't start
Hi, I am hoping someone can shed some light on this. An agent node failed to start, that is, when I did "service mesos-slave start" the service came up briefly & then stopped. Before stopping it produced the log shown below. The last thing it wrote is "Trying to create path '/mesos' in Zookeeper". This mention of the mesos znode prompted me to go for a clean slate by removing the mesos znode from Zookeeper. After doing this, the mesos-slave service started perfectly. What might be happening here, and also what's the right way to trouble-shoot such a problem? Mesos is version 0.23.0. Thanks for your help. -Paul Log file created at: 2016/03/29 14:19:39 Running on machine: 71.100.202.193 Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg I0329 14:19:39.512249 5870 logging.cpp:172] INFO level logging started! I0329 14:19:39.512564 5870 main.cpp:162] Build: 2015-07-24 10:05:39 by root I0329 14:19:39.512588 5870 main.cpp:164] Version: 0.23.0 I0329 14:19:39.512600 5870 main.cpp:167] Git tag: 0.23.0 I0329 14:19:39.512612 5870 main.cpp:171] Git SHA: 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66 I0329 14:19:39.615172 5870 containerizer.cpp:111] Using isolation: posix/cpu,posix/mem I0329 14:19:39.615697 5870 main.cpp:249] Starting Mesos slave I0329 14:19:39.616267 5870 slave.cpp:190] Slave started on 1)@ 71.100.202.193:5051 I0329 14:19:39.616286 5870 slave.cpp:191] Flags at startup: --attributes="hostType:shard1" --authenticatee="crammd5" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="docker,mesos" --default_role="*" --disk_watch_interval="1mins" --docker="/usr/local/ecxmcc/weaveShim" --docker_kill_orphans="true" --docker_remove_delay="6hrs" --docker_sandbox_directory="/mnt/mesos/sandbox" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="15secs" --enforce_container_disk_quota="false" --executor_registration_timeout="5mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hostname="71.100.202.193" --initialize_driver_logging="true" --ip="71.100.202.193" --isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --master="zk://71.100.202.191:2181/mesos" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --resource_monitoring_interval="1secs" --revocable_cpu_low_priority="true" --strict="true" --switch_user="true" --version="false" --work_dir="/tmp/mesos" I0329 14:19:39.616835 5870 slave.cpp:354] Slave resources: cpus(*):4; mem(*):23089; disk(*):122517; ports(*):[31000-32000] I0329 14:19:39.617032 5870 slave.cpp:384] Slave hostname: 71.100.202.193 I0329 14:19:39.617046 5870 slave.cpp:389] Slave checkpoint: true I0329 14:19:39.618841 5894 state.cpp:36] Recovering state from '/tmp/mesos/meta' I0329 14:19:39.618872 5894 state.cpp:672] Failed to find resources file '/tmp/mesos/meta/resources/resources.info' I0329 14:19:39.619730 5898 group.cpp:313] Group process (group(1)@ 71.100.202.193:5051) connected to ZooKeeper I0329 14:19:39.619760 5898 group.cpp:787] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0) I0329 14:19:39.619773 5898 group.cpp:385] Trying to create path '/mesos' in ZooKeeper