Re: Agent won't start
Check out this link for info on /tmp cleanup in Ubuntu: http://askubuntu.com/questions/20783/how-is-the-tmp-directory-cleaned-up And check out this link for information on some of the work_dir's contents on a Mesos agent: http://mesos.apache.org/documentation/latest/sandbox/ The work_dir contains important application state for the Mesos agent, so it should not be placed in a location that will be automatically garbage-collected by the OS. The choice of /tmp/mesos as a default location is a bit unfortunate, and hopefully we can resolve that JIRA issue soon to change it. Ideally you should be able to leave the work_dir alone and let the Mesos agent manage it for you. In any case, I would recommend that you set the work_dir to something outside of /tmp; /var/lib/mesos is a commonly-used location. Cheers, Greg
Re: Agent won't start
Hi Pradeep, And thank you for your reply! That, too, is very interesting. I think I need to synthesize what you and Greg are telling me and come up with a clean solution. Agent nodes can crash. Moreover, I can stop the mesos-slave service, and start it later with a reboot in between. So I am interested in fully understanding the causal chain here before I try to fix anything. -Paul On Tue, Mar 29, 2016 at 5:51 PM, Paul Bellwrote: > Whoa...interessant! > > The node *may* have been rebooted. Uptime says 2 days. I'll need to check > my notes. > > Can you point me to reference re Ubuntu behavior? > > Based on what you've told me so far, it sounds as if the sequence: > > stop service > reboot agent node > start service > > > could lead to trouble - or do I misunderstand? > > > Thank you again for your help. > > -Paul > > On Tue, Mar 29, 2016 at 5:36 PM, Greg Mann wrote: > >> Paul, >> This would be relevant for any system which is automatically deleting >> files in /tmp. It looks like in Ubuntu, the default behavior is for /tmp to >> be completely nuked at boot time. Was the agent node rebooted prior to this >> problem? >> >> On Tue, Mar 29, 2016 at 2:29 PM, Paul Bell wrote: >> >>> Hi Greg, >>> >>> Thanks very much for your quick reply. >>> >>> I simply forgot to mention platform. It's Ubuntu 14.04 LTS and it's not >>> systemd. I will look at the link you provide. >>> >>> Is there any chance that it might apply to non-systemd platforms? >>> >>> Cordially, >>> >>> Paul >>> >>> On Tue, Mar 29, 2016 at 5:18 PM, Greg Mann wrote: >>> Hi Paul, Noticing the logging output, "Failed to find resources file '/tmp/mesos/meta/resources/resources.info'", I wonder if your trouble may be related to the location of your agent's work_dir. See this ticket: https://issues.apache.org/jira/browse/MESOS-4541 Some users have reported issues resulting from the systemd-tmpfiles service garbage collecting files in /tmp, perhaps this is related? What platform is your agent running on? You could try specifying a different agent work directory outside of /tmp/ via the `--work_dir` command-line flag. Cheers, Greg On Tue, Mar 29, 2016 at 2:08 PM, Paul Bell wrote: > Hi, > > I am hoping someone can shed some light on this. > > An agent node failed to start, that is, when I did "service > mesos-slave start" the service came up briefly & then stopped. Before > stopping it produced the log shown below. The last thing it wrote is > "Trying to create path '/mesos' in Zookeeper". > > This mention of the mesos znode prompted me to go for a clean slate by > removing the mesos znode from Zookeeper. > > After doing this, the mesos-slave service started perfectly. > > What might be happening here, and also what's the right way to > trouble-shoot such a problem? Mesos is version 0.23.0. > > Thanks for your help. > > -Paul > > > Log file created at: 2016/03/29 14:19:39 > Running on machine: 71.100.202.193 > Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg > I0329 14:19:39.512249 5870 logging.cpp:172] INFO level logging > started! > I0329 14:19:39.512564 5870 main.cpp:162] Build: 2015-07-24 10:05:39 > by root > I0329 14:19:39.512588 5870 main.cpp:164] Version: 0.23.0 > I0329 14:19:39.512600 5870 main.cpp:167] Git tag: 0.23.0 > I0329 14:19:39.512612 5870 main.cpp:171] Git SHA: > 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66 > I0329 14:19:39.615172 5870 containerizer.cpp:111] Using isolation: > posix/cpu,posix/mem > I0329 14:19:39.615697 5870 main.cpp:249] Starting Mesos slave > I0329 14:19:39.616267 5870 slave.cpp:190] Slave started on 1)@ > 71.100.202.193:5051 > I0329 14:19:39.616286 5870 slave.cpp:191] Flags at startup: > --attributes="hostType:shard1" --authenticatee="crammd5" > --cgroups_cpu_enable_pids_and_tids_count="false" > --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" > --cgroups_limit_swap="false" --cgroups_root="mesos" > --container_disk_watch_interval="15secs" --containerizers="docker,mesos" > --default_role="*" --disk_watch_interval="1mins" > --docker="/usr/local/ecxmcc/weaveShim" --docker_kill_orphans="true" > --docker_remove_delay="6hrs" > --docker_sandbox_directory="/mnt/mesos/sandbox" > --docker_socket="/var/run/docker.sock" --docker_stop_timeout="15secs" > --enforce_container_disk_quota="false" > --executor_registration_timeout="5mins" > --executor_shutdown_grace_period="5secs" > --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" > --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" > --hadoop_home="" --help="false" --hostname="71.100.202.193" >
Re: Agent won't start
Whoa...interessant! The node *may* have been rebooted. Uptime says 2 days. I'll need to check my notes. Can you point me to reference re Ubuntu behavior? Based on what you've told me so far, it sounds as if the sequence: stop service reboot agent node start service could lead to trouble - or do I misunderstand? Thank you again for your help. -Paul On Tue, Mar 29, 2016 at 5:36 PM, Greg Mannwrote: > Paul, > This would be relevant for any system which is automatically deleting > files in /tmp. It looks like in Ubuntu, the default behavior is for /tmp to > be completely nuked at boot time. Was the agent node rebooted prior to this > problem? > > On Tue, Mar 29, 2016 at 2:29 PM, Paul Bell wrote: > >> Hi Greg, >> >> Thanks very much for your quick reply. >> >> I simply forgot to mention platform. It's Ubuntu 14.04 LTS and it's not >> systemd. I will look at the link you provide. >> >> Is there any chance that it might apply to non-systemd platforms? >> >> Cordially, >> >> Paul >> >> On Tue, Mar 29, 2016 at 5:18 PM, Greg Mann wrote: >> >>> Hi Paul, >>> Noticing the logging output, "Failed to find resources file >>> '/tmp/mesos/meta/resources/resources.info'", I wonder if your trouble >>> may be related to the location of your agent's work_dir. See this ticket: >>> https://issues.apache.org/jira/browse/MESOS-4541 >>> >>> Some users have reported issues resulting from the systemd-tmpfiles >>> service garbage collecting files in /tmp, perhaps this is related? What >>> platform is your agent running on? >>> >>> You could try specifying a different agent work directory outside of >>> /tmp/ via the `--work_dir` command-line flag. >>> >>> Cheers, >>> Greg >>> >>> >>> On Tue, Mar 29, 2016 at 2:08 PM, Paul Bell wrote: >>> Hi, I am hoping someone can shed some light on this. An agent node failed to start, that is, when I did "service mesos-slave start" the service came up briefly & then stopped. Before stopping it produced the log shown below. The last thing it wrote is "Trying to create path '/mesos' in Zookeeper". This mention of the mesos znode prompted me to go for a clean slate by removing the mesos znode from Zookeeper. After doing this, the mesos-slave service started perfectly. What might be happening here, and also what's the right way to trouble-shoot such a problem? Mesos is version 0.23.0. Thanks for your help. -Paul Log file created at: 2016/03/29 14:19:39 Running on machine: 71.100.202.193 Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg I0329 14:19:39.512249 5870 logging.cpp:172] INFO level logging started! I0329 14:19:39.512564 5870 main.cpp:162] Build: 2015-07-24 10:05:39 by root I0329 14:19:39.512588 5870 main.cpp:164] Version: 0.23.0 I0329 14:19:39.512600 5870 main.cpp:167] Git tag: 0.23.0 I0329 14:19:39.512612 5870 main.cpp:171] Git SHA: 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66 I0329 14:19:39.615172 5870 containerizer.cpp:111] Using isolation: posix/cpu,posix/mem I0329 14:19:39.615697 5870 main.cpp:249] Starting Mesos slave I0329 14:19:39.616267 5870 slave.cpp:190] Slave started on 1)@ 71.100.202.193:5051 I0329 14:19:39.616286 5870 slave.cpp:191] Flags at startup: --attributes="hostType:shard1" --authenticatee="crammd5" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="docker,mesos" --default_role="*" --disk_watch_interval="1mins" --docker="/usr/local/ecxmcc/weaveShim" --docker_kill_orphans="true" --docker_remove_delay="6hrs" --docker_sandbox_directory="/mnt/mesos/sandbox" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="15secs" --enforce_container_disk_quota="false" --executor_registration_timeout="5mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hostname="71.100.202.193" --initialize_driver_logging="true" --ip="71.100.202.193" --isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --master="zk://71.100.202.191:2181/mesos" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --resource_monitoring_interval="1secs" --revocable_cpu_low_priority="true"
Re: Agent won't start
Hello Paul, >From the logs, it looks like, on starting the mesos slave, it is trying to do slave recovery ( http://mesos.apache.org/documentation/latest/slave-recovery/) but since the resources.info is unavailable, it is unable to perform the recovery & hence end up killing itself. If you are fine with loosing any running existing mesos tasks/executors, then you can just cleanup the mesos default working directory where it keeps the checkpoint information($ rm -rf /tmp/mesos) and then try to restart the mesos slave. On Tue, Mar 29, 2016 at 10:08 PM, Paul Bellwrote: > Hi, > > I am hoping someone can shed some light on this. > > An agent node failed to start, that is, when I did "service mesos-slave > start" the service came up briefly & then stopped. Before stopping it > produced the log shown below. The last thing it wrote is "Trying to create > path '/mesos' in Zookeeper". > > This mention of the mesos znode prompted me to go for a clean slate by > removing the mesos znode from Zookeeper. > > After doing this, the mesos-slave service started perfectly. > > What might be happening here, and also what's the right way to > trouble-shoot such a problem? Mesos is version 0.23.0. > > Thanks for your help. > > -Paul > > > Log file created at: 2016/03/29 14:19:39 > Running on machine: 71.100.202.193 > Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg > I0329 14:19:39.512249 5870 logging.cpp:172] INFO level logging started! > I0329 14:19:39.512564 5870 main.cpp:162] Build: 2015-07-24 10:05:39 by > root > I0329 14:19:39.512588 5870 main.cpp:164] Version: 0.23.0 > I0329 14:19:39.512600 5870 main.cpp:167] Git tag: 0.23.0 > I0329 14:19:39.512612 5870 main.cpp:171] Git SHA: > 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66 > I0329 14:19:39.615172 5870 containerizer.cpp:111] Using isolation: > posix/cpu,posix/mem > I0329 14:19:39.615697 5870 main.cpp:249] Starting Mesos slave > I0329 14:19:39.616267 5870 slave.cpp:190] Slave started on 1)@ > 71.100.202.193:5051 > I0329 14:19:39.616286 5870 slave.cpp:191] Flags at startup: > --attributes="hostType:shard1" --authenticatee="crammd5" > --cgroups_cpu_enable_pids_and_tids_count="false" > --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" > --cgroups_limit_swap="false" --cgroups_root="mesos" > --container_disk_watch_interval="15secs" --containerizers="docker,mesos" > --default_role="*" --disk_watch_interval="1mins" > --docker="/usr/local/ecxmcc/weaveShim" --docker_kill_orphans="true" > --docker_remove_delay="6hrs" > --docker_sandbox_directory="/mnt/mesos/sandbox" > --docker_socket="/var/run/docker.sock" --docker_stop_timeout="15secs" > --enforce_container_disk_quota="false" > --executor_registration_timeout="5mins" > --executor_shutdown_grace_period="5secs" > --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" > --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" > --hadoop_home="" --help="false" --hostname="71.100.202.193" > --initialize_driver_logging="true" --ip="71.100.202.193" > --isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos" > --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" > --master="zk://71.100.202.191:2181/mesos" > --oversubscribed_resources_interval="15secs" --perf_duration="10secs" > --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" > --quiet="false" --recover="reconnect" --recovery_timeout="15mins" > --registration_backoff_factor="1secs" > --resource_monitoring_interval="1secs" --revocable_cpu_low_priority="true" > --strict="true" --switch_user="true" --version="false" > --work_dir="/tmp/mesos" > I0329 14:19:39.616835 5870 slave.cpp:354] Slave resources: cpus(*):4; > mem(*):23089; disk(*):122517; ports(*):[31000-32000] > I0329 14:19:39.617032 5870 slave.cpp:384] Slave hostname: 71.100.202.193 > I0329 14:19:39.617046 5870 slave.cpp:389] Slave checkpoint: true > I0329 14:19:39.618841 5894 state.cpp:36] Recovering state from > '/tmp/mesos/meta' > I0329 14:19:39.618872 5894 state.cpp:672] Failed to find resources file > '/tmp/mesos/meta/resources/resources.info' > I0329 14:19:39.619730 5898 group.cpp:313] Group process (group(1)@ > 71.100.202.193:5051) connected to ZooKeeper > I0329 14:19:39.619760 5898 group.cpp:787] Syncing group operations: queue > size (joins, cancels, datas) = (0, 0, 0) > I0329 14:19:39.619773 5898 group.cpp:385] Trying to create path '/mesos' > in ZooKeeper > > -- Regards, Pradeep Chhetri
Re: Agent won't start
Paul, This would be relevant for any system which is automatically deleting files in /tmp. It looks like in Ubuntu, the default behavior is for /tmp to be completely nuked at boot time. Was the agent node rebooted prior to this problem? On Tue, Mar 29, 2016 at 2:29 PM, Paul Bellwrote: > Hi Greg, > > Thanks very much for your quick reply. > > I simply forgot to mention platform. It's Ubuntu 14.04 LTS and it's not > systemd. I will look at the link you provide. > > Is there any chance that it might apply to non-systemd platforms? > > Cordially, > > Paul > > On Tue, Mar 29, 2016 at 5:18 PM, Greg Mann wrote: > >> Hi Paul, >> Noticing the logging output, "Failed to find resources file >> '/tmp/mesos/meta/resources/resources.info'", I wonder if your trouble >> may be related to the location of your agent's work_dir. See this ticket: >> https://issues.apache.org/jira/browse/MESOS-4541 >> >> Some users have reported issues resulting from the systemd-tmpfiles >> service garbage collecting files in /tmp, perhaps this is related? What >> platform is your agent running on? >> >> You could try specifying a different agent work directory outside of >> /tmp/ via the `--work_dir` command-line flag. >> >> Cheers, >> Greg >> >> >> On Tue, Mar 29, 2016 at 2:08 PM, Paul Bell wrote: >> >>> Hi, >>> >>> I am hoping someone can shed some light on this. >>> >>> An agent node failed to start, that is, when I did "service mesos-slave >>> start" the service came up briefly & then stopped. Before stopping it >>> produced the log shown below. The last thing it wrote is "Trying to create >>> path '/mesos' in Zookeeper". >>> >>> This mention of the mesos znode prompted me to go for a clean slate by >>> removing the mesos znode from Zookeeper. >>> >>> After doing this, the mesos-slave service started perfectly. >>> >>> What might be happening here, and also what's the right way to >>> trouble-shoot such a problem? Mesos is version 0.23.0. >>> >>> Thanks for your help. >>> >>> -Paul >>> >>> >>> Log file created at: 2016/03/29 14:19:39 >>> Running on machine: 71.100.202.193 >>> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg >>> I0329 14:19:39.512249 5870 logging.cpp:172] INFO level logging started! >>> I0329 14:19:39.512564 5870 main.cpp:162] Build: 2015-07-24 10:05:39 by >>> root >>> I0329 14:19:39.512588 5870 main.cpp:164] Version: 0.23.0 >>> I0329 14:19:39.512600 5870 main.cpp:167] Git tag: 0.23.0 >>> I0329 14:19:39.512612 5870 main.cpp:171] Git SHA: >>> 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66 >>> I0329 14:19:39.615172 5870 containerizer.cpp:111] Using isolation: >>> posix/cpu,posix/mem >>> I0329 14:19:39.615697 5870 main.cpp:249] Starting Mesos slave >>> I0329 14:19:39.616267 5870 slave.cpp:190] Slave started on 1)@ >>> 71.100.202.193:5051 >>> I0329 14:19:39.616286 5870 slave.cpp:191] Flags at startup: >>> --attributes="hostType:shard1" --authenticatee="crammd5" >>> --cgroups_cpu_enable_pids_and_tids_count="false" >>> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" >>> --cgroups_limit_swap="false" --cgroups_root="mesos" >>> --container_disk_watch_interval="15secs" --containerizers="docker,mesos" >>> --default_role="*" --disk_watch_interval="1mins" >>> --docker="/usr/local/ecxmcc/weaveShim" --docker_kill_orphans="true" >>> --docker_remove_delay="6hrs" >>> --docker_sandbox_directory="/mnt/mesos/sandbox" >>> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="15secs" >>> --enforce_container_disk_quota="false" >>> --executor_registration_timeout="5mins" >>> --executor_shutdown_grace_period="5secs" >>> --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" >>> --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" >>> --hadoop_home="" --help="false" --hostname="71.100.202.193" >>> --initialize_driver_logging="true" --ip="71.100.202.193" >>> --isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos" >>> --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" >>> --master="zk://71.100.202.191:2181/mesos" >>> --oversubscribed_resources_interval="15secs" --perf_duration="10secs" >>> --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" >>> --quiet="false" --recover="reconnect" --recovery_timeout="15mins" >>> --registration_backoff_factor="1secs" >>> --resource_monitoring_interval="1secs" --revocable_cpu_low_priority="true" >>> --strict="true" --switch_user="true" --version="false" >>> --work_dir="/tmp/mesos" >>> I0329 14:19:39.616835 5870 slave.cpp:354] Slave resources: cpus(*):4; >>> mem(*):23089; disk(*):122517; ports(*):[31000-32000] >>> I0329 14:19:39.617032 5870 slave.cpp:384] Slave hostname: 71.100.202.193 >>> I0329 14:19:39.617046 5870 slave.cpp:389] Slave checkpoint: true >>> I0329 14:19:39.618841 5894 state.cpp:36] Recovering state from >>> '/tmp/mesos/meta' >>> I0329 14:19:39.618872 5894 state.cpp:672] Failed to find resources file >>>
Re: Agent won't start
Hi Greg, Thanks very much for your quick reply. I simply forgot to mention platform. It's Ubuntu 14.04 LTS and it's not systemd. I will look at the link you provide. Is there any chance that it might apply to non-systemd platforms? Cordially, Paul On Tue, Mar 29, 2016 at 5:18 PM, Greg Mannwrote: > Hi Paul, > Noticing the logging output, "Failed to find resources file > '/tmp/mesos/meta/resources/resources.info'", I wonder if your trouble may > be related to the location of your agent's work_dir. See this ticket: > https://issues.apache.org/jira/browse/MESOS-4541 > > Some users have reported issues resulting from the systemd-tmpfiles > service garbage collecting files in /tmp, perhaps this is related? What > platform is your agent running on? > > You could try specifying a different agent work directory outside of /tmp/ > via the `--work_dir` command-line flag. > > Cheers, > Greg > > > On Tue, Mar 29, 2016 at 2:08 PM, Paul Bell wrote: > >> Hi, >> >> I am hoping someone can shed some light on this. >> >> An agent node failed to start, that is, when I did "service mesos-slave >> start" the service came up briefly & then stopped. Before stopping it >> produced the log shown below. The last thing it wrote is "Trying to create >> path '/mesos' in Zookeeper". >> >> This mention of the mesos znode prompted me to go for a clean slate by >> removing the mesos znode from Zookeeper. >> >> After doing this, the mesos-slave service started perfectly. >> >> What might be happening here, and also what's the right way to >> trouble-shoot such a problem? Mesos is version 0.23.0. >> >> Thanks for your help. >> >> -Paul >> >> >> Log file created at: 2016/03/29 14:19:39 >> Running on machine: 71.100.202.193 >> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg >> I0329 14:19:39.512249 5870 logging.cpp:172] INFO level logging started! >> I0329 14:19:39.512564 5870 main.cpp:162] Build: 2015-07-24 10:05:39 by >> root >> I0329 14:19:39.512588 5870 main.cpp:164] Version: 0.23.0 >> I0329 14:19:39.512600 5870 main.cpp:167] Git tag: 0.23.0 >> I0329 14:19:39.512612 5870 main.cpp:171] Git SHA: >> 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66 >> I0329 14:19:39.615172 5870 containerizer.cpp:111] Using isolation: >> posix/cpu,posix/mem >> I0329 14:19:39.615697 5870 main.cpp:249] Starting Mesos slave >> I0329 14:19:39.616267 5870 slave.cpp:190] Slave started on 1)@ >> 71.100.202.193:5051 >> I0329 14:19:39.616286 5870 slave.cpp:191] Flags at startup: >> --attributes="hostType:shard1" --authenticatee="crammd5" >> --cgroups_cpu_enable_pids_and_tids_count="false" >> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" >> --cgroups_limit_swap="false" --cgroups_root="mesos" >> --container_disk_watch_interval="15secs" --containerizers="docker,mesos" >> --default_role="*" --disk_watch_interval="1mins" >> --docker="/usr/local/ecxmcc/weaveShim" --docker_kill_orphans="true" >> --docker_remove_delay="6hrs" >> --docker_sandbox_directory="/mnt/mesos/sandbox" >> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="15secs" >> --enforce_container_disk_quota="false" >> --executor_registration_timeout="5mins" >> --executor_shutdown_grace_period="5secs" >> --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" >> --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" >> --hadoop_home="" --help="false" --hostname="71.100.202.193" >> --initialize_driver_logging="true" --ip="71.100.202.193" >> --isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos" >> --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" >> --master="zk://71.100.202.191:2181/mesos" >> --oversubscribed_resources_interval="15secs" --perf_duration="10secs" >> --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" >> --quiet="false" --recover="reconnect" --recovery_timeout="15mins" >> --registration_backoff_factor="1secs" >> --resource_monitoring_interval="1secs" --revocable_cpu_low_priority="true" >> --strict="true" --switch_user="true" --version="false" >> --work_dir="/tmp/mesos" >> I0329 14:19:39.616835 5870 slave.cpp:354] Slave resources: cpus(*):4; >> mem(*):23089; disk(*):122517; ports(*):[31000-32000] >> I0329 14:19:39.617032 5870 slave.cpp:384] Slave hostname: 71.100.202.193 >> I0329 14:19:39.617046 5870 slave.cpp:389] Slave checkpoint: true >> I0329 14:19:39.618841 5894 state.cpp:36] Recovering state from >> '/tmp/mesos/meta' >> I0329 14:19:39.618872 5894 state.cpp:672] Failed to find resources file >> '/tmp/mesos/meta/resources/resources.info' >> I0329 14:19:39.619730 5898 group.cpp:313] Group process (group(1)@ >> 71.100.202.193:5051) connected to ZooKeeper >> I0329 14:19:39.619760 5898 group.cpp:787] Syncing group operations: >> queue size (joins, cancels, datas) = (0, 0, 0) >> I0329 14:19:39.619773 5898 group.cpp:385] Trying to create path '/mesos' >> in ZooKeeper >> >> >
Re: Agent won't start
Hi Paul, Noticing the logging output, "Failed to find resources file '/tmp/mesos/meta/resources/resources.info'", I wonder if your trouble may be related to the location of your agent's work_dir. See this ticket: https://issues.apache.org/jira/browse/MESOS-4541 Some users have reported issues resulting from the systemd-tmpfiles service garbage collecting files in /tmp, perhaps this is related? What platform is your agent running on? You could try specifying a different agent work directory outside of /tmp/ via the `--work_dir` command-line flag. Cheers, Greg On Tue, Mar 29, 2016 at 2:08 PM, Paul Bellwrote: > Hi, > > I am hoping someone can shed some light on this. > > An agent node failed to start, that is, when I did "service mesos-slave > start" the service came up briefly & then stopped. Before stopping it > produced the log shown below. The last thing it wrote is "Trying to create > path '/mesos' in Zookeeper". > > This mention of the mesos znode prompted me to go for a clean slate by > removing the mesos znode from Zookeeper. > > After doing this, the mesos-slave service started perfectly. > > What might be happening here, and also what's the right way to > trouble-shoot such a problem? Mesos is version 0.23.0. > > Thanks for your help. > > -Paul > > > Log file created at: 2016/03/29 14:19:39 > Running on machine: 71.100.202.193 > Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg > I0329 14:19:39.512249 5870 logging.cpp:172] INFO level logging started! > I0329 14:19:39.512564 5870 main.cpp:162] Build: 2015-07-24 10:05:39 by > root > I0329 14:19:39.512588 5870 main.cpp:164] Version: 0.23.0 > I0329 14:19:39.512600 5870 main.cpp:167] Git tag: 0.23.0 > I0329 14:19:39.512612 5870 main.cpp:171] Git SHA: > 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66 > I0329 14:19:39.615172 5870 containerizer.cpp:111] Using isolation: > posix/cpu,posix/mem > I0329 14:19:39.615697 5870 main.cpp:249] Starting Mesos slave > I0329 14:19:39.616267 5870 slave.cpp:190] Slave started on 1)@ > 71.100.202.193:5051 > I0329 14:19:39.616286 5870 slave.cpp:191] Flags at startup: > --attributes="hostType:shard1" --authenticatee="crammd5" > --cgroups_cpu_enable_pids_and_tids_count="false" > --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" > --cgroups_limit_swap="false" --cgroups_root="mesos" > --container_disk_watch_interval="15secs" --containerizers="docker,mesos" > --default_role="*" --disk_watch_interval="1mins" > --docker="/usr/local/ecxmcc/weaveShim" --docker_kill_orphans="true" > --docker_remove_delay="6hrs" > --docker_sandbox_directory="/mnt/mesos/sandbox" > --docker_socket="/var/run/docker.sock" --docker_stop_timeout="15secs" > --enforce_container_disk_quota="false" > --executor_registration_timeout="5mins" > --executor_shutdown_grace_period="5secs" > --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" > --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" > --hadoop_home="" --help="false" --hostname="71.100.202.193" > --initialize_driver_logging="true" --ip="71.100.202.193" > --isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos" > --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" > --master="zk://71.100.202.191:2181/mesos" > --oversubscribed_resources_interval="15secs" --perf_duration="10secs" > --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" > --quiet="false" --recover="reconnect" --recovery_timeout="15mins" > --registration_backoff_factor="1secs" > --resource_monitoring_interval="1secs" --revocable_cpu_low_priority="true" > --strict="true" --switch_user="true" --version="false" > --work_dir="/tmp/mesos" > I0329 14:19:39.616835 5870 slave.cpp:354] Slave resources: cpus(*):4; > mem(*):23089; disk(*):122517; ports(*):[31000-32000] > I0329 14:19:39.617032 5870 slave.cpp:384] Slave hostname: 71.100.202.193 > I0329 14:19:39.617046 5870 slave.cpp:389] Slave checkpoint: true > I0329 14:19:39.618841 5894 state.cpp:36] Recovering state from > '/tmp/mesos/meta' > I0329 14:19:39.618872 5894 state.cpp:672] Failed to find resources file > '/tmp/mesos/meta/resources/resources.info' > I0329 14:19:39.619730 5898 group.cpp:313] Group process (group(1)@ > 71.100.202.193:5051) connected to ZooKeeper > I0329 14:19:39.619760 5898 group.cpp:787] Syncing group operations: queue > size (joins, cancels, datas) = (0, 0, 0) > I0329 14:19:39.619773 5898 group.cpp:385] Trying to create path '/mesos' > in ZooKeeper > >
Agent won't start
Hi, I am hoping someone can shed some light on this. An agent node failed to start, that is, when I did "service mesos-slave start" the service came up briefly & then stopped. Before stopping it produced the log shown below. The last thing it wrote is "Trying to create path '/mesos' in Zookeeper". This mention of the mesos znode prompted me to go for a clean slate by removing the mesos znode from Zookeeper. After doing this, the mesos-slave service started perfectly. What might be happening here, and also what's the right way to trouble-shoot such a problem? Mesos is version 0.23.0. Thanks for your help. -Paul Log file created at: 2016/03/29 14:19:39 Running on machine: 71.100.202.193 Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg I0329 14:19:39.512249 5870 logging.cpp:172] INFO level logging started! I0329 14:19:39.512564 5870 main.cpp:162] Build: 2015-07-24 10:05:39 by root I0329 14:19:39.512588 5870 main.cpp:164] Version: 0.23.0 I0329 14:19:39.512600 5870 main.cpp:167] Git tag: 0.23.0 I0329 14:19:39.512612 5870 main.cpp:171] Git SHA: 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66 I0329 14:19:39.615172 5870 containerizer.cpp:111] Using isolation: posix/cpu,posix/mem I0329 14:19:39.615697 5870 main.cpp:249] Starting Mesos slave I0329 14:19:39.616267 5870 slave.cpp:190] Slave started on 1)@ 71.100.202.193:5051 I0329 14:19:39.616286 5870 slave.cpp:191] Flags at startup: --attributes="hostType:shard1" --authenticatee="crammd5" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="docker,mesos" --default_role="*" --disk_watch_interval="1mins" --docker="/usr/local/ecxmcc/weaveShim" --docker_kill_orphans="true" --docker_remove_delay="6hrs" --docker_sandbox_directory="/mnt/mesos/sandbox" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="15secs" --enforce_container_disk_quota="false" --executor_registration_timeout="5mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hostname="71.100.202.193" --initialize_driver_logging="true" --ip="71.100.202.193" --isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --master="zk://71.100.202.191:2181/mesos" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --resource_monitoring_interval="1secs" --revocable_cpu_low_priority="true" --strict="true" --switch_user="true" --version="false" --work_dir="/tmp/mesos" I0329 14:19:39.616835 5870 slave.cpp:354] Slave resources: cpus(*):4; mem(*):23089; disk(*):122517; ports(*):[31000-32000] I0329 14:19:39.617032 5870 slave.cpp:384] Slave hostname: 71.100.202.193 I0329 14:19:39.617046 5870 slave.cpp:389] Slave checkpoint: true I0329 14:19:39.618841 5894 state.cpp:36] Recovering state from '/tmp/mesos/meta' I0329 14:19:39.618872 5894 state.cpp:672] Failed to find resources file '/tmp/mesos/meta/resources/resources.info' I0329 14:19:39.619730 5898 group.cpp:313] Group process (group(1)@ 71.100.202.193:5051) connected to ZooKeeper I0329 14:19:39.619760 5898 group.cpp:787] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0) I0329 14:19:39.619773 5898 group.cpp:385] Trying to create path '/mesos' in ZooKeeper
Re: Port Resource Offers
Hello Erik, Thank you for clarifying the doubt. That was the exact concern I was having. On Tue, Mar 29, 2016 at 9:05 PM, Erik Weatherswrote: > hi Pradeep, > > Yes, that would *definitely* be a problem. e.g., the Storm Framework > could easily assign Storm Workers to use those unavailable ports, and then > they would fail to come up since they wouldn't be able to bind to their > assigned port. I've answered a similar question before: > > > https://unix.stackexchange.com/questions/211647/how-safe-is-it-to-change-the-linux-ephemeral-port-range/237543#237543 > > - Erik > > On Tue, Mar 29, 2016 at 3:07 AM, Pradeep Chhetri < > pradeep.chhetr...@gmail.com> wrote: > >> Hi Klaus, >> >> Thank you for the quick reply. >> >> One quick question: >> >> I have some of the ports like 8400,8500,8600 which are already in use by >> consul agent running on each mesos slave. But they are also being announced >> by each mesos slave. Will this cause any problem to tasks which maybe >> assigned those ports in future by mesos ? >> >> Thanks >> >> On Tue, Mar 29, 2016 at 11:01 AM, Klaus Ma >> wrote: >> >>> Yes, all port resources must be ranges for now, e.g. 31000-35000. >>> >>> There’s already JIRA (MESOS-4627: Improve Ranges parsing to handle >>> single values) on that, patches are pending on review :). >>> >>> >>> Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer >>> Platform OpenSource Technology, STG, IBM GCG >>> +86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me >>> >>> >>> -- >>> Date: Tue, 29 Mar 2016 10:51:44 +0100 >>> Subject: Port Resource Offers >>> From: pradeep.chhetr...@gmail.com >>> To: user@mesos.apache.org >>> >>> >>> Hello, >>> >>> I am running mesos slaves with the modified port announcement. >>> >>> $ cat /etc/mesos-slave/resources >>> ports(*):[6379, 9200, 9300, 27017, 31000-35000] >>> >>> I can that this is being picked up when starting the mesos slaves in ps >>> output: >>> >>> --resources=ports(*):[6379, 9200, 9300, 27017, 31000-35000] >>> >>> However, when i hit the /state.json endpoint of mesos-master, I am >>> seeing this: >>> >>> >>> >>> I can see the tasks are being assigned ports in the range of 9300-27017. >>> There are some of these ports which are already used by other applications >>> running on each mesos slaves but are being announced. I am not sure if this >>> will cause some issue. I am assuming that it will always check if the port >>> is already binded by some other process before assigning port to a task. >>> >>> By going through the code and test cases, it looks like it always expect >>> port resource in ranges. >>> >>> >>> https://github.com/apache/mesos/blob/master/src/v1/resources.cpp#L1255-L1263 >>> >>> So I guess, I should always define ports in ranges rather than >>> individual port. >>> >>> It will be helpful if someone can confirm if it is the expected >>> behaviour and my configuration is wrong. >>> >>> -- >>> Regards, >>> Pradeep Chhetri >>> >> >> >> >> -- >> Regards, >> Pradeep Chhetri >> > > -- Regards, Pradeep Chhetri
Re: Port Resource Offers
hi Pradeep, Yes, that would *definitely* be a problem. e.g., the Storm Framework could easily assign Storm Workers to use those unavailable ports, and then they would fail to come up since they wouldn't be able to bind to their assigned port. I've answered a similar question before: https://unix.stackexchange.com/questions/211647/how-safe-is-it-to-change-the-linux-ephemeral-port-range/237543#237543 - Erik On Tue, Mar 29, 2016 at 3:07 AM, Pradeep Chhetri < pradeep.chhetr...@gmail.com> wrote: > Hi Klaus, > > Thank you for the quick reply. > > One quick question: > > I have some of the ports like 8400,8500,8600 which are already in use by > consul agent running on each mesos slave. But they are also being announced > by each mesos slave. Will this cause any problem to tasks which maybe > assigned those ports in future by mesos ? > > Thanks > > On Tue, Mar 29, 2016 at 11:01 AM, Klaus Mawrote: > >> Yes, all port resources must be ranges for now, e.g. 31000-35000. >> >> There’s already JIRA (MESOS-4627: Improve Ranges parsing to handle single >> values) on that, patches are pending on review :). >> >> >> Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer >> Platform OpenSource Technology, STG, IBM GCG >> +86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me >> >> >> -- >> Date: Tue, 29 Mar 2016 10:51:44 +0100 >> Subject: Port Resource Offers >> From: pradeep.chhetr...@gmail.com >> To: user@mesos.apache.org >> >> >> Hello, >> >> I am running mesos slaves with the modified port announcement. >> >> $ cat /etc/mesos-slave/resources >> ports(*):[6379, 9200, 9300, 27017, 31000-35000] >> >> I can that this is being picked up when starting the mesos slaves in ps >> output: >> >> --resources=ports(*):[6379, 9200, 9300, 27017, 31000-35000] >> >> However, when i hit the /state.json endpoint of mesos-master, I am seeing >> this: >> >> >> >> I can see the tasks are being assigned ports in the range of 9300-27017. >> There are some of these ports which are already used by other applications >> running on each mesos slaves but are being announced. I am not sure if this >> will cause some issue. I am assuming that it will always check if the port >> is already binded by some other process before assigning port to a task. >> >> By going through the code and test cases, it looks like it always expect >> port resource in ranges. >> >> >> https://github.com/apache/mesos/blob/master/src/v1/resources.cpp#L1255-L1263 >> >> So I guess, I should always define ports in ranges rather than individual >> port. >> >> It will be helpful if someone can confirm if it is the expected behaviour >> and my configuration is wrong. >> >> -- >> Regards, >> Pradeep Chhetri >> > > > > -- > Regards, > Pradeep Chhetri >
Re: Port Resource Offers
Hi Klaus, Thank you for the quick reply. One quick question: I have some of the ports like 8400,8500,8600 which are already in use by consul agent running on each mesos slave. But they are also being announced by each mesos slave. Will this cause any problem to tasks which maybe assigned those ports in future by mesos ? Thanks On Tue, Mar 29, 2016 at 11:01 AM, Klaus Mawrote: > Yes, all port resources must be ranges for now, e.g. 31000-35000. > > There’s already JIRA (MESOS-4627: Improve Ranges parsing to handle single > values) on that, patches are pending on review :). > > > Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer > Platform OpenSource Technology, STG, IBM GCG > +86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me > > > -- > Date: Tue, 29 Mar 2016 10:51:44 +0100 > Subject: Port Resource Offers > From: pradeep.chhetr...@gmail.com > To: user@mesos.apache.org > > > Hello, > > I am running mesos slaves with the modified port announcement. > > $ cat /etc/mesos-slave/resources > ports(*):[6379, 9200, 9300, 27017, 31000-35000] > > I can that this is being picked up when starting the mesos slaves in ps > output: > > --resources=ports(*):[6379, 9200, 9300, 27017, 31000-35000] > > However, when i hit the /state.json endpoint of mesos-master, I am seeing > this: > > > > I can see the tasks are being assigned ports in the range of 9300-27017. > There are some of these ports which are already used by other applications > running on each mesos slaves but are being announced. I am not sure if this > will cause some issue. I am assuming that it will always check if the port > is already binded by some other process before assigning port to a task. > > By going through the code and test cases, it looks like it always expect > port resource in ranges. > > > https://github.com/apache/mesos/blob/master/src/v1/resources.cpp#L1255-L1263 > > So I guess, I should always define ports in ranges rather than individual > port. > > It will be helpful if someone can confirm if it is the expected behaviour > and my configuration is wrong. > > -- > Regards, > Pradeep Chhetri > -- Regards, Pradeep Chhetri
RE: Port Resource Offers
Yes, all port resources must be ranges for now, e.g. 31000-35000. There’s already JIRA (MESOS-4627: Improve Ranges parsing to handle single values) on that, patches are pending on review :). Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer Platform OpenSource Technology, STG, IBM GCG +86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me Date: Tue, 29 Mar 2016 10:51:44 +0100 Subject: Port Resource Offers From: pradeep.chhetr...@gmail.com To: user@mesos.apache.org Hello, I am running mesos slaves with the modified port announcement. $ cat /etc/mesos-slave/resourcesports(*):[6379, 9200, 9300, 27017, 31000-35000] I can that this is being picked up when starting the mesos slaves in ps output: --resources=ports(*):[6379, 9200, 9300, 27017, 31000-35000] However, when i hit the /state.json endpoint of mesos-master, I am seeing this: I can see the tasks are being assigned ports in the range of 9300-27017. There are some of these ports which are already used by other applications running on each mesos slaves but are being announced. I am not sure if this will cause some issue. I am assuming that it will always check if the port is already binded by some other process before assigning port to a task. By going through the code and test cases, it looks like it always expect port resource in ranges. https://github.com/apache/mesos/blob/master/src/v1/resources.cpp#L1255-L1263 So I guess, I should always define ports in ranges rather than individual port. It will be helpful if someone can confirm if it is the expected behaviour and my configuration is wrong. -- Regards, Pradeep Chhetri