Re: Agent won't start

2016-03-30 Thread Pradeep Chhetri
Hello Paul,

Few things to note here:

1. Whenever, you change value of any *resource *or any *attribute*
(Description:
http://mesos.apache.org/documentation/latest/attributes-resources/), you
need to cleanup the work_dir (rm -rf /tmp/mesos) and restart the slave.

2. You muse be already knowing that all mesos tasks/executors started by
mesos-slave keeps running even in case mesos slave process dies. Once, you
cleanup the work_dir, you will no longer be able to recover those
executors/tasks and hence all mesos tasks/executors running on that mesos
slave will get killed. So ideally you *shouldn't* do it routinely. But as
in your case, if it doesn't matter, you can add this work_dir cleanup in
maybe sysinit/systemd/upstart script. (I can't think of the reason why
stopping all services on all mesos nodes is a routine tasks unless your
slaves are very temporary in nature eg. AWS spot instances)

3. If your use case is that you want to change resources dynamically on
each mesos slave, i would suggest you to check mesos dynamic reservation
apis (http://mesos.apache.org/documentation/latest/reservation/)

Hope this answer you questions. Let me know if i can help you more.


On Wed, Mar 30, 2016 at 8:20 PM, Paul Bell  wrote:

> Greg, thanks again - I am planning on moving my work_dir.
>
>
>
> Pradeep, thanks again. In a slightly different scenario, namely,
>
> service mesos-slave stop
> edit /etc/default/mesos-slave   (add a port resource)
> service mesos-slave start
>
>
> I noticed that slave did not start and - again - the log shows the same
> phenomena as in my original post. Per your suggestion, I did a
>
> rm -Rf /tmp/mesos
>
> and the slave service started correctly.
>
> Questions:
>
>
>1. Did editing /etc/default/mesos-slave cause the failure of the
>service to start?
>2. given that starting/stopping the entire cluster (stopping all
>services on all nodes) is a standard feature in our product, should I
>routinely to the above "rm" command when the mesos services are stopped?
>
>
> Thanks for your help.
>
> Cordially,
>
> Paul
>
> On Tue, Mar 29, 2016 at 6:16 PM, Greg Mann  wrote:
>
>> Check out this link for info on /tmp cleanup in Ubuntu:
>> http://askubuntu.com/questions/20783/how-is-the-tmp-directory-cleaned-up
>>
>> And check out this link for information on some of the work_dir's
>> contents on a Mesos agent:
>> http://mesos.apache.org/documentation/latest/sandbox/
>>
>> The work_dir contains important application state for the Mesos agent, so
>> it should not be placed in a location that will be automatically
>> garbage-collected by the OS. The choice of /tmp/mesos as a default location
>> is a bit unfortunate, and hopefully we can resolve that JIRA issue soon to
>> change it. Ideally you should be able to leave the work_dir alone and let
>> the Mesos agent manage it for you.
>>
>> In any case, I would recommend that you set the work_dir to something
>> outside of /tmp; /var/lib/mesos is a commonly-used location.
>>
>> Cheers,
>> Greg
>>
>
>


-- 
Regards,
Pradeep Chhetri


Re: Agent won't start

2016-03-30 Thread Paul Bell
Greg, thanks again - I am planning on moving my work_dir.



Pradeep, thanks again. In a slightly different scenario, namely,

service mesos-slave stop
edit /etc/default/mesos-slave   (add a port resource)
service mesos-slave start


I noticed that slave did not start and - again - the log shows the same
phenomena as in my original post. Per your suggestion, I did a

rm -Rf /tmp/mesos

and the slave service started correctly.

Questions:


   1. Did editing /etc/default/mesos-slave cause the failure of the service
   to start?
   2. given that starting/stopping the entire cluster (stopping all
   services on all nodes) is a standard feature in our product, should I
   routinely to the above "rm" command when the mesos services are stopped?


Thanks for your help.

Cordially,

Paul

On Tue, Mar 29, 2016 at 6:16 PM, Greg Mann  wrote:

> Check out this link for info on /tmp cleanup in Ubuntu:
> http://askubuntu.com/questions/20783/how-is-the-tmp-directory-cleaned-up
>
> And check out this link for information on some of the work_dir's contents
> on a Mesos agent: http://mesos.apache.org/documentation/latest/sandbox/
>
> The work_dir contains important application state for the Mesos agent, so
> it should not be placed in a location that will be automatically
> garbage-collected by the OS. The choice of /tmp/mesos as a default location
> is a bit unfortunate, and hopefully we can resolve that JIRA issue soon to
> change it. Ideally you should be able to leave the work_dir alone and let
> the Mesos agent manage it for you.
>
> In any case, I would recommend that you set the work_dir to something
> outside of /tmp; /var/lib/mesos is a commonly-used location.
>
> Cheers,
> Greg
>


Re: Agent won't start

2016-03-29 Thread Greg Mann
Check out this link for info on /tmp cleanup in Ubuntu:
http://askubuntu.com/questions/20783/how-is-the-tmp-directory-cleaned-up

And check out this link for information on some of the work_dir's contents
on a Mesos agent: http://mesos.apache.org/documentation/latest/sandbox/

The work_dir contains important application state for the Mesos agent, so
it should not be placed in a location that will be automatically
garbage-collected by the OS. The choice of /tmp/mesos as a default location
is a bit unfortunate, and hopefully we can resolve that JIRA issue soon to
change it. Ideally you should be able to leave the work_dir alone and let
the Mesos agent manage it for you.

In any case, I would recommend that you set the work_dir to something
outside of /tmp; /var/lib/mesos is a commonly-used location.

Cheers,
Greg


Re: Agent won't start

2016-03-29 Thread Paul Bell
Hi Pradeep,

And thank you for your reply!

That, too, is very interesting. I think I need to synthesize what you and
Greg are telling me and come up with a clean solution. Agent nodes can
crash. Moreover, I can stop the mesos-slave service, and start it later
with a reboot in between.

So I am interested in fully understanding the causal chain here before I
try to fix anything.

-Paul



On Tue, Mar 29, 2016 at 5:51 PM, Paul Bell  wrote:

> Whoa...interessant!
>
> The node *may* have been rebooted. Uptime says 2 days. I'll need to check
> my notes.
>
> Can you point me to reference re Ubuntu behavior?
>
> Based on what you've told me so far, it sounds as if the sequence:
>
> stop service
> reboot agent node
> start service
>
>
> could lead to trouble - or do I misunderstand?
>
>
> Thank you again for your help.
>
> -Paul
>
> On Tue, Mar 29, 2016 at 5:36 PM, Greg Mann  wrote:
>
>> Paul,
>> This would be relevant for any system which is automatically deleting
>> files in /tmp. It looks like in Ubuntu, the default behavior is for /tmp to
>> be completely nuked at boot time. Was the agent node rebooted prior to this
>> problem?
>>
>> On Tue, Mar 29, 2016 at 2:29 PM, Paul Bell  wrote:
>>
>>> Hi Greg,
>>>
>>> Thanks very much for your quick reply.
>>>
>>> I simply forgot to mention platform. It's Ubuntu 14.04 LTS and it's not
>>> systemd. I will look at the link you provide.
>>>
>>> Is there any chance that it might apply to non-systemd platforms?
>>>
>>> Cordially,
>>>
>>> Paul
>>>
>>> On Tue, Mar 29, 2016 at 5:18 PM, Greg Mann  wrote:
>>>
 Hi Paul,
 Noticing the logging output, "Failed to find resources file
 '/tmp/mesos/meta/resources/resources.info'", I wonder if your trouble
 may be related to the location of your agent's work_dir. See this ticket:
 https://issues.apache.org/jira/browse/MESOS-4541

 Some users have reported issues resulting from the systemd-tmpfiles
 service garbage collecting files in /tmp, perhaps this is related? What
 platform is your agent running on?

 You could try specifying a different agent work directory outside of
 /tmp/ via the `--work_dir` command-line flag.

 Cheers,
 Greg


 On Tue, Mar 29, 2016 at 2:08 PM, Paul Bell  wrote:

> Hi,
>
> I am hoping someone can shed some light on this.
>
> An agent node failed to start, that is, when I did "service
> mesos-slave start" the service came up briefly & then stopped. Before
> stopping it produced the log shown below. The last thing it wrote is
> "Trying to create path '/mesos' in Zookeeper".
>
> This mention of the mesos znode prompted me to go for a clean slate by
> removing the mesos znode from Zookeeper.
>
> After doing this, the mesos-slave service started perfectly.
>
> What might be happening here, and also what's the right way to
> trouble-shoot such a problem? Mesos is version 0.23.0.
>
> Thanks for your help.
>
> -Paul
>
>
> Log file created at: 2016/03/29 14:19:39
> Running on machine: 71.100.202.193
> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
> I0329 14:19:39.512249  5870 logging.cpp:172] INFO level logging
> started!
> I0329 14:19:39.512564  5870 main.cpp:162] Build: 2015-07-24 10:05:39
> by root
> I0329 14:19:39.512588  5870 main.cpp:164] Version: 0.23.0
> I0329 14:19:39.512600  5870 main.cpp:167] Git tag: 0.23.0
> I0329 14:19:39.512612  5870 main.cpp:171] Git SHA:
> 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66
> I0329 14:19:39.615172  5870 containerizer.cpp:111] Using isolation:
> posix/cpu,posix/mem
> I0329 14:19:39.615697  5870 main.cpp:249] Starting Mesos slave
> I0329 14:19:39.616267  5870 slave.cpp:190] Slave started on 1)@
> 71.100.202.193:5051
> I0329 14:19:39.616286  5870 slave.cpp:191] Flags at startup:
> --attributes="hostType:shard1" --authenticatee="crammd5"
> --cgroups_cpu_enable_pids_and_tids_count="false"
> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
> --cgroups_limit_swap="false" --cgroups_root="mesos"
> --container_disk_watch_interval="15secs" --containerizers="docker,mesos"
> --default_role="*" --disk_watch_interval="1mins"
> --docker="/usr/local/ecxmcc/weaveShim" --docker_kill_orphans="true"
> --docker_remove_delay="6hrs"
> --docker_sandbox_directory="/mnt/mesos/sandbox"
> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="15secs"
> --enforce_container_disk_quota="false"
> --executor_registration_timeout="5mins"
> --executor_shutdown_grace_period="5secs"
> --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB"
> --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1"
> --hadoop_home="" --help="false" --hostname="71.100.202.193"
> 

Re: Agent won't start

2016-03-29 Thread Paul Bell
Whoa...interessant!

The node *may* have been rebooted. Uptime says 2 days. I'll need to check
my notes.

Can you point me to reference re Ubuntu behavior?

Based on what you've told me so far, it sounds as if the sequence:

stop service
reboot agent node
start service


could lead to trouble - or do I misunderstand?


Thank you again for your help.

-Paul

On Tue, Mar 29, 2016 at 5:36 PM, Greg Mann  wrote:

> Paul,
> This would be relevant for any system which is automatically deleting
> files in /tmp. It looks like in Ubuntu, the default behavior is for /tmp to
> be completely nuked at boot time. Was the agent node rebooted prior to this
> problem?
>
> On Tue, Mar 29, 2016 at 2:29 PM, Paul Bell  wrote:
>
>> Hi Greg,
>>
>> Thanks very much for your quick reply.
>>
>> I simply forgot to mention platform. It's Ubuntu 14.04 LTS and it's not
>> systemd. I will look at the link you provide.
>>
>> Is there any chance that it might apply to non-systemd platforms?
>>
>> Cordially,
>>
>> Paul
>>
>> On Tue, Mar 29, 2016 at 5:18 PM, Greg Mann  wrote:
>>
>>> Hi Paul,
>>> Noticing the logging output, "Failed to find resources file
>>> '/tmp/mesos/meta/resources/resources.info'", I wonder if your trouble
>>> may be related to the location of your agent's work_dir. See this ticket:
>>> https://issues.apache.org/jira/browse/MESOS-4541
>>>
>>> Some users have reported issues resulting from the systemd-tmpfiles
>>> service garbage collecting files in /tmp, perhaps this is related? What
>>> platform is your agent running on?
>>>
>>> You could try specifying a different agent work directory outside of
>>> /tmp/ via the `--work_dir` command-line flag.
>>>
>>> Cheers,
>>> Greg
>>>
>>>
>>> On Tue, Mar 29, 2016 at 2:08 PM, Paul Bell  wrote:
>>>
 Hi,

 I am hoping someone can shed some light on this.

 An agent node failed to start, that is, when I did "service mesos-slave
 start" the service came up briefly & then stopped. Before stopping it
 produced the log shown below. The last thing it wrote is "Trying to create
 path '/mesos' in Zookeeper".

 This mention of the mesos znode prompted me to go for a clean slate by
 removing the mesos znode from Zookeeper.

 After doing this, the mesos-slave service started perfectly.

 What might be happening here, and also what's the right way to
 trouble-shoot such a problem? Mesos is version 0.23.0.

 Thanks for your help.

 -Paul


 Log file created at: 2016/03/29 14:19:39
 Running on machine: 71.100.202.193
 Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
 I0329 14:19:39.512249  5870 logging.cpp:172] INFO level logging started!
 I0329 14:19:39.512564  5870 main.cpp:162] Build: 2015-07-24 10:05:39 by
 root
 I0329 14:19:39.512588  5870 main.cpp:164] Version: 0.23.0
 I0329 14:19:39.512600  5870 main.cpp:167] Git tag: 0.23.0
 I0329 14:19:39.512612  5870 main.cpp:171] Git SHA:
 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66
 I0329 14:19:39.615172  5870 containerizer.cpp:111] Using isolation:
 posix/cpu,posix/mem
 I0329 14:19:39.615697  5870 main.cpp:249] Starting Mesos slave
 I0329 14:19:39.616267  5870 slave.cpp:190] Slave started on 1)@
 71.100.202.193:5051
 I0329 14:19:39.616286  5870 slave.cpp:191] Flags at startup:
 --attributes="hostType:shard1" --authenticatee="crammd5"
 --cgroups_cpu_enable_pids_and_tids_count="false"
 --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
 --cgroups_limit_swap="false" --cgroups_root="mesos"
 --container_disk_watch_interval="15secs" --containerizers="docker,mesos"
 --default_role="*" --disk_watch_interval="1mins"
 --docker="/usr/local/ecxmcc/weaveShim" --docker_kill_orphans="true"
 --docker_remove_delay="6hrs"
 --docker_sandbox_directory="/mnt/mesos/sandbox"
 --docker_socket="/var/run/docker.sock" --docker_stop_timeout="15secs"
 --enforce_container_disk_quota="false"
 --executor_registration_timeout="5mins"
 --executor_shutdown_grace_period="5secs"
 --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB"
 --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1"
 --hadoop_home="" --help="false" --hostname="71.100.202.193"
 --initialize_driver_logging="true" --ip="71.100.202.193"
 --isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos"
 --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO"
 --master="zk://71.100.202.191:2181/mesos"
 --oversubscribed_resources_interval="15secs" --perf_duration="10secs"
 --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns"
 --quiet="false" --recover="reconnect" --recovery_timeout="15mins"
 --registration_backoff_factor="1secs"
 --resource_monitoring_interval="1secs" --revocable_cpu_low_priority="true"

Re: Agent won't start

2016-03-29 Thread Pradeep Chhetri
Hello Paul,

>From the logs, it looks like, on starting the mesos slave, it is trying to
do slave recovery (
http://mesos.apache.org/documentation/latest/slave-recovery/) but since the
resources.info is unavailable, it is unable to perform the recovery & hence
end up killing itself.

If you are fine with loosing any running existing mesos tasks/executors,
then you can just cleanup the mesos default working directory where it
keeps the checkpoint information($ rm -rf /tmp/mesos) and then try to
restart the mesos slave.

On Tue, Mar 29, 2016 at 10:08 PM, Paul Bell  wrote:

> Hi,
>
> I am hoping someone can shed some light on this.
>
> An agent node failed to start, that is, when I did "service mesos-slave
> start" the service came up briefly & then stopped. Before stopping it
> produced the log shown below. The last thing it wrote is "Trying to create
> path '/mesos' in Zookeeper".
>
> This mention of the mesos znode prompted me to go for a clean slate by
> removing the mesos znode from Zookeeper.
>
> After doing this, the mesos-slave service started perfectly.
>
> What might be happening here, and also what's the right way to
> trouble-shoot such a problem? Mesos is version 0.23.0.
>
> Thanks for your help.
>
> -Paul
>
>
> Log file created at: 2016/03/29 14:19:39
> Running on machine: 71.100.202.193
> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
> I0329 14:19:39.512249  5870 logging.cpp:172] INFO level logging started!
> I0329 14:19:39.512564  5870 main.cpp:162] Build: 2015-07-24 10:05:39 by
> root
> I0329 14:19:39.512588  5870 main.cpp:164] Version: 0.23.0
> I0329 14:19:39.512600  5870 main.cpp:167] Git tag: 0.23.0
> I0329 14:19:39.512612  5870 main.cpp:171] Git SHA:
> 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66
> I0329 14:19:39.615172  5870 containerizer.cpp:111] Using isolation:
> posix/cpu,posix/mem
> I0329 14:19:39.615697  5870 main.cpp:249] Starting Mesos slave
> I0329 14:19:39.616267  5870 slave.cpp:190] Slave started on 1)@
> 71.100.202.193:5051
> I0329 14:19:39.616286  5870 slave.cpp:191] Flags at startup:
> --attributes="hostType:shard1" --authenticatee="crammd5"
> --cgroups_cpu_enable_pids_and_tids_count="false"
> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
> --cgroups_limit_swap="false" --cgroups_root="mesos"
> --container_disk_watch_interval="15secs" --containerizers="docker,mesos"
> --default_role="*" --disk_watch_interval="1mins"
> --docker="/usr/local/ecxmcc/weaveShim" --docker_kill_orphans="true"
> --docker_remove_delay="6hrs"
> --docker_sandbox_directory="/mnt/mesos/sandbox"
> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="15secs"
> --enforce_container_disk_quota="false"
> --executor_registration_timeout="5mins"
> --executor_shutdown_grace_period="5secs"
> --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB"
> --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1"
> --hadoop_home="" --help="false" --hostname="71.100.202.193"
> --initialize_driver_logging="true" --ip="71.100.202.193"
> --isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos"
> --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO"
> --master="zk://71.100.202.191:2181/mesos"
> --oversubscribed_resources_interval="15secs" --perf_duration="10secs"
> --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns"
> --quiet="false" --recover="reconnect" --recovery_timeout="15mins"
> --registration_backoff_factor="1secs"
> --resource_monitoring_interval="1secs" --revocable_cpu_low_priority="true"
> --strict="true" --switch_user="true" --version="false"
> --work_dir="/tmp/mesos"
> I0329 14:19:39.616835  5870 slave.cpp:354] Slave resources: cpus(*):4;
> mem(*):23089; disk(*):122517; ports(*):[31000-32000]
> I0329 14:19:39.617032  5870 slave.cpp:384] Slave hostname: 71.100.202.193
> I0329 14:19:39.617046  5870 slave.cpp:389] Slave checkpoint: true
> I0329 14:19:39.618841  5894 state.cpp:36] Recovering state from
> '/tmp/mesos/meta'
> I0329 14:19:39.618872  5894 state.cpp:672] Failed to find resources file
> '/tmp/mesos/meta/resources/resources.info'
> I0329 14:19:39.619730  5898 group.cpp:313] Group process (group(1)@
> 71.100.202.193:5051) connected to ZooKeeper
> I0329 14:19:39.619760  5898 group.cpp:787] Syncing group operations: queue
> size (joins, cancels, datas) = (0, 0, 0)
> I0329 14:19:39.619773  5898 group.cpp:385] Trying to create path '/mesos'
> in ZooKeeper
>
>


-- 
Regards,
Pradeep Chhetri


Re: Agent won't start

2016-03-29 Thread Greg Mann
Paul,
This would be relevant for any system which is automatically deleting files
in /tmp. It looks like in Ubuntu, the default behavior is for /tmp to be
completely nuked at boot time. Was the agent node rebooted prior to this
problem?

On Tue, Mar 29, 2016 at 2:29 PM, Paul Bell  wrote:

> Hi Greg,
>
> Thanks very much for your quick reply.
>
> I simply forgot to mention platform. It's Ubuntu 14.04 LTS and it's not
> systemd. I will look at the link you provide.
>
> Is there any chance that it might apply to non-systemd platforms?
>
> Cordially,
>
> Paul
>
> On Tue, Mar 29, 2016 at 5:18 PM, Greg Mann  wrote:
>
>> Hi Paul,
>> Noticing the logging output, "Failed to find resources file
>> '/tmp/mesos/meta/resources/resources.info'", I wonder if your trouble
>> may be related to the location of your agent's work_dir. See this ticket:
>> https://issues.apache.org/jira/browse/MESOS-4541
>>
>> Some users have reported issues resulting from the systemd-tmpfiles
>> service garbage collecting files in /tmp, perhaps this is related? What
>> platform is your agent running on?
>>
>> You could try specifying a different agent work directory outside of
>> /tmp/ via the `--work_dir` command-line flag.
>>
>> Cheers,
>> Greg
>>
>>
>> On Tue, Mar 29, 2016 at 2:08 PM, Paul Bell  wrote:
>>
>>> Hi,
>>>
>>> I am hoping someone can shed some light on this.
>>>
>>> An agent node failed to start, that is, when I did "service mesos-slave
>>> start" the service came up briefly & then stopped. Before stopping it
>>> produced the log shown below. The last thing it wrote is "Trying to create
>>> path '/mesos' in Zookeeper".
>>>
>>> This mention of the mesos znode prompted me to go for a clean slate by
>>> removing the mesos znode from Zookeeper.
>>>
>>> After doing this, the mesos-slave service started perfectly.
>>>
>>> What might be happening here, and also what's the right way to
>>> trouble-shoot such a problem? Mesos is version 0.23.0.
>>>
>>> Thanks for your help.
>>>
>>> -Paul
>>>
>>>
>>> Log file created at: 2016/03/29 14:19:39
>>> Running on machine: 71.100.202.193
>>> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
>>> I0329 14:19:39.512249  5870 logging.cpp:172] INFO level logging started!
>>> I0329 14:19:39.512564  5870 main.cpp:162] Build: 2015-07-24 10:05:39 by
>>> root
>>> I0329 14:19:39.512588  5870 main.cpp:164] Version: 0.23.0
>>> I0329 14:19:39.512600  5870 main.cpp:167] Git tag: 0.23.0
>>> I0329 14:19:39.512612  5870 main.cpp:171] Git SHA:
>>> 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66
>>> I0329 14:19:39.615172  5870 containerizer.cpp:111] Using isolation:
>>> posix/cpu,posix/mem
>>> I0329 14:19:39.615697  5870 main.cpp:249] Starting Mesos slave
>>> I0329 14:19:39.616267  5870 slave.cpp:190] Slave started on 1)@
>>> 71.100.202.193:5051
>>> I0329 14:19:39.616286  5870 slave.cpp:191] Flags at startup:
>>> --attributes="hostType:shard1" --authenticatee="crammd5"
>>> --cgroups_cpu_enable_pids_and_tids_count="false"
>>> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
>>> --cgroups_limit_swap="false" --cgroups_root="mesos"
>>> --container_disk_watch_interval="15secs" --containerizers="docker,mesos"
>>> --default_role="*" --disk_watch_interval="1mins"
>>> --docker="/usr/local/ecxmcc/weaveShim" --docker_kill_orphans="true"
>>> --docker_remove_delay="6hrs"
>>> --docker_sandbox_directory="/mnt/mesos/sandbox"
>>> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="15secs"
>>> --enforce_container_disk_quota="false"
>>> --executor_registration_timeout="5mins"
>>> --executor_shutdown_grace_period="5secs"
>>> --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB"
>>> --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1"
>>> --hadoop_home="" --help="false" --hostname="71.100.202.193"
>>> --initialize_driver_logging="true" --ip="71.100.202.193"
>>> --isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos"
>>> --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO"
>>> --master="zk://71.100.202.191:2181/mesos"
>>> --oversubscribed_resources_interval="15secs" --perf_duration="10secs"
>>> --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns"
>>> --quiet="false" --recover="reconnect" --recovery_timeout="15mins"
>>> --registration_backoff_factor="1secs"
>>> --resource_monitoring_interval="1secs" --revocable_cpu_low_priority="true"
>>> --strict="true" --switch_user="true" --version="false"
>>> --work_dir="/tmp/mesos"
>>> I0329 14:19:39.616835  5870 slave.cpp:354] Slave resources: cpus(*):4;
>>> mem(*):23089; disk(*):122517; ports(*):[31000-32000]
>>> I0329 14:19:39.617032  5870 slave.cpp:384] Slave hostname: 71.100.202.193
>>> I0329 14:19:39.617046  5870 slave.cpp:389] Slave checkpoint: true
>>> I0329 14:19:39.618841  5894 state.cpp:36] Recovering state from
>>> '/tmp/mesos/meta'
>>> I0329 14:19:39.618872  5894 state.cpp:672] Failed to find resources file
>>> 

Re: Agent won't start

2016-03-29 Thread Paul Bell
Hi Greg,

Thanks very much for your quick reply.

I simply forgot to mention platform. It's Ubuntu 14.04 LTS and it's not
systemd. I will look at the link you provide.

Is there any chance that it might apply to non-systemd platforms?

Cordially,

Paul

On Tue, Mar 29, 2016 at 5:18 PM, Greg Mann  wrote:

> Hi Paul,
> Noticing the logging output, "Failed to find resources file
> '/tmp/mesos/meta/resources/resources.info'", I wonder if your trouble may
> be related to the location of your agent's work_dir. See this ticket:
> https://issues.apache.org/jira/browse/MESOS-4541
>
> Some users have reported issues resulting from the systemd-tmpfiles
> service garbage collecting files in /tmp, perhaps this is related? What
> platform is your agent running on?
>
> You could try specifying a different agent work directory outside of /tmp/
> via the `--work_dir` command-line flag.
>
> Cheers,
> Greg
>
>
> On Tue, Mar 29, 2016 at 2:08 PM, Paul Bell  wrote:
>
>> Hi,
>>
>> I am hoping someone can shed some light on this.
>>
>> An agent node failed to start, that is, when I did "service mesos-slave
>> start" the service came up briefly & then stopped. Before stopping it
>> produced the log shown below. The last thing it wrote is "Trying to create
>> path '/mesos' in Zookeeper".
>>
>> This mention of the mesos znode prompted me to go for a clean slate by
>> removing the mesos znode from Zookeeper.
>>
>> After doing this, the mesos-slave service started perfectly.
>>
>> What might be happening here, and also what's the right way to
>> trouble-shoot such a problem? Mesos is version 0.23.0.
>>
>> Thanks for your help.
>>
>> -Paul
>>
>>
>> Log file created at: 2016/03/29 14:19:39
>> Running on machine: 71.100.202.193
>> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
>> I0329 14:19:39.512249  5870 logging.cpp:172] INFO level logging started!
>> I0329 14:19:39.512564  5870 main.cpp:162] Build: 2015-07-24 10:05:39 by
>> root
>> I0329 14:19:39.512588  5870 main.cpp:164] Version: 0.23.0
>> I0329 14:19:39.512600  5870 main.cpp:167] Git tag: 0.23.0
>> I0329 14:19:39.512612  5870 main.cpp:171] Git SHA:
>> 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66
>> I0329 14:19:39.615172  5870 containerizer.cpp:111] Using isolation:
>> posix/cpu,posix/mem
>> I0329 14:19:39.615697  5870 main.cpp:249] Starting Mesos slave
>> I0329 14:19:39.616267  5870 slave.cpp:190] Slave started on 1)@
>> 71.100.202.193:5051
>> I0329 14:19:39.616286  5870 slave.cpp:191] Flags at startup:
>> --attributes="hostType:shard1" --authenticatee="crammd5"
>> --cgroups_cpu_enable_pids_and_tids_count="false"
>> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
>> --cgroups_limit_swap="false" --cgroups_root="mesos"
>> --container_disk_watch_interval="15secs" --containerizers="docker,mesos"
>> --default_role="*" --disk_watch_interval="1mins"
>> --docker="/usr/local/ecxmcc/weaveShim" --docker_kill_orphans="true"
>> --docker_remove_delay="6hrs"
>> --docker_sandbox_directory="/mnt/mesos/sandbox"
>> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="15secs"
>> --enforce_container_disk_quota="false"
>> --executor_registration_timeout="5mins"
>> --executor_shutdown_grace_period="5secs"
>> --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB"
>> --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1"
>> --hadoop_home="" --help="false" --hostname="71.100.202.193"
>> --initialize_driver_logging="true" --ip="71.100.202.193"
>> --isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos"
>> --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO"
>> --master="zk://71.100.202.191:2181/mesos"
>> --oversubscribed_resources_interval="15secs" --perf_duration="10secs"
>> --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns"
>> --quiet="false" --recover="reconnect" --recovery_timeout="15mins"
>> --registration_backoff_factor="1secs"
>> --resource_monitoring_interval="1secs" --revocable_cpu_low_priority="true"
>> --strict="true" --switch_user="true" --version="false"
>> --work_dir="/tmp/mesos"
>> I0329 14:19:39.616835  5870 slave.cpp:354] Slave resources: cpus(*):4;
>> mem(*):23089; disk(*):122517; ports(*):[31000-32000]
>> I0329 14:19:39.617032  5870 slave.cpp:384] Slave hostname: 71.100.202.193
>> I0329 14:19:39.617046  5870 slave.cpp:389] Slave checkpoint: true
>> I0329 14:19:39.618841  5894 state.cpp:36] Recovering state from
>> '/tmp/mesos/meta'
>> I0329 14:19:39.618872  5894 state.cpp:672] Failed to find resources file
>> '/tmp/mesos/meta/resources/resources.info'
>> I0329 14:19:39.619730  5898 group.cpp:313] Group process (group(1)@
>> 71.100.202.193:5051) connected to ZooKeeper
>> I0329 14:19:39.619760  5898 group.cpp:787] Syncing group operations:
>> queue size (joins, cancels, datas) = (0, 0, 0)
>> I0329 14:19:39.619773  5898 group.cpp:385] Trying to create path '/mesos'
>> in ZooKeeper
>>
>>
>


Re: Agent won't start

2016-03-29 Thread Greg Mann
Hi Paul,
Noticing the logging output, "Failed to find resources file
'/tmp/mesos/meta/resources/resources.info'", I wonder if your trouble may
be related to the location of your agent's work_dir. See this ticket:
https://issues.apache.org/jira/browse/MESOS-4541

Some users have reported issues resulting from the systemd-tmpfiles service
garbage collecting files in /tmp, perhaps this is related? What platform is
your agent running on?

You could try specifying a different agent work directory outside of /tmp/
via the `--work_dir` command-line flag.

Cheers,
Greg


On Tue, Mar 29, 2016 at 2:08 PM, Paul Bell  wrote:

> Hi,
>
> I am hoping someone can shed some light on this.
>
> An agent node failed to start, that is, when I did "service mesos-slave
> start" the service came up briefly & then stopped. Before stopping it
> produced the log shown below. The last thing it wrote is "Trying to create
> path '/mesos' in Zookeeper".
>
> This mention of the mesos znode prompted me to go for a clean slate by
> removing the mesos znode from Zookeeper.
>
> After doing this, the mesos-slave service started perfectly.
>
> What might be happening here, and also what's the right way to
> trouble-shoot such a problem? Mesos is version 0.23.0.
>
> Thanks for your help.
>
> -Paul
>
>
> Log file created at: 2016/03/29 14:19:39
> Running on machine: 71.100.202.193
> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
> I0329 14:19:39.512249  5870 logging.cpp:172] INFO level logging started!
> I0329 14:19:39.512564  5870 main.cpp:162] Build: 2015-07-24 10:05:39 by
> root
> I0329 14:19:39.512588  5870 main.cpp:164] Version: 0.23.0
> I0329 14:19:39.512600  5870 main.cpp:167] Git tag: 0.23.0
> I0329 14:19:39.512612  5870 main.cpp:171] Git SHA:
> 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66
> I0329 14:19:39.615172  5870 containerizer.cpp:111] Using isolation:
> posix/cpu,posix/mem
> I0329 14:19:39.615697  5870 main.cpp:249] Starting Mesos slave
> I0329 14:19:39.616267  5870 slave.cpp:190] Slave started on 1)@
> 71.100.202.193:5051
> I0329 14:19:39.616286  5870 slave.cpp:191] Flags at startup:
> --attributes="hostType:shard1" --authenticatee="crammd5"
> --cgroups_cpu_enable_pids_and_tids_count="false"
> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
> --cgroups_limit_swap="false" --cgroups_root="mesos"
> --container_disk_watch_interval="15secs" --containerizers="docker,mesos"
> --default_role="*" --disk_watch_interval="1mins"
> --docker="/usr/local/ecxmcc/weaveShim" --docker_kill_orphans="true"
> --docker_remove_delay="6hrs"
> --docker_sandbox_directory="/mnt/mesos/sandbox"
> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="15secs"
> --enforce_container_disk_quota="false"
> --executor_registration_timeout="5mins"
> --executor_shutdown_grace_period="5secs"
> --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB"
> --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1"
> --hadoop_home="" --help="false" --hostname="71.100.202.193"
> --initialize_driver_logging="true" --ip="71.100.202.193"
> --isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos"
> --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO"
> --master="zk://71.100.202.191:2181/mesos"
> --oversubscribed_resources_interval="15secs" --perf_duration="10secs"
> --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns"
> --quiet="false" --recover="reconnect" --recovery_timeout="15mins"
> --registration_backoff_factor="1secs"
> --resource_monitoring_interval="1secs" --revocable_cpu_low_priority="true"
> --strict="true" --switch_user="true" --version="false"
> --work_dir="/tmp/mesos"
> I0329 14:19:39.616835  5870 slave.cpp:354] Slave resources: cpus(*):4;
> mem(*):23089; disk(*):122517; ports(*):[31000-32000]
> I0329 14:19:39.617032  5870 slave.cpp:384] Slave hostname: 71.100.202.193
> I0329 14:19:39.617046  5870 slave.cpp:389] Slave checkpoint: true
> I0329 14:19:39.618841  5894 state.cpp:36] Recovering state from
> '/tmp/mesos/meta'
> I0329 14:19:39.618872  5894 state.cpp:672] Failed to find resources file
> '/tmp/mesos/meta/resources/resources.info'
> I0329 14:19:39.619730  5898 group.cpp:313] Group process (group(1)@
> 71.100.202.193:5051) connected to ZooKeeper
> I0329 14:19:39.619760  5898 group.cpp:787] Syncing group operations: queue
> size (joins, cancels, datas) = (0, 0, 0)
> I0329 14:19:39.619773  5898 group.cpp:385] Trying to create path '/mesos'
> in ZooKeeper
>
>


Agent won't start

2016-03-29 Thread Paul Bell
Hi,

I am hoping someone can shed some light on this.

An agent node failed to start, that is, when I did "service mesos-slave
start" the service came up briefly & then stopped. Before stopping it
produced the log shown below. The last thing it wrote is "Trying to create
path '/mesos' in Zookeeper".

This mention of the mesos znode prompted me to go for a clean slate by
removing the mesos znode from Zookeeper.

After doing this, the mesos-slave service started perfectly.

What might be happening here, and also what's the right way to
trouble-shoot such a problem? Mesos is version 0.23.0.

Thanks for your help.

-Paul


Log file created at: 2016/03/29 14:19:39
Running on machine: 71.100.202.193
Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
I0329 14:19:39.512249  5870 logging.cpp:172] INFO level logging started!
I0329 14:19:39.512564  5870 main.cpp:162] Build: 2015-07-24 10:05:39 by root
I0329 14:19:39.512588  5870 main.cpp:164] Version: 0.23.0
I0329 14:19:39.512600  5870 main.cpp:167] Git tag: 0.23.0
I0329 14:19:39.512612  5870 main.cpp:171] Git SHA:
4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66
I0329 14:19:39.615172  5870 containerizer.cpp:111] Using isolation:
posix/cpu,posix/mem
I0329 14:19:39.615697  5870 main.cpp:249] Starting Mesos slave
I0329 14:19:39.616267  5870 slave.cpp:190] Slave started on 1)@
71.100.202.193:5051
I0329 14:19:39.616286  5870 slave.cpp:191] Flags at startup:
--attributes="hostType:shard1" --authenticatee="crammd5"
--cgroups_cpu_enable_pids_and_tids_count="false"
--cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
--cgroups_limit_swap="false" --cgroups_root="mesos"
--container_disk_watch_interval="15secs" --containerizers="docker,mesos"
--default_role="*" --disk_watch_interval="1mins"
--docker="/usr/local/ecxmcc/weaveShim" --docker_kill_orphans="true"
--docker_remove_delay="6hrs"
--docker_sandbox_directory="/mnt/mesos/sandbox"
--docker_socket="/var/run/docker.sock" --docker_stop_timeout="15secs"
--enforce_container_disk_quota="false"
--executor_registration_timeout="5mins"
--executor_shutdown_grace_period="5secs"
--fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB"
--frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1"
--hadoop_home="" --help="false" --hostname="71.100.202.193"
--initialize_driver_logging="true" --ip="71.100.202.193"
--isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos"
--log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO"
--master="zk://71.100.202.191:2181/mesos"
--oversubscribed_resources_interval="15secs" --perf_duration="10secs"
--perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns"
--quiet="false" --recover="reconnect" --recovery_timeout="15mins"
--registration_backoff_factor="1secs"
--resource_monitoring_interval="1secs" --revocable_cpu_low_priority="true"
--strict="true" --switch_user="true" --version="false"
--work_dir="/tmp/mesos"
I0329 14:19:39.616835  5870 slave.cpp:354] Slave resources: cpus(*):4;
mem(*):23089; disk(*):122517; ports(*):[31000-32000]
I0329 14:19:39.617032  5870 slave.cpp:384] Slave hostname: 71.100.202.193
I0329 14:19:39.617046  5870 slave.cpp:389] Slave checkpoint: true
I0329 14:19:39.618841  5894 state.cpp:36] Recovering state from
'/tmp/mesos/meta'
I0329 14:19:39.618872  5894 state.cpp:672] Failed to find resources file
'/tmp/mesos/meta/resources/resources.info'
I0329 14:19:39.619730  5898 group.cpp:313] Group process (group(1)@
71.100.202.193:5051) connected to ZooKeeper
I0329 14:19:39.619760  5898 group.cpp:787] Syncing group operations: queue
size (joins, cancels, datas) = (0, 0, 0)
I0329 14:19:39.619773  5898 group.cpp:385] Trying to create path '/mesos'
in ZooKeeper