Re: Agent won't start

2016-03-29 Thread Greg Mann
Check out this link for info on /tmp cleanup in Ubuntu:
http://askubuntu.com/questions/20783/how-is-the-tmp-directory-cleaned-up

And check out this link for information on some of the work_dir's contents
on a Mesos agent: http://mesos.apache.org/documentation/latest/sandbox/

The work_dir contains important application state for the Mesos agent, so
it should not be placed in a location that will be automatically
garbage-collected by the OS. The choice of /tmp/mesos as a default location
is a bit unfortunate, and hopefully we can resolve that JIRA issue soon to
change it. Ideally you should be able to leave the work_dir alone and let
the Mesos agent manage it for you.

In any case, I would recommend that you set the work_dir to something
outside of /tmp; /var/lib/mesos is a commonly-used location.

Cheers,
Greg


Re: Agent won't start

2016-03-29 Thread Paul Bell
Hi Pradeep,

And thank you for your reply!

That, too, is very interesting. I think I need to synthesize what you and
Greg are telling me and come up with a clean solution. Agent nodes can
crash. Moreover, I can stop the mesos-slave service, and start it later
with a reboot in between.

So I am interested in fully understanding the causal chain here before I
try to fix anything.

-Paul



On Tue, Mar 29, 2016 at 5:51 PM, Paul Bell  wrote:

> Whoa...interessant!
>
> The node *may* have been rebooted. Uptime says 2 days. I'll need to check
> my notes.
>
> Can you point me to reference re Ubuntu behavior?
>
> Based on what you've told me so far, it sounds as if the sequence:
>
> stop service
> reboot agent node
> start service
>
>
> could lead to trouble - or do I misunderstand?
>
>
> Thank you again for your help.
>
> -Paul
>
> On Tue, Mar 29, 2016 at 5:36 PM, Greg Mann  wrote:
>
>> Paul,
>> This would be relevant for any system which is automatically deleting
>> files in /tmp. It looks like in Ubuntu, the default behavior is for /tmp to
>> be completely nuked at boot time. Was the agent node rebooted prior to this
>> problem?
>>
>> On Tue, Mar 29, 2016 at 2:29 PM, Paul Bell  wrote:
>>
>>> Hi Greg,
>>>
>>> Thanks very much for your quick reply.
>>>
>>> I simply forgot to mention platform. It's Ubuntu 14.04 LTS and it's not
>>> systemd. I will look at the link you provide.
>>>
>>> Is there any chance that it might apply to non-systemd platforms?
>>>
>>> Cordially,
>>>
>>> Paul
>>>
>>> On Tue, Mar 29, 2016 at 5:18 PM, Greg Mann  wrote:
>>>
 Hi Paul,
 Noticing the logging output, "Failed to find resources file
 '/tmp/mesos/meta/resources/resources.info'", I wonder if your trouble
 may be related to the location of your agent's work_dir. See this ticket:
 https://issues.apache.org/jira/browse/MESOS-4541

 Some users have reported issues resulting from the systemd-tmpfiles
 service garbage collecting files in /tmp, perhaps this is related? What
 platform is your agent running on?

 You could try specifying a different agent work directory outside of
 /tmp/ via the `--work_dir` command-line flag.

 Cheers,
 Greg


 On Tue, Mar 29, 2016 at 2:08 PM, Paul Bell  wrote:

> Hi,
>
> I am hoping someone can shed some light on this.
>
> An agent node failed to start, that is, when I did "service
> mesos-slave start" the service came up briefly & then stopped. Before
> stopping it produced the log shown below. The last thing it wrote is
> "Trying to create path '/mesos' in Zookeeper".
>
> This mention of the mesos znode prompted me to go for a clean slate by
> removing the mesos znode from Zookeeper.
>
> After doing this, the mesos-slave service started perfectly.
>
> What might be happening here, and also what's the right way to
> trouble-shoot such a problem? Mesos is version 0.23.0.
>
> Thanks for your help.
>
> -Paul
>
>
> Log file created at: 2016/03/29 14:19:39
> Running on machine: 71.100.202.193
> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
> I0329 14:19:39.512249  5870 logging.cpp:172] INFO level logging
> started!
> I0329 14:19:39.512564  5870 main.cpp:162] Build: 2015-07-24 10:05:39
> by root
> I0329 14:19:39.512588  5870 main.cpp:164] Version: 0.23.0
> I0329 14:19:39.512600  5870 main.cpp:167] Git tag: 0.23.0
> I0329 14:19:39.512612  5870 main.cpp:171] Git SHA:
> 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66
> I0329 14:19:39.615172  5870 containerizer.cpp:111] Using isolation:
> posix/cpu,posix/mem
> I0329 14:19:39.615697  5870 main.cpp:249] Starting Mesos slave
> I0329 14:19:39.616267  5870 slave.cpp:190] Slave started on 1)@
> 71.100.202.193:5051
> I0329 14:19:39.616286  5870 slave.cpp:191] Flags at startup:
> --attributes="hostType:shard1" --authenticatee="crammd5"
> --cgroups_cpu_enable_pids_and_tids_count="false"
> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
> --cgroups_limit_swap="false" --cgroups_root="mesos"
> --container_disk_watch_interval="15secs" --containerizers="docker,mesos"
> --default_role="*" --disk_watch_interval="1mins"
> --docker="/usr/local/ecxmcc/weaveShim" --docker_kill_orphans="true"
> --docker_remove_delay="6hrs"
> --docker_sandbox_directory="/mnt/mesos/sandbox"
> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="15secs"
> --enforce_container_disk_quota="false"
> --executor_registration_timeout="5mins"
> --executor_shutdown_grace_period="5secs"
> --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB"
> --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1"
> --hadoop_home="" --help="false" --hostname="71.100.202.193"
> 

Re: Agent won't start

2016-03-29 Thread Paul Bell
Whoa...interessant!

The node *may* have been rebooted. Uptime says 2 days. I'll need to check
my notes.

Can you point me to reference re Ubuntu behavior?

Based on what you've told me so far, it sounds as if the sequence:

stop service
reboot agent node
start service


could lead to trouble - or do I misunderstand?


Thank you again for your help.

-Paul

On Tue, Mar 29, 2016 at 5:36 PM, Greg Mann  wrote:

> Paul,
> This would be relevant for any system which is automatically deleting
> files in /tmp. It looks like in Ubuntu, the default behavior is for /tmp to
> be completely nuked at boot time. Was the agent node rebooted prior to this
> problem?
>
> On Tue, Mar 29, 2016 at 2:29 PM, Paul Bell  wrote:
>
>> Hi Greg,
>>
>> Thanks very much for your quick reply.
>>
>> I simply forgot to mention platform. It's Ubuntu 14.04 LTS and it's not
>> systemd. I will look at the link you provide.
>>
>> Is there any chance that it might apply to non-systemd platforms?
>>
>> Cordially,
>>
>> Paul
>>
>> On Tue, Mar 29, 2016 at 5:18 PM, Greg Mann  wrote:
>>
>>> Hi Paul,
>>> Noticing the logging output, "Failed to find resources file
>>> '/tmp/mesos/meta/resources/resources.info'", I wonder if your trouble
>>> may be related to the location of your agent's work_dir. See this ticket:
>>> https://issues.apache.org/jira/browse/MESOS-4541
>>>
>>> Some users have reported issues resulting from the systemd-tmpfiles
>>> service garbage collecting files in /tmp, perhaps this is related? What
>>> platform is your agent running on?
>>>
>>> You could try specifying a different agent work directory outside of
>>> /tmp/ via the `--work_dir` command-line flag.
>>>
>>> Cheers,
>>> Greg
>>>
>>>
>>> On Tue, Mar 29, 2016 at 2:08 PM, Paul Bell  wrote:
>>>
 Hi,

 I am hoping someone can shed some light on this.

 An agent node failed to start, that is, when I did "service mesos-slave
 start" the service came up briefly & then stopped. Before stopping it
 produced the log shown below. The last thing it wrote is "Trying to create
 path '/mesos' in Zookeeper".

 This mention of the mesos znode prompted me to go for a clean slate by
 removing the mesos znode from Zookeeper.

 After doing this, the mesos-slave service started perfectly.

 What might be happening here, and also what's the right way to
 trouble-shoot such a problem? Mesos is version 0.23.0.

 Thanks for your help.

 -Paul


 Log file created at: 2016/03/29 14:19:39
 Running on machine: 71.100.202.193
 Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
 I0329 14:19:39.512249  5870 logging.cpp:172] INFO level logging started!
 I0329 14:19:39.512564  5870 main.cpp:162] Build: 2015-07-24 10:05:39 by
 root
 I0329 14:19:39.512588  5870 main.cpp:164] Version: 0.23.0
 I0329 14:19:39.512600  5870 main.cpp:167] Git tag: 0.23.0
 I0329 14:19:39.512612  5870 main.cpp:171] Git SHA:
 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66
 I0329 14:19:39.615172  5870 containerizer.cpp:111] Using isolation:
 posix/cpu,posix/mem
 I0329 14:19:39.615697  5870 main.cpp:249] Starting Mesos slave
 I0329 14:19:39.616267  5870 slave.cpp:190] Slave started on 1)@
 71.100.202.193:5051
 I0329 14:19:39.616286  5870 slave.cpp:191] Flags at startup:
 --attributes="hostType:shard1" --authenticatee="crammd5"
 --cgroups_cpu_enable_pids_and_tids_count="false"
 --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
 --cgroups_limit_swap="false" --cgroups_root="mesos"
 --container_disk_watch_interval="15secs" --containerizers="docker,mesos"
 --default_role="*" --disk_watch_interval="1mins"
 --docker="/usr/local/ecxmcc/weaveShim" --docker_kill_orphans="true"
 --docker_remove_delay="6hrs"
 --docker_sandbox_directory="/mnt/mesos/sandbox"
 --docker_socket="/var/run/docker.sock" --docker_stop_timeout="15secs"
 --enforce_container_disk_quota="false"
 --executor_registration_timeout="5mins"
 --executor_shutdown_grace_period="5secs"
 --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB"
 --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1"
 --hadoop_home="" --help="false" --hostname="71.100.202.193"
 --initialize_driver_logging="true" --ip="71.100.202.193"
 --isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos"
 --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO"
 --master="zk://71.100.202.191:2181/mesos"
 --oversubscribed_resources_interval="15secs" --perf_duration="10secs"
 --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns"
 --quiet="false" --recover="reconnect" --recovery_timeout="15mins"
 --registration_backoff_factor="1secs"
 --resource_monitoring_interval="1secs" --revocable_cpu_low_priority="true"

Re: Agent won't start

2016-03-29 Thread Pradeep Chhetri
Hello Paul,

>From the logs, it looks like, on starting the mesos slave, it is trying to
do slave recovery (
http://mesos.apache.org/documentation/latest/slave-recovery/) but since the
resources.info is unavailable, it is unable to perform the recovery & hence
end up killing itself.

If you are fine with loosing any running existing mesos tasks/executors,
then you can just cleanup the mesos default working directory where it
keeps the checkpoint information($ rm -rf /tmp/mesos) and then try to
restart the mesos slave.

On Tue, Mar 29, 2016 at 10:08 PM, Paul Bell  wrote:

> Hi,
>
> I am hoping someone can shed some light on this.
>
> An agent node failed to start, that is, when I did "service mesos-slave
> start" the service came up briefly & then stopped. Before stopping it
> produced the log shown below. The last thing it wrote is "Trying to create
> path '/mesos' in Zookeeper".
>
> This mention of the mesos znode prompted me to go for a clean slate by
> removing the mesos znode from Zookeeper.
>
> After doing this, the mesos-slave service started perfectly.
>
> What might be happening here, and also what's the right way to
> trouble-shoot such a problem? Mesos is version 0.23.0.
>
> Thanks for your help.
>
> -Paul
>
>
> Log file created at: 2016/03/29 14:19:39
> Running on machine: 71.100.202.193
> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
> I0329 14:19:39.512249  5870 logging.cpp:172] INFO level logging started!
> I0329 14:19:39.512564  5870 main.cpp:162] Build: 2015-07-24 10:05:39 by
> root
> I0329 14:19:39.512588  5870 main.cpp:164] Version: 0.23.0
> I0329 14:19:39.512600  5870 main.cpp:167] Git tag: 0.23.0
> I0329 14:19:39.512612  5870 main.cpp:171] Git SHA:
> 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66
> I0329 14:19:39.615172  5870 containerizer.cpp:111] Using isolation:
> posix/cpu,posix/mem
> I0329 14:19:39.615697  5870 main.cpp:249] Starting Mesos slave
> I0329 14:19:39.616267  5870 slave.cpp:190] Slave started on 1)@
> 71.100.202.193:5051
> I0329 14:19:39.616286  5870 slave.cpp:191] Flags at startup:
> --attributes="hostType:shard1" --authenticatee="crammd5"
> --cgroups_cpu_enable_pids_and_tids_count="false"
> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
> --cgroups_limit_swap="false" --cgroups_root="mesos"
> --container_disk_watch_interval="15secs" --containerizers="docker,mesos"
> --default_role="*" --disk_watch_interval="1mins"
> --docker="/usr/local/ecxmcc/weaveShim" --docker_kill_orphans="true"
> --docker_remove_delay="6hrs"
> --docker_sandbox_directory="/mnt/mesos/sandbox"
> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="15secs"
> --enforce_container_disk_quota="false"
> --executor_registration_timeout="5mins"
> --executor_shutdown_grace_period="5secs"
> --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB"
> --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1"
> --hadoop_home="" --help="false" --hostname="71.100.202.193"
> --initialize_driver_logging="true" --ip="71.100.202.193"
> --isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos"
> --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO"
> --master="zk://71.100.202.191:2181/mesos"
> --oversubscribed_resources_interval="15secs" --perf_duration="10secs"
> --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns"
> --quiet="false" --recover="reconnect" --recovery_timeout="15mins"
> --registration_backoff_factor="1secs"
> --resource_monitoring_interval="1secs" --revocable_cpu_low_priority="true"
> --strict="true" --switch_user="true" --version="false"
> --work_dir="/tmp/mesos"
> I0329 14:19:39.616835  5870 slave.cpp:354] Slave resources: cpus(*):4;
> mem(*):23089; disk(*):122517; ports(*):[31000-32000]
> I0329 14:19:39.617032  5870 slave.cpp:384] Slave hostname: 71.100.202.193
> I0329 14:19:39.617046  5870 slave.cpp:389] Slave checkpoint: true
> I0329 14:19:39.618841  5894 state.cpp:36] Recovering state from
> '/tmp/mesos/meta'
> I0329 14:19:39.618872  5894 state.cpp:672] Failed to find resources file
> '/tmp/mesos/meta/resources/resources.info'
> I0329 14:19:39.619730  5898 group.cpp:313] Group process (group(1)@
> 71.100.202.193:5051) connected to ZooKeeper
> I0329 14:19:39.619760  5898 group.cpp:787] Syncing group operations: queue
> size (joins, cancels, datas) = (0, 0, 0)
> I0329 14:19:39.619773  5898 group.cpp:385] Trying to create path '/mesos'
> in ZooKeeper
>
>


-- 
Regards,
Pradeep Chhetri


Re: Agent won't start

2016-03-29 Thread Greg Mann
Paul,
This would be relevant for any system which is automatically deleting files
in /tmp. It looks like in Ubuntu, the default behavior is for /tmp to be
completely nuked at boot time. Was the agent node rebooted prior to this
problem?

On Tue, Mar 29, 2016 at 2:29 PM, Paul Bell  wrote:

> Hi Greg,
>
> Thanks very much for your quick reply.
>
> I simply forgot to mention platform. It's Ubuntu 14.04 LTS and it's not
> systemd. I will look at the link you provide.
>
> Is there any chance that it might apply to non-systemd platforms?
>
> Cordially,
>
> Paul
>
> On Tue, Mar 29, 2016 at 5:18 PM, Greg Mann  wrote:
>
>> Hi Paul,
>> Noticing the logging output, "Failed to find resources file
>> '/tmp/mesos/meta/resources/resources.info'", I wonder if your trouble
>> may be related to the location of your agent's work_dir. See this ticket:
>> https://issues.apache.org/jira/browse/MESOS-4541
>>
>> Some users have reported issues resulting from the systemd-tmpfiles
>> service garbage collecting files in /tmp, perhaps this is related? What
>> platform is your agent running on?
>>
>> You could try specifying a different agent work directory outside of
>> /tmp/ via the `--work_dir` command-line flag.
>>
>> Cheers,
>> Greg
>>
>>
>> On Tue, Mar 29, 2016 at 2:08 PM, Paul Bell  wrote:
>>
>>> Hi,
>>>
>>> I am hoping someone can shed some light on this.
>>>
>>> An agent node failed to start, that is, when I did "service mesos-slave
>>> start" the service came up briefly & then stopped. Before stopping it
>>> produced the log shown below. The last thing it wrote is "Trying to create
>>> path '/mesos' in Zookeeper".
>>>
>>> This mention of the mesos znode prompted me to go for a clean slate by
>>> removing the mesos znode from Zookeeper.
>>>
>>> After doing this, the mesos-slave service started perfectly.
>>>
>>> What might be happening here, and also what's the right way to
>>> trouble-shoot such a problem? Mesos is version 0.23.0.
>>>
>>> Thanks for your help.
>>>
>>> -Paul
>>>
>>>
>>> Log file created at: 2016/03/29 14:19:39
>>> Running on machine: 71.100.202.193
>>> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
>>> I0329 14:19:39.512249  5870 logging.cpp:172] INFO level logging started!
>>> I0329 14:19:39.512564  5870 main.cpp:162] Build: 2015-07-24 10:05:39 by
>>> root
>>> I0329 14:19:39.512588  5870 main.cpp:164] Version: 0.23.0
>>> I0329 14:19:39.512600  5870 main.cpp:167] Git tag: 0.23.0
>>> I0329 14:19:39.512612  5870 main.cpp:171] Git SHA:
>>> 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66
>>> I0329 14:19:39.615172  5870 containerizer.cpp:111] Using isolation:
>>> posix/cpu,posix/mem
>>> I0329 14:19:39.615697  5870 main.cpp:249] Starting Mesos slave
>>> I0329 14:19:39.616267  5870 slave.cpp:190] Slave started on 1)@
>>> 71.100.202.193:5051
>>> I0329 14:19:39.616286  5870 slave.cpp:191] Flags at startup:
>>> --attributes="hostType:shard1" --authenticatee="crammd5"
>>> --cgroups_cpu_enable_pids_and_tids_count="false"
>>> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
>>> --cgroups_limit_swap="false" --cgroups_root="mesos"
>>> --container_disk_watch_interval="15secs" --containerizers="docker,mesos"
>>> --default_role="*" --disk_watch_interval="1mins"
>>> --docker="/usr/local/ecxmcc/weaveShim" --docker_kill_orphans="true"
>>> --docker_remove_delay="6hrs"
>>> --docker_sandbox_directory="/mnt/mesos/sandbox"
>>> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="15secs"
>>> --enforce_container_disk_quota="false"
>>> --executor_registration_timeout="5mins"
>>> --executor_shutdown_grace_period="5secs"
>>> --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB"
>>> --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1"
>>> --hadoop_home="" --help="false" --hostname="71.100.202.193"
>>> --initialize_driver_logging="true" --ip="71.100.202.193"
>>> --isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos"
>>> --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO"
>>> --master="zk://71.100.202.191:2181/mesos"
>>> --oversubscribed_resources_interval="15secs" --perf_duration="10secs"
>>> --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns"
>>> --quiet="false" --recover="reconnect" --recovery_timeout="15mins"
>>> --registration_backoff_factor="1secs"
>>> --resource_monitoring_interval="1secs" --revocable_cpu_low_priority="true"
>>> --strict="true" --switch_user="true" --version="false"
>>> --work_dir="/tmp/mesos"
>>> I0329 14:19:39.616835  5870 slave.cpp:354] Slave resources: cpus(*):4;
>>> mem(*):23089; disk(*):122517; ports(*):[31000-32000]
>>> I0329 14:19:39.617032  5870 slave.cpp:384] Slave hostname: 71.100.202.193
>>> I0329 14:19:39.617046  5870 slave.cpp:389] Slave checkpoint: true
>>> I0329 14:19:39.618841  5894 state.cpp:36] Recovering state from
>>> '/tmp/mesos/meta'
>>> I0329 14:19:39.618872  5894 state.cpp:672] Failed to find resources file
>>> 

Re: Agent won't start

2016-03-29 Thread Paul Bell
Hi Greg,

Thanks very much for your quick reply.

I simply forgot to mention platform. It's Ubuntu 14.04 LTS and it's not
systemd. I will look at the link you provide.

Is there any chance that it might apply to non-systemd platforms?

Cordially,

Paul

On Tue, Mar 29, 2016 at 5:18 PM, Greg Mann  wrote:

> Hi Paul,
> Noticing the logging output, "Failed to find resources file
> '/tmp/mesos/meta/resources/resources.info'", I wonder if your trouble may
> be related to the location of your agent's work_dir. See this ticket:
> https://issues.apache.org/jira/browse/MESOS-4541
>
> Some users have reported issues resulting from the systemd-tmpfiles
> service garbage collecting files in /tmp, perhaps this is related? What
> platform is your agent running on?
>
> You could try specifying a different agent work directory outside of /tmp/
> via the `--work_dir` command-line flag.
>
> Cheers,
> Greg
>
>
> On Tue, Mar 29, 2016 at 2:08 PM, Paul Bell  wrote:
>
>> Hi,
>>
>> I am hoping someone can shed some light on this.
>>
>> An agent node failed to start, that is, when I did "service mesos-slave
>> start" the service came up briefly & then stopped. Before stopping it
>> produced the log shown below. The last thing it wrote is "Trying to create
>> path '/mesos' in Zookeeper".
>>
>> This mention of the mesos znode prompted me to go for a clean slate by
>> removing the mesos znode from Zookeeper.
>>
>> After doing this, the mesos-slave service started perfectly.
>>
>> What might be happening here, and also what's the right way to
>> trouble-shoot such a problem? Mesos is version 0.23.0.
>>
>> Thanks for your help.
>>
>> -Paul
>>
>>
>> Log file created at: 2016/03/29 14:19:39
>> Running on machine: 71.100.202.193
>> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
>> I0329 14:19:39.512249  5870 logging.cpp:172] INFO level logging started!
>> I0329 14:19:39.512564  5870 main.cpp:162] Build: 2015-07-24 10:05:39 by
>> root
>> I0329 14:19:39.512588  5870 main.cpp:164] Version: 0.23.0
>> I0329 14:19:39.512600  5870 main.cpp:167] Git tag: 0.23.0
>> I0329 14:19:39.512612  5870 main.cpp:171] Git SHA:
>> 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66
>> I0329 14:19:39.615172  5870 containerizer.cpp:111] Using isolation:
>> posix/cpu,posix/mem
>> I0329 14:19:39.615697  5870 main.cpp:249] Starting Mesos slave
>> I0329 14:19:39.616267  5870 slave.cpp:190] Slave started on 1)@
>> 71.100.202.193:5051
>> I0329 14:19:39.616286  5870 slave.cpp:191] Flags at startup:
>> --attributes="hostType:shard1" --authenticatee="crammd5"
>> --cgroups_cpu_enable_pids_and_tids_count="false"
>> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
>> --cgroups_limit_swap="false" --cgroups_root="mesos"
>> --container_disk_watch_interval="15secs" --containerizers="docker,mesos"
>> --default_role="*" --disk_watch_interval="1mins"
>> --docker="/usr/local/ecxmcc/weaveShim" --docker_kill_orphans="true"
>> --docker_remove_delay="6hrs"
>> --docker_sandbox_directory="/mnt/mesos/sandbox"
>> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="15secs"
>> --enforce_container_disk_quota="false"
>> --executor_registration_timeout="5mins"
>> --executor_shutdown_grace_period="5secs"
>> --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB"
>> --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1"
>> --hadoop_home="" --help="false" --hostname="71.100.202.193"
>> --initialize_driver_logging="true" --ip="71.100.202.193"
>> --isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos"
>> --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO"
>> --master="zk://71.100.202.191:2181/mesos"
>> --oversubscribed_resources_interval="15secs" --perf_duration="10secs"
>> --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns"
>> --quiet="false" --recover="reconnect" --recovery_timeout="15mins"
>> --registration_backoff_factor="1secs"
>> --resource_monitoring_interval="1secs" --revocable_cpu_low_priority="true"
>> --strict="true" --switch_user="true" --version="false"
>> --work_dir="/tmp/mesos"
>> I0329 14:19:39.616835  5870 slave.cpp:354] Slave resources: cpus(*):4;
>> mem(*):23089; disk(*):122517; ports(*):[31000-32000]
>> I0329 14:19:39.617032  5870 slave.cpp:384] Slave hostname: 71.100.202.193
>> I0329 14:19:39.617046  5870 slave.cpp:389] Slave checkpoint: true
>> I0329 14:19:39.618841  5894 state.cpp:36] Recovering state from
>> '/tmp/mesos/meta'
>> I0329 14:19:39.618872  5894 state.cpp:672] Failed to find resources file
>> '/tmp/mesos/meta/resources/resources.info'
>> I0329 14:19:39.619730  5898 group.cpp:313] Group process (group(1)@
>> 71.100.202.193:5051) connected to ZooKeeper
>> I0329 14:19:39.619760  5898 group.cpp:787] Syncing group operations:
>> queue size (joins, cancels, datas) = (0, 0, 0)
>> I0329 14:19:39.619773  5898 group.cpp:385] Trying to create path '/mesos'
>> in ZooKeeper
>>
>>
>


Re: Agent won't start

2016-03-29 Thread Greg Mann
Hi Paul,
Noticing the logging output, "Failed to find resources file
'/tmp/mesos/meta/resources/resources.info'", I wonder if your trouble may
be related to the location of your agent's work_dir. See this ticket:
https://issues.apache.org/jira/browse/MESOS-4541

Some users have reported issues resulting from the systemd-tmpfiles service
garbage collecting files in /tmp, perhaps this is related? What platform is
your agent running on?

You could try specifying a different agent work directory outside of /tmp/
via the `--work_dir` command-line flag.

Cheers,
Greg


On Tue, Mar 29, 2016 at 2:08 PM, Paul Bell  wrote:

> Hi,
>
> I am hoping someone can shed some light on this.
>
> An agent node failed to start, that is, when I did "service mesos-slave
> start" the service came up briefly & then stopped. Before stopping it
> produced the log shown below. The last thing it wrote is "Trying to create
> path '/mesos' in Zookeeper".
>
> This mention of the mesos znode prompted me to go for a clean slate by
> removing the mesos znode from Zookeeper.
>
> After doing this, the mesos-slave service started perfectly.
>
> What might be happening here, and also what's the right way to
> trouble-shoot such a problem? Mesos is version 0.23.0.
>
> Thanks for your help.
>
> -Paul
>
>
> Log file created at: 2016/03/29 14:19:39
> Running on machine: 71.100.202.193
> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
> I0329 14:19:39.512249  5870 logging.cpp:172] INFO level logging started!
> I0329 14:19:39.512564  5870 main.cpp:162] Build: 2015-07-24 10:05:39 by
> root
> I0329 14:19:39.512588  5870 main.cpp:164] Version: 0.23.0
> I0329 14:19:39.512600  5870 main.cpp:167] Git tag: 0.23.0
> I0329 14:19:39.512612  5870 main.cpp:171] Git SHA:
> 4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66
> I0329 14:19:39.615172  5870 containerizer.cpp:111] Using isolation:
> posix/cpu,posix/mem
> I0329 14:19:39.615697  5870 main.cpp:249] Starting Mesos slave
> I0329 14:19:39.616267  5870 slave.cpp:190] Slave started on 1)@
> 71.100.202.193:5051
> I0329 14:19:39.616286  5870 slave.cpp:191] Flags at startup:
> --attributes="hostType:shard1" --authenticatee="crammd5"
> --cgroups_cpu_enable_pids_and_tids_count="false"
> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
> --cgroups_limit_swap="false" --cgroups_root="mesos"
> --container_disk_watch_interval="15secs" --containerizers="docker,mesos"
> --default_role="*" --disk_watch_interval="1mins"
> --docker="/usr/local/ecxmcc/weaveShim" --docker_kill_orphans="true"
> --docker_remove_delay="6hrs"
> --docker_sandbox_directory="/mnt/mesos/sandbox"
> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="15secs"
> --enforce_container_disk_quota="false"
> --executor_registration_timeout="5mins"
> --executor_shutdown_grace_period="5secs"
> --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB"
> --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1"
> --hadoop_home="" --help="false" --hostname="71.100.202.193"
> --initialize_driver_logging="true" --ip="71.100.202.193"
> --isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos"
> --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO"
> --master="zk://71.100.202.191:2181/mesos"
> --oversubscribed_resources_interval="15secs" --perf_duration="10secs"
> --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns"
> --quiet="false" --recover="reconnect" --recovery_timeout="15mins"
> --registration_backoff_factor="1secs"
> --resource_monitoring_interval="1secs" --revocable_cpu_low_priority="true"
> --strict="true" --switch_user="true" --version="false"
> --work_dir="/tmp/mesos"
> I0329 14:19:39.616835  5870 slave.cpp:354] Slave resources: cpus(*):4;
> mem(*):23089; disk(*):122517; ports(*):[31000-32000]
> I0329 14:19:39.617032  5870 slave.cpp:384] Slave hostname: 71.100.202.193
> I0329 14:19:39.617046  5870 slave.cpp:389] Slave checkpoint: true
> I0329 14:19:39.618841  5894 state.cpp:36] Recovering state from
> '/tmp/mesos/meta'
> I0329 14:19:39.618872  5894 state.cpp:672] Failed to find resources file
> '/tmp/mesos/meta/resources/resources.info'
> I0329 14:19:39.619730  5898 group.cpp:313] Group process (group(1)@
> 71.100.202.193:5051) connected to ZooKeeper
> I0329 14:19:39.619760  5898 group.cpp:787] Syncing group operations: queue
> size (joins, cancels, datas) = (0, 0, 0)
> I0329 14:19:39.619773  5898 group.cpp:385] Trying to create path '/mesos'
> in ZooKeeper
>
>


Agent won't start

2016-03-29 Thread Paul Bell
Hi,

I am hoping someone can shed some light on this.

An agent node failed to start, that is, when I did "service mesos-slave
start" the service came up briefly & then stopped. Before stopping it
produced the log shown below. The last thing it wrote is "Trying to create
path '/mesos' in Zookeeper".

This mention of the mesos znode prompted me to go for a clean slate by
removing the mesos znode from Zookeeper.

After doing this, the mesos-slave service started perfectly.

What might be happening here, and also what's the right way to
trouble-shoot such a problem? Mesos is version 0.23.0.

Thanks for your help.

-Paul


Log file created at: 2016/03/29 14:19:39
Running on machine: 71.100.202.193
Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
I0329 14:19:39.512249  5870 logging.cpp:172] INFO level logging started!
I0329 14:19:39.512564  5870 main.cpp:162] Build: 2015-07-24 10:05:39 by root
I0329 14:19:39.512588  5870 main.cpp:164] Version: 0.23.0
I0329 14:19:39.512600  5870 main.cpp:167] Git tag: 0.23.0
I0329 14:19:39.512612  5870 main.cpp:171] Git SHA:
4ce5475346a0abb7ef4b7ffc9836c5836d7c7a66
I0329 14:19:39.615172  5870 containerizer.cpp:111] Using isolation:
posix/cpu,posix/mem
I0329 14:19:39.615697  5870 main.cpp:249] Starting Mesos slave
I0329 14:19:39.616267  5870 slave.cpp:190] Slave started on 1)@
71.100.202.193:5051
I0329 14:19:39.616286  5870 slave.cpp:191] Flags at startup:
--attributes="hostType:shard1" --authenticatee="crammd5"
--cgroups_cpu_enable_pids_and_tids_count="false"
--cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
--cgroups_limit_swap="false" --cgroups_root="mesos"
--container_disk_watch_interval="15secs" --containerizers="docker,mesos"
--default_role="*" --disk_watch_interval="1mins"
--docker="/usr/local/ecxmcc/weaveShim" --docker_kill_orphans="true"
--docker_remove_delay="6hrs"
--docker_sandbox_directory="/mnt/mesos/sandbox"
--docker_socket="/var/run/docker.sock" --docker_stop_timeout="15secs"
--enforce_container_disk_quota="false"
--executor_registration_timeout="5mins"
--executor_shutdown_grace_period="5secs"
--fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB"
--frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1"
--hadoop_home="" --help="false" --hostname="71.100.202.193"
--initialize_driver_logging="true" --ip="71.100.202.193"
--isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos"
--log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO"
--master="zk://71.100.202.191:2181/mesos"
--oversubscribed_resources_interval="15secs" --perf_duration="10secs"
--perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns"
--quiet="false" --recover="reconnect" --recovery_timeout="15mins"
--registration_backoff_factor="1secs"
--resource_monitoring_interval="1secs" --revocable_cpu_low_priority="true"
--strict="true" --switch_user="true" --version="false"
--work_dir="/tmp/mesos"
I0329 14:19:39.616835  5870 slave.cpp:354] Slave resources: cpus(*):4;
mem(*):23089; disk(*):122517; ports(*):[31000-32000]
I0329 14:19:39.617032  5870 slave.cpp:384] Slave hostname: 71.100.202.193
I0329 14:19:39.617046  5870 slave.cpp:389] Slave checkpoint: true
I0329 14:19:39.618841  5894 state.cpp:36] Recovering state from
'/tmp/mesos/meta'
I0329 14:19:39.618872  5894 state.cpp:672] Failed to find resources file
'/tmp/mesos/meta/resources/resources.info'
I0329 14:19:39.619730  5898 group.cpp:313] Group process (group(1)@
71.100.202.193:5051) connected to ZooKeeper
I0329 14:19:39.619760  5898 group.cpp:787] Syncing group operations: queue
size (joins, cancels, datas) = (0, 0, 0)
I0329 14:19:39.619773  5898 group.cpp:385] Trying to create path '/mesos'
in ZooKeeper


Re: Port Resource Offers

2016-03-29 Thread Pradeep Chhetri
Hello Erik,

Thank you for clarifying the doubt. That was the exact concern I was having.



On Tue, Mar 29, 2016 at 9:05 PM, Erik Weathers 
wrote:

> hi Pradeep,
>
> Yes, that would *definitely* be a problem.  e.g., the Storm Framework
> could easily assign Storm Workers to use those unavailable ports, and then
> they would fail to come up since they wouldn't be able to bind to their
> assigned port.  I've answered a similar question before:
>
>
> https://unix.stackexchange.com/questions/211647/how-safe-is-it-to-change-the-linux-ephemeral-port-range/237543#237543
>
> - Erik
>
> On Tue, Mar 29, 2016 at 3:07 AM, Pradeep Chhetri <
> pradeep.chhetr...@gmail.com> wrote:
>
>> Hi Klaus,
>>
>> Thank you for the quick reply.
>>
>> One quick question:
>>
>> I have some of the ports like 8400,8500,8600 which are already in use by
>> consul agent running on each mesos slave. But they are also being announced
>> by each mesos slave. Will this cause any problem to tasks which maybe
>> assigned those ports in future by mesos ?
>>
>> Thanks
>>
>> On Tue, Mar 29, 2016 at 11:01 AM, Klaus Ma 
>> wrote:
>>
>>> Yes, all port resources must be ranges for now, e.g. 31000-35000.
>>>
>>> There’s already JIRA (MESOS-4627: Improve Ranges parsing to handle
>>> single values) on that, patches are pending on review :).
>>>
>>> 
>>> Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
>>> Platform OpenSource Technology, STG, IBM GCG
>>> +86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me
>>>
>>>
>>> --
>>> Date: Tue, 29 Mar 2016 10:51:44 +0100
>>> Subject: Port Resource Offers
>>> From: pradeep.chhetr...@gmail.com
>>> To: user@mesos.apache.org
>>>
>>>
>>> Hello,
>>>
>>> I am running mesos slaves with the modified port announcement.
>>>
>>> $ cat /etc/mesos-slave/resources
>>> ports(*):[6379, 9200, 9300, 27017, 31000-35000]
>>>
>>> I can that this is being picked up when starting the mesos slaves in ps
>>> output:
>>>
>>> --resources=ports(*):[6379, 9200, 9300, 27017, 31000-35000]
>>>
>>> However, when i hit the /state.json endpoint of mesos-master, I am
>>> seeing this:
>>>
>>>
>>> ​
>>> I can see the tasks are being assigned ports in the range of 9300-27017.
>>> There are some of these ports which are already used by other applications
>>> running on each mesos slaves but are being announced. I am not sure if this
>>> will cause some issue. I am assuming that it will always check if the port
>>> is already binded by some other process before assigning port to a task.
>>>
>>> By going through the code and test cases, it looks like it always expect
>>> port resource in ranges.
>>>
>>>
>>> https://github.com/apache/mesos/blob/master/src/v1/resources.cpp#L1255-L1263
>>>
>>> So I guess, I should always define ports in ranges rather than
>>> individual port.
>>>
>>> It will be helpful if someone can confirm if it is the expected
>>> behaviour and my configuration is wrong.
>>>
>>> --
>>> Regards,
>>> Pradeep Chhetri
>>>
>>
>>
>>
>> --
>> Regards,
>> Pradeep Chhetri
>>
>
>


-- 
Regards,
Pradeep Chhetri


Re: Port Resource Offers

2016-03-29 Thread Erik Weathers
hi Pradeep,

Yes, that would *definitely* be a problem.  e.g., the Storm Framework could
easily assign Storm Workers to use those unavailable ports, and then they
would fail to come up since they wouldn't be able to bind to their assigned
port.  I've answered a similar question before:

https://unix.stackexchange.com/questions/211647/how-safe-is-it-to-change-the-linux-ephemeral-port-range/237543#237543

- Erik

On Tue, Mar 29, 2016 at 3:07 AM, Pradeep Chhetri <
pradeep.chhetr...@gmail.com> wrote:

> Hi Klaus,
>
> Thank you for the quick reply.
>
> One quick question:
>
> I have some of the ports like 8400,8500,8600 which are already in use by
> consul agent running on each mesos slave. But they are also being announced
> by each mesos slave. Will this cause any problem to tasks which maybe
> assigned those ports in future by mesos ?
>
> Thanks
>
> On Tue, Mar 29, 2016 at 11:01 AM, Klaus Ma  wrote:
>
>> Yes, all port resources must be ranges for now, e.g. 31000-35000.
>>
>> There’s already JIRA (MESOS-4627: Improve Ranges parsing to handle single
>> values) on that, patches are pending on review :).
>>
>> 
>> Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
>> Platform OpenSource Technology, STG, IBM GCG
>> +86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me
>>
>>
>> --
>> Date: Tue, 29 Mar 2016 10:51:44 +0100
>> Subject: Port Resource Offers
>> From: pradeep.chhetr...@gmail.com
>> To: user@mesos.apache.org
>>
>>
>> Hello,
>>
>> I am running mesos slaves with the modified port announcement.
>>
>> $ cat /etc/mesos-slave/resources
>> ports(*):[6379, 9200, 9300, 27017, 31000-35000]
>>
>> I can that this is being picked up when starting the mesos slaves in ps
>> output:
>>
>> --resources=ports(*):[6379, 9200, 9300, 27017, 31000-35000]
>>
>> However, when i hit the /state.json endpoint of mesos-master, I am seeing
>> this:
>>
>>
>> ​
>> I can see the tasks are being assigned ports in the range of 9300-27017.
>> There are some of these ports which are already used by other applications
>> running on each mesos slaves but are being announced. I am not sure if this
>> will cause some issue. I am assuming that it will always check if the port
>> is already binded by some other process before assigning port to a task.
>>
>> By going through the code and test cases, it looks like it always expect
>> port resource in ranges.
>>
>>
>> https://github.com/apache/mesos/blob/master/src/v1/resources.cpp#L1255-L1263
>>
>> So I guess, I should always define ports in ranges rather than individual
>> port.
>>
>> It will be helpful if someone can confirm if it is the expected behaviour
>> and my configuration is wrong.
>>
>> --
>> Regards,
>> Pradeep Chhetri
>>
>
>
>
> --
> Regards,
> Pradeep Chhetri
>


Re: Port Resource Offers

2016-03-29 Thread Pradeep Chhetri
Hi Klaus,

Thank you for the quick reply.

One quick question:

I have some of the ports like 8400,8500,8600 which are already in use by
consul agent running on each mesos slave. But they are also being announced
by each mesos slave. Will this cause any problem to tasks which maybe
assigned those ports in future by mesos ?

Thanks

On Tue, Mar 29, 2016 at 11:01 AM, Klaus Ma  wrote:

> Yes, all port resources must be ranges for now, e.g. 31000-35000.
>
> There’s already JIRA (MESOS-4627: Improve Ranges parsing to handle single
> values) on that, patches are pending on review :).
>
> 
> Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
> Platform OpenSource Technology, STG, IBM GCG
> +86-10-8245 4084 | klaus1982...@gmail.com | http://k82.me
>
>
> --
> Date: Tue, 29 Mar 2016 10:51:44 +0100
> Subject: Port Resource Offers
> From: pradeep.chhetr...@gmail.com
> To: user@mesos.apache.org
>
>
> Hello,
>
> I am running mesos slaves with the modified port announcement.
>
> $ cat /etc/mesos-slave/resources
> ports(*):[6379, 9200, 9300, 27017, 31000-35000]
>
> I can that this is being picked up when starting the mesos slaves in ps
> output:
>
> --resources=ports(*):[6379, 9200, 9300, 27017, 31000-35000]
>
> However, when i hit the /state.json endpoint of mesos-master, I am seeing
> this:
>
>
> ​
> I can see the tasks are being assigned ports in the range of 9300-27017.
> There are some of these ports which are already used by other applications
> running on each mesos slaves but are being announced. I am not sure if this
> will cause some issue. I am assuming that it will always check if the port
> is already binded by some other process before assigning port to a task.
>
> By going through the code and test cases, it looks like it always expect
> port resource in ranges.
>
>
> https://github.com/apache/mesos/blob/master/src/v1/resources.cpp#L1255-L1263
>
> So I guess, I should always define ports in ranges rather than individual
> port.
>
> It will be helpful if someone can confirm if it is the expected behaviour
> and my configuration is wrong.
>
> --
> Regards,
> Pradeep Chhetri
>



-- 
Regards,
Pradeep Chhetri


RE: Port Resource Offers

2016-03-29 Thread Klaus Ma
Yes, all port resources must be ranges for now, e.g. 31000-35000.
There’s already JIRA (MESOS-4627: Improve Ranges parsing to handle single 
values) on that, patches are pending on review :).
Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer Platform OpenSource 
Technology, STG, IBM GCG +86-10-8245 4084 | klaus1982...@gmail.com | 
http://k82.me

Date: Tue, 29 Mar 2016 10:51:44 +0100
Subject: Port Resource Offers
From: pradeep.chhetr...@gmail.com
To: user@mesos.apache.org

Hello, 
I am running mesos slaves with the modified port announcement.
$ cat /etc/mesos-slave/resourcesports(*):[6379, 9200, 9300, 27017, 31000-35000]
I can that this is being picked up when starting the mesos slaves in ps output: 
--resources=ports(*):[6379, 9200, 9300, 27017, 31000-35000]
However, when i hit the /state.json endpoint of mesos-master, I am seeing this:

​
I can see the tasks are being assigned ports in the range of 9300-27017. There 
are some of these ports which are already used by other applications running on 
each mesos slaves but are being announced. I am not sure if this will cause 
some issue. I am assuming that it will always check if the port is already 
binded by some other process before assigning port to a task.
By going through the code and test cases, it looks like it always expect port 
resource in ranges.
https://github.com/apache/mesos/blob/master/src/v1/resources.cpp#L1255-L1263

So I guess, I should always define ports in ranges rather than individual port.
It will be helpful if someone can confirm if it is the expected behaviour and 
my configuration is wrong.
-- 
Regards,
Pradeep Chhetri