Fwd: mesos-slave configuration file with Ports Offer

2015-11-13 Thread Miguel Ángel Ausó
Hi

I'm trying configure "port offers" in mesos-slave using a configuration
file (I'm using mesosphere package), I know that I can use the "--resource"
option in start command, but I need to know if it's possible using a
configuration file, the same that other option, for exemple gc-delay or
master configuration

I tried with differents formats in
/etc/mesos-slave/resources/ports
[21000-24000]
ports:[21000-24000]
*:[21000-24000]
21000-24000

thanks!!


Re: mesos-slave configuration file with Ports Offer

2015-11-13 Thread craig w
Miguel, create a file named: /etc/mesos-slave/resources/ports(*)

It should contain: [21000-24000]

On Fri, Nov 13, 2015 at 6:30 AM, Miguel Ángel Ausó 
wrote:

> Hi
>
> I'm trying configure "port offers" in mesos-slave using a configuration
> file (I'm using mesosphere package), I know that I can use the "--resource"
> option in start command, but I need to know if it's possible using a
> configuration file, the same that other option, for exemple gc-delay or
> master configuration
>
> I tried with differents formats in
> /etc/mesos-slave/resources/ports
> [21000-24000]
> ports:[21000-24000]
> *:[21000-24000]
> 21000-24000
>
> thanks!!
>
>
>


-- 

https://github.com/mindscratch
https://www.google.com/+CraigWickesser
https://twitter.com/mind_scratch
https://twitter.com/craig_links


Re: Fate of slave node after timeout

2015-11-13 Thread Jie Yu
>
> Can that slave never again be added into the cluster, i.e., what happens
> if it comes up 1 second after exceeding the timeout product?


It'll not be added to the cluster. The master will send a Shutdown message
to the slave if it comes up after the timeout.

- Jie

On Fri, Nov 13, 2015 at 1:44 PM, Paul Bell  wrote:

> Hi All,
>
> IIRC, after (max_slave_ping_timeouts * slave_ping_timeout) is exceeded
> without a response from a mesos-slave, the master will remove the slave. In
> the Mesos UI I can see slave state transition from 1 deactivated to 0.
>
> Can that slave never again be added into the cluster, i.e., what happens
> if it comes up 1 second after exceeding the timeout product?
>
> (I'm dusting off some old notes and trying to refresh my memory about
> problems I haven't seen in quite some time).
>
> Thank you.
>
> -Paul
>


Re: Fate of slave node after timeout

2015-11-13 Thread Jie Yu
Paul, the slave will terminate after receiving a Shutdown message. The
slave will be restarted (e.g., by monit or systemd) and register with the
master as a new slave (a different slaveId).

- Jie

On Fri, Nov 13, 2015 at 2:53 PM, Paul  wrote:

> Jie,
>
> Thank you.
>
> That's odd behavior, no? That would seem to mean that the slave can never
> again join the cluster, at least not from it's original IP@.
>
> What if the master bounces? Will it then tolerate the slave?
>
> -Paul
>
> On Nov 13, 2015, at 4:46 PM, Jie Yu  wrote:
>
> Can that slave never again be added into the cluster, i.e., what happens
>> if it comes up 1 second after exceeding the timeout product?
>
>
> It'll not be added to the cluster. The master will send a Shutdown message
> to the slave if it comes up after the timeout.
>
> - Jie
>
> On Fri, Nov 13, 2015 at 1:44 PM, Paul Bell  wrote:
>
>> Hi All,
>>
>> IIRC, after (max_slave_ping_timeouts * slave_ping_timeout) is exceeded
>> without a response from a mesos-slave, the master will remove the slave. In
>> the Mesos UI I can see slave state transition from 1 deactivated to 0.
>>
>> Can that slave never again be added into the cluster, i.e., what happens
>> if it comes up 1 second after exceeding the timeout product?
>>
>> (I'm dusting off some old notes and trying to refresh my memory about
>> problems I haven't seen in quite some time).
>>
>> Thank you.
>>
>> -Paul
>>
>
>


Re: Fate of slave node after timeout

2015-11-13 Thread Paul
Jie,

Thank you.

That's odd behavior, no? That would seem to mean that the slave can never again 
join the cluster, at least not from it's original IP@.

What if the master bounces? Will it then tolerate the slave?

-Paul

On Nov 13, 2015, at 4:46 PM, Jie Yu  wrote:

>> Can that slave never again be added into the cluster, i.e., what happens if 
>> it comes up 1 second after exceeding the timeout product?
> 
> It'll not be added to the cluster. The master will send a Shutdown message to 
> the slave if it comes up after the timeout.
> 
> - Jie 
> 
>> On Fri, Nov 13, 2015 at 1:44 PM, Paul Bell  wrote:
>> Hi All,
>> 
>> IIRC, after (max_slave_ping_timeouts * slave_ping_timeout) is exceeded 
>> without a response from a mesos-slave, the master will remove the slave. In 
>> the Mesos UI I can see slave state transition from 1 deactivated to 0.
>> 
>> Can that slave never again be added into the cluster, i.e., what happens if 
>> it comes up 1 second after exceeding the timeout product?
>> 
>> (I'm dusting off some old notes and trying to refresh my memory about 
>> problems I haven't seen in quite some time).
>> 
>> Thank you.
>> 
>> -Paul
> 


Re: Fate of slave node after timeout

2015-11-13 Thread Paul
Ah, now I get it.

And this comports with the behavior I am observing right now.

Thanks again, Jie.

-Paul

> On Nov 13, 2015, at 5:55 PM, Jie Yu  wrote:
> 
> Paul, the slave will terminate after receiving a Shutdown message. The slave 
> will be restarted (e.g., by monit or systemd) and register with the master as 
> a new slave (a different slaveId).
> 
> - Jie
> 
>> On Fri, Nov 13, 2015 at 2:53 PM, Paul  wrote:
>> Jie,
>> 
>> Thank you.
>> 
>> That's odd behavior, no? That would seem to mean that the slave can never 
>> again join the cluster, at least not from it's original IP@.
>> 
>> What if the master bounces? Will it then tolerate the slave?
>> 
>> -Paul
>> 
>> On Nov 13, 2015, at 4:46 PM, Jie Yu  wrote:
>> 
 Can that slave never again be added into the cluster, i.e., what happens 
 if it comes up 1 second after exceeding the timeout product?
>>> 
>>> It'll not be added to the cluster. The master will send a Shutdown message 
>>> to the slave if it comes up after the timeout.
>>> 
>>> - Jie 
>>> 
 On Fri, Nov 13, 2015 at 1:44 PM, Paul Bell  wrote:
 Hi All,
 
 IIRC, after (max_slave_ping_timeouts * slave_ping_timeout) is exceeded 
 without a response from a mesos-slave, the master will remove the slave. 
 In the Mesos UI I can see slave state transition from 1 deactivated to 0.
 
 Can that slave never again be added into the cluster, i.e., what happens 
 if it comes up 1 second after exceeding the timeout product?
 
 (I'm dusting off some old notes and trying to refresh my memory about 
 problems I haven't seen in quite some time).
 
 Thank you.
 
 -Paul
> 


Fate of slave node after timeout

2015-11-13 Thread Paul Bell
Hi All,

IIRC, after (max_slave_ping_timeouts * slave_ping_timeout) is exceeded
without a response from a mesos-slave, the master will remove the slave. In
the Mesos UI I can see slave state transition from 1 deactivated to 0.

Can that slave never again be added into the cluster, i.e., what happens if
it comes up 1 second after exceeding the timeout product?

(I'm dusting off some old notes and trying to refresh my memory about
problems I haven't seen in quite some time).

Thank you.

-Paul


Re: Mesos and Zookeeper TCP keepalive

2015-11-13 Thread Jeremy Olexa
Jojy,


I will eventually be able to try adjusting those options, but not this moment, 
as it is the busy time.


Thanks again for all the help!

-Jeremy



From: tommy xiao 
Sent: Thursday, November 12, 2015 10:42 PM
To: user@mesos.apache.org
Subject: Re: Mesos and Zookeeper TCP keepalive

Jojy, Thanks for your clarify! cool!

2015-11-13 9:00 GMT+08:00 Jojy Varghese 
>:
Sorry for confusing you. I meant that you could maybe change your 
“max_slave_ping_timeouts” / “slave_ping_timeout” values and re-enable snapshots.

-Jojy

On Nov 12, 2015, at 3:30 PM, tommy xiao 
> wrote:

Hi Jojy

what mean for keep the “snapshot/backup” ? could you please give some docs to 
ref

2015-11-13 1:59 GMT+08:00 Jojy Varghese 
>:
Hi Jeremy
 Good to hear that you have a solution. Was curious about the correlation 
between snapshot creation and timeouts. Wondering if you can change 
“max_slave_ping_timeouts” / "slave_ping_timeout" as Joris suggested and keep 
the “snapshot/backup” also.

thanks
Jojy


> On Nov 11, 2015, at 6:04 PM, Jeremy Olexa 
> > wrote:
>
> Hi Joris, all,
>
> We are still at the default timeout values for those that you linked. In the 
> meantime, since the community pushed us to look at other things besides 
> evading firewall timeouts, we have disabled snapshot/backups on the VMs and 
> this has resolved the issue for the past 24 hours on the control group that 
> we have disabled, which has been the best behavior that we have ever seen. 
> There was a very close correlation between snapshot creation and mesos-slave 
> process restart (within minutes) that got us to this point. Apparently, the 
> snapshot creation and quiesce of the filesystem cause enough disruption to 
> trigger the default timeouts within mesos.
>
> We are fine with this solution because Mesos has enabled us to have a more 
> heterogeneous fleet of servers and backups aren't needed on these hosts. 
> Mesos for the win, there.
>
> Thanks to everyone that has contributed on this thread! It was a fun exercise 
> for me, in the code. It was also useful to hear feedback from the list on 
> places to look, eventually pushing me to a solution.
> -Jeremy
>
> From: Joris Van Remoortere >
> Sent: Wednesday, November 11, 2015 12:56 AM
> To: user@mesos.apache.org
> Subject: Re: Mesos and Zookeeper TCP keepalive
>
> Hi Jeremy,
>
> Can you read the description of these parameters on the master, and possibly 
> share your values for these flags?
>
>
> It seems from the re-registration attempt on the agent, that the master has 
> already treated the agent as "failed", and so will tell it to shut down on 
> any re-registration attempt.
>
> I'm curious if there is a conflict (or too narrow of a time gap) of timeouts 
> in your environment to allow re-registration by the agent after the agent 
> notices it needs to re-establish the connection.
>
> —
> Joris Van Remoortere
> Mesosphere
>
> On Tue, Nov 10, 2015 at 5:02 AM, Jeremy Olexa 
> > wrote:
> Hi Tommy, Erik, all,
>
> You are correct in your assumption that I'm trying to solve for a one hour 
> session expire time on a firewall. For some more background info, our master 
> cluster is in datacenter X, the slaves in X will stay "up" for days and days. 
> The slaves in a different datacenter, Y, connected to that master cluster 
> will stay "up" for about a few days and restart. The master cluster is 
> healthy, with a stable leader for months (no flapping), same for the ZK 
> "leader". There are about 35 slaves in datacenter Y. Maybe the firewall 
> session timer is a red herring because the slave restart is seemingly random 
> (the slave with the highest uptime is 6 days, but a handful only have uptime 
> of a day)
>
> I've started debugging this awhile ago, and the gist of the logs is here: 
> https://gist.github.com/jolexa/1a80e26a4b017846d083 I've posted this back in 
> October seeking help and Benjamin suggested network issues in both 
> directions, so I thought firewall.
>
> Thanks for any hints,
> Jeremy
>
> From: tommy xiao >
> Sent: Tuesday, November 10, 2015 3:07 AM
>
> To: user@mesos.apache.org
> Subject: Re: Mesos and Zookeeper TCP keepalive
>
> same here , same question with Erik. could you please input more background 
> info, thanks
>
> 2015-11-10 15:56 GMT+08:00 Erik Weathers 
> >:
> It would really help if you (Jeremy) explained the *actual* problem you are 
> facing.  I'm *guessing* that it's a firewall timing out the sessions because 
> there isn't activity on them for whatever the timeout of the 

[VOTE] Release Apache Mesos 0.26.0 (rc1)

2015-11-13 Thread Till Toenshoff
Hi friends,

Please vote on releasing the following candidate as Apache Mesos 0.26.0.


The CHANGELOG for the release is available at:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=0.26.0-rc1


The candidate for Mesos 0.26.0 release is available at:
https://dist.apache.org/repos/dist/dev/mesos/0.26.0-rc1/mesos-0.26.0.tar.gz

The tag to be voted on is 0.26.0-rc1:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=0.26.0-rc1

The MD5 checksum of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/0.26.0-rc1/mesos-0.26.0.tar.gz.md5

The signature of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/0.26.0-rc1/mesos-0.26.0.tar.gz.asc

The PGP key used to sign the release is here:
https://dist.apache.org/repos/dist/release/mesos/KEYS

The JAR is up in Maven in a staging repository here:
https://repository.apache.org/content/repositories/orgapachemesos-1085

Please vote on releasing this package as Apache Mesos 0.26.0!

The vote is open until Sun Nov 15 20:13:46 CET 2015 and passes if a majority of 
at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Mesos 0.26.0
[ ] -1 Do not release this package because ...

Thanks,
Bernd & Till 

Re: Zookeeper cluster changes

2015-11-13 Thread Donald Laidlaw
That is great stuff Joseph!

What I am trying to understand at the moment is how the mesos (master and 
slave) use the list of zookeepers passed to it at startup. For example:

zk://10.1.1.10:2181,10.1.2.10:2181,10.1.3.10:2181/mesos

At startup, does mesos attempt to connect to these servers in order and then 
use the first one that works? Later on, if the connection is lost, does mesos 
continue trying to reconnect with the servers in the list? Or does it fail fast 
as was mentioned earlier by Erik Weathers?

I ask, because that affects how I would like to try to recover the servers. If 
it is failing fast, then I can just check for changes to the ensemble at mesos 
startup. If it is not failing fast, I need more complex code to recognize the 
ensemble change then do a rolling restart.

Thanks!

Don Laidlaw
866 Cobequid Rd.
Lower Sackville, NS
B4C 4E1
Phone: 1-902-576-5185
Cell: 1-902-401-6771

> On Nov 11, 2015, at 4:29 PM, Joseph Smith  wrote:
> 
> We have live-migrated an entire cluster of 10s of thousands of Mesos Agents 
> to point at a new ZK ensemble without killing our cluster 
> , or the tasks were were running. 
> (twice)
> 
> We started by shutting off all of the Mesos Masters. I’ve heard rumors that 
> some people have found their Mesos Agents will kill themselves without a 
> master, but this has never been my experience. If you find this to be the 
> case, please reach out as I’d love to avoid that fate at all costs.
> 
> Once the masters were down, we submitted a change to modify the configuration 
> for the agents (we set up an automatic restart of the slave for some 
> configuration values such as this one to make it easier to roll out). It took 
> our current configuration management system the better part of an hour to get 
> the change propagated across the cluster, but while that was happening, the 
> agents were happily running, and user tasks were serving traffic. Once we saw 
> zk_watch_count (check under the mntr command) 
>  
> increase to the expected number of agents on the new ensemble, we turned on 
> the masters (also pointing at the new ensemble now) and the agents sent 
> status updates back to the masters. If you haven’t taken a look at zktraffic 
> , I’d recommend it for improved 
> visibility into your ensemble as well.
> 
> Please note- there’s a bug in the C ZooKeeper library 
>  where the members of 
> an ensemble will only be resolved once. There isn’t conclusive proof that it 
> affects the agents . We 
> were fine, but you may want to validate.
> 
>> On Nov 10, 2015, at 11:23 AM, Donald Laidlaw > > wrote:
>> 
>> I agree, you want to apply the changes gradually so as not to lose a quorum. 
>> The problem is automating this so that it happens in a lights-out 
>> environment, in the cloud, without some poor slob's pager going off in the 
>> middle of the night :)
>> 
>> While health checks can detect and replace a dead server reliably on any 
>> number of clouds, the new server comes up with a new IP address. This server 
>> can reliably join into zookeeper ensemble. However, it is tough to automate 
>> the rolling restart of the other mesos servers, both masters and slaves, 
>> that needs to occur to keep them happy. 
>> 
>> One thing I have not tried is to just ignore the change, and use something 
>> to detect the masters just prior to starting mesos. If they truly fail fast, 
>> then if they lose a zookeeper connection, then maybe they don’t care that 
>> they have been started with an out-of-date list of zookeeper servers.
>> 
>> What does mesos-master and mesos-slave do with a list of zookeeper servers 
>> to connect to? Just try them in order until one works, then use that one 
>> until it fails? If so, and it fails fast, then letting it continue to run 
>> with a stale list will have no ill effects. Or does it keep trying the 
>> servers in the list when a connection fails? 
>> 
>> Don Laidlaw
>> 
>> 
>>> On Nov 10, 2015, at 4:42 AM, Erik Weathers >> > wrote:
>>> 
>>> Keep in mind that mesos is designed to "fail fast".  So when there are 
>>> problems (such as losing connectivity to the resolved ZooKeeper IP) the 
>>> daemon(s) (master & slave) die.
>>> 
>>> Due to this design, we are all supposed to run the mesos daemons under 
>>> "supervision", which means auto-restart after they crash.  This can be done 
>>> with monit/god/runit/etc.
>>> 
>>> So, to perform maintenance on ZooKeeper, I would firstly ensure the 
>>> mesos-master processes are running under "supervision" so that they restart 
>>> quickly after a ZK connectivity failure occurs.  Then proceed with standard 
>>>