Hey Dick,

Regarding slave recovery, any changes in the SlaveInfo (see mesos.proto)
are considered as a new slave and hence recovery doesn't proceed forward.
This is because Master caches SlaveInfo and it is quite complex to
reconcile the differences in SlaveInfo. So we decided to fail on any
SlaveInfo changes for now.

In your particular case, https://issues.apache.org/jira/browse/MESOS-672
was committed in 0.18.0 which fixed redirection
 of WebUI. Included in this fix is https://reviews.apache.org/r/17573/
which changed how SlaveInfo.hostname is calculated. Since you are not
providing a hostname via "--hostname" flag, slave now deduces the hostname
from "--ip" flag. Looks like in your cluster the hostname corresponding to
that ip is different than what 'os::hostname()' gives.

Couple of options to move forward. If you want slave recovery, provide
"--hostname" that matches the previous hostname. If you don't care above
recovery, just remove the meta directory ("rm -rf /var/mesos/meta") so that
the slave starts as a fresh one (since you are not using cgroups, you will
have to manually kill any old executors/tasks that are still alive on the
slave).

Not sure about your comment on CFS. Enabling CFS shouldn't change how much
memory the slave sees as available. More details/logs would help diagnose
the issue.

HTH,



On Wed, Jun 18, 2014 at 4:26 AM, Dick Davies <d...@hellooperator.net> wrote:

> Should have said, the CLI for this is :
>
> /usr/local/sbin/mesos-slave --master=zk://10.10.10.105:2181/mesos
> --log_dir=/var/log/mesos --ip=10.10.10.101 --work_dir=/var/mesos
>
> (note IP is specified, hostname is not - docs indicated hostname arg
> will default to the fqdn of host, but it appears to be using the value
> passed as 'ip' instead.)
>
> On 18 June 2014 12:00, Dick Davies <d...@hellooperator.net> wrote:
> > Hi, we recently bumped 0.17.0 -> 0.18.2 and the slaves
> > now show their IPs rather than their FQDNs on the mesos UI.
> >
> > This broke slave recovery with the error:
> >
> > "Failed to perform recovery: Incompatible slave info detected"
> >
> >
> > cpu, mem, disk, ports are all the same. so is the 'id' field.
> >
> > the only thing that's changed is are the 'hostname' and webui_hostname
> > arguments
> > (the CLI we're passing in is exactly the same as it was on 0.17.0, so
> > presumably this is down to a change in mesos conventions).
> >
> > I've had similar issues enabling CFS in test environments (slaves show
> > less free memory and refuse to recover).
> >
> > is the 'id' field not enough to uniquely identify a slave?
>

Reply via email to