Re: "Failed to perform recovery: Incompatible slave info detected"

Vinod Kone Wed, 18 Jun 2014 11:48:06 -0700

Filed https://issues.apache.org/jira/browse/MESOS-1506 for fixing
flags/documentation.



On Wed, Jun 18, 2014 at 11:33 AM, Dick Davies <[email protected]>
wrote:

> Thanks, it might be worth correcting the docs in that case then.
> This URL says it'll use the system hostname, not the reverse DNS of
> the ip argument:
>
> http://mesos.apache.org/documentation/latest/configuration/
>
> re: the CFS thing - this was while running Docker on the slaves - that
> also uses cgroups
> so maybe resources were getting split with mesos or something? (I'm
> still reading up on
> cgroups) - definitely wasn't the case until cfs was enabled.
>
>
> On 18 June 2014 18:34, Vinod Kone <[email protected]> wrote:
> > Hey Dick,
> >
> > Regarding slave recovery, any changes in the SlaveInfo (see mesos.proto)
> are
> > considered as a new slave and hence recovery doesn't proceed forward.
> This
> > is because Master caches SlaveInfo and it is quite complex to reconcile
> the
> > differences in SlaveInfo. So we decided to fail on any SlaveInfo changes
> for
> > now.
> >
> > In your particular case, https://issues.apache.org/jira/browse/MESOS-672
> was
> > committed in 0.18.0 which fixed redirection
> >  of WebUI. Included in this fix is https://reviews.apache.org/r/17573/
> which
> > changed how SlaveInfo.hostname is calculated. Since you are not
> providing a
> > hostname via "--hostname" flag, slave now deduces the hostname from
> "--ip"
> > flag. Looks like in your cluster the hostname corresponding to that ip is
> > different than what 'os::hostname()' gives.
> >
> > Couple of options to move forward. If you want slave recovery, provide
> > "--hostname" that matches the previous hostname. If you don't care above
> > recovery, just remove the meta directory ("rm -rf /var/mesos/meta") so
> that
> > the slave starts as a fresh one (since you are not using cgroups, you
> will
> > have to manually kill any old executors/tasks that are still alive on the
> > slave).
> >
> > Not sure about your comment on CFS. Enabling CFS shouldn't change how
> much
> > memory the slave sees as available. More details/logs would help diagnose
> > the issue.
> >
> > HTH,
> >
> >
> >
> > On Wed, Jun 18, 2014 at 4:26 AM, Dick Davies <[email protected]>
> wrote:
> >>
> >> Should have said, the CLI for this is :
> >>
> >> /usr/local/sbin/mesos-slave --master=zk://10.10.10.105:2181/mesos
> >> --log_dir=/var/log/mesos --ip=10.10.10.101 --work_dir=/var/mesos
> >>
> >> (note IP is specified, hostname is not - docs indicated hostname arg
> >> will default to the fqdn of host, but it appears to be using the value
> >> passed as 'ip' instead.)
> >>
> >> On 18 June 2014 12:00, Dick Davies <[email protected]> wrote:
> >> > Hi, we recently bumped 0.17.0 -> 0.18.2 and the slaves
> >> > now show their IPs rather than their FQDNs on the mesos UI.
> >> >
> >> > This broke slave recovery with the error:
> >> >
> >> > "Failed to perform recovery: Incompatible slave info detected"
> >> >
> >> >
> >> > cpu, mem, disk, ports are all the same. so is the 'id' field.
> >> >
> >> > the only thing that's changed is are the 'hostname' and webui_hostname
> >> > arguments
> >> > (the CLI we're passing in is exactly the same as it was on 0.17.0, so
> >> > presumably this is down to a change in mesos conventions).
> >> >
> >> > I've had similar issues enabling CFS in test environments (slaves show
> >> > less free memory and refuse to recover).
> >> >
> >> > is the 'id' field not enough to uniquely identify a slave?
> >
> >
>

Re: "Failed to perform recovery: Incompatible slave info detected"

Reply via email to