Re: "Failed to perform recovery: Incompatible slave info detected"

Dick Davies Thu, 19 Jun 2014 12:16:03 -0700

Fair enough, appreciate the explanation (and that you've clearly
thought hard about this in the design).


The cluster I hit this on was in the process of being built and had no
tasks deployed, it just violated
my Principle of Least Astonishment that dropping some more cores into
the slaves seemed to kill them
off.

I can see there must be cases where this design choice is the right
thing to do, now we know we can
work around it easily enough - so thanks for the lesson :)

On 19 June 2014 18:43, Vinod Kone <vinodk...@gmail.com> wrote:
> Yes. The idea behind storing the whole slave info is to provide safety.
>
> Imagine, the slave resources were reduced on a restart. What does this mean
> for already running tasks that are using more resources than the newly
> configured resources? Should the slave kill them? If yes, which ones?
> Similarly what happens when the slave attributes are changed (e.g., "secure"
> to "unsecure")? Is it safe to keep running the existing tasks?
>
> As you can see, reconciliation of slave info is a complex problem. While
> there are some smarts we can add to the slave (e.g., increase of resources
> is OK while decrease is not) we haven't really seen a need for it yet.
>
>
> On Thu, Jun 19, 2014 at 3:03 AM, Dick Davies <d...@hellooperator.net> wrote:
>>
>> Fab, thanks Vinod. Turns out that feature (different FQDN to serve the ui
>> up on)
>> might well be really useful for us, so every cloud has a silver lining :)
>>
>> back to the metadata feature though - do you know why just the 'id' of
>> the slaves isn't used?
>> As it stands adding disk storage, cores or RAM to a slave will cause
>> it to drop out of cluster -
>> does checking the whole metadata provide any benefit vs. checking the id?
>>
>> On 18 June 2014 19:46, Vinod Kone <vinodk...@gmail.com> wrote:
>> > Filed https://issues.apache.org/jira/browse/MESOS-1506 for fixing
>> > flags/documentation.
>> >
>> >
>> > On Wed, Jun 18, 2014 at 11:33 AM, Dick Davies <d...@hellooperator.net>
>> > wrote:
>> >>
>> >> Thanks, it might be worth correcting the docs in that case then.
>> >> This URL says it'll use the system hostname, not the reverse DNS of
>> >> the ip argument:
>> >>
>> >> http://mesos.apache.org/documentation/latest/configuration/
>> >>
>> >> re: the CFS thing - this was while running Docker on the slaves - that
>> >> also uses cgroups
>> >> so maybe resources were getting split with mesos or something? (I'm
>> >> still reading up on
>> >> cgroups) - definitely wasn't the case until cfs was enabled.
>> >>
>> >>
>> >> On 18 June 2014 18:34, Vinod Kone <vinodk...@gmail.com> wrote:
>> >> > Hey Dick,
>> >> >
>> >> > Regarding slave recovery, any changes in the SlaveInfo (see
>> >> > mesos.proto)
>> >> > are
>> >> > considered as a new slave and hence recovery doesn't proceed forward.
>> >> > This
>> >> > is because Master caches SlaveInfo and it is quite complex to
>> >> > reconcile
>> >> > the
>> >> > differences in SlaveInfo. So we decided to fail on any SlaveInfo
>> >> > changes
>> >> > for
>> >> > now.
>> >> >
>> >> > In your particular case,
>> >> > https://issues.apache.org/jira/browse/MESOS-672
>> >> > was
>> >> > committed in 0.18.0 which fixed redirection
>> >> >  of WebUI. Included in this fix is
>> >> > https://reviews.apache.org/r/17573/
>> >> > which
>> >> > changed how SlaveInfo.hostname is calculated. Since you are not
>> >> > providing a
>> >> > hostname via "--hostname" flag, slave now deduces the hostname from
>> >> > "--ip"
>> >> > flag. Looks like in your cluster the hostname corresponding to that
>> >> > ip
>> >> > is
>> >> > different than what 'os::hostname()' gives.
>> >> >
>> >> > Couple of options to move forward. If you want slave recovery,
>> >> > provide
>> >> > "--hostname" that matches the previous hostname. If you don't care
>> >> > above
>> >> > recovery, just remove the meta directory ("rm -rf /var/mesos/meta")
>> >> > so
>> >> > that
>> >> > the slave starts as a fresh one (since you are not using cgroups, you
>> >> > will
>> >> > have to manually kill any old executors/tasks that are still alive on
>> >> > the
>> >> > slave).
>> >> >
>> >> > Not sure about your comment on CFS. Enabling CFS shouldn't change how
>> >> > much
>> >> > memory the slave sees as available. More details/logs would help
>> >> > diagnose
>> >> > the issue.
>> >> >
>> >> > HTH,
>> >> >
>> >> >
>> >> >
>> >> > On Wed, Jun 18, 2014 at 4:26 AM, Dick Davies <d...@hellooperator.net>
>> >> > wrote:
>> >> >>
>> >> >> Should have said, the CLI for this is :
>> >> >>
>> >> >> /usr/local/sbin/mesos-slave --master=zk://10.10.10.105:2181/mesos
>> >> >> --log_dir=/var/log/mesos --ip=10.10.10.101 --work_dir=/var/mesos
>> >> >>
>> >> >> (note IP is specified, hostname is not - docs indicated hostname arg
>> >> >> will default to the fqdn of host, but it appears to be using the
>> >> >> value
>> >> >> passed as 'ip' instead.)
>> >> >>
>> >> >> On 18 June 2014 12:00, Dick Davies <d...@hellooperator.net> wrote:
>> >> >> > Hi, we recently bumped 0.17.0 -> 0.18.2 and the slaves
>> >> >> > now show their IPs rather than their FQDNs on the mesos UI.
>> >> >> >
>> >> >> > This broke slave recovery with the error:
>> >> >> >
>> >> >> > "Failed to perform recovery: Incompatible slave info detected"
>> >> >> >
>> >> >> >
>> >> >> > cpu, mem, disk, ports are all the same. so is the 'id' field.
>> >> >> >
>> >> >> > the only thing that's changed is are the 'hostname' and
>> >> >> > webui_hostname
>> >> >> > arguments
>> >> >> > (the CLI we're passing in is exactly the same as it was on 0.17.0,
>> >> >> > so
>> >> >> > presumably this is down to a change in mesos conventions).
>> >> >> >
>> >> >> > I've had similar issues enabling CFS in test environments (slaves
>> >> >> > show
>> >> >> > less free memory and refuse to recover).
>> >> >> >
>> >> >> > is the 'id' field not enough to uniquely identify a slave?
>> >> >
>> >> >
>> >
>> >
>
>

Re: "Failed to perform recovery: Incompatible slave info detected"

Reply via email to