Filed https://issues.apache.org/jira/browse/MESOS-1506 for fixing flags/documentation.
On Wed, Jun 18, 2014 at 11:33 AM, Dick Davies <[email protected]> wrote: > Thanks, it might be worth correcting the docs in that case then. > This URL says it'll use the system hostname, not the reverse DNS of > the ip argument: > > http://mesos.apache.org/documentation/latest/configuration/ > > re: the CFS thing - this was while running Docker on the slaves - that > also uses cgroups > so maybe resources were getting split with mesos or something? (I'm > still reading up on > cgroups) - definitely wasn't the case until cfs was enabled. > > > On 18 June 2014 18:34, Vinod Kone <[email protected]> wrote: > > Hey Dick, > > > > Regarding slave recovery, any changes in the SlaveInfo (see mesos.proto) > are > > considered as a new slave and hence recovery doesn't proceed forward. > This > > is because Master caches SlaveInfo and it is quite complex to reconcile > the > > differences in SlaveInfo. So we decided to fail on any SlaveInfo changes > for > > now. > > > > In your particular case, https://issues.apache.org/jira/browse/MESOS-672 > was > > committed in 0.18.0 which fixed redirection > > of WebUI. Included in this fix is https://reviews.apache.org/r/17573/ > which > > changed how SlaveInfo.hostname is calculated. Since you are not > providing a > > hostname via "--hostname" flag, slave now deduces the hostname from > "--ip" > > flag. Looks like in your cluster the hostname corresponding to that ip is > > different than what 'os::hostname()' gives. > > > > Couple of options to move forward. If you want slave recovery, provide > > "--hostname" that matches the previous hostname. If you don't care above > > recovery, just remove the meta directory ("rm -rf /var/mesos/meta") so > that > > the slave starts as a fresh one (since you are not using cgroups, you > will > > have to manually kill any old executors/tasks that are still alive on the > > slave). > > > > Not sure about your comment on CFS. Enabling CFS shouldn't change how > much > > memory the slave sees as available. More details/logs would help diagnose > > the issue. > > > > HTH, > > > > > > > > On Wed, Jun 18, 2014 at 4:26 AM, Dick Davies <[email protected]> > wrote: > >> > >> Should have said, the CLI for this is : > >> > >> /usr/local/sbin/mesos-slave --master=zk://10.10.10.105:2181/mesos > >> --log_dir=/var/log/mesos --ip=10.10.10.101 --work_dir=/var/mesos > >> > >> (note IP is specified, hostname is not - docs indicated hostname arg > >> will default to the fqdn of host, but it appears to be using the value > >> passed as 'ip' instead.) > >> > >> On 18 June 2014 12:00, Dick Davies <[email protected]> wrote: > >> > Hi, we recently bumped 0.17.0 -> 0.18.2 and the slaves > >> > now show their IPs rather than their FQDNs on the mesos UI. > >> > > >> > This broke slave recovery with the error: > >> > > >> > "Failed to perform recovery: Incompatible slave info detected" > >> > > >> > > >> > cpu, mem, disk, ports are all the same. so is the 'id' field. > >> > > >> > the only thing that's changed is are the 'hostname' and webui_hostname > >> > arguments > >> > (the CLI we're passing in is exactly the same as it was on 0.17.0, so > >> > presumably this is down to a change in mesos conventions). > >> > > >> > I've had similar issues enabling CFS in test environments (slaves show > >> > less free memory and refuse to recover). > >> > > >> > is the 'id' field not enough to uniquely identify a slave? > > > > >

