Fab, thanks Vinod. Turns out that feature (different FQDN to serve the ui up on) might well be really useful for us, so every cloud has a silver lining :)
back to the metadata feature though - do you know why just the 'id' of the slaves isn't used? As it stands adding disk storage, cores or RAM to a slave will cause it to drop out of cluster - does checking the whole metadata provide any benefit vs. checking the id? On 18 June 2014 19:46, Vinod Kone <vinodk...@gmail.com> wrote: > Filed https://issues.apache.org/jira/browse/MESOS-1506 for fixing > flags/documentation. > > > On Wed, Jun 18, 2014 at 11:33 AM, Dick Davies <d...@hellooperator.net> > wrote: >> >> Thanks, it might be worth correcting the docs in that case then. >> This URL says it'll use the system hostname, not the reverse DNS of >> the ip argument: >> >> http://mesos.apache.org/documentation/latest/configuration/ >> >> re: the CFS thing - this was while running Docker on the slaves - that >> also uses cgroups >> so maybe resources were getting split with mesos or something? (I'm >> still reading up on >> cgroups) - definitely wasn't the case until cfs was enabled. >> >> >> On 18 June 2014 18:34, Vinod Kone <vinodk...@gmail.com> wrote: >> > Hey Dick, >> > >> > Regarding slave recovery, any changes in the SlaveInfo (see mesos.proto) >> > are >> > considered as a new slave and hence recovery doesn't proceed forward. >> > This >> > is because Master caches SlaveInfo and it is quite complex to reconcile >> > the >> > differences in SlaveInfo. So we decided to fail on any SlaveInfo changes >> > for >> > now. >> > >> > In your particular case, https://issues.apache.org/jira/browse/MESOS-672 >> > was >> > committed in 0.18.0 which fixed redirection >> > of WebUI. Included in this fix is https://reviews.apache.org/r/17573/ >> > which >> > changed how SlaveInfo.hostname is calculated. Since you are not >> > providing a >> > hostname via "--hostname" flag, slave now deduces the hostname from >> > "--ip" >> > flag. Looks like in your cluster the hostname corresponding to that ip >> > is >> > different than what 'os::hostname()' gives. >> > >> > Couple of options to move forward. If you want slave recovery, provide >> > "--hostname" that matches the previous hostname. If you don't care above >> > recovery, just remove the meta directory ("rm -rf /var/mesos/meta") so >> > that >> > the slave starts as a fresh one (since you are not using cgroups, you >> > will >> > have to manually kill any old executors/tasks that are still alive on >> > the >> > slave). >> > >> > Not sure about your comment on CFS. Enabling CFS shouldn't change how >> > much >> > memory the slave sees as available. More details/logs would help >> > diagnose >> > the issue. >> > >> > HTH, >> > >> > >> > >> > On Wed, Jun 18, 2014 at 4:26 AM, Dick Davies <d...@hellooperator.net> >> > wrote: >> >> >> >> Should have said, the CLI for this is : >> >> >> >> /usr/local/sbin/mesos-slave --master=zk://10.10.10.105:2181/mesos >> >> --log_dir=/var/log/mesos --ip=10.10.10.101 --work_dir=/var/mesos >> >> >> >> (note IP is specified, hostname is not - docs indicated hostname arg >> >> will default to the fqdn of host, but it appears to be using the value >> >> passed as 'ip' instead.) >> >> >> >> On 18 June 2014 12:00, Dick Davies <d...@hellooperator.net> wrote: >> >> > Hi, we recently bumped 0.17.0 -> 0.18.2 and the slaves >> >> > now show their IPs rather than their FQDNs on the mesos UI. >> >> > >> >> > This broke slave recovery with the error: >> >> > >> >> > "Failed to perform recovery: Incompatible slave info detected" >> >> > >> >> > >> >> > cpu, mem, disk, ports are all the same. so is the 'id' field. >> >> > >> >> > the only thing that's changed is are the 'hostname' and >> >> > webui_hostname >> >> > arguments >> >> > (the CLI we're passing in is exactly the same as it was on 0.17.0, so >> >> > presumably this is down to a change in mesos conventions). >> >> > >> >> > I've had similar issues enabling CFS in test environments (slaves >> >> > show >> >> > less free memory and refuse to recover). >> >> > >> >> > is the 'id' field not enough to uniquely identify a slave? >> > >> > > >