Hi Cody, On Wed, Oct 03, 2018 at 11:46:52AM -0400, Cody wrote: > Hi everyone, > > My cluster is deployed with both Controller and Instance HA. The deployment > completed without errors, but I noticed something strange from the 'pcs > status' output from the controllers: > > Clone Set: compute-unfence-trigger-clone [compute-unfence-trigger] > Started: [ overcloud-novacompute-0 ] > Stopped: [ overcloud-controller-0 overcloud-controller-1 > overcloud-controller-2 overcloud-novacompute-1 ] > nova-evacuate (ocf::openstack:NovaEvacuate): Started > overcloud-controller-0 > stonith-fence_ipmilan-002590a2d2c7 (stonith:fence_ipmilan): Started > overcloud-controller-1 > stonith-fence_ipmilan-002590a1c641 (stonith:fence_ipmilan): Started > overcloud-controller-2 > stonith-fence_ipmilan-002590f25822 (stonith:fence_ipmilan): Started > overcloud-controller-0 > stonith-fence_ipmilan-002590f3977a (stonith:fence_ipmilan): Started > overcloud-controller-2 > stonith-fence_ipmilan-002590f2631a (stonith:fence_ipmilan): Started > overcloud-controller-1 > > Notice the stonith-fence_ipmilan lines showed incorrect hosts for the last > two devices. The MAC addresses are for the overcloud-novacompute-0 and > overcloud-novacompute-1, but it got started on the controller nodes. Is > this right?
That is correct. Stonith resource only run on full cluster nodes (controllers) and not on pacemaker remote nodes (computes) > There are also some failed actions from the status output: > > Failed Actions: > * overcloud-novacompute-1_start_0 on overcloud-controller-2 'unknown error' > (1): call=3, status=Timed Out, exitreason='', last-rc-change='Wed Oct 3 > 03:48:55 2018', queued=0ms, exec=0ms > * overcloud-novacompute-1_start_0 on overcloud-controller-0 'unknown error' > (1): call=23, status=Timed Out, exitreason='', last-rc-change='Wed Oct 3 > 14:50:25 2018', queued=0ms, exec=0ms > * overcloud-novacompute-1_start_0 on overcloud-controller-1 'unknown error' > (1): call=3, status=Timed Out, exitreason='', last-rc-change='Wed Oct 3 > 03:47:51 2018', queued=0ms, exec=0ms Are these after the fresh deployment or are these after you triggered a crash on a compute node? > I can spin up VMs, but cannot do failover. If I manually trigger a crash on > one of the compute nodes, the affected VMs will remain at ERROR state and > the affected compute node will be unable to rejoin the cluster afterward. > > After a manual reboot on the affected compute node, it cannot start the pcs > cluster service. Its container 'nova_compute' also remains unhealthy after > reboot, with the lastest 'docker logs' message as: > > ++ cat /run_command > + CMD='/var/lib/nova/instanceha/check-run-nova-compute ' > + ARGS= > + [[ ! -n '' ]] > + . kolla_extend_start > ++ [[ ! -d /var/log/kolla/nova ]] > +++ stat -c %a /var/log/kolla/nova > ++ [[ 2755 != \7\5\5 ]] > ++ chmod 755 /var/log/kolla/nova > ++ . /usr/local/bin/kolla_nova_extend_start > +++ [[ ! -d /var/lib/nova/instances ]] > + echo 'Running command: > '\''/var/lib/nova/instanceha/check-run-nova-compute '\''' > Running command: '/var/lib/nova/instanceha/check-run-nova-compute ' > + exec /var/lib/nova/instanceha/check-run-nova-compute > Waiting for fence-down flag to be cleared > Waiting for fence-down flag to be cleared > Waiting for fence-down flag to be cleared > Waiting for fence-down flag to be cleared > Waiting for fence-down flag to be cleared > Waiting for fence-down flag to be cleared > ... > > So I guess something may be wrong with fencing, but I have no idea what > caused it and how to fix it. Any helps/suggestions/opinions would be > greatly appreciated. Thank you very much. So when a compute node crashes, one of the things that happens is that it gets forcefully marked as down. You should be able to unblock it manually via: 'nova service-list' to find the uuid of the forced-down service and then 'nova service-force-down --unset <uuid>' I am not sure I understand the exact picture of the problem. Is it: A) After I crash a compute node the VMs do not get resurrected on another compute node? B) The compute node I just crashed hangs at boot with the nova container waiting for fence-down flag to be cleared? Is only B) the issue or also A)? For B) can you try the following? """ pcs resource update overcloud-novacompute-0 meta reconnect_interval=180s pcs resource update overcloud-novacompute-1 meta reconnect_interval=180s pcs resource cleanup --all """ and retry the process and report back? cheers, Michele -- Michele Baldessari <[email protected]> C2A5 9DA3 9961 4FFB E01B D0BC DDD4 DCCB 7515 5C6D _______________________________________________ users mailing list [email protected] http://lists.rdoproject.org/mailman/listinfo/users To unsubscribe: [email protected]
