----- Original Message ----- > From: "Yuriy Demchenko" <[email protected]> > To: "Eli Mesika" <[email protected]> > Cc: [email protected] > Sent: Monday, May 19, 2014 2:13:02 PM > Subject: Re: [ovirt-users] power outage: HA vms not restarted > > On 05/19/2014 02:36 PM, Eli Mesika wrote: > > > > > > Hi, > > > > i'm running ovirt-3.2.2-el6 on 18 el6 hosts with FC san storage, 46 HA > > vms in 2 datacenters (3 hosts uses different storage with no > > connectivity to first storage, that's why second DC) > > Recently (2014-05-17) i had a double power outage: first blackout at > > 00:16, went back at ~00:19, second blackout at 00:26, went back at 10:06 > > When finally all went up (after approx. 10:16) - only 2 vms were > > restarted from 46. > > From browsing engine log i saw failed restart attemts of almost all vms > > after first blackout with error 'Failed with error ENGINE and code > > 5001', but after second blackout i saw no attempts to restart vms, and > > only error was 'connect timeout' (probably to srv5 - that host > > physically died after blackouts). > > And i cant figure why HA vms were not restarted? Please advice > > > > engine and (supposedly) spm host logs in attach. > >>> Hi Yuriy > >>> > >>> What I see is that the log for 2014-05-17 is started at 2014-05-17 > >>> 00:23:03 > >>> so I can not track the first interval you had mentioned (00:19 to 00:26) > >> 00:23 is the time when engine booted up after first outage, that's why > >> logs started at 00:23:03 > >>> However, I can clearly see that at 2014-05-17 00:23:03 the engine was > >>> restarted and at 2014-05-17 00:23:09,423 we had started to get connection > >>> errors. > >>> We had tried to solve the problem by doing Soft-Fencing (actually vdsmd > >>> service restart) on the problematic hosts, but ssh to the host failed so > >>> we had tried > >>> to hard-fence the host (restart/reboot), but this was due the > >>> configurable > >>> "quite time" in which we are preventing fencing operation after an engine > >>> restart which > >>> is set by default to 5 min (DisableFenceAtStartupInSec key in > >>> engine-config) and therefor we had skipped the fencing operation... > >> my hosts usually boots slower than engine, long bios checks + random > >> power on delay (120s), thats why at first engine reports connect errors > >> however, in logs i see, that some of them were allready up and engine > >> successfully contacted them: > > So maybe you should put DisableFenceAtStartupInSec to a higher value > > instead of the default 5 min... > > > >> 2014-05-17 00:23:10,450 INFO > >> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > >> (DefaultQuartzScheduler_Worker-18) Correlation ID: null, Call Stack: > >> null, Custom Event ID: -1, Message: State was set to Up for host srv11. > >> 2014-05-17 00:23:10,456 INFO > >> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > >> (DefaultQuartzScheduler_Worker-4) Correlation ID: null, Call Stack: > >> null, Custom Event ID: -1, Message: State was set to Up for host srv4. > >> 2014-05-17 00:23:10,458 INFO > >> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > >> (DefaultQuartzScheduler_Worker-11) Correlation ID: null, Call Stack: > >> null, Custom Event ID: -1, Message: State was set to Up for host srv7. > >> 2014-05-17 00:23:10,460 INFO > >> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > >> (DefaultQuartzScheduler_Worker-20) Correlation ID: null, Call Stack: > >> null, Custom Event ID: -1, Message: State was set to Up for host srv9. > >> > >> and after 00:23:11 i saw no fencing-related messages, only vm restart > >> attemts that failed with strange errors like: > >> 'Failed with error ENGINE and code 5001' > >> 'Candidate host srv1 (2a89e565-aa4e-4a19-82e3-e72e4edee111) was filtered > >> out by VAR__FILTERTYPE__INTERNAL filter Memory' > >> 'CanDoAction of action RunVm failed. > >> Reasons:VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VDS_VM_MEMORY,VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VDS_VM_MEMORY,SCHEDULING_ALL_HOSTS_FILTERED_OUT,VAR__FILTERTYPE__INTERNAL,$hostName > >> srv1,$filterName Memory,SCHEDULING_HOST_FILTERED_REASON' > > Strange that you didn't got the message itself (this is only the message > > key) > > The original message is : > > Cannot Run VM. There are no available running Hosts with sufficient memory > > in VM's Cluster . > > So, it failed on the RunVM command validation where no host with enough > > memory to run the VM was found > yea, but how that can be? all my hosts have 256gb ram and only 2 vms are > ram-hungry (128gb and 240gb), other uses much less (2-64gb) > > > >>> For the first period as I said I only can guess that one of your hosts > >>> fencing attempts was after those 5 minutes and therefor it was rebooted > >>> and the HA VMs were freed to run on other host. > >>> For the second period on which I have logs, the host fencing failed due > >>> to > >>> the required "quite time" and in this situation the only hing you can do > >>> in order to have the HA VMs running again is to > >>> right-click on each host and press "Confirm that host has been rebooted" > >> but i see in logs - after second period, at 10:14+ all hosts but one > >> (srv5) were up and power management verified successfully, isn't that > >> should be enough for engine to verify all ha vms down and restart them? > > How should engine know that the host was rebooted?? > > The fact that power management verified successfully is not enough in order > > to run the VMs on another host. > > As I see the fence commands that intend to reboot the host holding the VMs > > were failed ... > but in like 30secs after engine set holding hosts to non-responsive they > all went up - 'state was set to up for host' > so, engine can contact that hosts and query vm status and restart if > needed, shouldn't it? > > 2014-05-17 10:09:23,978 INFO > [org.ovirt.engine.core.vdsbroker.VdsManager] > (DefaultQuartzScheduler_Worker-48) Server failed to respond, vds_id = > 7bf30c8b-7eb2-4b1e-ba0e-13093bca57bf, vds_name = srv9, vm_count = 0, > spm_status = None, non-responsive_timeout (seconds) = 60, error = > java.net.ConnectException: Connection refused > 2014-05-17 10:09:40,143 INFO > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (DefaultQuartzScheduler_Worker-42) Correlation ID: null, Call Stack: > null, Custom Event ID: -1, Message: State was set to Up for host srv9. > > 2014-05-17 10:09:23,980 INFO > [org.ovirt.engine.core.vdsbroker.VdsManager] > (DefaultQuartzScheduler_Worker-51) Server failed to respond, vds_id = > 57f58a56-a7ad-4917-b32d-f5dfaf326419, vds_name = srv10, vm_count = 0, > spm_status = None, non-responsive_timeout (seconds) = 60, error = > java.net.ConnectException: Connection refused > 2014-05-17 10:09:43,214 INFO > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (DefaultQuartzScheduler_Worker-64) Correlation ID: null, Call Stack: > null, Custom Event ID: -1, Message: State was set to Up for host srv10. > > 2014-05-17 10:09:24,029 INFO > [org.ovirt.engine.core.vdsbroker.VdsManager] > (DefaultQuartzScheduler_Worker-55) Server failed to respond, vds_id = > bd07bc6d-37e8-4b55-810c-f41c78f7a8ad, vds_name = srv11, vm_count = 0, > spm_status = None, non-responsive_timeout (seconds) = 60, error = > java.net.ConnectException: Connection refused > 2014-05-17 10:09:43,239 INFO > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (DefaultQuartzScheduler_Worker-63) Correlation ID: null, Call Stack: > null, Custom Event ID: -1, Message: State was set to Up for host srv11. > > 2014-05-17 10:09:23,991 INFO > [org.ovirt.engine.core.vdsbroker.VdsManager] > (DefaultQuartzScheduler_Worker-50) Server failed to respond, vds_id = > eac8d084-1be0-4e8d-a589-385f71fe3afa, vds_name = srv12, vm_count = 5, > spm_status = Contending, non-responsive_timeout (seconds) = 82, error = > java.net.ConnectException: Connection refused > 2014-05-17 10:09:43,320 INFO > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (DefaultQuartzScheduler_Worker-60) Correlation ID: null, Call Stack: > null, Custom Event ID: -1, Message: State was set to Up for host srv12. > > 2014-05-17 10:09:23,987 INFO > [org.ovirt.engine.core.vdsbroker.VdsManager] > (DefaultQuartzScheduler_Worker-49) Server failed to respond, vds_id = > 712e822b-d3c6-4cb7-a6f1-58da5039bda1, vds_name = srv13, vm_count = 4, > spm_status = None, non-responsive_timeout (seconds) = 62, error = > java.net.ConnectException: Connection refused > 2014-05-17 10:09:44,934 INFO > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (DefaultQuartzScheduler_Worker-58) [77463af9] Correlation ID: null, Call > Stack: null, Custom Event ID: -1, Message: State was set to Up for host > srv13. > > 2014-05-17 10:09:24,016 INFO > [org.ovirt.engine.core.vdsbroker.VdsManager] > (DefaultQuartzScheduler_Worker-53) Server failed to respond, vds_id = > 9d99c8eb-754e-4446-8a0f-af7327d19989, vds_name = srv14, vm_count = 0, > spm_status = None, non-responsive_timeout (seconds) = 60, error = > java.net.ConnectException: Connection refused > 2014-05-17 10:09:43,930 INFO > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (DefaultQuartzScheduler_Worker-61) Correlation ID: null, Call Stack: > null, Custom Event ID: -1, Message: State was set to Up for host srv14. > > 2014-05-17 10:09:24,019 INFO > [org.ovirt.engine.core.vdsbroker.VdsManager] > (DefaultQuartzScheduler_Worker-54) Server failed to respond, vds_id = > 3ca82f5b-1343-451f-81d0-d2ed111b5f51, vds_name = srv15, vm_count = 0, > spm_status = None, non-responsive_timeout (seconds) = 60, error = > java.net.ConnectException: Connection refused > 2014-05-17 10:09:40,285 INFO > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (DefaultQuartzScheduler_Worker-50) Correlation ID: null, Call Stack: > null, Custom Event ID: -1, Message: State was set to Up for host srv15. > > 2014-05-17 10:09:23,996 INFO > [org.ovirt.engine.core.vdsbroker.VdsManager] > (DefaultQuartzScheduler_Worker-45) Server failed to respond, vds_id = > c9d9735b-d75f-4026-afcd-94491ee66a4f, vds_name = srv16, vm_count = 0, > spm_status = None, non-responsive_timeout (seconds) = 60, error = > java.net.ConnectException: Connection refused > 2014-05-17 10:09:40,188 INFO > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (DefaultQuartzScheduler_Worker-47) Correlation ID: null, Call Stack: > null, Custom Event ID: -1, Message: State was set to Up for host srv16 > > 2014-05-17 10:09:23,963 INFO > [org.ovirt.engine.core.vdsbroker.VdsManager] > (DefaultQuartzScheduler_Worker-47) Server failed to respond, vds_id = > 8daa3751-f347-41eb-b6a6-a27f4fda2def, vds_name = srv17, vm_count = 0, > spm_status = None, non-responsive_timeout (seconds) = 60, error = > java.net.ConnectException: Connection refused > 2014-05-17 10:09:46,358 INFO > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (DefaultQuartzScheduler_Worker-75) Correlation ID: null, Call Stack: > null, Custom Event ID: -1, Message: State was set to Up for host srv17. > > 2014-05-17 10:09:23,949 INFO > [org.ovirt.engine.core.vdsbroker.VdsManager] > (DefaultQuartzScheduler_Worker-46) Server failed to respond, vds_id = > 2e833c27-fd9f-4750-ab84-2fb629b66671, vds_name = srv18, vm_count = 0, > spm_status = None, non-responsive_timeout (seconds) = 60, error = > java.net.ConnectException: Connection refused > 2014-05-17 10:10:25,643 INFO > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (DefaultQuartzScheduler_Worker-7) Correlation ID: null, Call Stack: > null, Custom Event ID: -1, Message: State was set to Up for host srv18. > > 2014-05-17 10:09:23,991 INFO > [org.ovirt.engine.core.vdsbroker.VdsManager] > (DefaultQuartzScheduler_Worker-52) Server failed to respond, vds_id = > e7cedec5-9335-45bf-ae58-0744a9b93f41, vds_name = srv19, vm_count = 3, > spm_status = None, non-responsive_timeout (seconds) = 61, error = > java.net.ConnectException: Connection refused > 2014-05-17 10:09:40,260 INFO > [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > (DefaultQuartzScheduler_Worker-54) Correlation ID: null, Call Stack: > null, Custom Event ID: -1, Message: State was set to Up for host srv19. > > >> 2014-05-17 10:11:56,946 INFO > >> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] > >> (DefaultQuartzScheduler_Worker-75) [dbc315d] Correla > >> tion ID: 73469dd6, Call Stack: null, Custom Event ID: -1, Message: Host > >> srv17 power management was verified successfully. > >> > >> in fact, at ~17:55 my colleague restarted engine, forced spm selection > >> and started all vms - all started without errors, and he didn't had to > >> click 'confirm host has been rebooted' > > But here you had restarted the VMs on the original Host in which they were > > running after the host is UP, this will work > > The problem is to run a HA VM on another host, to do so, we must insure > > that the Host that run the VMs until now is not running then anymore > > If the fence operation is failed, we can say nothing on the Host status and > > then as I said you should manually approve that the host was rebooted in > > order > > to prevent VMs to run on multiple hosts which will lead to data corruption > but shouldn't engine restart corresponded vms after holding host came > up? (without manual fence) > because they up - so engine can query them about running/not running vms > and get actual state of vms - running or not > the only host were down at that point is srv5, which holded only 1 vm - > and it were correctly put in 'unknown' state, other vms were just 'down' > until we manually started them
Are you sure that those VMs are defined as Highly Available VMs ??? > _______________________________________________ Users mailing list [email protected] http://lists.ovirt.org/mailman/listinfo/users

