Re: [ovirt-users] power outage: HA vms not restarted

Yuriy Demchenko Mon, 19 May 2014 05:57:19 -0700

saw another strange moment in logs:

for example, vm 'prod.gui', according to logs engine tried to restart iton host srv19 at 00:23:11 and it failed with error 'Failed to run VMprod.gui: VdcBLLException: java.lang.NullPointerException (Failed witherror ENGINE and code 5001)'


in vdsm.log on srv19 i can see this error:

StoragePoolUnknown: Unknown pool id, pool not connected:('5849b030-626e-47cb-ad90-3ce782d831b3',)Thread-18::DEBUG::2014-05-1700:23:13,022::task::869::TaskManager.Task::(_run)Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::Task._run:27a323de-ac42-4d68-992f-eb5c84d8baea('5849b030-626e-47cb-ad90-3ce782d831b3',) {} failed - stopping taskThread-18::DEBUG::2014-05-1700:23:13,022::task::1194::TaskManager.Task::(stop)Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::stopping in state preparing(force False)Thread-18::DEBUG::2014-05-1700:23:13,023::task::974::TaskManager.Task::(_decref)Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::ref 1 aborting TrueThread-18::INFO::2014-05-1700:23:13,023::task::1151::TaskManager.Task::(prepare)Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::aborting: Task is aborted:'Unknown pool id, pool not connected' - code 309Thread-18::DEBUG::2014-05-1700:23:13,024::task::1156::TaskManager.Task::(prepare)Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::Prepare: aborted: Unknownpool id, pool not connectedThread-18::DEBUG::2014-05-1700:23:13,024::task::974::TaskManager.Task::(_decref)Task=`27a323de-ac42-4d68-992f-eb5c84d8baea`::ref 0 aborting True

according to engine database - storage pool'5849b030-626e-47cb-ad90-3ce782d831b3' is datacenter 'CODD'.

but - according to engine logs SPM start for datacenter CODD was runonly at 00:23:27, i.e. _after_ vm restart attemts2014-05-17 00:23:27,585 INFO[org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand](DefaultQuartzScheduler_Worker-43) [1bb8f3c0] starting spm on vds srv12,storage pool CODD, prevId 9, LVER 112014-05-17 00:23:27,610 INFO[org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand](DefaultQuartzScheduler_Worker-43) [1bb8f3c0] START,SpmStartVDSCommand(HostName = srv12, HostId =eac8d084-1be0-4e8d-a589-385f71fe3afa, storagePoolId =5849b030-626e-47cb-ad90-3ce782d831b3, prevId=9, prevLVER=11,storagePoolFormatType=V3, recoveryMode=Manual, SCSIFencing=false), logid: 1844d391

i thought no VM can be started until SPM host selected, why in my casevm restart occurred before that?and what 'pool not connected' means? no spm running, storage unavailablefrom some host or something else?



Yuriy Demchenko

On 05/19/2014 03:13 PM, Yuriy Demchenko wrote:

On 05/19/2014 02:36 PM, Eli Mesika wrote:
Hi,

i'm running ovirt-3.2.2-el6 on 18 el6 hosts with FC san storage, 46 HA
vms in 2 datacenters (3 hosts uses different storage with no
connectivity to first storage, that's why second DC)
Recently (2014-05-17) i had a double power outage: first blackout at
00:16, went back at ~00:19, second blackout at 00:26, went back at 10:06
When finally all went up (after approx. 10:16) - only 2 vms were
restarted from 46.
From browsing engine log i saw failed restart attemts of almostall vms
after first blackout with error 'Failed with error ENGINE and code
5001', but after second blackout i saw no attempts to restart vms, and
only error was 'connect timeout' (probably to srv5 - that host
physically died after blackouts).
And i cant figure why HA vms were not restarted? Please advice

engine and (supposedly) spm host logs in attach.
Hi Yuriy
What I see is that the log for 2014-05-17 is started at 2014-05-1700:23:03so I can not track the first interval you had mentioned (00:19 to00:26)
00:23 is the time when engine booted up after first outage, that's why
logs started at 00:23:03
However, I can clearly see that at 2014-05-17 00:23:03 the engine was
restarted and at 2014-05-17 00:23:09,423 we had started to getconnection
errors.
We had tried to solve the problem by doing Soft-Fencing (actuallyvdsmdservice restart) on the problematic hosts, but ssh to the hostfailed so
we had tried
to hard-fence the host (restart/reboot), but this was due theconfigurable"quite time" in which we are preventing fencing operation after anengine
restart which
is set by default to 5 min (DisableFenceAtStartupInSec key in
engine-config) and therefor we had skipped the fencing operation...
my hosts usually boots slower than engine, long bios checks + random
power on delay (120s), thats why at first engine reports connect errors
however, in logs i see, that some of them were allready up and engine
successfully contacted them:
So maybe you should put DisableFenceAtStartupInSec to a higher valueinstead of the default 5 min...
2014-05-17 00:23:10,450 INFO
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler_Worker-18) Correlation ID: null, Call Stack:
null, Custom Event ID: -1, Message: State was set to Up for host srv11.
2014-05-17 00:23:10,456 INFO
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler_Worker-4) Correlation ID: null, Call Stack:
null, Custom Event ID: -1, Message: State was set to Up for host srv4.
2014-05-17 00:23:10,458 INFO
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler_Worker-11) Correlation ID: null, Call Stack:
null, Custom Event ID: -1, Message: State was set to Up for host srv7.
2014-05-17 00:23:10,460 INFO
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler_Worker-20) Correlation ID: null, Call Stack:
null, Custom Event ID: -1, Message: State was set to Up for host srv9.

and after 00:23:11 i saw no fencing-related messages, only vm restart
attemts that failed with strange errors like:
'Failed with error ENGINE and code 5001'
'Candidate host srv1 (2a89e565-aa4e-4a19-82e3-e72e4edee111) wasfiltered
out by VAR__FILTERTYPE__INTERNAL filter Memory'
'CanDoAction of action RunVm failed.
Reasons:VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VDS_VM_MEMORY,VAR__ACTION__RUN,VAR__TYPE__VM,ACTION_TYPE_FAILED_VDS_VM_MEMORY,SCHEDULING_ALL_HOSTS_FILTERED_OUT,VAR__FILTERTYPE__INTERNAL,$hostName
srv1,$filterName Memory,SCHEDULING_HOST_FILTERED_REASON'
Strange that you didn't got the message itself (this is only themessage key)
The original message is :
Cannot Run VM. There are no available running Hosts with sufficientmemory in VM's Cluster .So, it failed on the RunVM command validation where no host withenough memory to run the VM was found
yea, but how that can be? all my hosts have 256gb ram and only 2 vmsare ram-hungry (128gb and 240gb), other uses much less (2-64gb)
For the first period as I said I only can guess that one of your hosts
fencing attempts was after those 5 minutes and therefor it wasrebooted
and the HA VMs were freed to run on other host.
For the second period on which I have logs, the host fencing faileddue tothe required "quite time" and in this situation the only hing youcan do
in order to have the HA VMs running again is to
right-click on each host and press "Confirm that host has beenrebooted"
but i see in logs - after second period, at 10:14+ all hosts but one
(srv5) were up and power management verified successfully, isn't that
should be enough for engine to verify all ha vms down and restart them?
How should engine know that the host was rebooted??
The fact that power management verified successfully is not enough inorder to run the VMs on another host.As I see the fence commands that intend to reboot the host holdingthe VMs were failed ...
but in like 30secs after engine set holding hosts to non-responsivethey all went up - 'state was set to up for host'so, engine can contact that hosts and query vm status and restart ifneeded, shouldn't it?
2014-05-17 10:09:23,978 INFO[org.ovirt.engine.core.vdsbroker.VdsManager](DefaultQuartzScheduler_Worker-48) Server failed to respond, vds_id =7bf30c8b-7eb2-4b1e-ba0e-13093bca57bf, vds_name = srv9, vm_count = 0,spm_status = None, non-responsive_timeout (seconds) = 60, error =java.net.ConnectException: Connection refused2014-05-17 10:09:40,143 INFO[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector](DefaultQuartzScheduler_Worker-42) Correlation ID: null, Call Stack:null, Custom Event ID: -1, Message: State was set to Up for host srv9.
2014-05-17 10:09:23,980 INFO[org.ovirt.engine.core.vdsbroker.VdsManager](DefaultQuartzScheduler_Worker-51) Server failed to respond, vds_id =57f58a56-a7ad-4917-b32d-f5dfaf326419, vds_name = srv10, vm_count = 0,spm_status = None, non-responsive_timeout (seconds) = 60, error =java.net.ConnectException: Connection refused2014-05-17 10:09:43,214 INFO[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector](DefaultQuartzScheduler_Worker-64) Correlation ID: null, Call Stack:null, Custom Event ID: -1, Message: State was set to Up for host srv10.
2014-05-17 10:09:24,029 INFO[org.ovirt.engine.core.vdsbroker.VdsManager](DefaultQuartzScheduler_Worker-55) Server failed to respond, vds_id =bd07bc6d-37e8-4b55-810c-f41c78f7a8ad, vds_name = srv11, vm_count = 0,spm_status = None, non-responsive_timeout (seconds) = 60, error =java.net.ConnectException: Connection refused2014-05-17 10:09:43,239 INFO[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector](DefaultQuartzScheduler_Worker-63) Correlation ID: null, Call Stack:null, Custom Event ID: -1, Message: State was set to Up for host srv11.
2014-05-17 10:09:23,991 INFO[org.ovirt.engine.core.vdsbroker.VdsManager](DefaultQuartzScheduler_Worker-50) Server failed to respond, vds_id =eac8d084-1be0-4e8d-a589-385f71fe3afa, vds_name = srv12, vm_count = 5,spm_status = Contending, non-responsive_timeout (seconds) = 82, error= java.net.ConnectException: Connection refused2014-05-17 10:09:43,320 INFO[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector](DefaultQuartzScheduler_Worker-60) Correlation ID: null, Call Stack:null, Custom Event ID: -1, Message: State was set to Up for host srv12.
2014-05-17 10:09:23,987 INFO[org.ovirt.engine.core.vdsbroker.VdsManager](DefaultQuartzScheduler_Worker-49) Server failed to respond, vds_id =712e822b-d3c6-4cb7-a6f1-58da5039bda1, vds_name = srv13, vm_count = 4,spm_status = None, non-responsive_timeout (seconds) = 62, error =java.net.ConnectException: Connection refused2014-05-17 10:09:44,934 INFO[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector](DefaultQuartzScheduler_Worker-58) [77463af9] Correlation ID: null,Call Stack: null, Custom Event ID: -1, Message: State was set to Upfor host srv13.
2014-05-17 10:09:24,016 INFO[org.ovirt.engine.core.vdsbroker.VdsManager](DefaultQuartzScheduler_Worker-53) Server failed to respond, vds_id =9d99c8eb-754e-4446-8a0f-af7327d19989, vds_name = srv14, vm_count = 0,spm_status = None, non-responsive_timeout (seconds) = 60, error =java.net.ConnectException: Connection refused2014-05-17 10:09:43,930 INFO[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector](DefaultQuartzScheduler_Worker-61) Correlation ID: null, Call Stack:null, Custom Event ID: -1, Message: State was set to Up for host srv14.
2014-05-17 10:09:24,019 INFO[org.ovirt.engine.core.vdsbroker.VdsManager](DefaultQuartzScheduler_Worker-54) Server failed to respond, vds_id =3ca82f5b-1343-451f-81d0-d2ed111b5f51, vds_name = srv15, vm_count = 0,spm_status = None, non-responsive_timeout (seconds) = 60, error =java.net.ConnectException: Connection refused2014-05-17 10:09:40,285 INFO[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector](DefaultQuartzScheduler_Worker-50) Correlation ID: null, Call Stack:null, Custom Event ID: -1, Message: State was set to Up for host srv15.
2014-05-17 10:09:23,996 INFO[org.ovirt.engine.core.vdsbroker.VdsManager](DefaultQuartzScheduler_Worker-45) Server failed to respond, vds_id =c9d9735b-d75f-4026-afcd-94491ee66a4f, vds_name = srv16, vm_count = 0,spm_status = None, non-responsive_timeout (seconds) = 60, error =java.net.ConnectException: Connection refused2014-05-17 10:09:40,188 INFO[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector](DefaultQuartzScheduler_Worker-47) Correlation ID: null, Call Stack:null, Custom Event ID: -1, Message: State was set to Up for host srv16
2014-05-17 10:09:23,963 INFO[org.ovirt.engine.core.vdsbroker.VdsManager](DefaultQuartzScheduler_Worker-47) Server failed to respond, vds_id =8daa3751-f347-41eb-b6a6-a27f4fda2def, vds_name = srv17, vm_count = 0,spm_status = None, non-responsive_timeout (seconds) = 60, error =java.net.ConnectException: Connection refused2014-05-17 10:09:46,358 INFO[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector](DefaultQuartzScheduler_Worker-75) Correlation ID: null, Call Stack:null, Custom Event ID: -1, Message: State was set to Up for host srv17.
2014-05-17 10:09:23,949 INFO[org.ovirt.engine.core.vdsbroker.VdsManager](DefaultQuartzScheduler_Worker-46) Server failed to respond, vds_id =2e833c27-fd9f-4750-ab84-2fb629b66671, vds_name = srv18, vm_count = 0,spm_status = None, non-responsive_timeout (seconds) = 60, error =java.net.ConnectException: Connection refused2014-05-17 10:10:25,643 INFO[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector](DefaultQuartzScheduler_Worker-7) Correlation ID: null, Call Stack:null, Custom Event ID: -1, Message: State was set to Up for host srv18.
2014-05-17 10:09:23,991 INFO[org.ovirt.engine.core.vdsbroker.VdsManager](DefaultQuartzScheduler_Worker-52) Server failed to respond, vds_id =e7cedec5-9335-45bf-ae58-0744a9b93f41, vds_name = srv19, vm_count = 3,spm_status = None, non-responsive_timeout (seconds) = 61, error =java.net.ConnectException: Connection refused2014-05-17 10:09:40,260 INFO[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector](DefaultQuartzScheduler_Worker-54) Correlation ID: null, Call Stack:null, Custom Event ID: -1, Message: State was set to Up for host srv19.
2014-05-17 10:11:56,946 INFO
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector]
(DefaultQuartzScheduler_Worker-75) [dbc315d] Correla
tion ID: 73469dd6, Call Stack: null, Custom Event ID: -1, Message: Host
srv17 power management was verified successfully.

in fact, at ~17:55 my colleague restarted engine, forced spm selection
and started all vms - all started without errors, and he didn't had to
click 'confirm host has been rebooted'
But here you had restarted the VMs on the original Host in which theywere running after the host is UP, this will workThe problem is to run a HA VM on another host, to do so, we mustinsure that the Host that run the VMs until now is not running thenanymoreIf the fence operation is failed, we can say nothing on the Hoststatus and then as I said you should manually approve that the hostwas rebooted in orderto prevent VMs to run on multiple hosts which will lead to datacorruption
but shouldn't engine restart corresponded vms after holding host cameup? (without manual fence)because they up - so engine can query them about running/not runningvms and get actual state of vms - running or notthe only host were down at that point is srv5, which holded only 1 vm- and it were correctly put in 'unknown' state, other vms were just'down' until we manually started them


_______________________________________________
Users mailing list
[email protected]
http://lists.ovirt.org/mailman/listinfo/users

Re: [ovirt-users] power outage: HA vms not restarted

Reply via email to