Hi Roel, You can change this setting in two ways. - you can update it in db directly as you stated (not recommended) - use engine-config -s vdsHeartbeatInSeconds=20 but prior to running this command you need to update config file /etc/ovirt-engine/engine-config/engine-config.properties with vdsHeartbeatInSeconds.type=Integer. This config value is not exposed by default.
Thanks, Piotr On Mon, Mar 16, 2015 at 11:18 PM, Roel de Rooy <rder...@motto.nl> wrote: > HI Piotr, > > Thanks for your reply! > > If I would like to change the heartbeat value, do I have to update the value > within the vdc_options table directly, or should this be done by another way > (e.g. config file)? > > Regards, > Roel > > -----Oorspronkelijk bericht----- > Van: Piotr Kliczewski [mailto:piotr.kliczew...@gmail.com] > Verzonden: maandag 16 maart 2015 12:16 > Aan: Roel de Rooy > CC: Michal Skrivanek; users@ovirt.org > Onderwerp: Re: [ovirt-users] Communication errors between engine and nodes? > > Unfortunately log entries that you copied give me almost no information about > nature of your issue. > There are few things that we can do to understand what is going on with your > setup. > > Heartbeat functionality provides means to detect whether we still have > connection with a host. By default heartbeat timeout is set to 10 seconds but > it can be modified by setting vdsHeartbeatInSeconds. > > In general whenever there are no incoming responses nor heartbeat frame is > not received engine will invalidate the connection and will attempt to > recover. If reconnection was successful you want see any other consequences > of loosing single heartbeat. I would explore stability of your network so if > the network is busy or you loose network packets from time to time this kind > of entries in the log are expected. You can increase heatbeat value and see > whether it will work better for your env. > > If you confirm that your network is stable we could explore the issue further > by setting debug level logging for your engine to understand exactly how the > messages are processes by a host and when we receive responses. > > > > On Mon, Mar 16, 2015 at 11:34 AM, Roel de Rooy <rder...@motto.nl> wrote: >> Received the "heartbeat exeeded" continuously this morning (seems to be >> quiet again for now). >> VM's still continue to work correctly and the storage domains (NFS shares) >> are still connected and reachable on the nodes, at the exact time that this >> issue is happening. >> >> Contacted our network engineer to see if he could see a load increase on our >> network, or could see any latency, errors, etc. >> Unfortunately he could not detect anything yet (he is still investigating >> this). >> >> >> I have attached both the engine and vdsm logs >> >> Engine.log: >> >> 2015-03-16 10:10:10,506 ERROR >> [org.ovirt.engine.core.vdsbroker.vdsbroker.ListVDSCommand] >> (DefaultQuartzScheduler_Worker-45) [6d40f562] Command >> ListVDSCommand(HostName = <HOST>, HostId = >> 3b87597e-081b-4c89-9b1e-cb04203259f5, >> vds=Host[<HOST>,3b87597e-081b-4c89-9b1e-cb04203259f5]) execution >> failed. Exception: VDSNetworkException: VDSGenericException: >> VDSNetworkException: Heartbeat exeeded >> 2015-03-16 10:10:10,507 ERROR >> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] >> (DefaultQuartzScheduler_Worker-35) [2c53103c] Command >> SpmStatusVDSCommand(HostName = <HOST>, HostId = >> 3b87597e-081b-4c89-9b1e-cb04203259f5, storagePoolId = >> 124ae76f-8acb-412e-91cc-dff9f6ec665d) execution failed. Exception: >> VDSNetworkException: VDSGenericException: VDSNetworkException: >> Heartbeat exeeded >> 2015-03-16 10:10:10,506 WARN >> [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] >> (ResponseWorker) Exception thrown during message processing >> 2015-03-16 10:10:10,507 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] >> (DefaultQuartzScheduler_Worker-45) [6d40f562] Host <HOST> is not responding. >> It will stay in Connecting state for a grace period of 88 seconds and after >> that an attempt to fence the host will be issued. >> 2015-03-16 10:10:10,510 INFO >> [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] >> (DefaultQuartzScheduler_Worker-35) [7e61eee] Running command: >> SetStoragePoolStatusCommand internal: true. Entities affected : ID: >> 124ae76f-8acb-412e-91cc-dff9f6ec665d Type: StoragePool >> 2015-03-16 10:10:10,512 INFO >> [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] >> (DefaultQuartzScheduler_Worker-35) [7e61eee] Storage Pool >> 124ae76f-8acb-412e-91cc-dff9f6ec665d - Updating Storage Domain >> bfa86142-6f2e-44fe-8a9c-cf4390f3b8ae status from Active to Unknown, >> reason : null >> 2015-03-16 10:10:10,513 INFO >> [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] >> (DefaultQuartzScheduler_Worker-35) [7e61eee] Storage Pool >> 124ae76f-8acb-412e-91cc-dff9f6ec665d - Updating Storage Domain >> 178a38d9-245c-43d3-bff9-6f3a5983bf03 status from Active to Unknown, >> reason : null >> 2015-03-16 10:10:10,514 INFO >> [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] >> (DefaultQuartzScheduler_Worker-35) [7e61eee] Storage Pool >> 124ae76f-8acb-412e-91cc-dff9f6ec665d - Updating Storage Domain >> 3b0b4f26-bec9-4730-a8ba-40965a228932 status from Active to Unknown, >> reason : null >> 2015-03-16 10:10:10,526 WARN >> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >> (DefaultQuartzScheduler_Worker-45) [6d40f562] Correlation ID: null, Call >> Stack: null, Custom Event ID: -1, Message: Host <HOST> is not responding. It >> will stay in Connecting state for a grace period of 88 seconds and after >> that an attempt to fence the host will be issued. >> 2015-03-16 10:10:10,527 ERROR >> [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] >> (DefaultQuartzScheduler_Worker-45) [6d40f562] Failure to refresh Vds runtime >> info: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: >> VDSGenericException: VDSNetworkException: Heartbeat exeeded >> at >> org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:183) >> [vdsbroker.jar:] >> at >> org.ovirt.engine.core.vdsbroker.vdsbroker.ListVDSCommand.executeVdsBrokerCommand(ListVDSCommand.java:24) >> [vdsbroker.jar:] >> at >> org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:96) >> [vdsbroker.jar:] >> at >> org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:56) >> [vdsbroker.jar:] >> at >> org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:31) >> [dal.jar:] >> at >> org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:418) >> [vdsbroker.jar:] >> at >> org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.fetchRunningVms(VdsUpdateRunTimeInfo.java:991) >> [vdsbroker.jar:] >> at >> org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.refreshVmStats(VdsUpdateRunTimeInfo.java:940) >> [vdsbroker.jar:] >> at >> org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.refreshVdsRunTimeInfo(VdsUpdateRunTimeInfo.java:658) >> [vdsbroker.jar:] >> at >> org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.refresh(VdsUpdateRunTimeInfo.java:494) >> [vdsbroker.jar:] >> at >> org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:236) >> [vdsbroker.jar:] >> at sun.reflect.GeneratedMethodAccessor70.invoke(Unknown Source) >> [:1.7.0_75] >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> [rt.jar:1.7.0_75] >> at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_75] >> at >> org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:60) >> [scheduler.jar:] >> at org.quartz.core.JobRunShell.run(JobRunShell.java:213) >> [quartz.jar:] >> at >> org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.ja >> va:557) [quartz.jar:] >> >> 2015-03-16 10:10:10,544 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] >> (DefaultQuartzScheduler_Worker-45) [6d40f562] Failed to refresh VDS , vds = >> 3b87597e-081b-4c89-9b1e-cb04203259f5 : <HOST>, VDS Network Error, continuing. >> VDSGenericException: VDSNetworkException: Heartbeat exeeded >> 2015-03-16 10:10:10,547 WARN >> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] >> (DefaultQuartzScheduler_Worker-35) [7e61eee] Correlation ID: 7e61eee, Call >> Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center >> <DC>. Setting Data Center status to Non Responsive (On host <HOST>, Error: >> Network error during communication with the Host.). >> 2015-03-16 10:10:10,566 INFO >> [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp >> Reactor) Connecting to /<IP> >> 2015-03-16 10:10:15,804 INFO >> [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHardwareInfoVDSCommand] >> (DefaultQuartzScheduler_Worker-44) [55f2c760] START, >> GetHardwareInfoVDSCommand(HostName = <HOST>, HostId = >> 3b87597e-081b-4c89-9b1e-cb04203259f5, >> vds=Host[<HOST>,3b87597e-081b-4c89-9b1e-cb04203259f5]), log id: >> 69516bf >> 2015-03-16 10:10:15,823 INFO >> [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHardwareInfoVDSCommand] >> (DefaultQuartzScheduler_Worker-44) [55f2c760] FINISH, >> GetHardwareInfoVDSCommand, log id: 69516bf >> 2015-03-16 10:10:15,866 INFO >> [org.ovirt.engine.core.bll.HandleVdsCpuFlagsOrClusterChangedCommand] >> (DefaultQuartzScheduler_Worker-44) [16f754a9] Running command: >> HandleVdsCpuFlagsOrClusterChangedCommand internal: true. Entities >> affected : ID: 3b87597e-081b-4c89-9b1e-cb04203259f5 Type: VDS >> 2015-03-16 10:10:16,924 INFO >> [org.ovirt.engine.core.bll.InitVdsOnUpCommand] >> (DefaultQuartzScheduler_Worker-44) [6f19cd0c] Running command: >> InitVdsOnUpCommand internal: true. Entities affected : ID: >> 124ae76f-8acb-412e-91cc-dff9f6ec665d Type: StoragePool >> 2015-03-16 10:10:16,931 INFO >> [org.ovirt.engine.core.bll.storage.ConnectHostToStoragePoolServersComm >> and] (DefaultQuartzScheduler_Worker-44) [64352136] Running command: >> ConnectHostToStoragePoolServersCommand internal: true. Entities >> affected : ID: 124ae76f-8acb-412e-91cc-dff9f6ec665d Type: StoragePool >> 2015-03-16 10:10:17,052 INFO >> [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSComm >> and] (DefaultQuartzScheduler_Worker-44) [64352136] START, >> ConnectStorageServerVDSCommand(HostName = <HOST>, HostId = >> 3b87597e-081b-4c89-9b1e-cb04203259f5, storagePoolId = >> 124ae76f-8acb-412e-91cc-dff9f6ec665d, storageType = NFS, >> connectionList = [{ id: 03ea1ab7-e96c-410b-911e-905e988b0dc7, >> connection: <IP>:/export, iqn: null, vfsType: null, mountOptions: >> null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };{ id: >> 65744a96-5f4c-4d5f-898b-932eaf97084c, connection: <IP>:/iso, iqn: >> null, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: >> null, nfsTimeo: null };{ id: 6ca291fc-0a20-4047-9aac-9d166a4c5300, >> connection: <IP>:/mnt/storage, iqn: null, vfsType: null, mountOptions: >> null, nfsVersion: AUTO, nfsRetrans: null, nfsTimeo: null };]), log id: >> 5369ca8f >> 2015-03-16 10:10:17,113 INFO >> [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSComm >> and] (DefaultQuartzScheduler_Worker-44) [64352136] FINISH, >> ConnectStorageServerVDSCommand, return: >> {6ca291fc-0a20-4047-9aac-9d166a4c5300=0, >> 65744a96-5f4c-4d5f-898b-932eaf97084c=0, >> 03ea1ab7-e96c-410b-911e-905e988b0dc7=0}, log id: 5369ca8f >> >> >> >> Corresponding vdsm.log: (these are the only lines around the same timeframe): >> >> Thread-52::DEBUG::2015-03-16 >> 10:10:10,977::task::595::Storage.TaskManager.Task::(_updateState) >> Task=`89a0021d-9d5a-4563-ad44-d320aacbc551`::moving from state init -> >> state preparing JsonRpc (StompReactor)::DEBUG::2015-03-16 >> 10:10:10,982::stompReactor::98::Broker.StompAdapter::(handle_frame) >> Handling message <StompFrame command='SEND'> >> Thread-52::INFO::2015-03-16 >> 10:10:10,983::logUtils::44::dispatcher::(wrapper) Run and protect: >> getVolumeSize(sdUUID=u'178a38d9-245c-43d3-bff9-6f3a5983bf03', >> spUUID=u'124ae76f-8acb-412e-91cc-dff9f6ec665d', >> imgUUID=u'fb58d38b-9965-40f3-af45-915a4968a3aa', >> volUUID=u'0c28ab0e-b1a0-42b6-8eac-71de1faa6827', options=None) >> Thread-27::DEBUG::2015-03-16 >> 10:10:10,985::fileSD::261::Storage.Misc.excCmd::(getReadDelay) >> /usr/bin/dd >> if=/rhev/data-center/mnt/<IP>:_mnt_storage/178a38d9-245c-43d3-bff9-6f3 >> a5983bf03/dom_md/metadata iflag=direct of=/dev/null bs=4096 count=1 >> (cwd None) >> >> >> -----Oorspronkelijk bericht----- >> Van: users-boun...@ovirt.org [mailto:users-boun...@ovirt.org] Namens >> Piotr Kliczewski >> Verzonden: 16 March 2015 08:39 >> Aan: Michal Skrivanek >> CC: users@ovirt.org >> Onderwerp: Re: [ovirt-users] Communication errors between engine and nodes? >> >> Can you please provide logs from both ends? >> >> On Fri, Mar 13, 2015 at 3:17 PM, Michal Skrivanek >> <michal.skriva...@redhat.com> wrote: >>> >>> On 13 Mar 2015, at 14:39, Chris Adams wrote: >>> >>>> Once upon a time, Roel de Rooy <rder...@motto.nl> said: >>>>> We are observing the same thing with our oVirt environment. >>>>> At random moments (could be a couple of times a day , once a day or even >>>>> once every couple of days), we receive the "VDSNetworkException" message >>>>> on one of our nodes. >>>>> Haven't seen the "heartbeat exceeded" message, but could be that I >>>>> overlooked it within our logs. >>>>> At some rare occasions, we also do see "Host cannot access the Storage >>>>> Domain(s) <UNKNOWN> attached to the Data Center", within the GUI. >>>>> >>>>> VM's will continue to run normally and most of the times the nodes will >>>>> be in "UP" state again within the same minute. >>>>> >>>>> Will still haven't found the root cause of this issue. >>>>> Our engine is CentOS 6.6 based and it's happing with both Centos 6 and >>>>> Fedora 20 nodes. >>>>> We are using a LCAP bond of 1Gbit ports for our management network. >>>>> >>>>> As we didn't see any reports about this before, we are currently looking >>>>> if something network related is causing this. >>>> >>>> I just opened a BZ on it (since it isn't just me): >>>> >>>> https://bugzilla.redhat.com/show_bug.cgi?id=1201779 >>>> >>>> My cluster went a couple of days without hitting this (as soon as I >>>> posted to the list of course), but then it happened several times >>>> overnight. Interestingly, one error logged was communicating with >>>> the node currently running my hosted engine. That should rule out >>>> external network (e.g. switch and such) issues, as those packets >>>> should not have left the physical box. >>> >>> well, hosted engine complicates things as you'd need to be able to >>> see the status of the engine guest running a standalone engine installation >>> or at least running that hosted engine on a single node without any other >>> VM may help…. >>> >>> Thanks, >>> michal >>> >>>> >>>> -- >>>> Chris Adams <c...@cmadams.net> >>>> _______________________________________________ >>>> Users mailing list >>>> Users@ovirt.org >>>> http://lists.ovirt.org/mailman/listinfo/users >>> >>> _______________________________________________ >>> Users mailing list >>> Users@ovirt.org >>> http://lists.ovirt.org/mailman/listinfo/users >> _______________________________________________ >> Users mailing list >> Users@ovirt.org >> http://lists.ovirt.org/mailman/listinfo/users _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users