Re: [ovirt-users] Communication errors between engine and nodes?

Piotr Kliczewski Tue, 17 Mar 2015 02:59:21 -0700

Hi Roel,

You can change this setting in two ways.
- you can update it in db directly as you stated (not recommended)
- use engine-config -s vdsHeartbeatInSeconds=20 but prior to running
this command
  you need to update config file
/etc/ovirt-engine/engine-config/engine-config.properties
  with vdsHeartbeatInSeconds.type=Integer. This config value is not
exposed by default.


Thanks,
Piotr

On Mon, Mar 16, 2015 at 11:18 PM, Roel de Rooy <rder...@motto.nl> wrote:
> HI Piotr,
>
> Thanks for your reply!
>
> If I would like to change the heartbeat value, do I have to update the value 
> within the vdc_options table directly, or should this be done by another way 
> (e.g. config file)?
>
> Regards,
> Roel
>
> -----Oorspronkelijk bericht-----
> Van: Piotr Kliczewski [mailto:piotr.kliczew...@gmail.com]
> Verzonden: maandag 16 maart 2015 12:16
> Aan: Roel de Rooy
> CC: Michal Skrivanek; users@ovirt.org
> Onderwerp: Re: [ovirt-users] Communication errors between engine and nodes?
>
> Unfortunately log entries that you copied give me almost no information about 
> nature of your issue.
> There are few things that we can do to understand what is going on with your 
> setup.
>
> Heartbeat functionality provides means to detect whether we still have 
> connection with a host. By default heartbeat timeout is set to 10 seconds but 
> it can be modified by setting vdsHeartbeatInSeconds.
>
> In general whenever there are no incoming responses nor heartbeat frame is 
> not received engine will invalidate the connection and will attempt to 
> recover. If reconnection was successful you want see any other consequences 
> of loosing single heartbeat. I would explore stability of your network so if 
> the network is busy or you loose network packets from time to time this kind 
> of entries in the log are expected. You can increase heatbeat value and see 
> whether it will work better for your env.
>
> If you confirm that your network is stable we could explore the issue further 
> by setting debug level logging for your engine to understand exactly how the 
> messages are processes by a host and when we receive responses.
>
>
>
> On Mon, Mar 16, 2015 at 11:34 AM, Roel de Rooy <rder...@motto.nl> wrote:
>> Received the "heartbeat exeeded" continuously this morning (seems to be 
>> quiet again for now).
>> VM's still continue to work correctly and the storage domains (NFS shares) 
>> are still connected and reachable on the nodes, at the exact time that this 
>> issue is happening.
>>
>> Contacted our network engineer to see if he could see a load increase on our 
>> network, or could see any latency, errors, etc.
>> Unfortunately he could not detect anything yet (he is still investigating 
>> this).
>>
>>
>> I have attached both the engine and vdsm logs
>>
>> Engine.log:
>>
>> 2015-03-16 10:10:10,506 ERROR
>> [org.ovirt.engine.core.vdsbroker.vdsbroker.ListVDSCommand]
>> (DefaultQuartzScheduler_Worker-45) [6d40f562] Command
>> ListVDSCommand(HostName = <HOST>, HostId =
>> 3b87597e-081b-4c89-9b1e-cb04203259f5,
>> vds=Host[<HOST>,3b87597e-081b-4c89-9b1e-cb04203259f5]) execution
>> failed. Exception: VDSNetworkException: VDSGenericException:
>> VDSNetworkException: Heartbeat exeeded
>> 2015-03-16 10:10:10,507 ERROR
>> [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand]
>> (DefaultQuartzScheduler_Worker-35) [2c53103c] Command
>> SpmStatusVDSCommand(HostName = <HOST>, HostId =
>> 3b87597e-081b-4c89-9b1e-cb04203259f5, storagePoolId =
>> 124ae76f-8acb-412e-91cc-dff9f6ec665d) execution failed. Exception:
>> VDSNetworkException: VDSGenericException: VDSNetworkException:
>> Heartbeat exeeded
>> 2015-03-16 10:10:10,506 WARN
>> [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker]
>> (ResponseWorker) Exception thrown during message processing
>> 2015-03-16 10:10:10,507 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] 
>> (DefaultQuartzScheduler_Worker-45) [6d40f562] Host <HOST> is not responding. 
>> It will stay in Connecting state for a grace period of 88 seconds and after 
>> that an attempt to fence the host will be issued.
>> 2015-03-16 10:10:10,510 INFO
>> [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand]
>> (DefaultQuartzScheduler_Worker-35) [7e61eee] Running command:
>> SetStoragePoolStatusCommand internal: true. Entities affected :  ID:
>> 124ae76f-8acb-412e-91cc-dff9f6ec665d Type: StoragePool
>> 2015-03-16 10:10:10,512 INFO
>> [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper]
>> (DefaultQuartzScheduler_Worker-35) [7e61eee] Storage Pool
>> 124ae76f-8acb-412e-91cc-dff9f6ec665d - Updating Storage Domain
>> bfa86142-6f2e-44fe-8a9c-cf4390f3b8ae status from Active to Unknown,
>> reason : null
>> 2015-03-16 10:10:10,513 INFO
>> [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper]
>> (DefaultQuartzScheduler_Worker-35) [7e61eee] Storage Pool
>> 124ae76f-8acb-412e-91cc-dff9f6ec665d - Updating Storage Domain
>> 178a38d9-245c-43d3-bff9-6f3a5983bf03 status from Active to Unknown,
>> reason : null
>> 2015-03-16 10:10:10,514 INFO
>> [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper]
>> (DefaultQuartzScheduler_Worker-35) [7e61eee] Storage Pool
>> 124ae76f-8acb-412e-91cc-dff9f6ec665d - Updating Storage Domain
>> 3b0b4f26-bec9-4730-a8ba-40965a228932 status from Active to Unknown,
>> reason : null
>> 2015-03-16 10:10:10,526 WARN  
>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
>> (DefaultQuartzScheduler_Worker-45) [6d40f562] Correlation ID: null, Call 
>> Stack: null, Custom Event ID: -1, Message: Host <HOST> is not responding. It 
>> will stay in Connecting state for a grace period of 88 seconds and after 
>> that an attempt to fence the host will be issued.
>> 2015-03-16 10:10:10,527 ERROR 
>> [org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo] 
>> (DefaultQuartzScheduler_Worker-45) [6d40f562] Failure to refresh Vds runtime 
>> info: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: 
>> VDSGenericException: VDSNetworkException: Heartbeat exeeded
>>         at 
>> org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:183)
>>  [vdsbroker.jar:]
>>         at 
>> org.ovirt.engine.core.vdsbroker.vdsbroker.ListVDSCommand.executeVdsBrokerCommand(ListVDSCommand.java:24)
>>  [vdsbroker.jar:]
>>         at 
>> org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:96)
>>  [vdsbroker.jar:]
>>         at 
>> org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:56)
>>  [vdsbroker.jar:]
>>         at 
>> org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:31) 
>> [dal.jar:]
>>         at 
>> org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:418)
>>  [vdsbroker.jar:]
>>         at 
>> org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.fetchRunningVms(VdsUpdateRunTimeInfo.java:991)
>>  [vdsbroker.jar:]
>>         at 
>> org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.refreshVmStats(VdsUpdateRunTimeInfo.java:940)
>>  [vdsbroker.jar:]
>>         at 
>> org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.refreshVdsRunTimeInfo(VdsUpdateRunTimeInfo.java:658)
>>  [vdsbroker.jar:]
>>         at 
>> org.ovirt.engine.core.vdsbroker.VdsUpdateRunTimeInfo.refresh(VdsUpdateRunTimeInfo.java:494)
>>  [vdsbroker.jar:]
>>         at 
>> org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:236) 
>> [vdsbroker.jar:]
>>         at sun.reflect.GeneratedMethodAccessor70.invoke(Unknown Source) 
>> [:1.7.0_75]
>>         at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>  [rt.jar:1.7.0_75]
>>         at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_75]
>>         at 
>> org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:60) 
>> [scheduler.jar:]
>>         at org.quartz.core.JobRunShell.run(JobRunShell.java:213) 
>> [quartz.jar:]
>>         at
>> org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.ja
>> va:557) [quartz.jar:]
>>
>> 2015-03-16 10:10:10,544 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] 
>> (DefaultQuartzScheduler_Worker-45) [6d40f562] Failed to refresh VDS , vds = 
>> 3b87597e-081b-4c89-9b1e-cb04203259f5 : <HOST>, VDS Network Error, continuing.
>> VDSGenericException: VDSNetworkException: Heartbeat exeeded
>> 2015-03-16 10:10:10,547 WARN  
>> [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
>> (DefaultQuartzScheduler_Worker-35) [7e61eee] Correlation ID: 7e61eee, Call 
>> Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center 
>> <DC>. Setting Data Center status to Non Responsive (On host <HOST>, Error: 
>> Network error during communication with the Host.).
>> 2015-03-16 10:10:10,566 INFO
>> [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp
>> Reactor) Connecting to /<IP>
>> 2015-03-16 10:10:15,804 INFO
>> [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHardwareInfoVDSCommand]
>> (DefaultQuartzScheduler_Worker-44) [55f2c760] START,
>> GetHardwareInfoVDSCommand(HostName = <HOST>, HostId =
>> 3b87597e-081b-4c89-9b1e-cb04203259f5,
>> vds=Host[<HOST>,3b87597e-081b-4c89-9b1e-cb04203259f5]), log id:
>> 69516bf
>> 2015-03-16 10:10:15,823 INFO
>> [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHardwareInfoVDSCommand]
>> (DefaultQuartzScheduler_Worker-44) [55f2c760] FINISH,
>> GetHardwareInfoVDSCommand, log id: 69516bf
>> 2015-03-16 10:10:15,866 INFO
>> [org.ovirt.engine.core.bll.HandleVdsCpuFlagsOrClusterChangedCommand]
>> (DefaultQuartzScheduler_Worker-44) [16f754a9] Running command:
>> HandleVdsCpuFlagsOrClusterChangedCommand internal: true. Entities
>> affected :  ID: 3b87597e-081b-4c89-9b1e-cb04203259f5 Type: VDS
>> 2015-03-16 10:10:16,924 INFO
>> [org.ovirt.engine.core.bll.InitVdsOnUpCommand]
>> (DefaultQuartzScheduler_Worker-44) [6f19cd0c] Running command:
>> InitVdsOnUpCommand internal: true. Entities affected :  ID:
>> 124ae76f-8acb-412e-91cc-dff9f6ec665d Type: StoragePool
>> 2015-03-16 10:10:16,931 INFO
>> [org.ovirt.engine.core.bll.storage.ConnectHostToStoragePoolServersComm
>> and] (DefaultQuartzScheduler_Worker-44) [64352136] Running command:
>> ConnectHostToStoragePoolServersCommand internal: true. Entities
>> affected :  ID: 124ae76f-8acb-412e-91cc-dff9f6ec665d Type: StoragePool
>> 2015-03-16 10:10:17,052 INFO
>> [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSComm
>> and] (DefaultQuartzScheduler_Worker-44) [64352136] START,
>> ConnectStorageServerVDSCommand(HostName = <HOST>, HostId =
>> 3b87597e-081b-4c89-9b1e-cb04203259f5, storagePoolId =
>> 124ae76f-8acb-412e-91cc-dff9f6ec665d, storageType = NFS,
>> connectionList = [{ id: 03ea1ab7-e96c-410b-911e-905e988b0dc7,
>> connection: <IP>:/export, iqn: null, vfsType: null, mountOptions:
>> null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };{ id:
>> 65744a96-5f4c-4d5f-898b-932eaf97084c, connection: <IP>:/iso, iqn:
>> null, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans:
>> null, nfsTimeo: null };{ id: 6ca291fc-0a20-4047-9aac-9d166a4c5300,
>> connection: <IP>:/mnt/storage, iqn: null, vfsType: null, mountOptions:
>> null, nfsVersion: AUTO, nfsRetrans: null, nfsTimeo: null };]), log id:
>> 5369ca8f
>> 2015-03-16 10:10:17,113 INFO
>> [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSComm
>> and] (DefaultQuartzScheduler_Worker-44) [64352136] FINISH,
>> ConnectStorageServerVDSCommand, return:
>> {6ca291fc-0a20-4047-9aac-9d166a4c5300=0,
>> 65744a96-5f4c-4d5f-898b-932eaf97084c=0,
>> 03ea1ab7-e96c-410b-911e-905e988b0dc7=0}, log id: 5369ca8f
>>
>>
>>
>> Corresponding vdsm.log: (these are the only lines around the same timeframe):
>>
>> Thread-52::DEBUG::2015-03-16
>> 10:10:10,977::task::595::Storage.TaskManager.Task::(_updateState)
>> Task=`89a0021d-9d5a-4563-ad44-d320aacbc551`::moving from state init ->
>> state preparing JsonRpc (StompReactor)::DEBUG::2015-03-16
>> 10:10:10,982::stompReactor::98::Broker.StompAdapter::(handle_frame)
>> Handling message <StompFrame command='SEND'>
>> Thread-52::INFO::2015-03-16
>> 10:10:10,983::logUtils::44::dispatcher::(wrapper) Run and protect:
>> getVolumeSize(sdUUID=u'178a38d9-245c-43d3-bff9-6f3a5983bf03',
>> spUUID=u'124ae76f-8acb-412e-91cc-dff9f6ec665d',
>> imgUUID=u'fb58d38b-9965-40f3-af45-915a4968a3aa',
>> volUUID=u'0c28ab0e-b1a0-42b6-8eac-71de1faa6827', options=None)
>> Thread-27::DEBUG::2015-03-16
>> 10:10:10,985::fileSD::261::Storage.Misc.excCmd::(getReadDelay)
>> /usr/bin/dd
>> if=/rhev/data-center/mnt/<IP>:_mnt_storage/178a38d9-245c-43d3-bff9-6f3
>> a5983bf03/dom_md/metadata iflag=direct of=/dev/null bs=4096 count=1
>> (cwd None)
>>
>>
>> -----Oorspronkelijk bericht-----
>> Van: users-boun...@ovirt.org [mailto:users-boun...@ovirt.org] Namens
>> Piotr Kliczewski
>> Verzonden: 16 March 2015 08:39
>> Aan: Michal Skrivanek
>> CC: users@ovirt.org
>> Onderwerp: Re: [ovirt-users] Communication errors between engine and nodes?
>>
>> Can you please provide logs from both ends?
>>
>> On Fri, Mar 13, 2015 at 3:17 PM, Michal Skrivanek 
>> <michal.skriva...@redhat.com> wrote:
>>>
>>> On 13 Mar 2015, at 14:39, Chris Adams wrote:
>>>
>>>> Once upon a time, Roel de Rooy <rder...@motto.nl> said:
>>>>> We are observing the same thing with our oVirt environment.
>>>>> At random moments (could be a couple of times a day , once a day or even 
>>>>> once every couple of days), we receive the "VDSNetworkException" message 
>>>>> on one of our nodes.
>>>>> Haven't seen the "heartbeat exceeded" message, but could be that I 
>>>>> overlooked it within our logs.
>>>>> At some rare occasions, we also do see "Host cannot access the Storage 
>>>>> Domain(s) <UNKNOWN> attached to the Data Center", within the GUI.
>>>>>
>>>>> VM's will continue to run normally and most of the times the nodes will 
>>>>> be in "UP" state again within the same minute.
>>>>>
>>>>> Will still haven't found the root cause of this issue.
>>>>> Our engine is CentOS 6.6 based and it's happing with both Centos 6 and 
>>>>> Fedora 20 nodes.
>>>>> We are using a LCAP bond of 1Gbit ports for our management network.
>>>>>
>>>>> As we didn't see any reports about this before, we are currently looking 
>>>>> if something network related is causing this.
>>>>
>>>> I just opened a BZ on it (since it isn't just me):
>>>>
>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1201779
>>>>
>>>> My cluster went a couple of days without hitting this (as soon as I
>>>> posted to the list of course), but then it happened several times
>>>> overnight.  Interestingly, one error logged was communicating with
>>>> the node currently running my hosted engine.  That should rule out
>>>> external network (e.g. switch and such) issues, as those packets
>>>> should not have left the physical box.
>>>
>>> well, hosted engine complicates things as you'd need to be able to
>>> see the status of the engine guest running a standalone engine installation 
>>> or at least running that hosted engine on a single node without any other 
>>> VM may help….
>>>
>>> Thanks,
>>> michal
>>>
>>>>
>>>> --
>>>> Chris Adams <c...@cmadams.net>
>>>> _______________________________________________
>>>> Users mailing list
>>>> Users@ovirt.org
>>>> http://lists.ovirt.org/mailman/listinfo/users
>>>
>>> _______________________________________________
>>> Users mailing list
>>> Users@ovirt.org
>>> http://lists.ovirt.org/mailman/listinfo/users
>> _______________________________________________
>> Users mailing list
>> Users@ovirt.org
>> http://lists.ovirt.org/mailman/listinfo/users
_______________________________________________
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users

Re: [ovirt-users] Communication errors between engine and nodes?

Reply via email to