Re: [ovirt-users] Communication errors between engine and nodes?

2015-03-20 Thread Chris Adams
So, in my case, I'm wondering if maybe there is some kind of weird
network issue happening.

The node that seems to be showing up most for the last day or two is one
of the two nodes running the hosted-engine HA, and is _not_ currently
hosting the engine.  It seems that, at the same time the engine has
trouble communicating with that node, the hosted-engine HA running on
that node has trouble seeing the engine.

I still can't find any actual network problem.  Using another physical
system, I ran fping to all the nodes and the engine with a 0.2 second
interval, and that didn't show any problem (I ran it until I also saw an
instance of the engine-node communication error).  I'm watching ARP
traffic now to see if something is sending bad answers.  I'm pretty
stumped at this point of what to look at next.

-- 
Chris Adams c...@cmadams.net
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] Communication errors between engine and nodes?

2015-03-19 Thread Gabi C
Hello!

FYI:
 updated 2 days ago 3 hypervisor from my setup to latest 3.5-patternfly ,
rebooted nodes, and engine and the error seems to be gone: no longer got
heartbeat exeeded.

On Tue, Mar 17, 2015 at 11:58 AM, Piotr Kliczewski 
piotr.kliczew...@gmail.com wrote:

 Hi Roel,

 You can change this setting in two ways.
 - you can update it in db directly as you stated (not recommended)
 - use engine-config -s vdsHeartbeatInSeconds=20 but prior to running
 this command
   you need to update config file
 /etc/ovirt-engine/engine-config/engine-config.properties
   with vdsHeartbeatInSeconds.type=Integer. This config value is not
 exposed by default.

 Thanks,
 Piotr

 On Mon, Mar 16, 2015 at 11:18 PM, Roel de Rooy rder...@motto.nl wrote:
  HI Piotr,
 
  Thanks for your reply!
 
  If I would like to change the heartbeat value, do I have to update the
 value within the vdc_options table directly, or should this be done by
 another way (e.g. config file)?
 
  Regards,
  Roel
 
  -Oorspronkelijk bericht-
  Van: Piotr Kliczewski [mailto:piotr.kliczew...@gmail.com]
  Verzonden: maandag 16 maart 2015 12:16
  Aan: Roel de Rooy
  CC: Michal Skrivanek; users@ovirt.org
  Onderwerp: Re: [ovirt-users] Communication errors between engine and
 nodes?
 
  Unfortunately log entries that you copied give me almost no information
 about nature of your issue.
  There are few things that we can do to understand what is going on with
 your setup.
 
  Heartbeat functionality provides means to detect whether we still have
 connection with a host. By default heartbeat timeout is set to 10 seconds
 but it can be modified by setting vdsHeartbeatInSeconds.
 
  In general whenever there are no incoming responses nor heartbeat frame
 is not received engine will invalidate the connection and will attempt to
 recover. If reconnection was successful you want see any other consequences
 of loosing single heartbeat. I would explore stability of your network so
 if the network is busy or you loose network packets from time to time this
 kind of entries in the log are expected. You can increase heatbeat value
 and see whether it will work better for your env.
 
  If you confirm that your network is stable we could explore the issue
 further by setting debug level logging for your engine to understand
 exactly how the messages are processes by a host and when we receive
 responses.
 
 
 
  On Mon, Mar 16, 2015 at 11:34 AM, Roel de Rooy rder...@motto.nl wrote:
  Received the heartbeat exeeded continuously this morning (seems to be
 quiet again for now).
  VM's still continue to work correctly and the storage domains (NFS
 shares) are still connected and reachable on the nodes, at the exact time
 that this issue is happening.
 
  Contacted our network engineer to see if he could see a load increase
 on our network, or could see any latency, errors, etc.
  Unfortunately he could not detect anything yet (he is still
 investigating this).
 
 
  I have attached both the engine and vdsm logs
 
  Engine.log:
 
  2015-03-16 10:10:10,506 ERROR
  [org.ovirt.engine.core.vdsbroker.vdsbroker.ListVDSCommand]
  (DefaultQuartzScheduler_Worker-45) [6d40f562] Command
  ListVDSCommand(HostName = HOST, HostId =
  3b87597e-081b-4c89-9b1e-cb04203259f5,
  vds=Host[HOST,3b87597e-081b-4c89-9b1e-cb04203259f5]) execution
  failed. Exception: VDSNetworkException: VDSGenericException:
  VDSNetworkException: Heartbeat exeeded
  2015-03-16 10:10:10,507 ERROR
  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand]
  (DefaultQuartzScheduler_Worker-35) [2c53103c] Command
  SpmStatusVDSCommand(HostName = HOST, HostId =
  3b87597e-081b-4c89-9b1e-cb04203259f5, storagePoolId =
  124ae76f-8acb-412e-91cc-dff9f6ec665d) execution failed. Exception:
  VDSNetworkException: VDSGenericException: VDSNetworkException:
  Heartbeat exeeded
  2015-03-16 10:10:10,506 WARN
  [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker]
  (ResponseWorker) Exception thrown during message processing
  2015-03-16 10:10:10,507 WARN
 [org.ovirt.engine.core.vdsbroker.VdsManager]
 (DefaultQuartzScheduler_Worker-45) [6d40f562] Host HOST is not
 responding. It will stay in Connecting state for a grace period of 88
 seconds and after that an attempt to fence the host will be issued.
  2015-03-16 10:10:10,510 INFO
  [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand]
  (DefaultQuartzScheduler_Worker-35) [7e61eee] Running command:
  SetStoragePoolStatusCommand internal: true. Entities affected :  ID:
  124ae76f-8acb-412e-91cc-dff9f6ec665d Type: StoragePool
  2015-03-16 10:10:10,512 INFO
  [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper]
  (DefaultQuartzScheduler_Worker-35) [7e61eee] Storage Pool
  124ae76f-8acb-412e-91cc-dff9f6ec665d - Updating Storage Domain
  bfa86142-6f2e-44fe-8a9c-cf4390f3b8ae status from Active to Unknown,
  reason : null
  2015-03-16 10:10:10,513 INFO
  [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper]
  (DefaultQuartzScheduler_Worker

Re: [ovirt-users] Communication errors between engine and nodes?

2015-03-17 Thread Piotr Kliczewski
Hi Roel,

You can change this setting in two ways.
- you can update it in db directly as you stated (not recommended)
- use engine-config -s vdsHeartbeatInSeconds=20 but prior to running
this command
  you need to update config file
/etc/ovirt-engine/engine-config/engine-config.properties
  with vdsHeartbeatInSeconds.type=Integer. This config value is not
exposed by default.

Thanks,
Piotr

On Mon, Mar 16, 2015 at 11:18 PM, Roel de Rooy rder...@motto.nl wrote:
 HI Piotr,

 Thanks for your reply!

 If I would like to change the heartbeat value, do I have to update the value 
 within the vdc_options table directly, or should this be done by another way 
 (e.g. config file)?

 Regards,
 Roel

 -Oorspronkelijk bericht-
 Van: Piotr Kliczewski [mailto:piotr.kliczew...@gmail.com]
 Verzonden: maandag 16 maart 2015 12:16
 Aan: Roel de Rooy
 CC: Michal Skrivanek; users@ovirt.org
 Onderwerp: Re: [ovirt-users] Communication errors between engine and nodes?

 Unfortunately log entries that you copied give me almost no information about 
 nature of your issue.
 There are few things that we can do to understand what is going on with your 
 setup.

 Heartbeat functionality provides means to detect whether we still have 
 connection with a host. By default heartbeat timeout is set to 10 seconds but 
 it can be modified by setting vdsHeartbeatInSeconds.

 In general whenever there are no incoming responses nor heartbeat frame is 
 not received engine will invalidate the connection and will attempt to 
 recover. If reconnection was successful you want see any other consequences 
 of loosing single heartbeat. I would explore stability of your network so if 
 the network is busy or you loose network packets from time to time this kind 
 of entries in the log are expected. You can increase heatbeat value and see 
 whether it will work better for your env.

 If you confirm that your network is stable we could explore the issue further 
 by setting debug level logging for your engine to understand exactly how the 
 messages are processes by a host and when we receive responses.



 On Mon, Mar 16, 2015 at 11:34 AM, Roel de Rooy rder...@motto.nl wrote:
 Received the heartbeat exeeded continuously this morning (seems to be 
 quiet again for now).
 VM's still continue to work correctly and the storage domains (NFS shares) 
 are still connected and reachable on the nodes, at the exact time that this 
 issue is happening.

 Contacted our network engineer to see if he could see a load increase on our 
 network, or could see any latency, errors, etc.
 Unfortunately he could not detect anything yet (he is still investigating 
 this).


 I have attached both the engine and vdsm logs

 Engine.log:

 2015-03-16 10:10:10,506 ERROR
 [org.ovirt.engine.core.vdsbroker.vdsbroker.ListVDSCommand]
 (DefaultQuartzScheduler_Worker-45) [6d40f562] Command
 ListVDSCommand(HostName = HOST, HostId =
 3b87597e-081b-4c89-9b1e-cb04203259f5,
 vds=Host[HOST,3b87597e-081b-4c89-9b1e-cb04203259f5]) execution
 failed. Exception: VDSNetworkException: VDSGenericException:
 VDSNetworkException: Heartbeat exeeded
 2015-03-16 10:10:10,507 ERROR
 [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand]
 (DefaultQuartzScheduler_Worker-35) [2c53103c] Command
 SpmStatusVDSCommand(HostName = HOST, HostId =
 3b87597e-081b-4c89-9b1e-cb04203259f5, storagePoolId =
 124ae76f-8acb-412e-91cc-dff9f6ec665d) execution failed. Exception:
 VDSNetworkException: VDSGenericException: VDSNetworkException:
 Heartbeat exeeded
 2015-03-16 10:10:10,506 WARN
 [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker]
 (ResponseWorker) Exception thrown during message processing
 2015-03-16 10:10:10,507 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] 
 (DefaultQuartzScheduler_Worker-45) [6d40f562] Host HOST is not responding. 
 It will stay in Connecting state for a grace period of 88 seconds and after 
 that an attempt to fence the host will be issued.
 2015-03-16 10:10:10,510 INFO
 [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand]
 (DefaultQuartzScheduler_Worker-35) [7e61eee] Running command:
 SetStoragePoolStatusCommand internal: true. Entities affected :  ID:
 124ae76f-8acb-412e-91cc-dff9f6ec665d Type: StoragePool
 2015-03-16 10:10:10,512 INFO
 [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper]
 (DefaultQuartzScheduler_Worker-35) [7e61eee] Storage Pool
 124ae76f-8acb-412e-91cc-dff9f6ec665d - Updating Storage Domain
 bfa86142-6f2e-44fe-8a9c-cf4390f3b8ae status from Active to Unknown,
 reason : null
 2015-03-16 10:10:10,513 INFO
 [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper]
 (DefaultQuartzScheduler_Worker-35) [7e61eee] Storage Pool
 124ae76f-8acb-412e-91cc-dff9f6ec665d - Updating Storage Domain
 178a38d9-245c-43d3-bff9-6f3a5983bf03 status from Active to Unknown,
 reason : null
 2015-03-16 10:10:10,514 INFO
 [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper]
 (DefaultQuartzScheduler_Worker-35) [7e61eee] Storage Pool

Re: [ovirt-users] Communication errors between engine and nodes?

2015-03-16 Thread Piotr Kliczewski
  
 [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] 
 (DefaultQuartzScheduler_Worker-44) [64352136] FINISH, 
 ConnectStorageServerVDSCommand, return: 
 {6ca291fc-0a20-4047-9aac-9d166a4c5300=0, 
 65744a96-5f4c-4d5f-898b-932eaf97084c=0, 
 03ea1ab7-e96c-410b-911e-905e988b0dc7=0}, log id: 5369ca8f



 Corresponding vdsm.log: (these are the only lines around the same timeframe):

 Thread-52::DEBUG::2015-03-16 
 10:10:10,977::task::595::Storage.TaskManager.Task::(_updateState) 
 Task=`89a0021d-9d5a-4563-ad44-d320aacbc551`::moving from state init - state 
 preparing
 JsonRpc (StompReactor)::DEBUG::2015-03-16 
 10:10:10,982::stompReactor::98::Broker.StompAdapter::(handle_frame) Handling 
 message StompFrame command='SEND'
 Thread-52::INFO::2015-03-16 10:10:10,983::logUtils::44::dispatcher::(wrapper) 
 Run and protect: 
 getVolumeSize(sdUUID=u'178a38d9-245c-43d3-bff9-6f3a5983bf03', 
 spUUID=u'124ae76f-8acb-412e-91cc-dff9f6ec665d', 
 imgUUID=u'fb58d38b-9965-40f3-af45-915a4968a3aa', 
 volUUID=u'0c28ab0e-b1a0-42b6-8eac-71de1faa6827', options=None)
 Thread-27::DEBUG::2015-03-16 
 10:10:10,985::fileSD::261::Storage.Misc.excCmd::(getReadDelay) /usr/bin/dd 
 if=/rhev/data-center/mnt/IP:_mnt_storage/178a38d9-245c-43d3-bff9-6f3a5983bf03/dom_md/metadata
  iflag=direct of=/dev/null bs=4096 count=1 (cwd None)


 -Oorspronkelijk bericht-
 Van: users-boun...@ovirt.org [mailto:users-boun...@ovirt.org] Namens Piotr 
 Kliczewski
 Verzonden: 16 March 2015 08:39
 Aan: Michal Skrivanek
 CC: users@ovirt.org
 Onderwerp: Re: [ovirt-users] Communication errors between engine and nodes?

 Can you please provide logs from both ends?

 On Fri, Mar 13, 2015 at 3:17 PM, Michal Skrivanek 
 michal.skriva...@redhat.com wrote:

 On 13 Mar 2015, at 14:39, Chris Adams wrote:

 Once upon a time, Roel de Rooy rder...@motto.nl said:
 We are observing the same thing with our oVirt environment.
 At random moments (could be a couple of times a day , once a day or even 
 once every couple of days), we receive the VDSNetworkException message 
 on one of our nodes.
 Haven't seen the heartbeat exceeded message, but could be that I 
 overlooked it within our logs.
 At some rare occasions, we also do see Host cannot access the Storage 
 Domain(s) UNKNOWN attached to the Data Center, within the GUI.

 VM's will continue to run normally and most of the times the nodes will be 
 in UP state again within the same minute.

 Will still haven't found the root cause of this issue.
 Our engine is CentOS 6.6 based and it's happing with both Centos 6 and 
 Fedora 20 nodes.
 We are using a LCAP bond of 1Gbit ports for our management network.

 As we didn't see any reports about this before, we are currently looking 
 if something network related is causing this.

 I just opened a BZ on it (since it isn't just me):

 https://bugzilla.redhat.com/show_bug.cgi?id=1201779

 My cluster went a couple of days without hitting this (as soon as I
 posted to the list of course), but then it happened several times
 overnight.  Interestingly, one error logged was communicating with
 the node currently running my hosted engine.  That should rule out
 external network (e.g. switch and such) issues, as those packets
 should not have left the physical box.

 well, hosted engine complicates things as you'd need to be able to see
 the status of the engine guest running a standalone engine installation or 
 at least running that hosted engine on a single node without any other VM 
 may help….

 Thanks,
 michal


 --
 Chris Adams c...@cmadams.net
 ___
 Users mailing list
 Users@ovirt.org
 http://lists.ovirt.org/mailman/listinfo/users

 ___
 Users mailing list
 Users@ovirt.org
 http://lists.ovirt.org/mailman/listinfo/users
 ___
 Users mailing list
 Users@ovirt.org
 http://lists.ovirt.org/mailman/listinfo/users
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] Communication errors between engine and nodes?

2015-03-16 Thread Roel de Rooy
...@ovirt.org [mailto:users-boun...@ovirt.org] Namens Piotr 
Kliczewski
Verzonden: 16 March 2015 08:39
Aan: Michal Skrivanek
CC: users@ovirt.org
Onderwerp: Re: [ovirt-users] Communication errors between engine and nodes?

Can you please provide logs from both ends?

On Fri, Mar 13, 2015 at 3:17 PM, Michal Skrivanek michal.skriva...@redhat.com 
wrote:

 On 13 Mar 2015, at 14:39, Chris Adams wrote:

 Once upon a time, Roel de Rooy rder...@motto.nl said:
 We are observing the same thing with our oVirt environment.
 At random moments (could be a couple of times a day , once a day or even 
 once every couple of days), we receive the VDSNetworkException message on 
 one of our nodes.
 Haven't seen the heartbeat exceeded message, but could be that I 
 overlooked it within our logs.
 At some rare occasions, we also do see Host cannot access the Storage 
 Domain(s) UNKNOWN attached to the Data Center, within the GUI.

 VM's will continue to run normally and most of the times the nodes will be 
 in UP state again within the same minute.

 Will still haven't found the root cause of this issue.
 Our engine is CentOS 6.6 based and it's happing with both Centos 6 and 
 Fedora 20 nodes.
 We are using a LCAP bond of 1Gbit ports for our management network.

 As we didn't see any reports about this before, we are currently looking if 
 something network related is causing this.

 I just opened a BZ on it (since it isn't just me):

 https://bugzilla.redhat.com/show_bug.cgi?id=1201779

 My cluster went a couple of days without hitting this (as soon as I 
 posted to the list of course), but then it happened several times 
 overnight.  Interestingly, one error logged was communicating with 
 the node currently running my hosted engine.  That should rule out 
 external network (e.g. switch and such) issues, as those packets 
 should not have left the physical box.

 well, hosted engine complicates things as you'd need to be able to see 
 the status of the engine guest running a standalone engine installation or at 
 least running that hosted engine on a single node without any other VM may 
 help….

 Thanks,
 michal


 --
 Chris Adams c...@cmadams.net
 ___
 Users mailing list
 Users@ovirt.org
 http://lists.ovirt.org/mailman/listinfo/users

 ___
 Users mailing list
 Users@ovirt.org
 http://lists.ovirt.org/mailman/listinfo/users
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] Communication errors between engine and nodes?

2015-03-16 Thread Roel de Rooy
HI Piotr,

Thanks for your reply!

If I would like to change the heartbeat value, do I have to update the value 
within the vdc_options table directly, or should this be done by another way 
(e.g. config file)?

Regards,
Roel

-Oorspronkelijk bericht-
Van: Piotr Kliczewski [mailto:piotr.kliczew...@gmail.com] 
Verzonden: maandag 16 maart 2015 12:16
Aan: Roel de Rooy
CC: Michal Skrivanek; users@ovirt.org
Onderwerp: Re: [ovirt-users] Communication errors between engine and nodes?

Unfortunately log entries that you copied give me almost no information about 
nature of your issue.
There are few things that we can do to understand what is going on with your 
setup.

Heartbeat functionality provides means to detect whether we still have 
connection with a host. By default heartbeat timeout is set to 10 seconds but 
it can be modified by setting vdsHeartbeatInSeconds.

In general whenever there are no incoming responses nor heartbeat frame is not 
received engine will invalidate the connection and will attempt to recover. If 
reconnection was successful you want see any other consequences of loosing 
single heartbeat. I would explore stability of your network so if the network 
is busy or you loose network packets from time to time this kind of entries in 
the log are expected. You can increase heatbeat value and see whether it will 
work better for your env.

If you confirm that your network is stable we could explore the issue further 
by setting debug level logging for your engine to understand exactly how the 
messages are processes by a host and when we receive responses.



On Mon, Mar 16, 2015 at 11:34 AM, Roel de Rooy rder...@motto.nl wrote:
 Received the heartbeat exeeded continuously this morning (seems to be quiet 
 again for now).
 VM's still continue to work correctly and the storage domains (NFS shares) 
 are still connected and reachable on the nodes, at the exact time that this 
 issue is happening.

 Contacted our network engineer to see if he could see a load increase on our 
 network, or could see any latency, errors, etc.
 Unfortunately he could not detect anything yet (he is still investigating 
 this).


 I have attached both the engine and vdsm logs

 Engine.log:

 2015-03-16 10:10:10,506 ERROR 
 [org.ovirt.engine.core.vdsbroker.vdsbroker.ListVDSCommand] 
 (DefaultQuartzScheduler_Worker-45) [6d40f562] Command 
 ListVDSCommand(HostName = HOST, HostId = 
 3b87597e-081b-4c89-9b1e-cb04203259f5, 
 vds=Host[HOST,3b87597e-081b-4c89-9b1e-cb04203259f5]) execution 
 failed. Exception: VDSNetworkException: VDSGenericException: 
 VDSNetworkException: Heartbeat exeeded
 2015-03-16 10:10:10,507 ERROR 
 [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] 
 (DefaultQuartzScheduler_Worker-35) [2c53103c] Command 
 SpmStatusVDSCommand(HostName = HOST, HostId = 
 3b87597e-081b-4c89-9b1e-cb04203259f5, storagePoolId = 
 124ae76f-8acb-412e-91cc-dff9f6ec665d) execution failed. Exception: 
 VDSNetworkException: VDSGenericException: VDSNetworkException: 
 Heartbeat exeeded
 2015-03-16 10:10:10,506 WARN  
 [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] 
 (ResponseWorker) Exception thrown during message processing
 2015-03-16 10:10:10,507 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] 
 (DefaultQuartzScheduler_Worker-45) [6d40f562] Host HOST is not responding. 
 It will stay in Connecting state for a grace period of 88 seconds and after 
 that an attempt to fence the host will be issued.
 2015-03-16 10:10:10,510 INFO  
 [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] 
 (DefaultQuartzScheduler_Worker-35) [7e61eee] Running command: 
 SetStoragePoolStatusCommand internal: true. Entities affected :  ID: 
 124ae76f-8acb-412e-91cc-dff9f6ec665d Type: StoragePool
 2015-03-16 10:10:10,512 INFO  
 [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] 
 (DefaultQuartzScheduler_Worker-35) [7e61eee] Storage Pool 
 124ae76f-8acb-412e-91cc-dff9f6ec665d - Updating Storage Domain 
 bfa86142-6f2e-44fe-8a9c-cf4390f3b8ae status from Active to Unknown, 
 reason : null
 2015-03-16 10:10:10,513 INFO  
 [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] 
 (DefaultQuartzScheduler_Worker-35) [7e61eee] Storage Pool 
 124ae76f-8acb-412e-91cc-dff9f6ec665d - Updating Storage Domain 
 178a38d9-245c-43d3-bff9-6f3a5983bf03 status from Active to Unknown, 
 reason : null
 2015-03-16 10:10:10,514 INFO  
 [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] 
 (DefaultQuartzScheduler_Worker-35) [7e61eee] Storage Pool 
 124ae76f-8acb-412e-91cc-dff9f6ec665d - Updating Storage Domain 
 3b0b4f26-bec9-4730-a8ba-40965a228932 status from Active to Unknown, 
 reason : null
 2015-03-16 10:10:10,526 WARN  
 [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
 (DefaultQuartzScheduler_Worker-45) [6d40f562] Correlation ID: null, Call 
 Stack: null, Custom Event ID: -1, Message: Host HOST is not responding. It 
 will stay in Connecting state for a grace period

Re: [ovirt-users] Communication errors between engine and nodes?

2015-03-16 Thread Piotr Kliczewski
Can you please provide logs from both ends?

On Fri, Mar 13, 2015 at 3:17 PM, Michal Skrivanek
michal.skriva...@redhat.com wrote:

 On 13 Mar 2015, at 14:39, Chris Adams wrote:

 Once upon a time, Roel de Rooy rder...@motto.nl said:
 We are observing the same thing with our oVirt environment.
 At random moments (could be a couple of times a day , once a day or even 
 once every couple of days), we receive the VDSNetworkException message on 
 one of our nodes.
 Haven't seen the heartbeat exceeded message, but could be that I 
 overlooked it within our logs.
 At some rare occasions, we also do see Host cannot access the Storage 
 Domain(s) UNKNOWN attached to the Data Center, within the GUI.

 VM's will continue to run normally and most of the times the nodes will be 
 in UP state again within the same minute.

 Will still haven't found the root cause of this issue.
 Our engine is CentOS 6.6 based and it's happing with both Centos 6 and 
 Fedora 20 nodes.
 We are using a LCAP bond of 1Gbit ports for our management network.

 As we didn't see any reports about this before, we are currently looking if 
 something network related is causing this.

 I just opened a BZ on it (since it isn't just me):

 https://bugzilla.redhat.com/show_bug.cgi?id=1201779

 My cluster went a couple of days without hitting this (as soon as I
 posted to the list of course), but then it happened several times
 overnight.  Interestingly, one error logged was communicating with the
 node currently running my hosted engine.  That should rule out external
 network (e.g. switch and such) issues, as those packets should not have
 left the physical box.

 well, hosted engine complicates things as you'd need to be able to see the 
 status of the engine guest
 running a standalone engine installation or at least running that hosted 
 engine on a single node without any other VM may help….

 Thanks,
 michal


 --
 Chris Adams c...@cmadams.net
 ___
 Users mailing list
 Users@ovirt.org
 http://lists.ovirt.org/mailman/listinfo/users

 ___
 Users mailing list
 Users@ovirt.org
 http://lists.ovirt.org/mailman/listinfo/users
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] Communication errors between engine and nodes?

2015-03-13 Thread Roel de Rooy
We are observing the same thing with our oVirt environment.
At random moments (could be a couple of times a day , once a day or even once 
every couple of days), we receive the VDSNetworkException message on one of 
our nodes.
Haven't seen the heartbeat exceeded message, but could be that I overlooked 
it within our logs.
At some rare occasions, we also do see Host cannot access the Storage 
Domain(s) UNKNOWN attached to the Data Center, within the GUI.

VM's will continue to run normally and most of the times the nodes will be in 
UP state again within the same minute.

Will still haven't found the root cause of this issue.
Our engine is CentOS 6.6 based and it's happing with both Centos 6 and Fedora 
20 nodes.
We are using a LCAP bond of 1Gbit ports for our management network.

As we didn't see any reports about this before, we are currently looking if 
something network related is causing this.
 




 
-Oorspronkelijk bericht-
Van: users-boun...@ovirt.org [mailto:users-boun...@ovirt.org] Namens Chris Adams
Verzonden: 12 March 2015 14:23
Aan: users@ovirt.org
Onderwerp: Re: [ovirt-users] Communication errors between engine and nodes?

Once upon a time, Lior Vernia lver...@redhat.com said:
 If I'm not mistaken, heartbeat intervals are configured to 10 seconds 
 by default.

Okay, thanks.

 The command times out queries for the status of VMs on a host - any 
 reason to suspect why that's taking long? Does it happen on specific hosts?

No idea.  It seemed to happen on node5 a bunch over a week, but then there were 
errors on other nodes as well.  It isn't always Heartbeet exceeded, sometimes 
it is VDSNetworkException: Message timeout which can be caused by 
communication issues.  I haven't been able to find any network issues that 
could cause this (no errors logged anywhere).

There doesn't seem to be any pattern to when it happens either.  The log entry 
I posted was from 04:42 local time, and a bunch of the VMs are CentOS 5, which 
does log rotation at 04:00 by default (which can spike the CPU and disk I/O), 
but they are all done long before 04:42.  It happened in the middle of the 
afternoon a couple of days ago, while I was logged-in to the web UI, and I 
didn't notice any unusual behavior.

One other odd thing: I have also been experiencing an issue where I randomly 
get logged out of the web UI.  Usually nothing else was going on, but a couple 
of times it seemed to correspond with one of the node errors (hard to tell).  
It looked like the same error as BZ 1198493 (I'd see a bunch of Failed to log 
User null@N/A out messages).  I don't know if these issues are related or that 
was just coincidence.

To try to rule out any unseen network issues, I started an fping to all seven 
nodes and the engine from another physical system on the same VLAN.  It is 
sending one ping to each of the eight hosts every 0.2 seconds.  That has not 
shown a dropped packet since I started yesterday afternoon.  However, during 
that time, I also have not seen any engine/vdsm timeouts.  I was going to say I 
had not been logged out of the web UI, but that just happened while I was 
typing the previous sentence.

--
Chris Adams c...@cmadams.net
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] Communication errors between engine and nodes?

2015-03-13 Thread Chris Adams
Once upon a time, Roel de Rooy rder...@motto.nl said:
 We are observing the same thing with our oVirt environment.
 At random moments (could be a couple of times a day , once a day or even once 
 every couple of days), we receive the VDSNetworkException message on one of 
 our nodes.
 Haven't seen the heartbeat exceeded message, but could be that I overlooked 
 it within our logs.
 At some rare occasions, we also do see Host cannot access the Storage 
 Domain(s) UNKNOWN attached to the Data Center, within the GUI.
 
 VM's will continue to run normally and most of the times the nodes will be in 
 UP state again within the same minute.
 
 Will still haven't found the root cause of this issue.
 Our engine is CentOS 6.6 based and it's happing with both Centos 6 and Fedora 
 20 nodes.
 We are using a LCAP bond of 1Gbit ports for our management network.
 
 As we didn't see any reports about this before, we are currently looking if 
 something network related is causing this.

I just opened a BZ on it (since it isn't just me):

https://bugzilla.redhat.com/show_bug.cgi?id=1201779

My cluster went a couple of days without hitting this (as soon as I
posted to the list of course), but then it happened several times
overnight.  Interestingly, one error logged was communicating with the
node currently running my hosted engine.  That should rule out external
network (e.g. switch and such) issues, as those packets should not have
left the physical box.

-- 
Chris Adams c...@cmadams.net
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] Communication errors between engine and nodes?

2015-03-13 Thread Michal Skrivanek

On 13 Mar 2015, at 14:39, Chris Adams wrote:

 Once upon a time, Roel de Rooy rder...@motto.nl said:
 We are observing the same thing with our oVirt environment.
 At random moments (could be a couple of times a day , once a day or even 
 once every couple of days), we receive the VDSNetworkException message on 
 one of our nodes.
 Haven't seen the heartbeat exceeded message, but could be that I 
 overlooked it within our logs.
 At some rare occasions, we also do see Host cannot access the Storage 
 Domain(s) UNKNOWN attached to the Data Center, within the GUI.
 
 VM's will continue to run normally and most of the times the nodes will be 
 in UP state again within the same minute.
 
 Will still haven't found the root cause of this issue.
 Our engine is CentOS 6.6 based and it's happing with both Centos 6 and 
 Fedora 20 nodes.
 We are using a LCAP bond of 1Gbit ports for our management network.
 
 As we didn't see any reports about this before, we are currently looking if 
 something network related is causing this.
 
 I just opened a BZ on it (since it isn't just me):
 
 https://bugzilla.redhat.com/show_bug.cgi?id=1201779
 
 My cluster went a couple of days without hitting this (as soon as I
 posted to the list of course), but then it happened several times
 overnight.  Interestingly, one error logged was communicating with the
 node currently running my hosted engine.  That should rule out external
 network (e.g. switch and such) issues, as those packets should not have
 left the physical box.

well, hosted engine complicates things as you'd need to be able to see the 
status of the engine guest
running a standalone engine installation or at least running that hosted engine 
on a single node without any other VM may help….

Thanks,
michal

 
 -- 
 Chris Adams c...@cmadams.net
 ___
 Users mailing list
 Users@ovirt.org
 http://lists.ovirt.org/mailman/listinfo/users

___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] Communication errors between engine and nodes?

2015-03-12 Thread Lior Vernia
If I'm not mistaken, heartbeat intervals are configured to 10 seconds by
default.

The command times out queries for the status of VMs on a host - any
reason to suspect why that's taking long? Does it happen on specific hosts?

On 11/03/15 18:40, Chris Adams wrote:
 Once upon a time, Chris Adams c...@cmadams.net said:
 2015-03-10 04:42:23,310 ERROR 
 [org.ovirt.engine.core.vdsbroker.vdsbroker.ListVDSCommand] 
 (DefaultQuartzScheduler_Worker-40) [75b9e6d9] Command 
 ListVDSCommand(HostName = node5, HostId = 
 8dfd0195-f386-4e16-9379-a5287221d5bd, 
 vds=Host[node5,8dfd0195-f386-4e16-9379-a5287221d5bd]) execution failed.  
 Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: 
 Heartbeat exeeded 
 
 I'm trying to dig into this some on my own (without knowing about
 oVirt's internals); can somebody tell me the timeout for the dispatching
 of commands to vdsm?  I get different things happening when the engine
 thinks a node has gone away, but they all start with the same
 org.ovirt.engine.core.vdsbroker.vdsbroker bit (and have a network
 timeout of some type).
 
 I don't see anything in common in any of the logs at the time of the
 error, so I'm trying to roll back to when the request was sent (but I
 don't know how long it took for the engine to time out before the error
 was logged).
 
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] Communication errors between engine and nodes?

2015-03-12 Thread Chris Adams
Once upon a time, Lior Vernia lver...@redhat.com said:
 If I'm not mistaken, heartbeat intervals are configured to 10 seconds by
 default.

Okay, thanks.

 The command times out queries for the status of VMs on a host - any
 reason to suspect why that's taking long? Does it happen on specific hosts?

No idea.  It seemed to happen on node5 a bunch over a week, but then
there were errors on other nodes as well.  It isn't always Heartbeet
exceeded, sometimes it is VDSNetworkException: Message timeout which
can be caused by communication issues.  I haven't been able to find any
network issues that could cause this (no errors logged anywhere).

There doesn't seem to be any pattern to when it happens either.  The log
entry I posted was from 04:42 local time, and a bunch of the VMs are
CentOS 5, which does log rotation at 04:00 by default (which can spike
the CPU and disk I/O), but they are all done long before 04:42.  It
happened in the middle of the afternoon a couple of days ago, while I
was logged-in to the web UI, and I didn't notice any unusual behavior.

One other odd thing: I have also been experiencing an issue where I
randomly get logged out of the web UI.  Usually nothing else was going
on, but a couple of times it seemed to correspond with one of the node
errors (hard to tell).  It looked like the same error as BZ 1198493 (I'd
see a bunch of Failed to log User null@N/A out messages).  I don't
know if these issues are related or that was just coincidence.

To try to rule out any unseen network issues, I started an fping to all
seven nodes and the engine from another physical system on the same
VLAN.  It is sending one ping to each of the eight hosts every 0.2
seconds.  That has not shown a dropped packet since I started yesterday
afternoon.  However, during that time, I also have not seen any
engine/vdsm timeouts.  I was going to say I had not been logged out of
the web UI, but that just happened while I was typing the previous
sentence.

-- 
Chris Adams c...@cmadams.net
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users


Re: [ovirt-users] Communication errors between engine and nodes?

2015-03-11 Thread Chris Adams
Once upon a time, Chris Adams c...@cmadams.net said:
 2015-03-10 04:42:23,310 ERROR 
 [org.ovirt.engine.core.vdsbroker.vdsbroker.ListVDSCommand] 
 (DefaultQuartzScheduler_Worker-40) [75b9e6d9] Command ListVDSCommand(HostName 
 = node5, HostId = 8dfd0195-f386-4e16-9379-a5287221d5bd, 
 vds=Host[node5,8dfd0195-f386-4e16-9379-a5287221d5bd]) execution failed.  
 Exception: VDSNetworkException: VDSGenericException: VDSNetworkException: 
 Heartbeat exeeded 

I'm trying to dig into this some on my own (without knowing about
oVirt's internals); can somebody tell me the timeout for the dispatching
of commands to vdsm?  I get different things happening when the engine
thinks a node has gone away, but they all start with the same
org.ovirt.engine.core.vdsbroker.vdsbroker bit (and have a network
timeout of some type).

I don't see anything in common in any of the logs at the time of the
error, so I'm trying to roll back to when the request was sent (but I
don't know how long it took for the engine to time out before the error
was logged).
-- 
Chris Adams c...@cmadams.net
___
Users mailing list
Users@ovirt.org
http://lists.ovirt.org/mailman/listinfo/users