Bryan, I checked your logs and I see extensive activity of ioprocess at that time which seems to cause getAllVmStats to be processed 7 seconds later. The call was invoked on vdsm side much later than we saw heartbeat exceeded exception on the engine side.
@Nir please take a look Thanks, Piotr 2017-09-13 04:07:07,126-0500 INFO (itmap/0) [IOProcessClient] Starting client ioprocess-10031 (__init__:330) 2017-09-13 04:07:07,176-0500 INFO (itmap/1) [IOProcessClient] Starting client ioprocess-10032 (__init__:330) 2017-09-13 04:07:07,222-0500 INFO (ioprocess/9466) [IOProcess] Starting ioprocess (__init__:452) 2017-09-13 04:07:07,226-0500 INFO (itmap/2) [IOProcessClient] Starting client ioprocess-10033 (__init__:330) 2017-09-13 04:07:07,279-0500 INFO (ioprocess/9474) [IOProcess] Starting ioprocess (__init__:452) 2017-09-13 04:07:07,284-0500 INFO (itmap/3) [IOProcessClient] Starting client ioprocess-10034 (__init__:330) 2017-09-13 04:07:07,332-0500 INFO (ioprocess/9482) [IOProcess] Starting ioprocess (__init__:452) 2017-09-13 04:07:07,340-0500 INFO (itmap/4) [IOProcessClient] Starting client ioprocess-10035 (__init__:330) 2017-09-13 04:07:07,395-0500 INFO (ioprocess/9502) [IOProcess] Starting ioprocess (__init__:452) 2017-09-13 04:07:07,401-0500 INFO (itmap/5) [IOProcessClient] Starting client ioprocess-10036 (__init__:330) 2017-09-13 04:07:07,461-0500 INFO (ioprocess/9523) [IOProcess] Starting ioprocess (__init__:452) 2017-09-13 04:07:07,469-0500 INFO (itmap/6) [IOProcessClient] Starting client ioprocess-10037 (__init__:330) 2017-09-13 04:07:07,525-0500 INFO (ioprocess/9538) [IOProcess] Starting ioprocess (__init__:452) 2017-09-13 04:07:07,532-0500 INFO (itmap/7) [IOProcessClient] Starting client ioprocess-10038 (__init__:330) 2017-09-13 04:07:07,579-0500 INFO (ioprocess/9556) [IOProcess] Starting ioprocess (__init__:452) 2017-09-13 04:07:07,583-0500 INFO (itmap/8) [IOProcessClient] Starting client ioprocess-10039 (__init__:330) 2017-09-13 04:07:07,628-0500 INFO (ioprocess/9569) [IOProcess] Starting ioprocess (__init__:452) 2017-09-13 04:07:07,642-0500 INFO (monitor/cacbac0) [storage.StorageDomain] Resource namespace 01_img_cacbac0c-4b08-46cd-ace8-b23266c92291 already registered (sd:727) 2017-09-13 04:07:07,642-0500 INFO (monitor/cacbac0) [storage.StorageDomain] Resource namespace 02_vol_cacbac0c-4b08-46cd-ace8-b23266c92291 already registered (sd:736) 2017-09-13 04:07:07,646-0500 INFO (monitor/e371d38) [storage.StorageDomain] Resource namespace 01_img_e371d380-7194-4950-b901-5f2aed5dfb35 already registered (sd:727) 2017-09-13 04:07:07,646-0500 INFO (monitor/e371d38) [storage.StorageDomain] Resource namespace 02_vol_e371d380-7194-4950-b901-5f2aed5dfb35 already registered (sd:736) 2017-09-13 04:07:07,655-0500 INFO (monitor/ea15319) [storage.StorageDomain] Resource namespace 01_img_ea153191-ac25-4984-aaec-4a59cbf383a5 already registered (sd:727) 2017-09-13 04:07:07,655-0500 INFO (monitor/ea15319) [storage.StorageDomain] Resource namespace 02_vol_ea153191-ac25-4984-aaec-4a59cbf383a5 already registered (sd:736) 2017-09-13 04:07:07,661-0500 INFO (monitor/58a7c5d) [storage.StorageDomain] Resource namespace 01_img_58a7c5dd-0b31-4066-ae05-8f541614dfde already registered (sd:727) 2017-09-13 04:07:07,661-0500 INFO (monitor/58a7c5d) [storage.StorageDomain] Resource namespace 02_vol_58a7c5dd-0b31-4066-ae05-8f541614dfde already registered (sd:736) 2017-09-13 04:07:07,665-0500 INFO (ioprocess/9582) [IOProcess] Starting ioprocess (__init__:452) 2017-09-13 04:07:07,668-0500 INFO (monitor/f9cab69) [storage.StorageDomain] Resource namespace 01_img_f9cab69e-ae3d-4b6d-8a02-389e4b1d8764 already registered (sd:727) 2017-09-13 04:07:07,668-0500 INFO (monitor/f9cab69) [storage.StorageDomain] Resource namespace 02_vol_f9cab69e-ae3d-4b6d-8a02-389e4b1d8764 already registered (sd:736) 2017-09-13 04:07:07,669-0500 INFO (monitor/f927ceb) [storage.StorageDomain] Resource namespace 01_img_f927ceb8-91d2-41bd-ba42-dc795395b6d0 already registered (sd:727) 2017-09-13 04:07:07,669-0500 INFO (monitor/f927ceb) [storage.StorageDomain] Resource namespace 02_vol_f927ceb8-91d2-41bd-ba42-dc795395b6d0 already registered (sd:736) 2017-09-13 04:07:07,688-0500 INFO (monitor/031c06e) [storage.StorageDomain] Resource namespace 01_img_031c06e4-9ba6-4572-8800-39da4a5420f0 already registered (sd:727) 2017-09-13 04:07:07,688-0500 INFO (monitor/031c06e) [storage.StorageDomain] Resource namespace 02_vol_031c06e4-9ba6-4572-8800-39da4a5420f0 already registered (sd:736) 2017-09-13 04:07:07,725-0500 INFO (monitor/804be30) [storage.StorageDomain] Resource namespace 01_img_804be302-3763-403b-96db-febebcbfd778 already registered (sd:727) 2017-09-13 04:07:07,725-0500 INFO (monitor/804be30) [storage.StorageDomain] Resource namespace 02_vol_804be302-3763-403b-96db-febebcbfd778 already registered (sd:736) 2017-09-13 04:07:14,688-0500 INFO (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call Host.getAllVmStats succeeded in 0.02 seconds (__init__:539) On Mon, Sep 18, 2017 at 6:43 PM, Bryan Sockel <bryan.soc...@altn.com> wrote: > Here are the logs as requested. > > > Bryan Sockel > Systems Administrator > Alt-N Technologies | Grapevine, TX > Office 817.601.3222 x233 > > Sent using Alt-N's own MDaemon Messaging Server > Get to know the Alt-N family by liking us on Facebook! > > > > -----Original Message----- > From: "piotr.kliczew...@gmail.com -- Piotr Kliczewski" > <piotr.kliczew...@gmail.com> > To: Bryan Sockel <bryan.soc...@altn.com> > Cc: "users@ovirt.org" <users@ovirt.org> > Date: Thu, 14 Sep 2017 21:30:19 +0200 > Subject: Re: [ovirt-users] Server Not Responding > > Bryan, > > In your engine logs I see : > > 2017-09-13 04:07:07,599-05 ERROR > [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] > (DefaultQuartzScheduler3) [] Command 'GetAllVmStatsVDSCommand(HostName > = vm-host-colo-1, VdsIdVDSCommandParametersBase:{runAsync='true', > hostId='e75d4446-9bfc-47cb-8bf8-a2e681720b66'})' execution failed: > VDSGenericzException: VDSNetworkException: Heartbeat exceeded > > It would br great to understand what happened on vdsm side because the > engine was still trying to connect at: 2017-09-13 09:30:46,275-05 > > In vdsm logs you provided I see that they start at 2017-09-13 > 09:01:08,895-0500 and end at 2017-09-13 09:53:24,760-0500. > > Please provide vdsm logs from the time the issue occurred. > > Thanks, > Piotr > > On Wed, Sep 13, 2017 at 5:09 PM, Bryan Sockel <bryan.soc...@altn.com> wrote: >> >> Hi >> >> Having an issue where i frequently have a server that is set to not >> responsive. VM's are set to unknown status, but still continue to run. >> This issue is isolated to just a single host. My Setup is currently a 2 >> Data Center Configuration with 2 servers in each data center. Issue is >> occurring at my remote site. >> >> The primary storage volumes are setup on dedicated hardware, with the >> arbiter running on the server that is having issues. There is also >> another >> gluster replica volume hosted on this box, the replica is the other >> dedicated server. >> >> The logs are showing: >> >> [org.ovirt.engine.core.vdsbroker.vdsbroker.GetCapabilitiesVDSCommand] >> (DefaultQuartzScheduler8) [] Command 'GetCapabilitiesVDSCommand(HostName = >> vm-host-colo-1, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', >> hostId='e75d4446-9bfc-47cb-8bf8-a2e681720b66', >> vds='Host[vm-host-colo-1,e75d4446-9bfc-47cb-8bf8-a2e681720b66]'})' >> execution >> failed: java.rmi.ConnectException: Connection timeout >> >> [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] >> (DefaultQuartzScheduler8) [] Failure to refresh host 'vm-host-colo-1' >> runtime info: java.rmi.ConnectException: Connection timeout. >> >> >> I have attached the vdsm.log from the server with issues and the >> engine.log. >> >> Thanks >> >> Bryan Sockel >> >> >> _______________________________________________ >> Users mailing list >> Users@ovirt.org >> http://lists.ovirt.org/mailman/listinfo/users >> _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users