[ovirt-users] Re: Daily reboots of Hosted Engine?
On Tue, Mar 19, 2019 at 3:40 PM Juhani Rautiainen wrote: > > > while true; > >do ping -c 1 -W 2 10.168.8.1 > /dev/null; echo $?; sleep 0.5; > > done > > I'll try this tomorrow during the expected failure time. And I found the reason. Nothing wrong with the ovirt. There is big filetransfer going through FW every fifteen minutes and it's ping response goes beyond horrible. And it's Enterprise level FW. Sorry for wasting the time and thanks for the help, Juhani ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/BP6KFMNYIOMH3YS55UEF5NZEWKQH34KH/
[ovirt-users] Re: Daily reboots of Hosted Engine?
On Tue, Mar 19, 2019 at 3:01 PM Simone Tiraboschi wrote: > >> >> No failed pings to be seen. So how that ping.py decides that 4 out of 5 >> failed?? > > > It's just calling the system ping utility as an external process checking the > exit code. > I don't see any issue with that approach. I was looking at the same thing but I can also see that packets reach the host NIC. I just read the times again and it seems that first ping was delayed (took over 2 secs). So is that 4 out of 5 number of succeeded pings? Because I read it the other way. > Can you please try executing: > > while true; >do ping -c 1 -W 2 10.168.8.1 > /dev/null; echo $?; sleep 0.5; > done I'll try this tomorrow during the expected failure time. Thanks, Juhani ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/T4BR74HPP6ADXEQTC7SHK36DXUUY4UDB/
[ovirt-users] Re: Daily reboots of Hosted Engine?
On Tue, Mar 19, 2019 at 1:32 PM Juhani Rautiainen < juhani.rautiai...@gmail.com> wrote: > On Tue, Mar 19, 2019 at 1:33 PM Juhani Rautiainen > wrote: > > > > On Tue, Mar 19, 2019 at 12:46 PM Juhani Rautiainen > > > > It seems that either our firewall is not responding to pings or > > something else is wrong. Looking at the broker.log this can be seen. > > Curious thing is that the reboot happens even when ping comes back in > > couple of seconds. Is there timeout in ping or does it fire them in > > quick succession? > > I don't know much of Python, but I think there is a problem with > broker/ping.py. I noticed that these ping failures happen every > fifteen minutes: > > [root@ovirt01 ~]# grep Failed /var/log/ovirt-hosted-engine-ha/broker.log > Thread-1::WARNING::2019-03-19 > 14:04:44,898::ping::63::ping.Ping::(action) Failed to ping 10.168.8.1, > (4 out of 5) > Thread-1::WARNING::2019-03-19 > 14:19:38,891::ping::63::ping.Ping::(action) Failed to ping 10.168.8.1, > (4 out of 5) > > I monitored the firewall and network traffic in host and ping works > but that ping.py somehow thinks that it did not get replies. I can't > see anything obvius in the code. But this is from tcpdump from that > last failure time frame: > > 14:19:22.598518 IP ovirt01.virt.local > gateway: ICMP echo request, id > 19055, seq 1, length 64 > 14:19:22.598705 IP gateway > ovirt01.virt.local: ICMP echo reply, id > 19055, seq 1, length 64 > 14:19:23.126800 IP ovirt01.virt.local > gateway: ICMP echo request, id > 19056, seq 1, length 64 > 14:19:23.126978 IP gateway > ovirt01.virt.local: ICMP echo reply, id > 19056, seq 1, length 64 > 14:19:23.653544 IP ovirt01.virt.local > gateway: ICMP echo request, id > 19057, seq 1, length 64 > 14:19:23.653731 IP gateway > ovirt01.virt.local: ICMP echo reply, id > 19057, seq 1, length 64 > 14:19:24.180846 IP ovirt01.virt.local > gateway: ICMP echo request, id > 19058, seq 1, length 64 > 14:19:24.181042 IP gateway > ovirt01.virt.local: ICMP echo reply, id > 19058, seq 1, length 64 > 14:19:24.708083 IP ovirt01.virt.local > gateway: ICMP echo request, id > 19065, seq 1, length 64 > 14:19:24.708274 IP gateway > ovirt01.virt.local: ICMP echo reply, id > 19065, seq 1, length 64 > 14:19:32.743986 IP ovirt01.virt.local > gateway: ICMP echo request, id > 19141, seq 1, length 64 > 14:19:35.160398 IP gateway > ovirt01.virt.local: ICMP echo reply, id > 19141, seq 1, length 64 > 14:19:35.271171 IP ovirt01.virt.local > gateway: ICMP echo request, id > 19152, seq 1, length 64 > 14:19:35.365315 IP gateway > ovirt01.virt.local: ICMP echo reply, id > 19152, seq 1, length 64 > 14:19:35.892716 IP ovirt01.virt.local > gateway: ICMP echo request, id > 19154, seq 1, length 64 > 14:19:36.002087 IP gateway > ovirt01.virt.local: ICMP echo reply, id > 19154, seq 1, length 64 > 14:19:36.529263 IP ovirt01.virt.local > gateway: ICMP echo request, id > 19156, seq 1, length 64 > 14:19:38.359281 IP gateway > ovirt01.virt.local: ICMP echo reply, id > 19156, seq 1, length 64 > 14:19:38.887231 IP ovirt01.virt.local > gateway: ICMP echo request, id > 19201, seq 1, length 64 > 14:19:38.889774 IP gateway > ovirt01.virt.local: ICMP echo reply, id > 19201, seq 1, length 64 > 14:19:42.923684 IP ovirt01.virt.local > gateway: ICMP echo request, id > 19234, seq 1, length 64 > 14:19:42.923951 IP gateway > ovirt01.virt.local: ICMP echo reply, id > 19234, seq 1, length 64 > 14:19:43.450788 IP ovirt01.virt.local > gateway: ICMP echo request, id > 19235, seq 1, length 64 > 14:19:43.450968 IP gateway > ovirt01.virt.local: ICMP echo reply, id > 19235, seq 1, length 64 > 14:19:43.977791 IP ovirt01.virt.local > gateway: ICMP echo request, id > 19237, seq 1, length 64 > 14:19:43.977965 IP gateway > ovirt01.virt.local: ICMP echo reply, id > 19237, seq 1, length 64 > 14:19:44.504541 IP ovirt01.virt.local > gateway: ICMP echo request, id > 19238, seq 1, length 64 > 14:19:44.504715 IP gateway > ovirt01.virt.local: ICMP echo reply, id > 19238, seq 1, length 64 > 14:19:45.031570 IP ovirt01.virt.local > gateway: ICMP echo request, id > 19244, seq 1, length 64 > 14:19:45.031752 IP gateway > ovirt01.virt.local: ICMP echo reply, id > 19244, seq 1, length 64 > > No failed pings to be seen. So how that ping.py decides that 4 out of 5 > failed?? > It's just calling the system ping utility as an external process checking the exit code. I don't see any issue with that approach. Can you please try executing: while true; do ping -c 1 -W 2 10.168.8.1 > /dev/null; echo $?; sleep 0.5; done > > Thanks, > Juhani > ___ > Users mailing list -- users@ovirt.org > To unsubscribe send an email to users-le...@ovirt.org > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > oVirt Code of Conduct: > https://www.ovirt.org/community/about/community-guidelines/ > List Archives: > https://lists.ovirt.org/archives/list/users@ovirt.org/message/UH7MKGQECM2VSI77DNRHQB56C76FJBTY/ > ___ U
[ovirt-users] Re: Daily reboots of Hosted Engine?
On Tue, Mar 19, 2019 at 1:33 PM Juhani Rautiainen wrote: > > On Tue, Mar 19, 2019 at 12:46 PM Juhani Rautiainen > > It seems that either our firewall is not responding to pings or > something else is wrong. Looking at the broker.log this can be seen. > Curious thing is that the reboot happens even when ping comes back in > couple of seconds. Is there timeout in ping or does it fire them in > quick succession? I don't know much of Python, but I think there is a problem with broker/ping.py. I noticed that these ping failures happen every fifteen minutes: [root@ovirt01 ~]# grep Failed /var/log/ovirt-hosted-engine-ha/broker.log Thread-1::WARNING::2019-03-19 14:04:44,898::ping::63::ping.Ping::(action) Failed to ping 10.168.8.1, (4 out of 5) Thread-1::WARNING::2019-03-19 14:19:38,891::ping::63::ping.Ping::(action) Failed to ping 10.168.8.1, (4 out of 5) I monitored the firewall and network traffic in host and ping works but that ping.py somehow thinks that it did not get replies. I can't see anything obvius in the code. But this is from tcpdump from that last failure time frame: 14:19:22.598518 IP ovirt01.virt.local > gateway: ICMP echo request, id 19055, seq 1, length 64 14:19:22.598705 IP gateway > ovirt01.virt.local: ICMP echo reply, id 19055, seq 1, length 64 14:19:23.126800 IP ovirt01.virt.local > gateway: ICMP echo request, id 19056, seq 1, length 64 14:19:23.126978 IP gateway > ovirt01.virt.local: ICMP echo reply, id 19056, seq 1, length 64 14:19:23.653544 IP ovirt01.virt.local > gateway: ICMP echo request, id 19057, seq 1, length 64 14:19:23.653731 IP gateway > ovirt01.virt.local: ICMP echo reply, id 19057, seq 1, length 64 14:19:24.180846 IP ovirt01.virt.local > gateway: ICMP echo request, id 19058, seq 1, length 64 14:19:24.181042 IP gateway > ovirt01.virt.local: ICMP echo reply, id 19058, seq 1, length 64 14:19:24.708083 IP ovirt01.virt.local > gateway: ICMP echo request, id 19065, seq 1, length 64 14:19:24.708274 IP gateway > ovirt01.virt.local: ICMP echo reply, id 19065, seq 1, length 64 14:19:32.743986 IP ovirt01.virt.local > gateway: ICMP echo request, id 19141, seq 1, length 64 14:19:35.160398 IP gateway > ovirt01.virt.local: ICMP echo reply, id 19141, seq 1, length 64 14:19:35.271171 IP ovirt01.virt.local > gateway: ICMP echo request, id 19152, seq 1, length 64 14:19:35.365315 IP gateway > ovirt01.virt.local: ICMP echo reply, id 19152, seq 1, length 64 14:19:35.892716 IP ovirt01.virt.local > gateway: ICMP echo request, id 19154, seq 1, length 64 14:19:36.002087 IP gateway > ovirt01.virt.local: ICMP echo reply, id 19154, seq 1, length 64 14:19:36.529263 IP ovirt01.virt.local > gateway: ICMP echo request, id 19156, seq 1, length 64 14:19:38.359281 IP gateway > ovirt01.virt.local: ICMP echo reply, id 19156, seq 1, length 64 14:19:38.887231 IP ovirt01.virt.local > gateway: ICMP echo request, id 19201, seq 1, length 64 14:19:38.889774 IP gateway > ovirt01.virt.local: ICMP echo reply, id 19201, seq 1, length 64 14:19:42.923684 IP ovirt01.virt.local > gateway: ICMP echo request, id 19234, seq 1, length 64 14:19:42.923951 IP gateway > ovirt01.virt.local: ICMP echo reply, id 19234, seq 1, length 64 14:19:43.450788 IP ovirt01.virt.local > gateway: ICMP echo request, id 19235, seq 1, length 64 14:19:43.450968 IP gateway > ovirt01.virt.local: ICMP echo reply, id 19235, seq 1, length 64 14:19:43.977791 IP ovirt01.virt.local > gateway: ICMP echo request, id 19237, seq 1, length 64 14:19:43.977965 IP gateway > ovirt01.virt.local: ICMP echo reply, id 19237, seq 1, length 64 14:19:44.504541 IP ovirt01.virt.local > gateway: ICMP echo request, id 19238, seq 1, length 64 14:19:44.504715 IP gateway > ovirt01.virt.local: ICMP echo reply, id 19238, seq 1, length 64 14:19:45.031570 IP ovirt01.virt.local > gateway: ICMP echo request, id 19244, seq 1, length 64 14:19:45.031752 IP gateway > ovirt01.virt.local: ICMP echo reply, id 19244, seq 1, length 64 No failed pings to be seen. So how that ping.py decides that 4 out of 5 failed?? Thanks, Juhani ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/UH7MKGQECM2VSI77DNRHQB56C76FJBTY/
[ovirt-users] Re: Daily reboots of Hosted Engine?
On Tue, Mar 19, 2019 at 12:46 PM Juhani Rautiainen wrote: > > > Couldn't find anything that jumps as problem but another post in list > made me check ha-agent logs. This is the reason for reboot: > > MainThread::INFO::2019-03-19 > 12:04:41,262::states::135::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) > Penalizing score by 1600 due to gateway status > MainThread::INFO::2019-03-19 > 12:04:41,263::hosted_engine::493::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) > Current state EngineUp (score: 1800) > MainThread::ERROR::2019-03-19 > 12:04:51,283::states::435::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) > Host ovirt02.virt.local (id 2) score is significantly better than > local score, shutting down VM on this host > MainThread::INFO::2019-03-19 > 12:04:51,467::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) > Success, was notification of state_transition (EngineUp-EngineStop) > sent? sent > MainThread::INFO::2019-03-19 > 12:04:51,624::hosted_engine::493::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) > Current state EngineStop (score: 3400) > > So HA-agent does the reboot. Now the question is: What that > 'Penalizing score by 1600 due to gateway status' means? Other HA VM's > don't seen to have any problems. It seems that either our firewall is not responding to pings or something else is wrong. Looking at the broker.log this can be seen. Curious thing is that the reboot happens even when ping comes back in couple of seconds. Is there timeout in ping or does it fire them in quick succession? Thread-1::INFO::2019-03-19 12:04:20,244::ping::60::ping.Ping::(action) Successfully pinged 10.168.8.1 Thread-2::INFO::2019-03-19 12:04:20,567::mgmt_bridge::62::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt with ports Thread-5::INFO::2019-03-19 12:04:24,729::engine_health::242::engine_health.EngineHealth::(_result_from_stats) VM is up on this host with healthy engine Thread-2::INFO::2019-03-19 12:04:29,745::mgmt_bridge::62::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt with ports Thread-3::INFO::2019-03-19 12:04:30,166::mem_free::51::mem_free.MemFree::(action) memFree: 340451 Thread-5::INFO::2019-03-19 12:04:34,843::engine_health::242::engine_health.EngineHealth::(_result_from_stats) VM is up on this host with healthy engine Thread-2::INFO::2019-03-19 12:04:39,926::mgmt_bridge::62::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt with ports Thread-3::INFO::2019-03-19 12:04:40,287::mem_free::51::mem_free.MemFree::(action) memFree: 340450 Thread-1::WARNING::2019-03-19 12:04:40,389::ping::63::ping.Ping::(action) Failed to ping 10.168.8.1, (0 out of 5) Thread-1::INFO::2019-03-19 12:04:43,474::ping::60::ping.Ping::(action) Successfully pinged 10.168.8.1 Thread-5::INFO::2019-03-19 12:04:44,961::engine_health::242::engine_health.EngineHealth::(_result_from_stats) VM is up on this host with healthy engine Thread-2::INFO::2019-03-19 12:04:50,154::mgmt_bridge::62::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt with ports Thread-3::INFO::2019-03-19 12:04:50,415::mem_free::51::mem_free.MemFree::(action) memFree: 340454 Thread-1::INFO::2019-03-19 12:04:51,616::ping::60::ping.Ping::(action) Successfully pinged 10.168.8.1 Thread-5::INFO::2019-03-19 12:04:55,076::engine_health::242::engine_health.EngineHealth::(_result_from_stats) VM is up on this host with healthy engine Thread-4::INFO::2019-03-19 12:04:59,197::cpu_load_no_engine::126::cpu_load_no_engine.CpuLoadNoEngine::(calculate_load) System load total=0.0247, engine=0.0004, non-engine=0.0243 Thread-2::INFO::2019-03-19 12:05:00,434::mgmt_bridge::62::mgmt_bridge.MgmtBridge::(action) Found bridge ovirtmgmt with ports Thread-3::INFO::2019-03-19 12:05:00,541::mem_free::51::mem_free.MemFree::(action) memFree: 340433 Thread-1::INFO::2019-03-19 12:05:01,763::ping::60::ping.Ping::(action) Successfully pinged 10.168.8.1 Thread-7::INFO::2019-03-19 12:05:06,692::engine_health::203::engine_health.EngineHealth::(_result_from_stats) VM not running on this host, status Down Thanks, Juhani ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/PCIGAKWR6OZZTOEQ33P2QUA6RTJM5WQY/
[ovirt-users] Re: Daily reboots of Hosted Engine?
On Tue, Mar 19, 2019 at 12:39 PM Kaustav Majumder wrote: > > > It should not affect. >> >> Can >> this cause problems? I noticed that this message was in events hour >> before reboot: >> > @Sahina Bose what can cause such? >> >> Invalid status on Data Center Default. Setting status to Non Responsive. >> >> Same event happened just after reboot. > >> -Juhani > > > Can you also check the vdsm logs for any anomaly around the time of reboot . Couldn't find anything that jumps as problem but another post in list made me check ha-agent logs. This is the reason for reboot: MainThread::INFO::2019-03-19 12:04:41,262::states::135::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score) Penalizing score by 1600 due to gateway status MainThread::INFO::2019-03-19 12:04:41,263::hosted_engine::493::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineUp (score: 1800) MainThread::ERROR::2019-03-19 12:04:51,283::states::435::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Host ovirt02.virt.local (id 2) score is significantly better than local score, shutting down VM on this host MainThread::INFO::2019-03-19 12:04:51,467::brokerlink::68::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (EngineUp-EngineStop) sent? sent MainThread::INFO::2019-03-19 12:04:51,624::hosted_engine::493::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Current state EngineStop (score: 3400) So HA-agent does the reboot. Now the question is: What that 'Penalizing score by 1600 due to gateway status' means? Other HA VM's don't seen to have any problems. Thanks, Juhani ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/FYM4IONIT5K7NOYLZ3S2GIEDCSIFKXQI/
[ovirt-users] Re: Daily reboots of Hosted Engine?
On Tue, Mar 19, 2019 at 4:00 PM Juhani Rautiainen < juhani.rautiai...@gmail.com> wrote: > On Tue, Mar 19, 2019 at 12:21 PM Kaustav Majumder > wrote: > > > > Hi, > > Can you check if the he vm fqdn resolves to it's ip from all the hosts? > > I checked both hosts and DNS resolving works fine. Just occurred to me > that I also added addresses to /etc/hosts just in case DNS fails. It should not affect. > Can > this cause problems? I noticed that this message was in events hour > before reboot: > > @Sahina Bose what can cause such? > Invalid status on Data Center Default. Setting status to Non Responsive. > > Same event happened just after reboot. > > -Juhani > Can you also check the vdsm logs for any anomaly around the time of reboot . -- Thanks, Kaustav Majumder ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/EN6KQSD6DYFMSA5K3LXNFR4AM56Y2WHO/
[ovirt-users] Re: Daily reboots of Hosted Engine?
On Tue, Mar 19, 2019 at 12:21 PM Kaustav Majumder wrote: > > Hi, > Can you check if the he vm fqdn resolves to it's ip from all the hosts? I checked both hosts and DNS resolving works fine. Just occurred to me that I also added addresses to /etc/hosts just in case DNS fails. Can this cause problems? I noticed that this message was in events hour before reboot: Invalid status on Data Center Default. Setting status to Non Responsive. Same event happened just after reboot. -Juhani ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/742ETPAXWBALM443VUOYFV6UITK3YKXH/
[ovirt-users] Re: Daily reboots of Hosted Engine?
Hi, Can you check if the he vm fqdn resolves to it's ip from all the hosts? On Tue, Mar 19, 2019 at 3:48 PM Juhani Rautiainen < juhani.rautiai...@gmail.com> wrote: > Hi! > > Hosted engine reboots itself almost daily. Is this by design? If not, > where should I be searching for the clues why it shuts down? Someone > is giving reboot order to HE because /var/log/messages in contains > this: > Mar 19 12:05:00 ovirtmgr qemu-ga: info: guest-shutdown called, mode: > powerdown > Mar 19 12:05:00 ovirtmgr systemd: Started Delayed Shutdown Service. > > And I'm still running v4.3.0 because upgrade to that was bit painful > and haven't dared to new round. > > Thanks, > -Juhani > ___ > Users mailing list -- users@ovirt.org > To unsubscribe send an email to users-le...@ovirt.org > Privacy Statement: https://www.ovirt.org/site/privacy-policy/ > oVirt Code of Conduct: > https://www.ovirt.org/community/about/community-guidelines/ > List Archives: > https://lists.ovirt.org/archives/list/users@ovirt.org/message/WIJXBQZQT4HDWAQ4IVLOIFGHKAKTT76O/ > Thanks, Kaustav ___ Users mailing list -- users@ovirt.org To unsubscribe send an email to users-le...@ovirt.org Privacy Statement: https://www.ovirt.org/site/privacy-policy/ oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/ List Archives: https://lists.ovirt.org/archives/list/users@ovirt.org/message/2BHBIMFFGWMXGKVU5MFMQ36NJGHPS7EE/